An intelligent FAQ chatbot for the CodeAlpha Artificial Intelligence Internship program, built as part of TASK 2 of the internship.
This chatbot leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to answer user questions about the CodeAlpha AI Internship. It uses a hybrid approach:
- NLP Preprocessing – Tokenizes, cleans, and lemmatizes user input using SpaCy.
- Intent Matching – Uses TF-IDF Vectorizer and Cosine Similarity to match user queries against a curated FAQ dataset.
- LLM Fallback – When no good FAQ match is found (similarity below threshold), it falls back to an LLM API for conversational responses.
- LLM Enhancement – For high-similarity matches, the FAQ answer is optionally formatted politely by the LLM.
| Requirement | Implementation |
|---|---|
| FAQ Dataset | 25 comprehensive FAQs in faq_data.json covering internship details, perks, tasks, submission, etc. |
| NLP Preprocessing | SpaCy pipeline for tokenization, stop word removal, punctuation cleaning, and lemmatization |
| Intent Matching | TF-IDF Vectorizer (with unigrams + bigrams) + Cosine Similarity scoring |
| LLM Fallback | OpenAI-compatible API (chatgpt-4o) when similarity < 0.60 threshold |
| LLM Enhancement | Matched FAQ answers are optionally reformatted politely by the LLM |
| FastAPI Backend | /chat endpoint with Pydantic validation and CORS support |
| Modern Chat UI | ChatGPT-style interface with message bubbles, typing indicator, and suggestion chips |
| Environment Variables | API key stored in .env file, loaded via python-dotenv |
CodeAlpha_Chatbot_FAQ/
│
├── backend/
│ ├── main.py # FastAPI server with /chat, /health, /faqs endpoints
│ ├── nlp_engine.py # NLP preprocessing, TF-IDF, Cosine Similarity, LLM fallback
│ ├── faq_data.json # FAQ dataset (25 questions & answers)
│ ├── requirements.txt # Python dependencies
│ └── .env # Environment variables (API_KEY, BASE_URL, MODEL)
│
├── frontend/
│ ├── index.html # Chat UI structure with welcome screen & suggestion chips
│ ├── style.css # Modern ChatGPT-inspired responsive styling
│ └── script.js # Fetch API communication, message handling, typing indicator
│
└── README.md # This documentation file
| Component | Technology |
|---|---|
| Backend Framework | FastAPI (Python) |
| NLP Library | SpaCy (en_core_web_sm model) |
| Vectorization | Scikit-learn TF-IDF Vectorizer |
| Similarity Metric | Cosine Similarity (sklearn) |
| LLM Integration | OpenAI Python SDK (custom base URL) |
| Frontend | HTML5, CSS3, Vanilla JavaScript |
| Environment Config | python-dotenv |
| Server | Uvicorn (ASGI) |
- Python 3.9+ installed on your system
- pip package manager
- A modern web browser (Chrome, Firefox, Edge, Safari)
# If using Git
git clone <your-repo-url>
cd CodeAlpha_Chatbot_FAQ# Navigate to the backend directory
cd backend
# Create a virtual environment (recommended)
python -m venv venv
# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install Python dependencies
pip install -r requirements.txt
# Download the SpaCy English language model
python -m spacy download en_core_web_smThe .env file is already included in the backend/ directory with the following configuration:
API_KEY=
BASE_URL=
MODEL=chatgpt-4o
SIMILARITY_THRESHOLD=0.60
HOST=0.0.0.0
PORT=8000
⚠️ Security Note: Never commit your.envfile to a public repository. Add it to.gitignore.
# Make sure you're in the backend/ directory with venv activated
cd backend
python main.pyThe server will start at http://localhost:8000. You should see:
============================================================
CodeAlpha FAQ Chatbot - Starting Server
============================================================
Host: 0.0.0.0
Port: 8000
Docs: http://0.0.0.0:8000/docs
============================================================
Simply open the frontend/index.html file in your web browser:
# Option 1: Double-click index.html in your file explorer
# Option 2: Open from terminal (macOS)
open frontend/index.html
# Option 3: Open from terminal (Linux)
xdg-open frontend/index.html
# Option 4: Open from terminal (Windows)
start frontend/index.htmlYou can also use a simple HTTP server:
# From the project root directory
cd frontend
python -m http.server 3000
# Then open http://localhost:3000 in your browserUser Input
│
▼
┌─────────────────────────┐
│ SpaCy Preprocessing │ Tokenize → Clean → Lemmatize
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ TF-IDF Vectorization │ Convert text to numerical vectors
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ Cosine Similarity │ Find best matching FAQ
└──────────┬──────────────┘
│
┌─────┴──────┐
│ Score ≥ 0.60│ Score < 0.60
▼ ▼
┌──────────┐ ┌──────────────┐
│ Return │ │ LLM Fallback │
│ FAQ │ │ Generate new │
│ Answer │ │ response │
│ (+format)│ │ │
└──────────┘ └──────────────┘
- Tokenization – Text is split into individual tokens using SpaCy's tokenizer.
- Lowercasing – All text is converted to lowercase for uniformity.
- Stop Word Removal – Common English stop words (the, is, at, etc.) are removed.
- Punctuation Removal – Punctuation and whitespace tokens are filtered out.
- Lemmatization – Each token is converted to its base dictionary form (e.g., "running" → "run", "interns" → "intern").
- The preprocessed FAQ questions are transformed into TF-IDF vectors using unigrams and bigrams.
- When a user sends a message, it is preprocessed and transformed using the same vectorizer.
- Cosine Similarity is computed between the user's vector and each FAQ vector.
- The FAQ with the highest similarity score is selected as the best match.
- Threshold: If the best similarity score is below 0.60, the system falls back to the LLM API.
- Fallback: The user's question is sent to the LLM with a system prompt about CodeAlpha, generating a conversational response.
- Enhancement: If the score is above 0.60, the matched FAQ answer is sent to the LLM with a prompt to rephrase it politely and conversationally.
Send a message and receive a chatbot response.
Request Body:
{
"message": "What is the CodeAlpha AI Internship?",
"use_llm_formatting": true
}Response:
{
"response": "The CodeAlpha AI Internship is a fantastic virtual program...",
"source": "faq_llm_formatted",
"similarity_score": 0.85,
"matched_question": "What is the CodeAlpha Artificial Intelligence Internship?"
}Response Sources:
| Source | Description |
|---|---|
faq_direct |
Raw FAQ answer returned directly (when use_llm_formatting is false) |
faq_llm_formatted |
FAQ answer reformatted politely by the LLM |
llm_fallback |
LLM-generated response when no good FAQ match found |
Check API health and status.
List all FAQ questions with their IDs.
Get engine statistics (FAQ count, TF-IDF matrix shape, threshold, etc.).
Interactive Swagger UI documentation.
# Health check
curl http://localhost:8000/health
# Send a chat message
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is the CodeAlpha AI Internship?", "use_llm_formatting": true}'
# List all FAQs
curl http://localhost:8000/faqsOpen http://localhost:8000/docs in your browser for an interactive API testing interface.
-
SpaCy over NLTK: SpaCy provides a more efficient and modern NLP pipeline with better lemmatization accuracy and faster processing compared to NLTK.
-
TF-IDF with Bigrams: Using both unigrams and bigrams captures more context in the FAQ matching process, improving accuracy for queries that match phrases rather than individual words.
-
LLM Enhancement: Instead of just returning raw FAQ answers, the LLM reformulates them conversationally, making the chatbot feel more natural and engaging.
-
Singleton Pattern: The NLP engine is initialized once and reused across all requests, avoiding the overhead of loading SpaCy models repeatedly.
-
Vanilla JS Frontend: A clean, dependency-free frontend ensures easy setup and no build tools required, while still providing a professional ChatGPT-style experience.
This project is built as part of the CodeAlpha Artificial Intelligence Internship (TASK 2). Feel free to use and modify for educational purposes.
CodeAlpha AI Intern – TASK 2: FAQ Chatbot with NLP and LLM Fallback