🤖 CodeAlpha FAQ Chatbot

An intelligent FAQ chatbot for the CodeAlpha Artificial Intelligence Internship program, built as part of TASK 2 of the internship.

📋 Project Overview

This chatbot leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to answer user questions about the CodeAlpha AI Internship. It uses a hybrid approach:

NLP Preprocessing – Tokenizes, cleans, and lemmatizes user input using SpaCy.
Intent Matching – Uses TF-IDF Vectorizer and Cosine Similarity to match user queries against a curated FAQ dataset.
LLM Fallback – When no good FAQ match is found (similarity below threshold), it falls back to an LLM API for conversational responses.
LLM Enhancement – For high-similarity matches, the FAQ answer is optionally formatted politely by the LLM.

🎯 TASK 2 Requirements Met

Requirement	Implementation
FAQ Dataset	25 comprehensive FAQs in `faq_data.json` covering internship details, perks, tasks, submission, etc.
NLP Preprocessing	SpaCy pipeline for tokenization, stop word removal, punctuation cleaning, and lemmatization
Intent Matching	TF-IDF Vectorizer (with unigrams + bigrams) + Cosine Similarity scoring
LLM Fallback	OpenAI-compatible API (chatgpt-4o) when similarity < 0.60 threshold
LLM Enhancement	Matched FAQ answers are optionally reformatted politely by the LLM
FastAPI Backend	`/chat` endpoint with Pydantic validation and CORS support
Modern Chat UI	ChatGPT-style interface with message bubbles, typing indicator, and suggestion chips
Environment Variables	API key stored in `.env` file, loaded via `python-dotenv`

📁 Project Structure

CodeAlpha_Chatbot_FAQ/
│
├── backend/
│   ├── main.py              # FastAPI server with /chat, /health, /faqs endpoints
│   ├── nlp_engine.py        # NLP preprocessing, TF-IDF, Cosine Similarity, LLM fallback
│   ├── faq_data.json        # FAQ dataset (25 questions & answers)
│   ├── requirements.txt     # Python dependencies
│   └── .env                 # Environment variables (API_KEY, BASE_URL, MODEL)
│
├── frontend/
│   ├── index.html           # Chat UI structure with welcome screen & suggestion chips
│   ├── style.css            # Modern ChatGPT-inspired responsive styling
│   └── script.js            # Fetch API communication, message handling, typing indicator
│
└── README.md                # This documentation file

🛠️ Tech Stack

Component	Technology
Backend Framework	FastAPI (Python)
NLP Library	SpaCy (en_core_web_sm model)
Vectorization	Scikit-learn TF-IDF Vectorizer
Similarity Metric	Cosine Similarity (sklearn)
LLM Integration	OpenAI Python SDK (custom base URL)
Frontend	HTML5, CSS3, Vanilla JavaScript
Environment Config	python-dotenv
Server	Uvicorn (ASGI)

🚀 Setup & Installation

Prerequisites

Python 3.9+ installed on your system
pip package manager
A modern web browser (Chrome, Firefox, Edge, Safari)

Step 1: Clone or Download the Project

# If using Git
git clone <your-repo-url>
cd CodeAlpha_Chatbot_FAQ

Step 2: Set Up the Backend

# Navigate to the backend directory
cd backend

# Create a virtual environment (recommended)
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install Python dependencies
pip install -r requirements.txt

# Download the SpaCy English language model
python -m spacy download en_core_web_sm

Step 3: Configure Environment Variables

The .env file is already included in the backend/ directory with the following configuration:

API_KEY=
BASE_URL=
MODEL=chatgpt-4o
SIMILARITY_THRESHOLD=0.60
HOST=0.0.0.0
PORT=8000

⚠️ Security Note: Never commit your .env file to a public repository. Add it to .gitignore.

Step 4: Start the Backend Server

# Make sure you're in the backend/ directory with venv activated
cd backend
python main.py

The server will start at http://localhost:8000. You should see:

============================================================
  CodeAlpha FAQ Chatbot - Starting Server
============================================================
  Host: 0.0.0.0
  Port: 8000
  Docs: http://0.0.0.0:8000/docs
============================================================

Step 5: Open the Frontend

Simply open the frontend/index.html file in your web browser:

# Option 1: Double-click index.html in your file explorer

# Option 2: Open from terminal (macOS)
open frontend/index.html

# Option 3: Open from terminal (Linux)
xdg-open frontend/index.html

# Option 4: Open from terminal (Windows)
start frontend/index.html

You can also use a simple HTTP server:

# From the project root directory
cd frontend
python -m http.server 3000
# Then open http://localhost:3000 in your browser

💬 How It Works

Architecture Flow

User Input
    │
    ▼
┌─────────────────────────┐
│   SpaCy Preprocessing   │  Tokenize → Clean → Lemmatize
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│   TF-IDF Vectorization  │  Convert text to numerical vectors
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│   Cosine Similarity     │  Find best matching FAQ
└──────────┬──────────────┘
           │
     ┌─────┴──────┐
     │ Score ≥ 0.60│  Score < 0.60
     ▼             ▼
┌──────────┐  ┌──────────────┐
│ Return   │  │ LLM Fallback │
│ FAQ      │  │ Generate new │
│ Answer   │  │ response     │
│ (+format)│  │              │
└──────────┘  └──────────────┘

NLP Preprocessing Pipeline

Tokenization – Text is split into individual tokens using SpaCy's tokenizer.
Lowercasing – All text is converted to lowercase for uniformity.
Stop Word Removal – Common English stop words (the, is, at, etc.) are removed.
Punctuation Removal – Punctuation and whitespace tokens are filtered out.
Lemmatization – Each token is converted to its base dictionary form (e.g., "running" → "run", "interns" → "intern").

Intent Matching

The preprocessed FAQ questions are transformed into TF-IDF vectors using unigrams and bigrams.
When a user sends a message, it is preprocessed and transformed using the same vectorizer.
Cosine Similarity is computed between the user's vector and each FAQ vector.
The FAQ with the highest similarity score is selected as the best match.

LLM Fallback & Enhancement

Threshold: If the best similarity score is below 0.60, the system falls back to the LLM API.
Fallback: The user's question is sent to the LLM with a system prompt about CodeAlpha, generating a conversational response.
Enhancement: If the score is above 0.60, the matched FAQ answer is sent to the LLM with a prompt to rephrase it politely and conversationally.

🔌 API Endpoints

POST `/chat`

Send a message and receive a chatbot response.

Request Body:

{
  "message": "What is the CodeAlpha AI Internship?",
  "use_llm_formatting": true
}

Response:

{
  "response": "The CodeAlpha AI Internship is a fantastic virtual program...",
  "source": "faq_llm_formatted",
  "similarity_score": 0.85,
  "matched_question": "What is the CodeAlpha Artificial Intelligence Internship?"
}

Response Sources:

Source	Description
`faq_direct`	Raw FAQ answer returned directly (when `use_llm_formatting` is false)
`faq_llm_formatted`	FAQ answer reformatted politely by the LLM
`llm_fallback`	LLM-generated response when no good FAQ match found

GET `/health`

Check API health and status.

GET `/faqs`

List all FAQ questions with their IDs.

GET `/stats`

Get engine statistics (FAQ count, TF-IDF matrix shape, threshold, etc.).

GET `/docs`

Interactive Swagger UI documentation.

🧪 Testing

Test the API Directly

# Health check
curl http://localhost:8000/health

# Send a chat message
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is the CodeAlpha AI Internship?", "use_llm_formatting": true}'

# List all FAQs
curl http://localhost:8000/faqs

Test via Swagger UI

Open http://localhost:8000/docs in your browser for an interactive API testing interface.

📝 Key Design Decisions

SpaCy over NLTK: SpaCy provides a more efficient and modern NLP pipeline with better lemmatization accuracy and faster processing compared to NLTK.
TF-IDF with Bigrams: Using both unigrams and bigrams captures more context in the FAQ matching process, improving accuracy for queries that match phrases rather than individual words.
LLM Enhancement: Instead of just returning raw FAQ answers, the LLM reformulates them conversationally, making the chatbot feel more natural and engaging.
Singleton Pattern: The NLP engine is initialized once and reused across all requests, avoiding the overhead of loading SpaCy models repeatedly.
Vanilla JS Frontend: A clean, dependency-free frontend ensures easy setup and no build tools required, while still providing a professional ChatGPT-style experience.

📜 License

This project is built as part of the CodeAlpha Artificial Intelligence Internship (TASK 2). Feel free to use and modify for educational purposes.

👤 Author

CodeAlpha AI Intern – TASK 2: FAQ Chatbot with NLP and LLM Fallback

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
backend		backend
frontend		frontend
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🤖 CodeAlpha FAQ Chatbot

📋 Project Overview

🎯 TASK 2 Requirements Met

📁 Project Structure

🛠️ Tech Stack

🚀 Setup & Installation

Prerequisites

Step 1: Clone or Download the Project

Step 2: Set Up the Backend

Step 3: Configure Environment Variables

Step 4: Start the Backend Server

Step 5: Open the Frontend

💬 How It Works

Architecture Flow

NLP Preprocessing Pipeline

Intent Matching

LLM Fallback & Enhancement

🔌 API Endpoints

POST /chat

GET /health

GET /faqs

GET /stats

GET /docs

🧪 Testing

Test the API Directly

Test via Swagger UI

📝 Key Design Decisions

📜 License

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/chat`

GET `/health`

GET `/faqs`

GET `/stats`

GET `/docs`

Packages