A production-style Retrieval-Augmented Generation (RAG) system built with FastAPI + React, designed for intelligent PDF understanding through advanced retrieval engineering.
Unlike simple "upload PDF → ask GPT" projects, OctoVector-AI implements a complete retrieval pipeline with:
- Dense Retrieval
- Sparse Retrieval
- Hybrid Search
- Reciprocal Rank Fusion (RRF)
- Cross Encoder Re-ranking
- Grounded Prompt Generation
- Hallucination Reduction
The system focuses on one principle:
Better retrieval = better answers.
- PDF text extraction
- Text cleaning and normalization
- Semantic chunking
- Sentence-aware segmentation
- SentenceTransformer embeddings
- Precomputed embedding storage
- Efficient lifecycle management
- Dense retrieval using FAISS
- Sparse retrieval using BM25
- Hybrid retrieval pipeline
- Reciprocal Rank Fusion (RRF)
- Cross Encoder reranking
- Heuristic relevance boosting
- Precision-focused retrieval optimization
- Gemini-powered response generation
- Grounded prompt construction
- Context injection
- Hallucination reduction
- FastAPI architecture
- Async APIs
- CORS support
- Logging
- Persistent upload storage
- React upload interface
- Question-answer workflow
- Real-time interaction
┌─────────────────────────┐
│ React Frontend │
│ Upload + Ask Questions │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ FastAPI Backend │
└────────────┬────────────┘
│
┌───────────────────────┴──────────────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ /upload API │ │ /query API │
└────────┬─────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ PDF Ingestion │ │ Query Processing │
│ Cleaning │ │ Hybrid Retrieval │
│ Chunking │ │ Reranking │
└────────┬─────────┘ └────────┬─────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ Embedding Model │ │ Context Selection │
│ SentenceTransform│ └────────┬─────────────┘
└────────┬─────────┘ │
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ FAISS Vector DB │ │ Prompt Construction │
└────────┬─────────┘ └────────┬─────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ Stored Embeddings│ │ Gemini Generation │
└──────────────────┘ └────────┬─────────────┘
▼
┌──────────────────────┐
│ Grounded Response │
└──────────────────────┘
POST /uploadHandles:
- PDF upload
- Text extraction
- Cleaning
- Semantic chunking
- Embedding generation
- Vector preparation
POST /queryHandles:
- Hybrid retrieval
- Reranking
- Context selection
- Prompt construction
- Gemini generation
- Answer delivery
PDF Upload
↓
Disk Persistence
↓
Text Extraction
↓
Cleaning
↓
Chunking
↓
Embedding Generation
↓
Store Chunks + Embeddings
User Question
↓
retrieve_chunks()
↓
Hybrid Retrieval
↓
RRF Fusion
↓
Cross Encoder Reranking
↓
generate_response()
↓
Gemini Answer
Responsible for:
- PDF extraction
- Normalization
- Semantic chunking
Responsible for:
- Dense retrieval
- Sparse retrieval
- FAISS search
- BM25 search
Responsible for:
- RRF Fusion
- Cross Encoder reranking
- Heuristic boosting
Responsible for:
- Prompt construction
- Citation-aware context injection
- Hallucination reduction
Responsible for:
- Gemini interaction
- Response synthesis
Responsible for:
- API orchestration
- Frontend integration
- Lifecycle management
Combines:
- Semantic search
- Lexical search
- Retrieval fusion
Provides significantly better results than standalone retrieval approaches.
Improves ranking quality after retrieval and increases answer relevance.
More context-preserving than fixed-size chunking.
Advanced retrieval engineering technique for combining ranking systems.
Reduces hallucinations and improves answer reliability.
- FastAPI
- Python
- SentenceTransformers
- FAISS
- BM25
- Gemini API
- React
- Dense Retrieval
- Sparse Retrieval
- Hybrid Search
- Cross Encoder Reranking
- RRF Fusion
project/
│
├── frontend/
│ └── React UI
│
├── main.py
├── ingestion/
├── retrieval/
├── generation/
├── embeddings/
└── uploads/
│
├── vector_store/
│
└── README.md
This system follows a retrieval-first architecture:
The quality of generated answers depends primarily on retrieval quality.
Instead of relying solely on an LLM, the system prioritizes strong information retrieval before generation.
- Multi-document support
- Streaming responses
- Persistent vector database
- User authentication
- Conversation memory
- Citation highlighting
- Kubernetes deployment
- Semantic chunking
- Embeddings
- Text preprocessing
- BM25
- FAISS
- Hybrid Retrieval
- RRF
- Reranking
- Prompt engineering
- Grounding
- Hallucination control
- FastAPI
- Modular architecture
- Logging
- Persistence
- React integration
- Async workflows
Built with retrieval engineering at the core.
The name OctoVector AI combines two ideas that represent the system’s core architecture.
Octo is inspired by the Octopus, symbolizing intelligence, adaptability, and multiple components working together in parallel—similar to the system’s multi-stage pipeline involving retrieval, fusion, reranking, and generation.
Vector represents embedding vectors, the foundation of semantic search, enabling the system to understand meaning and context beyond simple keyword matching.
Together, OctoVector AI represents:
- Parallel Intelligence → multiple retrieval strategies working together
- Semantic Depth → understanding meaning through embeddings
- Precision & Adaptability → refining and retrieving the most relevant information
The name reflects a system designed to intelligently navigate and refine knowledge across multiple layers to deliver accurate, context-aware answers.