Retrieval-Augmented Generation is the backbone of "chat with your documents" systems. This project implements one end to end — and measures whether it actually works.
A conversational assistant that answers questions about your documents (PDF, Word) through a complete RAG pipeline, with source citations and quantitative evaluation of answer quality.
🎥 Showcase 🎯 OverviewAskMyDocs lets you query a document in natural language. You ask a question, it retrieves the relevant passages from the document and generates a sourced answer — without hallucinating, relying solely on the provided content.
The project implements the full RAG (Retrieval-Augmented Generation) chain, from document ingestion to answer generation, plus an evaluation harness to objectively measure its performance.
- 📄 Ingestion of PDF and Word (.docx) documents
- ✂️ Smart chunking with metadata preservation (source, page)
- 🔍 Semantic search via multilingual embeddings (optimized for French)
- 🤖 Answer generation with Gemini or Ollama for RGPD compliance, strictly grounded in the retrieved context
- 📌 Source citations (document + page) for every answer
- 💬 Chat interface with history (Streamlit)
- 📊 Evaluation harness: annotated dataset, retrieval and generation metrics
Q: "What is the deadline to notify a personal data breach?"
A: A personal data breach must be notified to the supervisory authority within 72 hours of becoming aware of it [page 52].
📄 Sources: RGPD.pdf — p.51, p.52
PDF / DOCX
│
▼
Loader ──────► Splitter ──────► Embedder ──────► ChromaDB
(extraction) (chunking) (vectorization) (vector storage)
│
▼
Question ──► Semantic search (top-K)
│
▼
Generation (Gemini or Ollama)
│
▼
Answer + cited sources
The pipeline is exposed through two high-level functions in rag.py:
ingest(file_path)— loads, chunks and indexes a documentask(question)— retrieves the relevant passages and generates the answer
| Layer | Tool | Why this choice |
|---|---|---|
| Language | Python 3.11+ | Standard for ML/data work |
| Dependency management | uv | Fast, modern, reproducible locks |
| PDF / DOCX extraction | pypdf, python-docx | Lightweight, no system deps |
| Chunking | langchain-text-splitters | Robust recursive splitting |
| Embeddings | sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) |
Local, free, multilingual (French) |
| Vector store | ChromaDB | Zero-config local persistence |
| LLM | Google Gemini or Ollama | Generous free tier for development or RGPD compliance |
| UI | Streamlit | Fast Python-native UI |
| Evaluation | custom annotated dataset + custom metrics | Full control over what's measured |
| Tests | pytest | Coverage of core logic and edge cases |
# Clone the repository
git clone https://github.com/YOUR-USERNAME/askmydocs.git
cd askmydocs
# Install dependencies with uv
uv sync
# Configure the API key
cp .env.example .env
# Edit .env and add your Google AI Studio keyGet a free API key from Google AI Studio.
uv run streamlit run app.pyUpload a document in the sidebar, index it, then ask your questions.
# Run the full pipeline on a document
uv run python -m askmydocs.rag data/uploads/my_document.pdf "My question?"The project includes an evaluation harness that measures the pipeline's performance on a dataset of annotated questions (with reference pages and expected keywords).
uv run python -m askmydocs.eval.runner data/uploads/RGPD.pdfThe harness decouples retrieval from generation, to precisely diagnose the source of any failure — a low hit rate points to a retrieval problem, while a good hit rate paired with refusals points to a prompt or generation problem.
| Metric | What it measures |
|---|---|
| Hit rate | Does retrieval find at least one relevant page? |
| Precision | What proportion of retrieved chunks is relevant? |
| Keyword recall | Does the generated answer contain the expected facts? |
| Refusal rate | How often the LLM responds "I can't find this" |
| Metric | Score |
|---|---|
| Hit rate | TBD |
| Precision | TBD |
| Keyword recall | TBD |
| Refusal rate | TBD |
Reading the numbers: precision is expected to be low on this corpus — with only 1–2 relevant pages out of 88, even perfect retrieval cannot score high. The meaningful signals here are hit rate (is the right page found?) and keyword recall (is the answer factually correct?).
A few real problems solved while building this — and what they taught me:
- Embedding model / language mismatch — the initial English-centric model poorly separated French text (a discriminative gap of only ~0.08 between related and unrelated sentence pairs). Diagnosing this and switching to a multilingual model more than doubled the gap. Lesson: match the embedding model to the language of your corpus, and measure it rather than assume.
- Extraction noise → index pollution — figure-heavy pages produced 1-character chunks (bare page numbers) that polluted the vector index and surfaced as irrelevant top results. Added a minimum-length filter, covered by a unit test. Lesson: in RAG, ingestion quality matters as much as the model — garbage in, garbage out.
- External API resilience — the LLM API intermittently returned 429 (rate limit) and 503 (overload) responses during batch evaluation. Added retry-with-backoff covering both, plus graceful degradation so a single failure doesn't discard the whole run.
The notebook notebooks/01_exploration.ipynb projects the corpus embedding space into 2D (PCA). It shows the thematic clustering of chunks and the position of a query among the relevant passages.
askmydocs/
├── src/askmydocs/
│ ├── config.py # Centralized configuration
│ ├── types.py # Shared types (TypedDict)
│ ├── loader.py # PDF / DOCX extraction
│ ├── splitter.py # Chunking
│ ├── embedder.py # Embedding generation
│ ├── vectorstore.py # Storage and search (ChromaDB)
│ ├── rag.py # Pipeline orchestration
│ └── eval/ # Evaluation harness
│ ├── metrics.py
│ └── runner.py
│ └── llm/ # Answer generation
│ ├── gemini.py # Use Gemini for performance
│ ├── ollama.py # Use Ollama for data privacy
│ └── pompt.py # Prompt use for the LLM
├── notebooks/ # Exploration and visualization
├── tests/ # Unit tests (pytest)
├── data/ # Documents and evaluation dataset
└── app.py # Streamlit interface
uv run pytest -v- Result re-ranking with a cross-encoder
- Hybrid search (semantic + keyword / BM25)
- Expand the evaluation dataset
- Containerized deployment (Docker) to the cloud
- OCR support for scanned PDFs
MIT




