Skip to content

liliandoublet/askmydocs

Repository files navigation

📚 AskMyDocs

Retrieval-Augmented Generation is the backbone of "chat with your documents" systems. This project implements one end to end — and measures whether it actually works.

A conversational assistant that answers questions about your documents (PDF, Word) through a complete RAG pipeline, with source citations and quantitative evaluation of answer quality.

Python Streamlit License

AskMyDocs Interface

🎥 Showcase

Hox to run streamlit

Load your document

How it work

🎯 Overview

AskMyDocs lets you query a document in natural language. You ask a question, it retrieves the relevant passages from the document and generates a sourced answer — without hallucinating, relying solely on the provided content.

The project implements the full RAG (Retrieval-Augmented Generation) chain, from document ingestion to answer generation, plus an evaluation harness to objectively measure its performance.

✨ Features

  • 📄 Ingestion of PDF and Word (.docx) documents
  • ✂️ Smart chunking with metadata preservation (source, page)
  • 🔍 Semantic search via multilingual embeddings (optimized for French)
  • 🤖 Answer generation with Gemini or Ollama for RGPD compliance, strictly grounded in the retrieved context
  • 📌 Source citations (document + page) for every answer
  • 💬 Chat interface with history (Streamlit)
  • 📊 Evaluation harness: annotated dataset, retrieval and generation metrics

💬 Example

Q: "What is the deadline to notify a personal data breach?"

A: A personal data breach must be notified to the supervisory authority within 72 hours of becoming aware of it [page 52].

📄 Sources: RGPD.pdf — p.51, p.52

🏗️ Architecture

PDF / DOCX
    │
    ▼
  Loader  ──────►  Splitter  ──────►  Embedder  ──────►  ChromaDB
(extraction)     (chunking)      (vectorization)    (vector storage)
                                                            │
                                                            ▼
                              Question ──► Semantic search (top-K)
                                                            │
                                                            ▼
                                            Generation (Gemini or Ollama)
                                                            │
                                                            ▼
                                              Answer + cited sources

The pipeline is exposed through two high-level functions in rag.py:

  • ingest(file_path) — loads, chunks and indexes a document
  • ask(question) — retrieves the relevant passages and generates the answer

🛠️ Tech Stack

Layer Tool Why this choice
Language Python 3.11+ Standard for ML/data work
Dependency management uv Fast, modern, reproducible locks
PDF / DOCX extraction pypdf, python-docx Lightweight, no system deps
Chunking langchain-text-splitters Robust recursive splitting
Embeddings sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) Local, free, multilingual (French)
Vector store ChromaDB Zero-config local persistence
LLM Google Gemini or Ollama Generous free tier for development or RGPD compliance
UI Streamlit Fast Python-native UI
Evaluation custom annotated dataset + custom metrics Full control over what's measured
Tests pytest Coverage of core logic and edge cases

🚀 Installation

# Clone the repository
git clone https://github.com/YOUR-USERNAME/askmydocs.git
cd askmydocs

# Install dependencies with uv
uv sync

# Configure the API key
cp .env.example .env
# Edit .env and add your Google AI Studio key

Get a free API key from Google AI Studio.

💻 Usage

Run the application

uv run streamlit run app.py

Upload a document in the sidebar, index it, then ask your questions.

Command line

# Run the full pipeline on a document
uv run python -m askmydocs.rag data/uploads/my_document.pdf "My question?"

📊 Evaluation

The project includes an evaluation harness that measures the pipeline's performance on a dataset of annotated questions (with reference pages and expected keywords).

uv run python -m askmydocs.eval.runner data/uploads/RGPD.pdf

Metrics measured

The harness decouples retrieval from generation, to precisely diagnose the source of any failure — a low hit rate points to a retrieval problem, while a good hit rate paired with refusals points to a prompt or generation problem.

Metric What it measures
Hit rate Does retrieval find at least one relevant page?
Precision What proportion of retrieved chunks is relevant?
Keyword recall Does the generated answer contain the expected facts?
Refusal rate How often the LLM responds "I can't find this"

Results on the GDPR document (88 pages, 626 chunks)

Metric Score
Hit rate TBD
Precision TBD
Keyword recall TBD
Refusal rate TBD

Reading the numbers: precision is expected to be low on this corpus — with only 1–2 relevant pages out of 88, even perfect retrieval cannot score high. The meaningful signals here are hit rate (is the right page found?) and keyword recall (is the answer factually correct?).

🧠 Engineering notes

A few real problems solved while building this — and what they taught me:

  • Embedding model / language mismatch — the initial English-centric model poorly separated French text (a discriminative gap of only ~0.08 between related and unrelated sentence pairs). Diagnosing this and switching to a multilingual model more than doubled the gap. Lesson: match the embedding model to the language of your corpus, and measure it rather than assume.
  • Extraction noise → index pollution — figure-heavy pages produced 1-character chunks (bare page numbers) that polluted the vector index and surfaced as irrelevant top results. Added a minimum-length filter, covered by a unit test. Lesson: in RAG, ingestion quality matters as much as the model — garbage in, garbage out.
  • External API resilience — the LLM API intermittently returned 429 (rate limit) and 503 (overload) responses during batch evaluation. Added retry-with-backoff covering both, plus graceful degradation so a single failure doesn't discard the whole run.

🔬 Embedding visualization

The notebook notebooks/01_exploration.ipynb projects the corpus embedding space into 2D (PCA). It shows the thematic clustering of chunks and the position of a query among the relevant passages.

PCA visualization of embeddings

📁 Project structure

askmydocs/
├── src/askmydocs/
│   ├── config.py          # Centralized configuration
│   ├── types.py           # Shared types (TypedDict)
│   ├── loader.py          # PDF / DOCX extraction
│   ├── splitter.py        # Chunking
│   ├── embedder.py        # Embedding generation
│   ├── vectorstore.py     # Storage and search (ChromaDB)
│   ├── rag.py             # Pipeline orchestration
│   └── eval/              # Evaluation harness
│       ├── metrics.py
│       └── runner.py
│   └── llm/               # Answer generation
│       ├── gemini.py      # Use Gemini for performance
│       ├── ollama.py      # Use Ollama for data privacy
│       └── pompt.py       # Prompt use for the LLM
├── notebooks/             # Exploration and visualization
├── tests/                 # Unit tests (pytest)
├── data/                  # Documents and evaluation dataset
└── app.py                 # Streamlit interface

🧪 Tests

uv run pytest -v

🔭 Possible improvements

  • Result re-ranking with a cross-encoder
  • Hybrid search (semantic + keyword / BM25)
  • Expand the evaluation dataset
  • Containerized deployment (Docker) to the cloud
  • OCR support for scanned PDFs

📄 License

MIT

About

RAG assistant to query your own documents, built with ChromaDB, the Gemini API and a Streamlit / Docker stack.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors