βΆ Live demo: insightrag.dexdevs.com β run it in your browser, free offline backend. Browse all 10 portfolio demos via the all demos link.
Production-grade, agentic RAG over your own documents β hybrid retrieval (BM25 + dense), cross-encoder reranking, grounded answers with citations, RAGAS-style evaluation, and safety guardrails. Runs fully offline and free by default (no API key, no GPU), and swaps to real local or hosted models with a single environment variable.
Try it locally in one line:
streamlit run ui/streamlit_app.pyβ load the bundled sample docs and ask away, no setup, no keys.π Hosted demo (Hugging Face Space / Streamlit Community Cloud) link + a 20β30s screen-capture GIF will be added here. See
docs/for the deployment notes.
On the bundled 12-question EdTech/finance benchmark, with the default fully offline backends:
- Recall@5 = 0.92, MRR = 0.93 β hybrid retrieval matches or beats BM25-only and dense-only on every metric.
- Cross-encoder reranking cuts prompt context by ~58% (β933 β β389 tokens/query) while keeping the right passage on top β directly lowering LLM cost and latency.
- Faithfulness = 0.91 with an honest refusal path when retrieval is weak (the assistant says "I don't know" instead of hallucinating).
- Prompt-injection detection F1 = 1.0 and PII redaction accuracy = 1.0 on the bundled labelled sets.
Every number above is reproducible with one command and no API key β see Evaluation.
flowchart LR
subgraph Ingest
A[PDF / Markdown / TXT] --> B[Chunker]
B --> C[Embedder]
C --> D[(Vector store)]
end
Q[User query] --> G1{Injection guard}
G1 -- blocked --> X[Refuse + flag]
G1 -- ok --> R[Hybrid retrieval<br/>BM25 + dense]
D --> R
R --> RR[Cross-encoder rerank]
RR --> GEN{Enough context?}
GEN -- no --> RF[Refuse honestly]
GEN -- yes --> L[LLM: grounded answer + citations]
L --> G2[PII redaction]
G2 --> ANS[Answer + citations + latency]
Full design rationale and trade-offs: docs/architecture.md.
git clone https://github.com/ranafaraz/InsightRAG.git
cd InsightRAG
pip install -e ".[dev]"
# Ask a question (ingests a file/dir first)
python -m rag.cli ask "What does InsightRAG use to rerank candidates?" --path docs/uvicorn app.main:app --reload
# -> http://localhost:8000/docs
curl -X POST localhost:8000/ingest/text -H "Content-Type: application/json" \
-d '{"texts":["InsightRAG reranks candidates with a cross-encoder."]}'
curl -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"query":"What reranks candidates?"}'pip install -e ".[ui]"
streamlit run ui/streamlit_app.pydocker compose up --build # API on http://localhost:8000The default backends are deterministic and offline. Flip them on via .env (copy from .env.example):
| Component | Offline default | Real backend | Env var |
|---|---|---|---|
| Embeddings | hash |
sentence-transformers (BGE) |
EMBEDDING_BACKEND |
| Reranker | lexical |
cross-encoder (MiniLM) |
RERANK_BACKEND |
| LLM | stub |
ollama (Llama 3.1) / openai |
LLM_BACKEND |
| Vector store | memory |
chroma (persistent) |
VECTOR_STORE |
pip install -e ".[local]" # sentence-transformers + chroma
# .env:
# EMBEDDING_BACKEND=sentence-transformers
# RERANK_BACKEND=cross-encoder
# LLM_BACKEND=ollama (needs `ollama serve` + `ollama pull llama3.1:8b`)
# VECTOR_STORE=chromaFor OpenAI: set LLM_BACKEND=openai and OPENAI_API_KEY in .env. Keys are never committed.
python -m eval_harness.harness # regenerates eval_harness/RESULTS.mdRetrieval ablation (offline backends, 12 questions / 12 docs):
| Configuration | Recall@3 | Recall@5 | MRR | Context Precision@5 |
|---|---|---|---|---|
| BM25 only | 0.917 | 0.917 | 0.929 | 0.183 |
| Dense only | 0.917 | 0.917 | 0.819 | 0.183 |
| Hybrid (BM25+dense) | 0.917 | 0.917 | 0.931 | 0.183 |
| Hybrid + reranker | 0.917 | 0.917 | 0.917 | 0.183 |
Answer quality & cost:
| Metric | Value |
|---|---|
| Faithfulness | 0.907 |
| Answer relevancy | 0.212 |
| Answer correctness (token F1) | 0.415 |
| Avg latency / query | 1.7 ms |
| Prompt tokens before β after rerank | β933 β β389 (β58%) |
Guardrails:
| Guardrail | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|
| Prompt-injection detection | 1.000 | 1.000 | 1.000 | 1.000 |
| PII redaction | β | β | β | 1.000 |
These come from the deterministic offline backends, so they regenerate identically on any machine. The hashing embedder is bag-of-words, so the BM25-vs-dense gap is intentionally small here. The CI eval gate (
eval_harness/gate.py) fails the build if recall@5, faithfulness, injection-F1 or PII accuracy regress.
With real models (sentence-transformers BGE + cross-encoder MiniLM β full table in eval_harness/RESULTS_real_models.md):
| Configuration | Recall@3 | Recall@5 | MRR |
|---|---|---|---|
| BM25 only | 1.000 | 1.000 | 0.944 |
| Dense only | 1.000 | 1.000 | 1.000 |
| Hybrid (BM25+dense) | 1.000 | 1.000 | 1.000 |
Semantic dense retrieval lifts MRR from 0.944 β 1.000 β it ranks the right passage first on the queries BM25 gets close but not first. Regenerate with:
EMBEDDING_BACKEND=sentence-transformers RERANK_BACKEND=cross-encoder python -m eval_harness.harness
rag/ ingest Β· chunk Β· retrieve Β· rerank Β· generate Β· pipeline Β· providers/
app/ FastAPI service (/health, /ingest, /chat)
guardrails/ prompt-injection screen + PII redaction
ui/ Streamlit demo
eval_harness/ metrics Β· ablation harness Β· CI eval gate Β· bundled benchmark
tests/ pytest suite (offline, no downloads)
docs/ architecture + diagrams + demo GIF
- Provider-agnostic from day one. Embeddings, reranking, generation and storage each sit behind a tiny protocol with a deterministic offline backend. That makes the repo trivially reproducible (CI runs the whole pipeline with no model downloads) and means swapping Llama for OpenAI is a one-line config change, not a rewrite.
- Evaluation is a first-class citizen, not a screenshot. Recall@k, RAGAS-style faithfulness/relevancy, and guardrail precision/recall are computed by a harness and gated in CI β the same discipline that keeps a production RAG system from silently regressing.
- Honest behaviour over impressive demos. The system refuses when context is insufficient and cites the exact passage behind every claim, which is what actually builds user trust.
See docs/architecture.md for the full decision log.
pip install -e ".[dev]"
ruff check .
pytest -q # 27 tests, fully offlineGitHub Actions runs lint + tests on Python 3.10β3.12 and the eval gate on every push/PR.
MIT Β© Rana Faraz