Production RAG engine built in Java and Python. The HNSW vector index is implemented from scratch — no FAISS, no ChromaDB, no Pinecone.
84 vectors · 31 Wikipedia articles · 35ms query latency · recall@10 = 0.984
Search "difference between BERT and GPT" across 31 Wikipedia articles — results from 4 different source documents returned in 35ms:
| Rank | Source | RRF Score |
|---|---|---|
| 1 | wikipedia-transformer_chunk_2 | 0.0325 |
| 2 | gpt-4_chunk_0 | 0.0317 |
| 3 | bert_language_model_chunk_1 | 0.0308 |
| 4 | bert_language_model_chunk_2 | 0.0303 |
flowchart TD
Client(["Client
(Browser / curl)"])
Client -->|POST /ingest| DC
Client -->|POST /search| QC
Client -->|GET /| Dashboard["React Dashboard
live metrics + search UI"]
subgraph Ingest Pipeline
DC["DocumentChunker
sentence-aware
sliding window
overlap=2 sentences"]
EC1["EmbeddingClient
batch embed ~28ms"]
DC --> EC1
end
subgraph Search Pipeline
QC["QueryCache
LRU 10k entries
ReadWriteLock
hit = <1ms"]
EC2["EmbeddingClient
all-MiniLM-L6-v2
384-dim, ~4ms warm"]
QC -->|miss| EC2
QC -->|hit| Results
end
EC1 --> HNSW
EC1 --> BM25
EC1 --> IP
EC2 --> HNSW
EC2 --> BM25
HNSW["HnswIndex
hand-rolled HNSW
M=16, ef=200
recall@10=0.984"]
BM25["BM25Index
inverted index
k1=1.5, b=0.75"]
IP["IndexPersistence
binary float32
atomic rename
<15ms load"]
HNSW --> RRF
BM25 --> RRF
RRF["ReciprocalRankFusion
RRF = 1/60+rank_dense
+ 1/60+rank_bm25"]
RRF --> Results(["Ranked Results"])
Request flow: Client → QueryCache → embed query → HNSW + BM25 in parallel → RRF fusion → ranked chunks
Ingest pipeline:
- Document arrives at POST /ingest
- DocumentChunker splits text into overlapping sentence chunks (~256 tokens, 2-sentence overlap)
- EmbeddingClient sends chunks in one batch to the Python embedding service (~28ms)
- Each chunk vector inserted into HnswIndex and BM25Index
- IndexPersistence serializes to disk atomically — index survives restarts
Search pipeline:
- Query arrives at POST /search
- QueryCache checks LRU cache — hit returns instantly (<1ms)
- Cache miss — EmbeddingClient embeds the query (~4ms warm)
- HnswIndex.search() beam search finds nearest dense neighbors
- BM25Index.search() keyword scoring runs in parallel
- ReciprocalRankFusion merges both ranked lists
- Top-K chunks returned with RRF scores
Implements Malkov and Yashunin 2018 — the algorithm inside Pinecone, Weaviate, and Qdrant.
Layer 2: entry ──────────────────────── nodeA
Layer 1: entry ──── nodeB ──────────── nodeA ──── nodeC
Layer 0: entry ─ n ─ n ─ nodeB ─ n ─ nodeA ─ n ─ nodeC ─ ...
(all nodes, dense local connections)
- Layer assignment: floor(-ln(uniform) * mL) where mL = 1/ln(M)
- Insert: greedy descent from top layer, beam search, bidirectional connections
- Search: greedy descent to layer 1, beam search at layer 0 with efSearch candidates
- Parameters: M=16, efConstruction=200, efSearch=50
Standard BM25 scoring — catches exact keyword matches that dense retrieval misses.
score(q,d) = sum[ IDF(t) * (TF * (k1+1)) / (TF + k1*(1-b + b*|d|/avgdl)) ]
k1 = 1.5 (term frequency saturation)
b = 0.75 (document length normalization)
Combines dense and sparse results without normalizing scores across different scales.
RRF(d) = 1/(60 + rank_dense) + 1/(60 + rank_bm25)
Only uses rank position — dense cosine distances and BM25 scores never need to be compared directly.
LRU cache with ReadWriteLock:
- Many concurrent readers allowed simultaneously, never blocking each other
- Writers get exclusive access only during cache updates
- 10,000 entry capacity, evicts least-recently-used on overflow
Atomic write pattern — same durability approach as LevelDB and RocksDB:
- Write HNSW graph to hnsw.index.tmp (float32 vectors + neighbor lists)
- Write metadata to hnsw.meta.tmp (JSON — config, entry point, layer count)
- Atomic rename .tmp to final file
- Crash before rename leaves old index intact
Gao et al. 2022. Short queries embed in question-space; documents embed in answer-space. Gap causes poor retrieval on vague queries.
Solution: generate a hypothetical answer with the LLM, embed that instead. The hypothetical answer lives in answer-space, much closer to real document embeddings.
| Metric | Value |
|---|---|
| HNSW Recall@10 (1000 vectors, 50 queries) | 0.984 |
| Embedding — cold start | ~1500ms |
| Embedding — warm single text | ~4ms |
| Embedding — warm batch of 3 | ~28ms |
| Search — cold (no cache) | ~35ms |
| Search — cached | <1ms |
| Index load from disk on restart | <15ms |
| Dataset | 31 Wikipedia articles, 84 vectors |
Full results: benchmarks/RESULTS.md
Requirements: Java 17+, Maven, Python 3.10+
git clone https://github.com/sameer-sde/vektr.git
cd vektr
# Start embedding service
pip install -r ml/requirements.txt
python3 ml/embed_server.py &
# Build and start server
mvn package -q -DskipTests
java -Xmx512m -jar target/vektr-1.0.0.jar &
# Open dashboard
open http://localhost:8080
# Run recall benchmark
mvn test
# Bulk ingest Wikipedia articles
python3 ml/ingest_wikipedia.py# Ingest a document
curl -X POST http://localhost:8080/ingest \
-H 'Content-Type: application/json' \
-d '{"doc_id":"my-doc","text":"Your document text."}'
# Search
curl -X POST http://localhost:8080/search \
-H 'Content-Type: application/json' \
-d '{"query":"your question","k":5}'
# Health check
curl http://localhost:8080/health
# Metrics
curl http://localhost:8080/metricsvektr/
├── src/main/java/com/vektr/
│ ├── index/
│ │ ├── HnswIndex.java # HNSW graph: insert, search, restore
│ │ ├── HnswNode.java # Node: vector + neighbor lists per layer
│ │ ├── HnswConfig.java # M, efConstruction, efSearch, distance
│ │ └── DistanceFunction.java # COSINE, EUCLIDEAN, DOT
│ ├── retrieval/
│ │ ├── BM25Index.java # Inverted index with BM25 scoring
│ │ └── ReciprocalRankFusion.java
│ ├── pipeline/
│ │ ├── DocumentChunker.java # Sentence-aware sliding window chunker
│ │ └── EmbeddingClient.java # HTTP client for Python service
│ ├── query/
│ │ ├── QueryCache.java # LRU cache with ReadWriteLock
│ │ └── QueryRewriter.java # HyDE query rewriting
│ ├── storage/
│ │ └── IndexPersistence.java # Binary serialization, atomic rename
│ └── server/
│ └── VektrServer.java # Undertow HTTP, all routes
├── src/test/
│ └── HnswIndexTest.java # Recall@K benchmark, correctness tests
├── ml/
│ ├── embed_server.py # FastAPI embedding microservice
│ ├── ingest_wikipedia.py # Bulk Wikipedia ingestion
│ └── requirements.txt
└── benchmarks/RESULTS.md
- Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. https://arxiv.org/abs/1603.09320
- Gao, L., et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). https://arxiv.org/abs/2212.10496
- Cormack, G. V., et al. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods. SIGIR 2009.
- Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond.
