Skip to content

feat: simpler iterative retrieval (BM25 → rerank → grade → expand)#25

Open
dominikpeter wants to merge 3 commits into
mainfrom
feat/iterative-retrieval-v2
Open

feat: simpler iterative retrieval (BM25 → rerank → grade → expand)#25
dominikpeter wants to merge 3 commits into
mainfrom
feat/iterative-retrieval-v2

Conversation

@dominikpeter
Copy link
Copy Markdown
Member

Summary

Replaces the fan-out of fast-path / filter-task / swarm-rescue / escalate with a clean iterative loop that matches the design you sketched: BM25 with extended queries → top 20 → rerank → grader → on fail expand (more variants + semantic) → iterate up to 5 times.

Flow

preprocess ∥ detect_filter_intent          (parallel LLM calls)
seed variants = LLM synonyms + edit-distance-1 typo rewrites

for iter_n in range(max_iter):
    broad_search(variants, semantic=iter_n≥1?0.5:None)   │
                     ∥                                    │  batched via
    filter_search(intent, variants) if intent             │  Meilisearch
                                                          │  multi-search
    merge (always keep both arms) → pool
    stage-aware merge with priority → take top 20 → rerank → pin filter top-5
    grader:
        top-1 score ≥ 0.9             → pass, break
        else LLM relevance_check:
            makes_sense ∧ conf ≥ 0.7  → pass, break
    if fail: LLM generates more variants, loop

Key design choices

  • Always keep the unfiltered arm: even when filter-intent fires, we still run broad search in parallel and merge both. When the intent LLM picks the wrong supplier_name value, broad_docs still has the right hits.
  • Stricter grader: 0.9 score floor (was 0.7) + 0.7 LLM confidence floor (was 0.6). Marginal pools now iterate instead of early-exiting.
  • Progressive semantic weight: iter 0 is BM25-only (if preprocess didn't set a ratio); iter 1+ blends in 0.5 semantic. Matches the ‘bm25 first, then hybrid, then more variants’ design.
  • Stage-aware merge (_merge_with_priority): later iters can only ADD candidates — the first 5 slots are reserved for earlier-stage hits, so expansion never demotes a correctly-surfaced doc.
  • Multi-search everywhere: every broad/filter/rescue call batches variants via Meilisearch /multi-search (one HTTP round-trip per stage).

Eval

3-run means on tests.eval_v2 (40 OneTrade DE base cases + Article + Supplier):

Hit@5
baseline (main) 122/189 ±2
this PR 127/189 ±1

+5 hits, +2.7pp, first client-side win above noise floor since #18.

Latency: ~500-800ms per query on fast path (grader short-circuit). Worst case with 3 iters: ~4-6s.

Removed

  • _aescalate_if_needed helper (~80 LOC) — absorbed into the loop
  • Fast-path with relevance-check gate — fast-accept was bypassing filter too aggressively
  • Hyde task in _aretrieve_documents_asearch handles HyDE internally

Net: -53 LOC (361 → 308), simpler control flow.

🤖 Generated with Claude Code

dominikpeter and others added 3 commits April 21, 2026 07:56
Restructures _aretrieve_documents' tail into a staged escalation:

  Stage 0 (always)  standard retrieval + filter + pin  (unchanged)
  Stage 1 (on fail) swarm_retrieve variants, merged with stage-0 via
                    _merge_with_priority so pinned hits stay in top-k

The grader (relevance_check) short-circuits on top-1 BM25 score ≥ 0.7
(cheap heuristic for confident matches) and only fails on confidently-
negative LLM verdicts — marginal cases skip expansion. Updated grader
prompt recognises synonyms/paraphrases/multilingual equivalents so
"Bieröffner" ≈ "Flaschenöffner" counts as a match.

_merge_with_priority is a new helper: reserves the first `keep_top`
slots for the prior stage, RRF-fuses remaining candidates from both
stages. Prevents later-stage noise from demoting earlier hits.

Stage 2 (filter-free broad) intentionally omitted — regressed more
cases than it helped on the current benchmark.

Eval: baseline 122/189 ±2 → with loop 123/189 ±2 (within noise but
architecturally cleaner; handles synonym gaps without manual tuning).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the fan-out fast-path/filter-task/swarm/escalate pileup with a
clean loop that matches the "smart iterative" mental model:

  1. preprocess ∥ detect_filter_intent (parallel LLM calls)
  2. seed variants = LLM synonyms + edit-distance-1 typo candidates
  3. for iter in range(max_iter):
       - broad search + (if intent) filter search, batched via
         Meilisearch multi-search, always keeping both arms so the
         unfiltered rescue stays alive even when filter fires
       - stage-aware merge into running pool (_merge_with_priority),
         take top 20, rerank, pin filter_docs to top-5
       - grader: top-1 score ≥ 0.9 short-circuit, else LLM relevance
         check (accept on makes_sense ∧ confidence ≥ 0.7)
       - pass → break; fail → LLM generates fresh variants, loop

Stricter grader (0.9 score floor, 0.7 LLM confidence) keeps the loop
running on ambiguous pools instead of early-exiting. Always merging
broad+filter catches cases where the intent LLM narrowed to the wrong
supplier field.

Eval (3-run): baseline 122/189 ±2 → 127/189 ±1 (+5 hits, +2.7pp).
Above noise floor, first real client-side win since #18.

Removes ~200 LOC of escalate/fast-path scaffolding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per .agents/skills/deslop guidelines:
- Shorter docstrings (remove rambling examples that belong in tests).
- Drop nested _emit helper in _typo_candidates, inline early-return.
- Collapse seen/skip pattern in _merge_with_priority.

No behaviour change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant