Skip to content

knowledge: O(1) exists + cached lower in InMemoryVaultBackend#150

Open
Antawari wants to merge 1 commit into
mainfrom
ishtar/inmemoryvaultbackend-o1-hash-index
Open

knowledge: O(1) exists + cached lower in InMemoryVaultBackend#150
Antawari wants to merge 1 commit into
mainfrom
ishtar/inmemoryvaultbackend-o1-hash-index

Conversation

@Antawari
Copy link
Copy Markdown
Contributor

Summary

The in-memory backend ships as the default when knowledge is unconfigured — its docstring said "for testing" but the factory returns it for the bare enabled=False and backend="memory" paths. Production users hit it unless they explicitly opt into LanceDB. The pre-fix performance characteristics:

Before After
exists(content_hash) O(N) — any(...) linear scan O(1) — set lookup
query() per-entry hot loop e.content.lower() re-allocated per entry per call reads from a lower-case cache populated at store()
10k-entry ingest with exists-then-store guard 3.571s on this CPU (quadratic) sub-second (linear)

Ingest paths (KnowledgeIngestConsumer._store, knowledge.ingest.*, TechScanner.scan_and_store) all call exists() before store(). Stacked: ingesting N entries previously required N linear-scan exists checks → O(N²) on the shipping default.

Changes

src/bonfire/knowledge/memory.py

  • Added _hashes: set[str] side index for O(1) exists().
  • Added _lower_content: dict[str, str] keyed by entry_id, populated at store() time. query() reads from the cache in its substring-match hot loop.
  • Preserved _entries: list[VaultEntry] as the public attribute — the conformance test in test_knowledge_memory.py asserts backend._entries == [] and isinstance(..., list) directly. Adding side indexes keeps the contract intact.
  • store() is append-only on all three (_entries, _hashes, _lower_content) — no silent dedup on hash collision; the existing test_store_appends_to_entries contract requires append-on-every-store. Callers run the exists()-then-store() guard pattern.
  • Updated module + class docstrings to acknowledge production-default usage (the docstring previously said "for testing" — that was misleading given the factory's behavior).

tests/unit/test_knowledge_memory_performance.py (new · Knight RED)

Five tests — the 10k benchmark is the load-bearing RED:

  • test_10k_entry_ingest_under_one_second — 10k entries with exists-then-store guard, asserts < 1.0s wall clock. Pre-fix on this CPU: 3.571s. Post-fix: included in the full suite's 1.98s total run.
  • test_query_with_large_vault_does_not_allocate_per_entry_lower — 100 queries over 5k-entry vault, < 2.0s. Future-regression guard for the lower-cache.
  • test_entries_list_grows_on_every_store_even_with_hash_collision — pins the no-silent-dedup contract.
  • test_exists_returns_true_for_any_stored_hash — O(1) exists correctness sweep over 50 stores.
  • test_query_lower_cache_handles_uppercase_content — case-insensitive substring match preserved.

TDD verification

  • Knight RED state confirmed against pre-fix code: 1 fail / 4 pass (the 10k ingest is the load-bearing RED; the other 4 are correctness regression guards that should remain green).
  • Warrior GREEN state confirmed after fix: 5/5 pass.
  • Existing memory suite regression: 17/17 still green (test_knowledge_memory.py).
  • Caller regression: 45/45 still green (factory + ingest + consumer + backfill tests).
  • ruff check + ruff format --check on changed files: clean.

Test plan

  • pytest tests/unit/test_knowledge_memory_performance.py — confirm 5/5 green.
  • pytest tests/unit/test_knowledge_memory.py — confirm 17/17 still green (no regression).
  • (Optional manual) Run bonfire backfill_all against a large session log; expect linear scaling.

🤖 Generated with Claude Code

The in-memory backend ships as the default when knowledge is
unconfigured — its docstring said "for testing" but the factory
returns it for the bare enabled=False and backend="memory" paths.
Production users hit it unless they explicitly opt into LanceDB.

Pre-fix:
  - exists(content_hash) is `any(e.content_hash == h for e in self._entries)`
    — linear scan per call.
  - query() allocates `e.content.lower()` per entry per call — hot-path
    string allocation that grows linearly with vault size.

Ingest paths (KnowledgeIngestConsumer._store, knowledge.ingest.*,
TechScanner.scan_and_store) ALL call exists() before store(). Stacked
together: ingesting N entries requires N linear-scan exists checks ⇒
O(N²) ingest on the shipping default.

## Fix

src/bonfire/knowledge/memory.py:

  - Added `_hashes: set[str]` side index for O(1) exists() — replaces
    the prior O(N) `any(...)` scan.
  - Added `_lower_content: dict[str, str]` keyed by entry_id, populated
    at store() time. query() now reads pre-computed lowercase from this
    cache instead of allocating `e.content.lower()` per entry per call.
  - Preserved `_entries: list[VaultEntry]` as the public attribute — the
    innovative-lens conformance test in test_knowledge_memory.py asserts
    `backend._entries == []` and `isinstance(..., list)` directly.
  - `store()` is append-only on all three (_entries, _hashes,
    _lower_content) — no silent dedup on hash collision; the existing
    test_store_appends_to_entries contract requires append-on-every-store.
  - Updated module + class docstrings to acknowledge production-default
    usage (acceptance criterion #3 in the ticket).

## Tests

tests/unit/test_knowledge_memory_performance.py (new · Knight RED):

  Five tests:
    - 10k-entry ingest with exists-then-store guard pattern, asserts
      < 1.0s wall clock. Pre-fix on this CPU: 3.571s. Post-fix: included
      in the suite's 1.98s total run.
    - 100 queries over 5k-entry vault, asserts < 2.0s (future regression
      guard — the .lower() cache enforces this).
    - Hash-collision append regression guard (_entries grows on every
      store, no silent dedup).
    - Exists-after-50-stores correctness sweep.
    - Lower-case cache handles uppercase content (case-insensitive
      substring match preserved).

## Verification

  pytest tests/unit/test_knowledge_memory_performance.py
    5 passed (RED→GREEN verified)

  pytest tests/unit/test_knowledge_memory.py (existing — full suite)
    17 passed (zero regressions)

  pytest tests/unit/test_knowledge_factory.py test_knowledge_ingest.py \
         test_knowledge_consumer.py test_knowledge_backfill.py
    45 passed (callers of InMemoryVaultBackend — zero regressions)

  ruff check + format on changed files: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant