knowledge: O(1) exists + cached lower in InMemoryVaultBackend by Antawari · Pull Request #150 · BonfireAI/bonfire

Antawari · 2026-05-27T22:45:20Z

Summary

The in-memory backend ships as the default when knowledge is unconfigured — its docstring said "for testing" but the factory returns it for the bare enabled=False and backend="memory" paths. Production users hit it unless they explicitly opt into LanceDB. The pre-fix performance characteristics:

	Before	After
`exists(content_hash)`	O(N) — `any(...)` linear scan	O(1) — `set` lookup
`query()` per-entry hot loop	`e.content.lower()` re-allocated per entry per call	reads from a lower-case cache populated at `store()`
10k-entry ingest with exists-then-store guard	3.571s on this CPU (quadratic)	sub-second (linear)

Ingest paths (KnowledgeIngestConsumer._store, knowledge.ingest.*, TechScanner.scan_and_store) all call exists() before store(). Stacked: ingesting N entries previously required N linear-scan exists checks → O(N²) on the shipping default.

Changes

src/bonfire/knowledge/memory.py

Added _hashes: set[str] side index for O(1) exists().
Added _lower_content: dict[str, str] keyed by entry_id, populated at store() time. query() reads from the cache in its substring-match hot loop.
Preserved _entries: list[VaultEntry] as the public attribute — the conformance test in test_knowledge_memory.py asserts backend._entries == [] and isinstance(..., list) directly. Adding side indexes keeps the contract intact.
store() is append-only on all three (_entries, _hashes, _lower_content) — no silent dedup on hash collision; the existing test_store_appends_to_entries contract requires append-on-every-store. Callers run the exists()-then-store() guard pattern.
Updated module + class docstrings to acknowledge production-default usage (the docstring previously said "for testing" — that was misleading given the factory's behavior).

tests/unit/test_knowledge_memory_performance.py (new · Knight RED)

Five tests — the 10k benchmark is the load-bearing RED:

test_10k_entry_ingest_under_one_second — 10k entries with exists-then-store guard, asserts < 1.0s wall clock. Pre-fix on this CPU: 3.571s. Post-fix: included in the full suite's 1.98s total run.
test_query_with_large_vault_does_not_allocate_per_entry_lower — 100 queries over 5k-entry vault, < 2.0s. Future-regression guard for the lower-cache.
test_entries_list_grows_on_every_store_even_with_hash_collision — pins the no-silent-dedup contract.
test_exists_returns_true_for_any_stored_hash — O(1) exists correctness sweep over 50 stores.
test_query_lower_cache_handles_uppercase_content — case-insensitive substring match preserved.

TDD verification

Knight RED state confirmed against pre-fix code: 1 fail / 4 pass (the 10k ingest is the load-bearing RED; the other 4 are correctness regression guards that should remain green).
Warrior GREEN state confirmed after fix: 5/5 pass.
Existing memory suite regression: 17/17 still green (test_knowledge_memory.py).
Caller regression: 45/45 still green (factory + ingest + consumer + backfill tests).
ruff check + ruff format --check on changed files: clean.

Test plan

pytest tests/unit/test_knowledge_memory_performance.py — confirm 5/5 green.
pytest tests/unit/test_knowledge_memory.py — confirm 17/17 still green (no regression).
(Optional manual) Run bonfire backfill_all against a large session log; expect linear scaling.

🤖 Generated with Claude Code

The in-memory backend ships as the default when knowledge is unconfigured — its docstring said "for testing" but the factory returns it for the bare enabled=False and backend="memory" paths. Production users hit it unless they explicitly opt into LanceDB. Pre-fix: - exists(content_hash) is `any(e.content_hash == h for e in self._entries)` — linear scan per call. - query() allocates `e.content.lower()` per entry per call — hot-path string allocation that grows linearly with vault size. Ingest paths (KnowledgeIngestConsumer._store, knowledge.ingest.*, TechScanner.scan_and_store) ALL call exists() before store(). Stacked together: ingesting N entries requires N linear-scan exists checks ⇒ O(N²) ingest on the shipping default. ## Fix src/bonfire/knowledge/memory.py: - Added `_hashes: set[str]` side index for O(1) exists() — replaces the prior O(N) `any(...)` scan. - Added `_lower_content: dict[str, str]` keyed by entry_id, populated at store() time. query() now reads pre-computed lowercase from this cache instead of allocating `e.content.lower()` per entry per call. - Preserved `_entries: list[VaultEntry]` as the public attribute — the innovative-lens conformance test in test_knowledge_memory.py asserts `backend._entries == []` and `isinstance(..., list)` directly. - `store()` is append-only on all three (_entries, _hashes, _lower_content) — no silent dedup on hash collision; the existing test_store_appends_to_entries contract requires append-on-every-store. - Updated module + class docstrings to acknowledge production-default usage (acceptance criterion #3 in the ticket). ## Tests tests/unit/test_knowledge_memory_performance.py (new · Knight RED): Five tests: - 10k-entry ingest with exists-then-store guard pattern, asserts < 1.0s wall clock. Pre-fix on this CPU: 3.571s. Post-fix: included in the suite's 1.98s total run. - 100 queries over 5k-entry vault, asserts < 2.0s (future regression guard — the .lower() cache enforces this). - Hash-collision append regression guard (_entries grows on every store, no silent dedup). - Exists-after-50-stores correctness sweep. - Lower-case cache handles uppercase content (case-insensitive substring match preserved). ## Verification pytest tests/unit/test_knowledge_memory_performance.py 5 passed (RED→GREEN verified) pytest tests/unit/test_knowledge_memory.py (existing — full suite) 17 passed (zero regressions) pytest tests/unit/test_knowledge_factory.py test_knowledge_ingest.py \ test_knowledge_consumer.py test_knowledge_backfill.py 45 passed (callers of InMemoryVaultBackend — zero regressions) ruff check + format on changed files: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

knowledge: O(1) exists + cached lower in InMemoryVaultBackend#150

knowledge: O(1) exists + cached lower in InMemoryVaultBackend#150
Antawari wants to merge 1 commit into
mainfrom
ishtar/inmemoryvaultbackend-o1-hash-index

Antawari commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Antawari commented May 27, 2026

Summary

Changes

TDD verification

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant