knowledge: O(1) exists + cached lower in InMemoryVaultBackend#150
Open
Antawari wants to merge 1 commit into
Open
knowledge: O(1) exists + cached lower in InMemoryVaultBackend#150Antawari wants to merge 1 commit into
Antawari wants to merge 1 commit into
Conversation
The in-memory backend ships as the default when knowledge is
unconfigured — its docstring said "for testing" but the factory
returns it for the bare enabled=False and backend="memory" paths.
Production users hit it unless they explicitly opt into LanceDB.
Pre-fix:
- exists(content_hash) is `any(e.content_hash == h for e in self._entries)`
— linear scan per call.
- query() allocates `e.content.lower()` per entry per call — hot-path
string allocation that grows linearly with vault size.
Ingest paths (KnowledgeIngestConsumer._store, knowledge.ingest.*,
TechScanner.scan_and_store) ALL call exists() before store(). Stacked
together: ingesting N entries requires N linear-scan exists checks ⇒
O(N²) ingest on the shipping default.
## Fix
src/bonfire/knowledge/memory.py:
- Added `_hashes: set[str]` side index for O(1) exists() — replaces
the prior O(N) `any(...)` scan.
- Added `_lower_content: dict[str, str]` keyed by entry_id, populated
at store() time. query() now reads pre-computed lowercase from this
cache instead of allocating `e.content.lower()` per entry per call.
- Preserved `_entries: list[VaultEntry]` as the public attribute — the
innovative-lens conformance test in test_knowledge_memory.py asserts
`backend._entries == []` and `isinstance(..., list)` directly.
- `store()` is append-only on all three (_entries, _hashes,
_lower_content) — no silent dedup on hash collision; the existing
test_store_appends_to_entries contract requires append-on-every-store.
- Updated module + class docstrings to acknowledge production-default
usage (acceptance criterion #3 in the ticket).
## Tests
tests/unit/test_knowledge_memory_performance.py (new · Knight RED):
Five tests:
- 10k-entry ingest with exists-then-store guard pattern, asserts
< 1.0s wall clock. Pre-fix on this CPU: 3.571s. Post-fix: included
in the suite's 1.98s total run.
- 100 queries over 5k-entry vault, asserts < 2.0s (future regression
guard — the .lower() cache enforces this).
- Hash-collision append regression guard (_entries grows on every
store, no silent dedup).
- Exists-after-50-stores correctness sweep.
- Lower-case cache handles uppercase content (case-insensitive
substring match preserved).
## Verification
pytest tests/unit/test_knowledge_memory_performance.py
5 passed (RED→GREEN verified)
pytest tests/unit/test_knowledge_memory.py (existing — full suite)
17 passed (zero regressions)
pytest tests/unit/test_knowledge_factory.py test_knowledge_ingest.py \
test_knowledge_consumer.py test_knowledge_backfill.py
45 passed (callers of InMemoryVaultBackend — zero regressions)
ruff check + format on changed files: clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The in-memory backend ships as the default when knowledge is unconfigured — its docstring said "for testing" but the factory returns it for the bare
enabled=Falseandbackend="memory"paths. Production users hit it unless they explicitly opt into LanceDB. The pre-fix performance characteristics:exists(content_hash)any(...)linear scansetlookupquery()per-entry hot loope.content.lower()re-allocated per entry per callstore()Ingest paths (
KnowledgeIngestConsumer._store,knowledge.ingest.*,TechScanner.scan_and_store) all callexists()beforestore(). Stacked: ingesting N entries previously required N linear-scan exists checks → O(N²) on the shipping default.Changes
src/bonfire/knowledge/memory.py_hashes: set[str]side index for O(1)exists()._lower_content: dict[str, str]keyed byentry_id, populated atstore()time.query()reads from the cache in its substring-match hot loop._entries: list[VaultEntry]as the public attribute — the conformance test intest_knowledge_memory.pyassertsbackend._entries == []andisinstance(..., list)directly. Adding side indexes keeps the contract intact.store()is append-only on all three (_entries,_hashes,_lower_content) — no silent dedup on hash collision; the existingtest_store_appends_to_entriescontract requires append-on-every-store. Callers run theexists()-then-store()guard pattern.tests/unit/test_knowledge_memory_performance.py(new · Knight RED)Five tests — the 10k benchmark is the load-bearing RED:
test_10k_entry_ingest_under_one_second— 10k entries withexists-then-storeguard, asserts < 1.0s wall clock. Pre-fix on this CPU: 3.571s. Post-fix: included in the full suite's 1.98s total run.test_query_with_large_vault_does_not_allocate_per_entry_lower— 100 queries over 5k-entry vault, < 2.0s. Future-regression guard for the lower-cache.test_entries_list_grows_on_every_store_even_with_hash_collision— pins the no-silent-dedup contract.test_exists_returns_true_for_any_stored_hash— O(1) exists correctness sweep over 50 stores.test_query_lower_cache_handles_uppercase_content— case-insensitive substring match preserved.TDD verification
test_knowledge_memory.py).ruff check+ruff format --checkon changed files: clean.Test plan
pytest tests/unit/test_knowledge_memory_performance.py— confirm 5/5 green.pytest tests/unit/test_knowledge_memory.py— confirm 17/17 still green (no regression).bonfire backfill_allagainst a large session log; expect linear scaling.🤖 Generated with Claude Code