Drop the Library semantic-search layer; grep/read is the search path#215
Merged
Conversation
The deploy's sync-supabase job applies supabase/schema.sql on every merge to main, so telling the user to run `mise run sync` after a cloud merge is redundant noise in an end-of-task summary. Record that in the schema-changes section so future sessions don't repeat the reminder.
The per-chunk embedding layer behind doc_search was the wrong level of abstraction for a few-dozen-document corpus. It underperformed (semantic ranking surfaced the table of contents and definitions over the operative clauses - the model said it wouldn't have found the answer with doc_search alone) and cost a heavy backfill (thousands of chunks for a multi-MB upload, hours of cron sweeps before a fresh doc was searchable). Routing by doc_list + pinpointing by doc_grep + reading by doc_read covers the job, works the instant a document is uploaded, and is a lot less machinery. Removed: - the doc_search tool and searchDocumentsSemantic; - the document_chunks table, its claim/save/search RPCs, and the embed-input source (schema.sql carries idempotent drops so a sync cleans the old objects off projects that ran the chunked schema); - chunkText + chunk insertion on upload (insertDocumentChunks), and the chunkText unit tests; - the embeddings backfill's sixth source (docs back to five). Changed: - the Library drawer search now does a substring match over the user's documents (SupabaseService.searchDocuments) instead of semantic chunk search; dropped the per-result snippet that only made sense for passage hits; - LIBRARY_BLOCK teaches list -> grep -> read (with a nudge to broaden the regex with synonyms, since there's no fuzzy fallback); - doc tool descriptions and the dev/user docs updated throughout. The one real tradeoff: paraphrase recall. Grep needs the model to guess synonyms when the user's wording diverges from the document's; the prompt nudges it to do exactly that. For terminology-heavy reference docs (contracts, policies, tax) the wording is usually consistent, so this bites rarely.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SYNOPSIS
Drop the Library's semantic-search layer; exact
grep-then-readbecomes the search path. Plus a CLAUDE.md note that cloud merges auto-sync the schema.PURPOSE
The per-chunk embedding layer behind
doc_searchwas the wrong level of abstraction for a few-dozen-document corpus. It underperformed (semantic ranking surfaced the table of contents and definitions over the operative clauses - the model reported it wouldn't have found the answer withdoc_searchalone) and was expensive (thousands of chunks for a multi-MB upload; hours of cron sweeps before a fresh doc was searchable).DESCRIPTION
Layer 1 - how it worked. Upload chunked the extracted text, each chunk embedded via the backfill,
doc_searchcosine-ranked chunks. The drawer search andLIBRARY_BLOCKwere built around that.Layer 2 - what this removes/changes. Gone: the
doc_searchtool +searchDocumentsSemantic; thedocument_chunkstable + its claim/save/search RPCs + the embed-input source;chunkText+ chunk insertion on upload + the chunk unit tests; the backfill's sixth source (embeddings doc back to five). Changed: the drawer search is now a substring match over documents (searchDocuments), dropping the passage snippet;LIBRARY_BLOCKteachesdoc_list -> doc_grep -> doc_readand nudges the model to broaden the regex with synonyms; tool descriptions + dev/user docs updated throughout.schema.sqlcarries idempotentdropstatements (drop function/table if exists) where the chunk objects were defined, so a sync cleans them off any project that ran the chunked schema - the standard way to retire an object under the re-apply-the-whole-file model.Layer 3 - how that fixes PURPOSE. Routing by
doc_list+ pinpointing bydoc_grep+ reading bydoc_readcovers the job, works the instant a document is uploaded (no backfill wait), and is far less machinery.The one real tradeoff: paraphrase recall. Grep needs the model to guess synonyms when the user's wording diverges from the document's; the prompt nudges it to. For terminology-heavy reference docs (contracts, policies, tax) wording is usually consistent, so this bites rarely.
Notes for reviewers:
document_chunks(extracted text is retained ondocuments, so nothing the user cares about is lost).mise run syncafter cloud merges); kept as its own commit, hence rebase-merge.Gate green: svelte-check 0 errors, lint, knip, 1769 tests, build (no warnings), markdownlint.
Generated by Claude Code