Skip to content

Drop the Library semantic-search layer; grep/read is the search path#215

Merged
sysread merged 2 commits into
mainfrom
claude/affectionate-ritchie-1jcnn
May 29, 2026
Merged

Drop the Library semantic-search layer; grep/read is the search path#215
sysread merged 2 commits into
mainfrom
claude/affectionate-ritchie-1jcnn

Conversation

@sysread
Copy link
Copy Markdown
Owner

@sysread sysread commented May 29, 2026

SYNOPSIS

Drop the Library's semantic-search layer; exact grep-then-read becomes the search path. Plus a CLAUDE.md note that cloud merges auto-sync the schema.

PURPOSE

The per-chunk embedding layer behind doc_search was the wrong level of abstraction for a few-dozen-document corpus. It underperformed (semantic ranking surfaced the table of contents and definitions over the operative clauses - the model reported it wouldn't have found the answer with doc_search alone) and was expensive (thousands of chunks for a multi-MB upload; hours of cron sweeps before a fresh doc was searchable).

DESCRIPTION

Layer 1 - how it worked. Upload chunked the extracted text, each chunk embedded via the backfill, doc_search cosine-ranked chunks. The drawer search and LIBRARY_BLOCK were built around that.

Layer 2 - what this removes/changes. Gone: the doc_search tool + searchDocumentsSemantic; the document_chunks table + its claim/save/search RPCs + the embed-input source; chunkText + chunk insertion on upload + the chunk unit tests; the backfill's sixth source (embeddings doc back to five). Changed: the drawer search is now a substring match over documents (searchDocuments), dropping the passage snippet; LIBRARY_BLOCK teaches doc_list -> doc_grep -> doc_read and nudges the model to broaden the regex with synonyms; tool descriptions + dev/user docs updated throughout.

schema.sql carries idempotent drop statements (drop function/table if exists) where the chunk objects were defined, so a sync cleans them off any project that ran the chunked schema - the standard way to retire an object under the re-apply-the-whole-file model.

Layer 3 - how that fixes PURPOSE. Routing by doc_list + pinpointing by doc_grep + reading by doc_read covers the job, works the instant a document is uploaded (no backfill wait), and is far less machinery.

The one real tradeoff: paraphrase recall. Grep needs the model to guess synonyms when the user's wording diverges from the document's; the prompt nudges it to. For terminology-heavy reference docs (contracts, policies, tax) wording is usually consistent, so this bites rarely.

Notes for reviewers:

  • The chunk-table drop is intentional and destructive to document_chunks (extracted text is retained on documents, so nothing the user cares about is lost).
  • Second commit is an unrelated one-line CLAUDE.md convention note (stop reminding about mise run sync after cloud merges); kept as its own commit, hence rebase-merge.

Gate green: svelte-check 0 errors, lint, knip, 1769 tests, build (no warnings), markdownlint.


Generated by Claude Code

claude added 2 commits May 29, 2026 20:29
The deploy's sync-supabase job applies supabase/schema.sql on every
merge to main, so telling the user to run `mise run sync` after a cloud
merge is redundant noise in an end-of-task summary. Record that in the
schema-changes section so future sessions don't repeat the reminder.
The per-chunk embedding layer behind doc_search was the wrong level of
abstraction for a few-dozen-document corpus. It underperformed (semantic
ranking surfaced the table of contents and definitions over the
operative clauses - the model said it wouldn't have found the answer
with doc_search alone) and cost a heavy backfill (thousands of chunks
for a multi-MB upload, hours of cron sweeps before a fresh doc was
searchable). Routing by doc_list + pinpointing by doc_grep + reading by
doc_read covers the job, works the instant a document is uploaded, and
is a lot less machinery.

Removed:
- the doc_search tool and searchDocumentsSemantic;
- the document_chunks table, its claim/save/search RPCs, and the
  embed-input source (schema.sql carries idempotent drops so a sync
  cleans the old objects off projects that ran the chunked schema);
- chunkText + chunk insertion on upload (insertDocumentChunks), and the
  chunkText unit tests;
- the embeddings backfill's sixth source (docs back to five).

Changed:
- the Library drawer search now does a substring match over the user's
  documents (SupabaseService.searchDocuments) instead of semantic chunk
  search; dropped the per-result snippet that only made sense for
  passage hits;
- LIBRARY_BLOCK teaches list -> grep -> read (with a nudge to broaden
  the regex with synonyms, since there's no fuzzy fallback);
- doc tool descriptions and the dev/user docs updated throughout.

The one real tradeoff: paraphrase recall. Grep needs the model to guess
synonyms when the user's wording diverges from the document's; the
prompt nudges it to do exactly that. For terminology-heavy reference
docs (contracts, policies, tax) the wording is usually consistent, so
this bites rarely.
@sysread sysread merged commit 4c7d4a0 into main May 29, 2026
1 check passed
@sysread sysread deleted the claude/affectionate-ritchie-1jcnn branch May 29, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants