Drop the Library semantic-search layer; grep/read is the search path by sysread · Pull Request #215 · sysread/nak

sysread · 2026-05-29T21:08:57Z

SYNOPSIS

Drop the Library's semantic-search layer; exact grep-then-read becomes the search path. Plus a CLAUDE.md note that cloud merges auto-sync the schema.

PURPOSE

The per-chunk embedding layer behind doc_search was the wrong level of abstraction for a few-dozen-document corpus. It underperformed (semantic ranking surfaced the table of contents and definitions over the operative clauses - the model reported it wouldn't have found the answer with doc_search alone) and was expensive (thousands of chunks for a multi-MB upload; hours of cron sweeps before a fresh doc was searchable).

DESCRIPTION

Layer 1 - how it worked. Upload chunked the extracted text, each chunk embedded via the backfill, doc_search cosine-ranked chunks. The drawer search and LIBRARY_BLOCK were built around that.

Layer 2 - what this removes/changes. Gone: the doc_search tool + searchDocumentsSemantic; the document_chunks table + its claim/save/search RPCs + the embed-input source; chunkText + chunk insertion on upload + the chunk unit tests; the backfill's sixth source (embeddings doc back to five). Changed: the drawer search is now a substring match over documents (searchDocuments), dropping the passage snippet; LIBRARY_BLOCK teaches doc_list -> doc_grep -> doc_read and nudges the model to broaden the regex with synonyms; tool descriptions + dev/user docs updated throughout.

schema.sql carries idempotent drop statements (drop function/table if exists) where the chunk objects were defined, so a sync cleans them off any project that ran the chunked schema - the standard way to retire an object under the re-apply-the-whole-file model.

Layer 3 - how that fixes PURPOSE. Routing by doc_list + pinpointing by doc_grep + reading by doc_read covers the job, works the instant a document is uploaded (no backfill wait), and is far less machinery.

The one real tradeoff: paraphrase recall. Grep needs the model to guess synonyms when the user's wording diverges from the document's; the prompt nudges it to. For terminology-heavy reference docs (contracts, policies, tax) wording is usually consistent, so this bites rarely.

Notes for reviewers:

The chunk-table drop is intentional and destructive to document_chunks (extracted text is retained on documents, so nothing the user cares about is lost).
Second commit is an unrelated one-line CLAUDE.md convention note (stop reminding about mise run sync after cloud merges); kept as its own commit, hence rebase-merge.

Gate green: svelte-check 0 errors, lint, knip, 1769 tests, build (no warnings), markdownlint.

Generated by Claude Code

The deploy's sync-supabase job applies supabase/schema.sql on every merge to main, so telling the user to run `mise run sync` after a cloud merge is redundant noise in an end-of-task summary. Record that in the schema-changes section so future sessions don't repeat the reminder.

The per-chunk embedding layer behind doc_search was the wrong level of abstraction for a few-dozen-document corpus. It underperformed (semantic ranking surfaced the table of contents and definitions over the operative clauses - the model said it wouldn't have found the answer with doc_search alone) and cost a heavy backfill (thousands of chunks for a multi-MB upload, hours of cron sweeps before a fresh doc was searchable). Routing by doc_list + pinpointing by doc_grep + reading by doc_read covers the job, works the instant a document is uploaded, and is a lot less machinery. Removed: - the doc_search tool and searchDocumentsSemantic; - the document_chunks table, its claim/save/search RPCs, and the embed-input source (schema.sql carries idempotent drops so a sync cleans the old objects off projects that ran the chunked schema); - chunkText + chunk insertion on upload (insertDocumentChunks), and the chunkText unit tests; - the embeddings backfill's sixth source (docs back to five). Changed: - the Library drawer search now does a substring match over the user's documents (SupabaseService.searchDocuments) instead of semantic chunk search; dropped the per-result snippet that only made sense for passage hits; - LIBRARY_BLOCK teaches list -> grep -> read (with a nudge to broaden the regex with synonyms, since there's no fuzzy fallback); - doc tool descriptions and the dev/user docs updated throughout. The one real tradeoff: paraphrase recall. Grep needs the model to guess synonyms when the user's wording diverges from the document's; the prompt nudges it to do exactly that. For terminology-heavy reference docs (contracts, policies, tax) the wording is usually consistent, so this bites rarely.

claude added 2 commits May 29, 2026 20:29

sysread merged commit 4c7d4a0 into main May 29, 2026
1 check passed

sysread deleted the claude/affectionate-ritchie-1jcnn branch May 29, 2026 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop the Library semantic-search layer; grep/read is the search path#215

Drop the Library semantic-search layer; grep/read is the search path#215
sysread merged 2 commits into
mainfrom
claude/affectionate-ritchie-1jcnn

sysread commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sysread commented May 29, 2026

SYNOPSIS

PURPOSE

DESCRIPTION

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants