Skip to content

feat: IVF coarse quantizer for large corpora (full parity, format v2)#2

Merged
a-tokyo merged 5 commits into
mainfrom
feat/ivf
Jun 10, 2026
Merged

feat: IVF coarse quantizer for large corpora (full parity, format v2)#2
a-tokyo merged 5 commits into
mainfrom
feat/ivf

Conversation

@a-tokyo

@a-tokyo a-tokyo commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Ships the final roadmap wave: an opt-in IVF coarse quantizer (ivf: { nlist, nprobe? }) on TurboQuantIndex, with full parity through IdMapIndex and Collection (config + per-query nprobe + full remove support + serialization), and bumps the version to 0.0.3 (merging this PR auto-publishes).

What's inside

  • core/kmeans — seeded k-means++ init + Lloyd iterations (Lloyd 1982; Arthur & Vassilvitskii 2007), spherical mode for cosine/dot, plain L2 for euclidean, deterministic empty-cluster repair. Bit-reproducible per (data, seed).
  • core/search.searchSlots — probed-subset scan sharing one prepareScan validation path with searchFlat, so both throw identical typed errors; nprobe = nlist reproduces the flat scan bit-for-bit (the IVF analog of the WASM ≡ scalar oracle, asserted in tests).
  • index/coarse.CoarseQuantizer — centroids + posting lists kept in lockstep with the index's swap-remove storage via slot→list/slot→position arrays (O(1) removal patching; invariant-fuzzed).
  • Training UX mirrors calibration — fit-and-freeze from the first non-empty batch when it has ≥ nlist vectors; a smaller first batch freezes the index flat forever. ivfActive getter on all layers.
  • Serialization format v2 — always-written ivf presence byte + nlist/nprobe/centroids/listForSlot (postings rebuilt on load); hardened BAD_IVF validation, bounds-checked before any allocation; v2-only readers per D-010 (ADR D-016 added).
  • bench:ivf — 20k × 768-d clustered, 4-bit: 11.4× QPS at the flat scan's recall (nprobe = nlist/16), 22.8× at nprobe = 1; results JSON committed.
  • Docs — README scope/quickstart/roadmap, guide (new IVF section + error tables), api-reference, architecture, serialization (v2 layout), benchmarks.

Verification

  • 408 tests pass (58 new), coverage 98.9 / 95.8 / 100 / 99.7 against the 90% gate
  • typecheck, lint, format:check, build all green
  • Oracles: nprobe = nlist ≡ flat (exact indices+scores, all metrics); interleaved add/remove parity vs a flat twin; serialization round-trip incl. crafted-buffer corruption tests for every BAD_IVF branch

Merging publishes quantvec@0.0.3 to npm via release-publish.yml.

🤖 Generated with Claude Code

Ships the final roadmap wave: opt-in `ivf: { nlist, nprobe? }` on
TurboQuantIndex, flowing through IdMapIndex and Collection.

- core/kmeans: seeded k-means++ + Lloyd (spherical for cosine/dot, L2 for
  euclidean), empty-cluster repair, deterministic per (data, seed)
- core/search: searchSlots probed-subset scan sharing prepareScan validation
  with searchFlat — nprobe = nlist reproduces the flat scan bit-for-bit
- index/coarse: CoarseQuantizer with posting lists in lockstep with
  swap-remove (O(1) membership patching; full remove parity)
- training mirrors calibration: fit-and-freeze from the first ≥ nlist batch
- io/serialize: format VERSION 2 with hardened ivf section (BAD_IVF)
- bench:ivf: 11.4x QPS at the flat scan's recall (nprobe = nlist/16),
  22.8x at nprobe = 1 (20k x 768-d clustered, 4-bit)
- docs: roadmap closed out, guide/api-reference/architecture/serialization
  updated; ADR D-016; version 0.0.3

408 tests, coverage 98.9/95.8/100/99.7 (gate 90).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
quantvec Ready Ready Preview, Comment Jun 10, 2026 3:11pm

Comment thread src/index/coarse.ts Fixed
…alse positive

CodeQL (js/implicit-operand-conversion) failed to resolve the module-level
bigint const and flagged the XOR operand as possibly undefined. Inlining the
literal also lets us drop the explicit 64-bit mask: createRng masks bigint
seeds itself and the separator fits in the mask, so XOR-then-mask equals
mask-then-XOR — the derived seed is bit-identical.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in IVF (inverted-file) coarse quantizer to scale search to larger corpora by partitioning vectors into k-means cells and scanning only the probed cells’ slots, while preserving parity across TurboQuantIndex, IdMapIndex, and Collection. It also bumps the serialization format to v2 to persist IVF state (plus docs/benchmarks) and releases quantvec@0.0.3.

Changes:

  • Introduces seeded k-means++/Lloyd training (core/kmeans) and an IVF coarse quantizer with posting-list bookkeeping (index/coarse), integrated into TurboQuantIndex search/add/remove/serialize flows.
  • Adds searchSlots (subset scan) sharing the same validation path as searchFlat, enabling probed-slot scanning with flat-scan parity when nprobe = nlist.
  • Bumps serialization to format v2 with an always-present IVF section flag + IVF state; updates tests, docs, benchmarks, and package version/scripts.

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/io/serialize.ts Serialization format v2: adds IVF section (flag + nlist/nprobe + centroids + listForSlot) and BAD_IVF validation.
src/io/serialize.test.ts Tests for v2 IVF round-trip and crafted-buffer corruption branches; adjusts offsets for new IVF flag byte.
src/index/turboquant-index.ts Adds IVF option types/validation, coarse-quantizer training/freezing, IVF search path (probe + searchSlots), payload/bytes round-trip.
src/index/turboquant-index.test.ts End-to-end IVF correctness/parity tests: training semantics, nprobe=nlist oracle, recall sanity, remove parity, mask behavior, ser/de.
src/index/id-map-index.ts Plumbs IVF training + per-query nprobe through the id-mapped wrapper; exposes ivfActive.
src/index/id-map-index.test.ts Tests IVF passthrough: training via addWithIds, remove parity, filter composition, ser/de preserves IVF + ids.
src/index/coarse.ts New CoarseQuantizer: trains centroids, maintains postings with O(1) swap-remove bookkeeping, probes nearest cells.
src/index/coarse.test.ts Determinism/invariants/fuzz tests for coarse quantizer; probe behavior; fromState round-trip.
src/index.ts Exports IvfOptions type from the public entrypoint.
src/ergonomic/types.ts Adds ivf config and per-search nprobe to the ergonomic Collection API surface.
src/ergonomic/collection.ts Passes IVF config into IdMapIndex; forwards per-query nprobe into search options.
src/ergonomic/collection.test.ts Tests Collection IVF config passthrough and composition with filters + deletes.
src/core/search.ts Refactors shared validation/query prep into prepareScan; adds searchSlots and INVALID_SLOT.
src/core/search.test.ts Adds oracle tests: searchSlots(allSlots) ≡ searchFlat, subset-only behavior, mask handling, validation parity.
src/core/kmeans.ts New deterministic seeded k-means++ + Lloyd implementation with spherical mode and empty-cluster repair.
src/core/kmeans.test.ts Tests validation, determinism, clustering behavior, spherical norms, empty-cluster repair, duplicates.
README.md Updates scope/quickstart/roadmap to document IVF as shipped and provide usage example.
package.json Bumps version to 0.0.3; adds bench:ivf script.
package-lock.json Lockfile version bump to 0.0.3.
docs/worklog/DECISIONS.md Adds ADR D-016 documenting IVF design, training/freeze semantics, v2 serialization decision.
docs/serialization.md Updates spec to v2 body layout including calibration + IVF sections and validation codes.
docs/roadmap.md Moves IVF to shipped; planned section now empty.
docs/guide.md Adds IVF guide section and updates error-code tables for new IVF/search errors.
docs/benchmarks.md Adds IVF benchmark results and interpretation (recall/QPS tradeoff).
docs/architecture.md Updates module map and scope statement to include IVF coarse quantizer.
docs/api-reference.md Documents ivf options, ivfActive, and per-query nprobe in public APIs.
benchmarks/results/ivf-d768.json Commits IVF benchmark results JSON for the documented run.
benchmarks/ivf.ts New deterministic IVF benchmark harness and results writer.
.agents/plans/2026-06-10-ivf-wave.md Implementation plan artifact for the IVF wave.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/index/turboquant-index.ts
Comment thread src/index/turboquant-index.ts
… fitting

Addresses the two Copilot review findings on #2:

- Extract the query validation (length / finiteness / non-zero) from the shared
  scan preamble into an exported validateQuery, and run it in the IVF search
  branch BEFORE centroid probing — a malformed query now throws the same typed
  SearchError as the flat path without ever reaching the probe arithmetic.
- TurboQuantIndex.add() now validates the batch atomically (validateVectorBatch,
  as IdMapIndex/Collection already do) whenever a calibration/IVF training
  decision is pending: a non-finite or zero row can no longer poison the
  about-to-be-frozen calibration fit or k-means centroids, and a failed first
  batch leaves the index completely unchanged (nothing appended, decision not
  frozen — a later clean batch still trains).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@a-tokyo

a-tokyo commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Both review findings addressed in d6a64ee:

  1. Probe-before-validation — query validation (length / finiteness / non-zero) is now extracted into an exported validateQuery shared by the scan preamble, and the IVF branch runs it before centroid probing. A malformed query throws the identical typed SearchError on the flat and IVF paths, verified by a test that asserts the same codes across both. (Note: the error shape was already identical — searchSlots re-validates — but probing garbage first was wasted work and fragile.)

  2. Training on an invalid first batchTurboQuantIndex.add() now runs validateVectorBatch atomically whenever a calibration/IVF training decision is pending (matching what IdMapIndex.addWithIds/Collection.upsert already did). A non-finite/zero row can no longer poison the about-to-be-frozen centroids or calibration fit; the failed batch leaves the index completely unchanged and the decision not frozen (one correction to the finding: the freeze flag was only ever set after a successful train, so a throwing train never froze state — the real hazard was NaN rows silently poisoning centroids before encode rejected them mid-batch, which this fixes). Covered by new tests for both the IVF and calibration cases.

411 tests green, coverage gate unchanged.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 1 comment.

Comment thread src/index/turboquant-index.ts
…e by construction)

The probed scan visits slots in posting-list order and the top-k heap keeps
tied candidates by visit order, so the "nprobe = nlist == flat scan" oracle
held by tie-dynamics luck rather than by construction (cross-cell exact-score
ties could in principle diverge at the k boundary). Special-case nprobe ===
nlist to searchFlat: canonical slot order, identical validation, and it skips
a pointless full centroid probe. Adds a duplicate-vector boundary-tie
regression test guarding the routing.

Addresses the third Copilot finding on #2.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@a-tokyo

a-tokyo commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Third finding addressed in the latest commit: nprobe === nlist now routes through searchFlat directly (canonical slot order, same validation), so the oracle holds by construction instead of by tie dynamics — and a full-probe pass is skipped as a bonus.

One honest note on severity: identical vectors always assign to the same cell (deterministic nearest-centroid), and within a cell posting order preserves ascending slots, so the realistic duplicate-vector case did not actually diverge (verified by running the new boundary-tie test against the pre-fix code — it passed there too). The divergence requires exact f32 score ties across different cells, which is essentially impossible to construct through the rotation+quantization pipeline. Still worth fixing exactly as suggested: the guarantee should not depend on that reasoning. The new duplicate-vector regression test guards the routing.

@vercel vercel Bot requested a deployment to Preview June 10, 2026 15:09 Abandoned
@a-tokyo a-tokyo changed the title feat: IVF coarse quantizer for large corpora (full parity, format v2) — v0.0.3 feat: IVF coarse quantizer for large corpora (full parity, format v2) Jun 10, 2026
@a-tokyo a-tokyo merged commit 61caf5a into main Jun 10, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants