Filter safety by shaleenji · Pull Request #250 · endee-io/endee

shaleenji · 2026-05-15T08:09:44Z

Pull Request

Summary

Splits the work from filter_pass into the portion that is byte-compatible with filter indexes built by master and ships it as filter_safety. The portion that changes the on-disk bucket layout, the numeric sortable-key domain, and the upsert semantics — and therefore requires a reindex — is deliberately deferred to a follow-up PR ("Part 2"). The split is documented in docs/filter_bucket_format_followup.md.

Net effect on an existing deployment: drop in, restart, queries continue to return the same answers, plus the new $gt / $gte / $lt / $lte operators, filter input validation, defensive bitmap deserialization, and a batch of perf and code-hygiene work. No rebuild required.

Headline items in this PR:

New numeric query operators: $gt, $gte, $lt, $lte.
Filter parameter validation (server side) + reject : inside filter keys / values.
Safe roaring-bitmap deserialization (readSafe + internal_validate + payload-size check). Bitmap byte format unchanged — valid master-built bitmaps still parse.
Perf: zero-copy reads from MDBX, meta fetched only for filter-matching vectors, batched / addMany numeric inserts, bounded MDBX write transactions.
Refactors: OperationResult return type plumbed through filter call sites; unified add_filters_from_json.
Build: macOS-friendly clang detection via xcrun.
Tests: new ndd_request_validation_test covering the validation work. 42 pass, 7 skip (3 Part-2 regression alarms guarded by GTEST_SKIP, 4 benchmarks gated on ENDEE_BENCH_DB), 0 fail.
Docs: filter.md expanded; new docs/filter_bucket_format_followup.md enumerates exactly what Part 2 must do to remove the Part-1 carry-forwards (Bucket::count field, is_number_integer branch in sortable_from_json, etc.).

Type of Change

Bug fix (safe bitmap deserialization hardens against truncated / garbage payloads)
New feature ($gt / $gte / $lt / $lte operators, filter parameter validation)
Breaking change
Documentation update (filter.md + new follow-up doc)
Refactor / cleanup (OperationResult plumbing, unified add_filters_from_json, batched inserts)

Related Issue

Closes # N/A

Checklist

Code compiles and tests pass — 42 pass, 7 skip, 0 fail in ndd_filter_test + ndd_request_validation_test. Skips are intentional (3 Part-2 regression alarms with explanatory GTEST_SKIP messages, 4 benchmarks gated on ENDEE_BENCH_DB).
New tests added where applicable — tests/request_validation_test.cpp covers the new validation path.
Documentation updated if needed — docs/filter.md expanded; docs/filter_bucket_format_followup.md lists everything deferred to Part 2 and the Part-1 carry-forwards Part 2 must remove.
No unintended breaking changes — Part-1 invariants verified at branch tip: Bucket::is_empty() still checks only ids.empty(); Bucket::serialize / deserialize still write/read the count field; Filter::sortable_from_json still branches on is_number_integer() → int_to_sortable; store_vectors_batch does NOT take is_new_to_db. Indexes built on master remain readable byte-for-byte.

…n a txn

Cherry-pick of 02acc13 limited to Part-1-compatible tests. Adds request_validation_test.cpp covering filter parameter validation from 3e33557 and wires it into tests/CMakeLists.txt. The remaining contents of 02acc13 (vector_storage_test.cpp, numeric_index_stress_test.cpp, tests/repo_filter.py, and the new TEST_F additions in filter_test.cpp) exercise Part-2 behavior (bitmap-only bucket state, unified float numeric encoding, upsert cleanup, deleteFilter meta sync) and are deferred to Part 2.

Records the four filter_pass commits skipped from the Part-1 split (546430d, b0e8425, e9cca02, 4cb445d), the hpp->cpp refactor (7743296) deferred to be bundled with the bucket layout change, and the Part-2 test files split out of 02acc13. Documents the Part-1 carry-forwards (Bucket count field, sortable_from_json int branch) that exist to keep filter_safety byte-compatible with master-built indexes and that Part 2 should remove.

Three Hypothesis tests from a46d0b8 (safe filter bitmap deserialization) assert behavior that only exists after Part 2: - Hypothesis2.SaturationCreatesBitmapOnlyEntries — expects Bucket::add to route delta-0 inserts past MAX_SIZE into the summary bitmap (546430d). - Hypothesis4.DeserializeRejectsLegacyCountFormat — expects the count-less deserializer to reject the legacy on-disk shape (546430d). - Hypothesis4.ReadSummaryBitmapRejectsLegacyCountFormat — expects read_summary_bitmap to reject the same shape via an alignment check; Part 1 intentionally removed that check because the count field is still part of the layout. Each test now calls GTEST_SKIP() with a message pointing at docs/filter_part2_followups.md. Part 2 must remove these skips when the underlying fixes land.

github-actions · 2026-05-15T08:15:41Z

VectorDB Benchmark - Ready To Run

CI Passed ([lint + unit tests] (https://github.com/endee-io/endee/actions/runs/25908084844)) - benchmark options unlocked.

Post one of the command below. Only members with write access can trigger runs.

Available Modes

Mode	Command	What runs
Dense	`/correctness_benchmarking dense`	HNSW insert throughput · query P50/P95/P99 · recall@10 · concurrent QPS
Hybrid	`/correctness_benchmarking hybrid`	Dense + sparse BM25 fusion · same suite + fusion latency overhead

Infrastructure

Server	Role	Instance
Endee Server	Endee VectorDB — code from this branch	`t2.large`
Benchmark Server	Benchmark runner	`t3a.large`

Both servers start on demand and are always terminated after the run — pass or fail.

How Correctness Benchmarking Works

1. Post /correctness_benchmarking <mode>
2. Endee Server Create  →  this branch's code deployed  →  Endee starts in chosen mode
3. Benchmark Server Create  →  benchmark suite transferred
4. Benchmark Server runs correctness benchmarking against Endee Server
5. Results posted back here  →  pass/fail + full metrics table
6. Both servers terminated   →  always, even on failure

After a new push, CI must pass again before this menu reappears.

Move the implementations of CategoryIndex, NumericIndex, Bucket, and Filter from their respective headers into new translation units. The headers now expose only types, declarations, and the tiny inline accessors (sortable_from_float family, Bucket::get_value / is_full / is_empty). Behavior is unchanged; this is a build-time refactor. Define NDD_FILTER_SOURCES once in the root CMakeLists.txt and pull it into both NDD_CORE_SOURCES (for the main binary) and the ndd_filter_test target so the implementations are linked in both places. Add #include <thread> to settings.hpp. It uses std::thread::hardware_concurrency() but was relying on a transitive include from the old filter.hpp; the trimmed filter.hpp no longer pulls in <thread>, so the test build broke without this fix. Verified: ndd_filter_test (42 pass, 7 skip, 0 fail) and ndd_request_validation_test (6 pass, 0 fail) match the pre-split results; ndd-avx2 builds clean.

shaleenji · 2026-05-15T09:39:18Z

================================================================================
Commit : `15192bd`
Short : `15192bd`
Subject: Fix FP16 NEON build on AArch64 CPUs without FP16FML support (#168)
When : 2026-05-15 09:38:04 +0000

LABEL FILTER RESULTS
FilterBoost(%) LabelPct 0 0.5 0 0.5 0 0.5 0 0.2 0 0.2 0 0.2 0 0.1 0 0.1 0 0.1 0 0.05 0 0.05 0 0.05 0 0.02 0 0.02 0 0.02 0 0.01 0 0.01 0 0.01 0 0.002 0 0.002 0 0.002 0 0.001 0 0.001 0 0.001 Concurrency Test QPS P99(s) P95(s) Recall
16 test_1 1385.2552 0.0085 0.0078 0.9778
16 test_2 1441.1741 0.0083 0.0075 0.9778
16 test_3 1411.8426 0.008 0.0075 0.9778
16 test_1 728.8524 0.0143 0.0132 0.978
16 test_2 738.2412 0.0145 0.0133 0.978
16 test_3 729.5199 0.0146 0.0133 0.978
16 test_1 459.5179 0.021 0.0198 0.9793
16 test_2 464.8781 0.021 0.0198 0.9793
16 test_3 459.6361 0.0205 0.0193 0.9793
16 test_1 269.6389 0.0345 0.0317 0.9785
16 test_2 273.076 0.033 0.0315 0.9785
16 test_3 271.1142 0.0346 0.0318 0.9785
16 test_1 160.6592 0.0545 0.0506 0.9762
16 test_2 159.4784 0.0551 0.0512 0.9762
16 test_3 161.0747 0.0525 0.0498 0.9762
16 test_1 119.5435 0.0725 0.069 0.9659
16 test_2 117.2639 0.0729 0.0693 0.9659
16 test_3 118.2296 0.0713 0.0681 0.9659
16 test_1 2263.0854 0.005 0.0045 0.9999
16 test_2 2282.6694 0.005 0.0044 0.9999
16 test_3 2221.3103 0.0059 0.0048 0.9999
16 test_1 2908.372 0.0029 0.0029 0.9998
16 test_2 2918.447 0.0037 0.0031 0.9998
16 test_3 2913.0584 0.0032 0.003 0.9998

INT FILTER RESULTS
FilterBoost(%) IntFilterRate Concurrency Test QPS P99(s) P95(s) Recall
0 0.99 16 test_1 118.8615 0.0694 0.0659 0.9652
0 0.99 16 test_2 120.7722 0.0706 0.0671 0.9652
0 0.99 16 test_3 124.4392 0.0712 0.0677 0.9652
0 0.80 16 test_1 538.3185 0.019 0.0178 0.9793
0 0.80 16 test_2 534.2095 0.0201 0.0178 0.9793
0 0.80 16 test_3 524.4287 0.0203 0.0179 0.9793
0 0.50 16 test_1 703.7076 0.0145 0.0134 0.9783
0 0.50 16 test_2 702.1918 0.0151 0.0138 0.9783
0 0.50 16 test_3 715.0694 0.0173 0.015 0.9783
0 0.01 16 test_1 742.828 0.0158 0.0133 0.974
0 0.01 16 test_2 752.7353 0.0155 0.0136 0.974
0 0.01 16 test_3 750.9379 0.0142 0.0133 0.974

shaleenji · 2026-05-15T10:54:35Z

================================================================================
Commit : `d1a5522`
Short : `d1a5522`
Subject: filter: split headers into hpp + cpp
When : 2026-05-15 10:39:01 +0000

LABEL FILTER RESULTS
FilterBoost(%) LabelPct Concurrency Test QPS P99(s) P95(s) Recall
0 0.5 16 test_1 1382.0941 0.0084 0.0075 0.9787
0 0.5 16 test_2 1383.0399 0.0081 0.0073 0.9787
0 0.5 16 test_3 1430.8326 0.0083 0.0076 0.9787
0 0.2 16 test_1 750.5759 0.0142 0.013 0.9773
0 0.2 16 test_2 747.8254 0.0147 0.0135 0.9773
0 0.2 16 test_3 741.3709 0.0137 0.0126 0.9773
0 0.1 16 test_1 454.8632 0.0203 0.0193 0.9801
0 0.1 16 test_2 463.8765 0.0206 0.0191 0.9801
0 0.1 16 test_3 461.1711 0.0204 0.0189 0.9801
0 0.05 16 test_1 276.1424 0.0327 0.0308 0.9782
0 0.05 16 test_2 269.995 0.0336 0.0314 0.9782
0 0.05 16 test_3 271.265 0.0326 0.0305 0.9782
0 0.02 16 test_1 162.0203 0.0526 0.0499 0.9739
0 0.02 16 test_2 159.6628 0.0534 0.0498 0.9739
0 0.02 16 test_3 163.1439 0.0552 0.0508 0.9739
0 0.01 16 test_1 117.433 0.0707 0.0677 0.9645
0 0.01 16 test_2 118.8786 0.0708 0.0674 0.9645
0 0.01 16 test_3 118.2547 0.0731 0.07 0.9645
0 0.002 16 test_1 2915.8864 0.0045 0.0036 0.9999
0 0.002 16 test_2 2928.9737 0.0038 0.0034 0.9999
0 0.002 16 test_3 2934.104 0.0033 0.0032 0.9999
0 0.001 16 test_1 2864.5193 0.0029 0.0027 0.9998
0 0.001 16 test_2 2907.3291 0.0027 0.0024 0.9998
0 0.001 16 test_3 2909.2607 0.0025 0.0023 0.9998

INT FILTER RESULTS
FilterBoost(%) IntFilterRate Concurrency Test QPS P99(s) P95(s) Recall
0 0.99 16 test_1 118.0655 0.0703 0.067 0.9651
0 0.99 16 test_2 118.9901 0.0707 0.0668 0.9651
0 0.99 16 test_3 120.5183 0.0707 0.0678 0.9651
0 0.80 16 test_1 572.2425 0.0176 0.0161 0.9802
0 0.80 16 test_2 578.8009 0.0174 0.0163 0.9802
0 0.80 16 test_3 582.8413 0.0171 0.0157 0.9802
0 0.50 16 test_1 798.4126 0.0137 0.0121 0.9784
0 0.50 16 test_2 783.5072 0.0151 0.0126 0.9784
0 0.50 16 test_3 811.5474 0.0128 0.012 0.9784
0 0.01 16 test_1 832.6413 0.0132 0.0118 0.9746
0 0.01 16 test_2 828.8141 0.0132 0.0118 0.9746
0 0.01 16 test_3 828.8333 0.0126 0.0118 0.9746

shaleenji · 2026-05-15T10:59:59Z

Server: 8 CPU, 32GB RAM
Client: 4CPU, 16GB RAM (client concurrency for vectordbbench: 16)

shaleengarg added 27 commits May 15, 2026 07:22

filter

9903d7f

removing dead code

159e205

unified implementation of add_filters_from_json

693badb

grouping numeric insertions for transactionality and performance

7dd581f

addMany instead of a looped add

89e9df0

cleanup

eda67ec

put batch todo comments

0ff2f58

commenting for better understanding

fa9aa99

name changes

7047255

docs updated for understanding

6643ccf

timing function to time individual components of filterd search

0a6697b

no need to copy data from mdbx

d51372d

using return type OperationResult to propagate the logs

29a60a4

comments updated

5d6ae77

filter adding gt, gte, lt, lte

9601ac5

reject filters with : in key or value

bba8e6a

do meta data fetch only for vectors that satisfy the filters

ecbee5d

safe filter bitmap deserialization

4db88aa

filter parameters validation

861a0c5

bounding the filter mdbx size by reducing the number of updates withi…

f1cd4f5

…n a txn

removing search timing for testing

5edc082

filter docs

c2b1778

mac compile time flags to use xcrun to find the correct clang version

e56debe

filter bucket format followup

89e32ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter safety#250

Filter safety#250
shaleenji wants to merge 28 commits into
masterfrom
filter_safety

shaleenji commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

shaleenji commented May 15, 2026

Uh oh!

shaleenji commented May 15, 2026

Uh oh!

shaleenji commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaleenji commented May 15, 2026

Pull Request

Summary

Type of Change

Related Issue

Checklist

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VectorDB Benchmark - Ready To Run

Available Modes

Infrastructure

How Correctness Benchmarking Works

Uh oh!

shaleenji commented May 15, 2026

================================================================================ Commit : 15192bd Short : 15192bd Subject: Fix FP16 NEON build on AArch64 CPUs without FP16FML support (#168) When : 2026-05-15 09:38:04 +0000

Uh oh!

shaleenji commented May 15, 2026

================================================================================ Commit : d1a5522 Short : d1a5522 Subject: filter: split headers into hpp + cpp When : 2026-05-15 10:39:01 +0000

Uh oh!

shaleenji commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 15, 2026 •

edited

Loading

================================================================================
Commit : `15192bd`
Short : `15192bd`
Subject: Fix FP16 NEON build on AArch64 CPUs without FP16FML support (#168)
When : 2026-05-15 09:38:04 +0000

================================================================================
Commit : `d1a5522`
Short : `d1a5522`
Subject: filter: split headers into hpp + cpp
When : 2026-05-15 10:39:01 +0000