Skip to content

fix(sf-324): WriteBehind read surface buffer-aware + flush-vs-coalesce seq-identity (eviction read-consistency)#88

Merged
ivkan merged 6 commits into
mainfrom
fix/sf-324-eviction-readconsistency
Jun 25, 2026
Merged

fix(sf-324): WriteBehind read surface buffer-aware + flush-vs-coalesce seq-identity (eviction read-consistency)#88
ivkan merged 6 commits into
mainfrom
fix/sf-324-eviction-readconsistency

Conversation

@ivkan

@ivkan ivkan commented Jun 25, 2026

Copy link
Copy Markdown
Member

What

Makes the WriteBehind datastore read surface buffer-aware so active eviction no longer exposes stale/missing reads for acked-but-unflushed writes, and closes a flush-vs-coalesce staging race surfaced while fixing it.

Fixes TODO-539 (the TODO-484 G4b soak finding): with active eviction, scan_values/scan_values_batched/enumerate_leaves/list_maps delegated straight to inner redb while load/load_all overlaid the pending buffer — so the QUERY full-scan + Merkle read paths returned stale/missing values for acked writes (2421/4000 keys diverged pre-crash; no-eviction control passed).

Changes

  • Buffer-aware overlay (write_behind.rs): scan_values/scan_values_batched/enumerate_leaves/list_maps now mirror the load/load_all overlay — staging Some overlays the buffered value, None suppresses the durable row, is_backup=true short-circuits to inner (primary-only). merkle_leaf_hash None (OrMap-no-leaf) propagated (no placeholder leaf).
  • Flush-vs-coalesce seq-identity fix (write_behind.rs): background flush + max-retry discard now drop a staging slot only when its seq matches the flushed entry (clear_staging_if_current); stage() inserts are monotonic by seq so a concurrent older write can't clobber a newer staged value. processors.rs left unchanged (the prescribed merge fix R6 was empirically refuted via A/B).
  • Soak harness (benches/soak_harness/): readback hardened against interleaved live pushes; residency-coupled Merkle gates carved to non-fatal [TODO-530] pending; env-gated server-log passthrough.

Verification

  • AC2–AC8 unit tests green on real redb + WriteBehind; stage_is_monotonic_by_seq green.
  • cargo fmt --check 0 · cargo clippy --all-targets --all-features -- -D warnings 0 · lib suite 1495 / 0.
  • AC1 (re-scoped, user decision): active-eviction no-crash soak → PASS + no-eviction control → PASS (pre-crash divergence 0, was 300–600). The crash-loop whole-soak + live two-client Merkle disagreement + post-restart paths are a different residency-coupled Merkle mechanism, carved to TODO-539.
  • Fresh-context review: v1 CHANGES_REQUESTED → Fix Response v1 → v2 APPROVED. Pre-finalize consolidated cross-vendor net (glm-5.2, full diff) independently confirmed the three core invariants; all advisory findings dispositioned to TODO-541 (pre-existing, masked by keyed ScanProcessor).

Follow-ups

TODO-539 (residency-independent Merkle / SYNC-treewalk), TODO-540 (Fix C: eviction skips pending), TODO-541 (raw scan_values overlay hardening + terminal-discard/list_maps edge cases).

Next

On green CI → merge → TODO-484 G4b soak re-run under crash + ACTIVE eviction as the closing gate.

ivkan added 6 commits June 25, 2026 11:52
…ver-log passthrough

recv_decoded skipped to the first decodable message of any type, so a live
ServerEvent delta (a QUERY_SUB registers a live subscription) interleaving with
a request-response reply made read_all bail with 'expected QUERY_RESP, got
ServerEvent' under churn. Skip unsolicited ServerEvent/ServerBatchEvent/
QueryUpdate/JournalEvent pushes while awaiting the actual reply.

Add env-gated SOAK_SERVER_LOG_PASSTHROUGH to mirror server log lines to the
harness stderr for operator diagnostics (e.g. confirming eviction fires).

Surfaced while running the TODO-484 G4b soak re-run; needed to observe real
convergence under active eviction (see TODO-539).
Under active eviction (SPEC-323), a buffered-but-not-yet-flushed record can
have its resident engine copy evicted while the value lives only in the
write-behind staging buffer. load()/load_all() already overlay staging for
read-your-writes, but scan_values/scan_values_batched/enumerate_leaves/
list_maps delegated straight to the durable inner store, so the full-scan
query path and the Merkle leaf source returned stale-or-missing values for
acked writes (read-your-writes broken under memory pressure).

Overlay the map's pending staging set onto each durable read:
- buffered Some replaces the durable row/leaf (newer buffered value;
  enumerate recomputes the leaf hash so a buffered-then-flushed record
  leaves the Merkle root unchanged)
- buffered None (pending delete) suppresses the durable row/leaf so an
  evicted-then-deleted key cannot resurrect
- staging-only keys (buffered, never flushed) are emitted exactly once: in
  the first scan_values batch (resumed batches overlay only) and after the
  durable leaf enumeration, propagating merkle_leaf_hash's None for OrMap
  entries that contribute no leaf
- list_maps unions the durable catalog with staging-only maps
- is_backup scans delegate straight to inner: staging holds only non-backup
  writes, so overlaying a backup scan would double-count

Staging set is collected upfront per map, bounded by the buffer capacity.
Drive a WriteBehindDataStore over a real redb inner with a long flush delay
so writes stay buffered, then assert the overlay surface:
- AC2: buffered-only key surfaces in scan; no double-count after flush;
  newer buffered write overrides an older flushed value, exactly once
- AC3: a buffered pending delete hides the flushed value from scan and
  enumerate_leaves (no resurrection)
- AC4: leaf hash is identical buffered vs flushed (Merkle root unchanged);
  OrMap entry yields a leaf, OrTombstones (merkle_leaf_hash None) yields none
- AC5: backup scan/enumerate is the inner result, no non-backup overlay
- AC6: list_maps unions a staging-only map, excludes a delete-only map
- AC7: multi-batch scan over flushed + staging-only keys returns the full
  key set exactly once, no boundary miss, no staging-only duplicate
The background flush drained an older write, persisted it, then cleared the
key's staging slot unconditionally. Because the partition-queue lock is released
during the persist, a newer write can coalesce into staging in that window; the
unconditional clear then wiped it, dropping read-your-writes back to the stale
durable value. Under active eviction the resident copy is gone, so the staging
overlay is the only correct source -- this is the soak's pre-crash
`expected=2 actual=1` residual.

Tag each staging slot with its originating write's sequence and remove it only
when that seq still matches the flushed entry (clear_staging_if_current); a
newer coalesced write survives until its own flush. pending_count still
decrements per terminal flush. Adds ac8 white-box test driving the exact
flush-vs-coalesce race window.
After the read-surface fix, the soak's pre-crash full-scan convergence passes
under active eviction, but two clients still read different Merkle roots (live)
and the root + delta read-back change across restart. These are one defect: the
Merkle root/index is built from the resident set, so eviction (mutating
residency) and kill -9 (dropping the in-memory index) both change it. That is
TODO-530's residency-independent-Merkle / SYNC-treewalk track, not the
read-surface buffer-awareness this spec delivers.

Route the live two-client Merkle disagreement and the post-restart
merkle-root/delta/query gates to the non-fatal pending_gates bucket (logged,
[TODO-530]-tagged), so the active-eviction soak reports the read-surface
capability it actually verifies. Rename pending_322b -> pending_gates.
…ening)

Cross-vendor (glm-5.2) review of the seq-identity fix flagged that two
concurrent writers on the same key can interleave between next_sequence() and
the staging insert, letting an older write's value clobber the newer staged
one (a plain insert is last-writer-wins by wall-clock, not by seq). That would
make a slot's seq untruthful and break the identity clear_staging_if_current
relies on. Route all three staging inserts through a monotonic stage() helper
that only replaces the slot when the incoming seq >= the current seq. Adds
stage_is_monotonic_by_seq test.
@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying topgun with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5e6b09a
Status: ✅  Deploy successful!
Preview URL: https://350f7588.topgun-f45.pages.dev
Branch Preview URL: https://fix-sf-324-eviction-readcons.topgun-f45.pages.dev

View logs

@ivkan ivkan merged commit 4916d31 into main Jun 25, 2026
16 checks passed
@ivkan ivkan deleted the fix/sf-324-eviction-readconsistency branch June 25, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant