Skip to content

fix(sf-323): activate cost-based eviction — measure per-record cost + mark records clean once durable#87

Merged
ivkan merged 11 commits into
mainfrom
fix/sf-323-eviction-cost
Jun 24, 2026
Merged

fix(sf-323): activate cost-based eviction — measure per-record cost + mark records clean once durable#87
ivkan merged 11 commits into
mainfrom
fix/sf-323-eviction-cost

Conversation

@ivkan

@ivkan ivkan commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary

Activates cost-based eviction, which was dormant in production. Two gates were keeping the eviction orchestrator from ever firing:

  1. Cost hardwired to 0 — every record's metadata.cost was 0 at all three write sites, so owned_entry_cost() summed to 0, current_ram stayed below the high-water mark, and eviction never engaged.
  2. Records permanently dirtyevict_lru filters !is_dirty(), but on_store() had zero production callers, so every put()-written record stayed dirty forever and was never eligible for eviction (a second dormancy gate found mid-implementation).

What changed

  • Cost is genuinely measured → stored → read end-to-end: estimated_cost(&RecordValue) returns the rmp_serde::to_vec_named serialized byte length (floor ≥ 1) + key.len(), stamped at put(), get() lazy-load, and hydrate_loaded().
  • Records marked clean once durable (R6) via a new atomic StorageEngine::mark_stored(key, now) (DashMap get_mut per-key write lock) called after a successful WAL-fsync'd data_store.add(). Null-store-backed records correctly stay dirty.
  • Cost stays local & never-serializedRecordMetadata keeps no serde derives, enforced at compile time via assert_not_impl_any!(RecordMetadata: Serialize, DeserializeOwned).
  • Serde-failure cost fallback → tracing::warn! + bounded COST_ESTIMATE_FALLBACK = 4096 (a huge sentinel would make the orchestrator thrash evicting innocent records under an unreachable ceiling).

Correctness contract

mark_stored's last_update_time <= now is a best-effort timestamp guard, not a per-write identity guard — under a same-millisecond concurrent same-key put() it can transiently mark a not-yet-durable record clean. This window is bounded and self-healing (the concurrent persist completes independently of eviction; put() acks only after its own persist), so no acked-durable write is ever lost. The trait doc and put() comment state this honestly. The airtight per-write-token fix is tracked in TODO-537.

Verification

  • cargo fmt --check ✓ · cargo clippy --all-targets --all-features -- -D warnings ✓ · cargo test --release -p topgun-server --lib → 1487/0/0
  • Behavioral e2e (AC1–AC4, AC6): eviction fires under real cost pressure (drives the real private tick() defect branch, asserts dirty_count()==0 + >0 evicted, non-vacuous); query correct under active eviction; restart correct under active eviction (TODO-530 holds); cost non-leakage; write-through clean-marking.
  • Two independent review nets: fresh-context opus impl-reviewer (Review v4 APPROVED, wiring grep-confirmed) + glm-5.2 cross-vendor net (clean — confirmed the deferred items are correctly scoped).

Review trail

v1 CHANGES_REQUESTED (R6 TOCTOU) → Fix Response v1 (atomic mark_stored) → v2 APPROVED → v3 overturned by cross-vendor (timestamp-vs-identity guard) → Fix Response v2 (honest contract + 4 minors) → v4 APPROVED → post-v4 cross-vendor net clean.

Follow-ups

  • TODO-537 — per-write identity-token clean-marking (close the ms-tie false-clean window).
  • TODO-538 — cost-model heap accuracy (serialized-size proxy under-counts true heap footprint).
  • TODO-484 — G4b soak re-run under crash + active eviction to confirm TODO-530 closed end-to-end now eviction actually fires.

Note on file count

7 source files touched (vs the 5-file Rust spec cap) + test-only Cargo.toml/Cargo.lock. Breach-by-design: the R6 correctness fix is a single unit (one trait method + its sole impl), consistent with SPEC-310a/318/322c precedent.

ivkan added 11 commits June 24, 2026 20:11
…nvariant (G1)

- Add pub fn estimated_cost(value: &RecordValue) -> u64 in record.rs using
  rmp_serde::to_vec_named with >= 1 floor (never panic on write path)
- Document that cost field is local-only/never-wire in RecordMetadata doc
- RecordMetadata intentionally retains NO Serialize/Deserialize derive
- Call sites must add key.len() as u64 to capture key heap contribution
…coding test (G2)

- put(): measure estimated_cost(&value) + key.len() before value moves into record
- get() lazy-load: measure cost of loaded value + key.len() on cache miss
- hydrate_loaded(): measure cost of hydrated value + key.len()
- size_and_owned_entry_cost_reflect_records: assert > 0 (was assert_eq == 0)
  The old assertion encoded the dormant-eviction bug; now guards the fix
…n (G3)

- AC1 (eviction_orchestrator.rs): real tick() fires under real cost pressure;
  drives private tick() from within the same cfg(test) mod; asserts > 0 records
  evicted and owned_entry_cost drops; non-vacuous (flush-then-eligible ordering)
- AC2 (eviction_cost_test.rs): query correct under active eviction; full datastore
  scan via scan_values returns complete record set with evicted keys non-resident;
  proves SPEC-322c streaming path is load-bearing
- AC3 (eviction_cost_test.rs): restart correct under active eviction; fresh engine
  over same redb, all written keys visible via datastore scan (TODO-530 closed)
- AC4 (eviction_cost_test.rs): cost non-leakage; RecordValue round-trip via
  rmp_serde carries no cost field; NullDataStore probe confirms metadata never
  reaches the persistence layer
- storage/mod.rs: register eviction_cost_test under #[cfg(test)]
- record.rs: add #[must_use] to estimated_cost() (must_use_candidate)
- eviction_orchestrator.rs: backtick-quote identifiers in doc comments
- eviction_cost_test.rs: backtick-quote identifiers in doc comments + fmt reformat
…able (R6)

- put() with Client/CrdtMerge provenance: after data_store.add() succeeds,
  re-fetch the record from the engine and call metadata.on_store(now), marking
  it clean and eligible for eviction. Gated on !is_null() so NullDataStore
  (no real persistence) leaves records dirty as required.
- get() lazy-load: records loaded from the datastore enter the engine clean
  (last_stored_time = now via on_store) — already persisted, immediately
  re-evictable in the evict→reload steady state.
- hydrate_loaded(): same as get() — datastore-materialized records enter clean.

Without this fix, evict_lru's !is_dirty() filter (line 416) excluded every
put()-written record permanently, making eviction dead in production even after
the cost fix (R2). on_store() had zero production callers before this commit.
AC1 (eviction_orchestrator.rs):
- Remove manual last_stored_time re-insertion hack that faked dirty→clean
  transition. With R6, put() over a real redb datastore calls on_store(now)
  after add() succeeds, marking records genuinely clean. dirty_count() == 0
  assertion added to prove real eligibility without manipulation.

AC2/AC3 (eviction_cost_test.rs):
- Remove write_and_flush_records helper's snapshot re-insertion hack.
  Replaced by write_records() which simply calls put() — records are clean
  because R6 marks them via the real write-through path. Added dirty_count()
  assertion in AC2 to confirm real eligibility.

AC6 (eviction_cost_test.rs, new):
- Part 1: put() with Client provenance over real redb → dirty_count() == 0
  AND evict_lru evicts the record (genuinely eligible, no hack).
- Part 2: put() over NullDataStore → dirty_count() >= 1, evict_lru returns 0
  (no real persistence → stays dirty, correctly ineligible for eviction).
- Part 3: get() lazy-load → record enters engine clean (dirty_count() == 0),
  evict_lru can evict it immediately (evict→reload steady state works).
…ored

- Fixed: put() Step 6 re-fetch+re-put (engine.get -> on_store -> engine.put)
  was a read-modify-write with no per-key lock across Step 4 -> Step 6. Under
  concurrent same-key/same-partition writes it could re-put a clean-stamped
  value over a newer durable write (lost write + false-clean).
- Add StorageEngine::mark_stored(key, now): atomic in-place clean-mark under
  the DashMap per-key write lock; applies only when last_update_time <= now so
  a newer concurrent (still-dirty) write is never falsely marked evictable and
  an already-clean newer record is never regressed.
- Replace put() Step 6 RMW with engine.mark_stored; no whole-record re-put.
- Add 3 unit tests: marks resident clean, skips newer write, absent-key false.
- Correct the StorageEngine::mark_stored trait doc and the put() write-through
  comment: the last_update_time<=now guard is a best-effort timestamp mark, NOT
  a per-write identity check. State plainly that two concurrent same-key puts in
  the same millisecond can transiently mark the other writer's not-yet-durable
  record clean, and why that is bounded + self-healing (concurrent persist
  completes independently of eviction; put() acks only after its own persist, so
  no acked-durable write is lost). Replaces the prior over-stated guarantee.
- Trace mark_stored's false return (record evicted or superseded after persist)
  to surface write-then-immediately-evict churn.
- Replace the silent unwrap_or(1) (which blinds the eviction orchestrator to a
  record's true size) with a tracing::warn! plus COST_ESTIMATE_FALLBACK. The
  sentinel is deliberately moderate, not u64::MAX-large: a value that fails to
  encode also cannot be persisted, so it stays dirty/un-evictable, and a huge
  sentinel would make the orchestrator thrash evicting innocent records under a
  ceiling it can never reach.
- Document that estimated_cost is a deliberate second encode of the value
  (persistence encodes it again); folding the two is a tracked follow-up.
- Add static_assertions dev-dependency and assert_not_impl_any!(RecordMetadata:
  Serialize, DeserializeOwned) in the AC4 module. Turns the cost-non-leakage
  invariant from a comment into an enforced contract: adding a serde derive to
  RecordMetadata now fails to compile, not just at runtime.
@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying topgun with  Cloudflare Pages  Cloudflare Pages

Latest commit: e58f016
Status: ✅  Deploy successful!
Preview URL: https://d650e983.topgun-f45.pages.dev
Branch Preview URL: https://fix-sf-323-eviction-cost.topgun-f45.pages.dev

View logs

@ivkan ivkan merged commit 48b3266 into main Jun 24, 2026
16 checks passed
@ivkan ivkan deleted the fix/sf-323-eviction-cost branch June 24, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant