fix(sf-323): activate cost-based eviction — measure per-record cost + mark records clean once durable#87
Merged
Merged
Conversation
…nvariant (G1) - Add pub fn estimated_cost(value: &RecordValue) -> u64 in record.rs using rmp_serde::to_vec_named with >= 1 floor (never panic on write path) - Document that cost field is local-only/never-wire in RecordMetadata doc - RecordMetadata intentionally retains NO Serialize/Deserialize derive - Call sites must add key.len() as u64 to capture key heap contribution
…coding test (G2) - put(): measure estimated_cost(&value) + key.len() before value moves into record - get() lazy-load: measure cost of loaded value + key.len() on cache miss - hydrate_loaded(): measure cost of hydrated value + key.len() - size_and_owned_entry_cost_reflect_records: assert > 0 (was assert_eq == 0) The old assertion encoded the dormant-eviction bug; now guards the fix
…n (G3) - AC1 (eviction_orchestrator.rs): real tick() fires under real cost pressure; drives private tick() from within the same cfg(test) mod; asserts > 0 records evicted and owned_entry_cost drops; non-vacuous (flush-then-eligible ordering) - AC2 (eviction_cost_test.rs): query correct under active eviction; full datastore scan via scan_values returns complete record set with evicted keys non-resident; proves SPEC-322c streaming path is load-bearing - AC3 (eviction_cost_test.rs): restart correct under active eviction; fresh engine over same redb, all written keys visible via datastore scan (TODO-530 closed) - AC4 (eviction_cost_test.rs): cost non-leakage; RecordValue round-trip via rmp_serde carries no cost field; NullDataStore probe confirms metadata never reaches the persistence layer - storage/mod.rs: register eviction_cost_test under #[cfg(test)]
- record.rs: add #[must_use] to estimated_cost() (must_use_candidate) - eviction_orchestrator.rs: backtick-quote identifiers in doc comments - eviction_cost_test.rs: backtick-quote identifiers in doc comments + fmt reformat
…able (R6) - put() with Client/CrdtMerge provenance: after data_store.add() succeeds, re-fetch the record from the engine and call metadata.on_store(now), marking it clean and eligible for eviction. Gated on !is_null() so NullDataStore (no real persistence) leaves records dirty as required. - get() lazy-load: records loaded from the datastore enter the engine clean (last_stored_time = now via on_store) — already persisted, immediately re-evictable in the evict→reload steady state. - hydrate_loaded(): same as get() — datastore-materialized records enter clean. Without this fix, evict_lru's !is_dirty() filter (line 416) excluded every put()-written record permanently, making eviction dead in production even after the cost fix (R2). on_store() had zero production callers before this commit.
AC1 (eviction_orchestrator.rs): - Remove manual last_stored_time re-insertion hack that faked dirty→clean transition. With R6, put() over a real redb datastore calls on_store(now) after add() succeeds, marking records genuinely clean. dirty_count() == 0 assertion added to prove real eligibility without manipulation. AC2/AC3 (eviction_cost_test.rs): - Remove write_and_flush_records helper's snapshot re-insertion hack. Replaced by write_records() which simply calls put() — records are clean because R6 marks them via the real write-through path. Added dirty_count() assertion in AC2 to confirm real eligibility. AC6 (eviction_cost_test.rs, new): - Part 1: put() with Client provenance over real redb → dirty_count() == 0 AND evict_lru evicts the record (genuinely eligible, no hack). - Part 2: put() over NullDataStore → dirty_count() >= 1, evict_lru returns 0 (no real persistence → stays dirty, correctly ineligible for eviction). - Part 3: get() lazy-load → record enters engine clean (dirty_count() == 0), evict_lru can evict it immediately (evict→reload steady state works).
…ored - Fixed: put() Step 6 re-fetch+re-put (engine.get -> on_store -> engine.put) was a read-modify-write with no per-key lock across Step 4 -> Step 6. Under concurrent same-key/same-partition writes it could re-put a clean-stamped value over a newer durable write (lost write + false-clean). - Add StorageEngine::mark_stored(key, now): atomic in-place clean-mark under the DashMap per-key write lock; applies only when last_update_time <= now so a newer concurrent (still-dirty) write is never falsely marked evictable and an already-clean newer record is never regressed. - Replace put() Step 6 RMW with engine.mark_stored; no whole-record re-put. - Add 3 unit tests: marks resident clean, skips newer write, absent-key false.
- Correct the StorageEngine::mark_stored trait doc and the put() write-through comment: the last_update_time<=now guard is a best-effort timestamp mark, NOT a per-write identity check. State plainly that two concurrent same-key puts in the same millisecond can transiently mark the other writer's not-yet-durable record clean, and why that is bounded + self-healing (concurrent persist completes independently of eviction; put() acks only after its own persist, so no acked-durable write is lost). Replaces the prior over-stated guarantee. - Trace mark_stored's false return (record evicted or superseded after persist) to surface write-then-immediately-evict churn.
- Replace the silent unwrap_or(1) (which blinds the eviction orchestrator to a record's true size) with a tracing::warn! plus COST_ESTIMATE_FALLBACK. The sentinel is deliberately moderate, not u64::MAX-large: a value that fails to encode also cannot be persisted, so it stays dirty/un-evictable, and a huge sentinel would make the orchestrator thrash evicting innocent records under a ceiling it can never reach. - Document that estimated_cost is a deliberate second encode of the value (persistence encodes it again); folding the two is a tracked follow-up.
- Add static_assertions dev-dependency and assert_not_impl_any!(RecordMetadata: Serialize, DeserializeOwned) in the AC4 module. Turns the cost-non-leakage invariant from a comment into an enforced contract: adding a serde derive to RecordMetadata now fails to compile, not just at runtime.
Deploying topgun with
|
| Latest commit: |
e58f016
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://d650e983.topgun-f45.pages.dev |
| Branch Preview URL: | https://fix-sf-323-eviction-cost.topgun-f45.pages.dev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Activates cost-based eviction, which was dormant in production. Two gates were keeping the eviction orchestrator from ever firing:
0— every record'smetadata.costwas0at all three write sites, soowned_entry_cost()summed to0,current_ramstayed below the high-water mark, and eviction never engaged.evict_lrufilters!is_dirty(), buton_store()had zero production callers, so everyput()-written record stayed dirty forever and was never eligible for eviction (a second dormancy gate found mid-implementation).What changed
estimated_cost(&RecordValue)returns thermp_serde::to_vec_namedserialized byte length (floor ≥ 1) +key.len(), stamped atput(),get()lazy-load, andhydrate_loaded().StorageEngine::mark_stored(key, now)(DashMapget_mutper-key write lock) called after a successful WAL-fsync'ddata_store.add(). Null-store-backed records correctly stay dirty.RecordMetadatakeeps no serde derives, enforced at compile time viaassert_not_impl_any!(RecordMetadata: Serialize, DeserializeOwned).tracing::warn!+ boundedCOST_ESTIMATE_FALLBACK = 4096(a huge sentinel would make the orchestrator thrash evicting innocent records under an unreachable ceiling).Correctness contract
mark_stored'slast_update_time <= nowis a best-effort timestamp guard, not a per-write identity guard — under a same-millisecond concurrent same-keyput()it can transiently mark a not-yet-durable record clean. This window is bounded and self-healing (the concurrent persist completes independently of eviction;put()acks only after its own persist), so no acked-durable write is ever lost. The trait doc andput()comment state this honestly. The airtight per-write-token fix is tracked in TODO-537.Verification
cargo fmt --check✓ ·cargo clippy --all-targets --all-features -- -D warnings✓ ·cargo test --release -p topgun-server --lib→ 1487/0/0tick()defect branch, assertsdirty_count()==0+ >0 evicted, non-vacuous); query correct under active eviction; restart correct under active eviction (TODO-530 holds); cost non-leakage; write-through clean-marking.Review trail
v1 CHANGES_REQUESTED (R6 TOCTOU) → Fix Response v1 (atomic
mark_stored) → v2 APPROVED → v3 overturned by cross-vendor (timestamp-vs-identity guard) → Fix Response v2 (honest contract + 4 minors) → v4 APPROVED → post-v4 cross-vendor net clean.Follow-ups
Note on file count
7 source files touched (vs the 5-file Rust spec cap) + test-only
Cargo.toml/Cargo.lock. Breach-by-design: the R6 correctness fix is a single unit (one trait method + its sole impl), consistent with SPEC-310a/318/322c precedent.