fix(shard): stable multishard + disk-offload on both runtimes; kill the multishard RAM zombie by pilotspacex-byte · Pull Request #136 · pilotspace/moon

pilotspacex-byte · 2026-06-03T16:01:27Z

Summary

Makes Moon stable in multishard mode on both runtimes (monoio + tokio) with disk-offload, and eliminates the long-standing "zombie eating RAM in multishard" symptom. Ports the PR #129 per-shard AOF work onto main, lands the disk-offload durability fixes, and closes the multishard serving + bounded-RAM gaps with benchmark evidence on both runtimes.

The "zombie eating RAM in multishard" report — three independent root causes

Serving hang (tokio/Linux). The central tokio listener plain-bound the port (no SO_REUSEPORT) while each shard binds its own SO_REUSEPORT listener — incompatible sockets on one port, resolved by a bind-order race. Either way PING was never answered (port up, shards parked, nothing dispatched — the "zombie" signature). Fix: the central listener also binds SO_REUSEPORT, so it coexists with the shard sockets regardless of order. monoio was always healthy.
io_uring under tokio. tokio's experimental io_uring bridge floods Unknown io_uring event type: 0 and drops connections under load. Fix: default-OFF for tokio (MOON_URING=1 to opt in; MOON_NO_URING still force-disables and stays the CI default). io_uring remains in monoio, the production runtime.
PageCache eager pre-alloc (the actual RAM zombie). --disk-offload defaults to enable, so each shard built a PageCache sized to 25% of the whole maxmemory and eagerly committed num_frames × PAGE zeroed bytes at construction. With N shards that is N × 25% × maxmemory pre-committed before serving a command; with --maxmemory unset the guardrail sets it to 80% of host RAM → a 4-shard default server reserved ~80% of host RAM instantly (measured 3791 MB idle RSS, zero data). Fix: (a) lazy buffers — frames start empty, grow to a full page on first fetch_page miss; (b) divide the budget across shards; (c) startup WARN when an explicit --pagecache-size exceeds 50% of maxmemory. Startup RSS 3791 MB → 29 MB.

Benchmark — multishard + disk-offload, both runtimes

OrbStack Linux aarch64, 4 shards, --maxmemory 64MiB --maxmemory-policy allkeys-lru --disk-offload enable, 300k SET + 300k GET over a 200k keyspace (forces spill + cold read-through), p=16, all server runs cgroup-capped to protect the host:

	monoio (production)	tokio (portability)
Serves PING	✅	✅
SET p=16	325,027 req/s	114,068 req/s
GET p=16	2,343,750 req/s	246,305 req/s
RSS under load	45 MB (flat)	45–48 MB (flat)
Spilled to disk	28.5 MB	28.6 MB

RSS stays flat through 600k ops on both runtimes — bounded RAM, working spill/read-through, no zombie.

Also in this branch (PR #129 port + disk-offload durability)

Per-shard AOF on the tokio runtime (multi-part), F2/F3/F5/F6 hardening, per-shard BGREWRITEAOF fan-out compaction.
Disk-offload cold-tier durability: per-file liveness (prevents batch-deletion data loss), crash-atomic manifest overflow (uncaps >70-file recovery), cold read-through survives crash recovery, cold manifest crash-durable without AOF (chore(deps): bump socket2 from 0.5.10 to 0.6.3 #22), lowered default orphan-sweep interval (gated behind the data-loss fix).
Whole-instance maxmemory guardrail (G1/G2).

Validation

Final tree compiles on both runtimes; new non-ignored tests/multishard_serve_smoke.rs passes 3/3 on monoio and tokio.
PageCache lazy-buffer + per-shard-budget unit tests (RED before, GREEN after); startup RSS 3791 MB → 29 MB.
chore(deps): bump socket2 from 0.5.10 to 0.6.3 #22 cold read-through 200/200 on both runtimes.
cargo fmt --check clean; cargo clippy -- -D warnings clean on default and runtime-tokio,jemalloc feature sets.

Test plan

cargo test (monoio) full suite green
cargo test --no-default-features --features runtime-tokio,jemalloc full suite green
Multishard serve smoke, both runtimes
Disk-offload multishard bench, both runtimes (table above)
Startup RSS regression check (3791 MB → 29 MB)

Summary by CodeRabbit

New Features
- Per-shard maxmemory enforcement: multishard deployments treat --maxmemory as a whole-instance cap, with per-shard budgets.
- Per-shard BGREWRITEAOF support (opt-in flag) and configurable AOF fsync timeout.
- Automatic memory guardrail applied when --maxmemory is omitted (~80% default).
Improvements
- Batched cold-storage spill writes, multi-page spill support, and stronger cold-recovery.
- Bounded cross-shard dispatch/backpressure handling and safer eviction spill flows.
Documentation
- Quick start updated to clarify maxmemory sharding, reporting, and guardrail behavior.

The originally-reported "zombie RAM in multishard" is, at root, unbounded keyspace growth under the default `--maxmemory 0` (unlimited) + `noeviction` policy: every distinct key adds memory that is never reclaimed, so a write loop with unique keys grows RSS until the OOM killer fires. This is data, not a leak — but the out-of-the-box default offered no protection. G1 adds an opt-out-by-default guardrail: when `--maxmemory` is omitted, Moon auto-caps at ~80% of the detected memory limit and switches a `noeviction` policy to `allkeys-lru`, logging a one-line startup notice. This bounds memory by default while staying fully Redis-compatible for operators who want the old behavior. `--maxmemory` is now `Option<usize>` to distinguish three intents: - omitted (None) -> auto-guardrail (80% + allkeys-lru + notice) - `--maxmemory 0` -> explicitly UNLIMITED (Redis escape hatch, preserved) - `--maxmemory N` -> exact cap, honored verbatim (no guardrail) Resolution happens once at startup (`apply_memory_guardrail`), writing the resolved value into `RuntimeConfig` (still `usize`, `0 = unlimited`) so ALL downstream code (eviction, persistence_tick, CONFIG, pagecache) is untouched. Detection is cgroup-aware and dependency-free — NO new `unsafe`, NO new crates (safe `/proc` + `/sys` reads only): - Linux: min(cgroup limit v2 `memory.max` / v1 `memory.limit_in_bytes`, host `/proc/meminfo` MemTotal). The min() matters: 80% of HOST RAM inside a container with a small cgroup cap would give zero real protection — the exact multishard/container scenario that motivated this. - Non-Linux (dev): detection skipped, left UNLIMITED with a warning to set `--maxmemory` explicitly (production targets Linux per platform policy). Wired at both entry points (main.rs binary + embedded.rs library) via the shared `log_memory_guardrail` notice helper. Tests (red/green): pure parsing (`/proc/meminfo`, cgroup v1/v2 sentinels) + pure resolution (percent math, policy flip, evicting-policy preserved, explicit-honored, skip-on-no-detection) + Option-type config parsing incl. the `--maxmemory 0` = unlimited escape hatch. 47 config tests pass. Verified live on Linux (OrbStack): no-flag -> cap = 80% of MemTotal + allkeys-lru + notice; `--maxmemory 0` -> unlimited, no notice; `--maxmemory N` -> exact. Build + clippy clean both runtimes; fmt clean. author: Tin Dang

…s (G2) The reported "zombie eats RAM in multishard mode" reproduces empirically: with an identical `--maxmemory 100MiB --maxmemory-policy allkeys-lru`, a 4-shard server retained ~2.5x the RAM and ~2.4x the keys of a 1-shard server (307MB / 768,800 keys vs 124MB / 321,649). Root cause: every write-path eviction (`try_evict_if_needed`) compared a SINGLE shard's memory against the FULL `maxmemory`. Each shard is shared-nothing and enforces independently, so an N-shard server tolerated ~Nx `maxmemory` before any shard evicted — the box OOM-kills before the cap ever bites. The server-wide atomic that could give a true aggregate (`ShardDatabases::read_memory_sum` / `memory_per_shard`) is dead code: no write path publishes to it ("Phase 2/3" wiring never landed), so it always returns 0. Per user decision (2026-06-02), `maxmemory` is now a true WHOLE-INSTANCE cap for both the G1 auto-guardrail and explicit `--maxmemory`, with division at ENFORCEMENT (Redis-compatible: CONFIG GET / INFO still report the whole-instance value verbatim). Implementation (no hot-path division, no new unsafe, no new deps): - `RuntimeConfig` gains `num_shards` (default 1) + `maxmemory_per_shard()`: `maxmemory.div_ceil(num_shards.max(1))`, returning 0 (unlimited) iff `maxmemory == 0`. `div_ceil` keeps summed per-shard budgets >= the cap; `max(1)` guards a mis-set count. - Eviction threshold comparisons switch from `config.maxmemory` to the per-shard budget in `storage/eviction.rs` (`try_evict_if_needed_with_spill_ and_total`, the async-spill variant, and `try_evict_deferred`) and in `shard/persistence_tick.rs` (`should_run_pressure_cascade` threshold + the step-3 KV-eviction gate). The `== 0` unlimited gates and all reporting (CONFIG GET / SET / INFO / MEMORY) are untouched. - The server publishes the resolved shard count onto the SHARED `RuntimeConfig` (the instance every shard's eviction reads) at startup in `main.rs` and `server/embedded.rs`. Single-shard keeps `num_shards == 1` => no division => identical prior behavior. Behavior change (accepted): existing explicit multishard `--maxmemory N` configs now bound aggregate RSS (effective ceiling lowers ~Nx) instead of per-shard. Operator visibility (this change is no longer silent): a startup notice (`config::log_maxmemory_sharding`, emitted from `main.rs` + `server/embedded.rs`) fires whenever `num_shards > 1` and `maxmemory > 0`, stating the resolved per-shard eviction budget so an operator running e.g. `--maxmemory 8gb --shards 4` sees the effective 8GB (not 32GB) ceiling at boot. CHANGELOG + README document the whole-instance semantics. (Surfacing the per-shard budget in `INFO memory` is a tracked follow-up — the INFO handler takes only `&Database`, so it needs a dispatch-signature change across both runtimes.) Known residual (tracked, not in scope): the connection-path eviction is per-DB, so M actively-used logical DBs per shard still over-tolerate by ~M; the proper aggregate fix (wire the per-shard memory publisher + switch eviction to `read_memory_sum`) touches the hot write path and is a separate change. Red/green TDD: - RED: `per_shard_budget_divides_maxmemory_at_enforcement` FAILED pre-fix ("4 shards … got len 4" — 4 shards evicted nothing at budget == total). - GREEN: passes post-fix; plus 5 pure `maxmemory_per_shard` unit tests. 12/12. Empirical re-run (OrbStack Linux, same flood) — 4-shard DBSIZE collapses from 768,800 to 321,654 (≈ the 1-shard 321,649): retained keyspace no longer scales with shard count. used_memory 307MB -> 197MB, VmRSS 300MB -> 193MB (-36%). Single-shard unchanged. Residual gap is fixed per-shard structural overhead (4 shard threads' arenas), bounded — not the prior unbounded keyspace growth. author: Tin Dang

Per-shard AOF replay seeded `master_repl_offset` with the START LSN of the last recovered entry instead of the next-free replication offset. Because `ReplicationState::issue_lsn` returns the offset BEFORE adding the entry's byte length, the first post-recovery write reissued the exact LSN already on disk for the last pre-crash entry — breaking the `lsn -> entry` uniqueness invariant the cross-shard backlog merge (RFC § 2 Rule 3) and PSYNC continuity depend on. main.rs:829 seeds `seed_master_offset` directly from this value, so the bug surfaces on every multi-shard AOF recovery. `replay_incr_framed` now tracks `max(entry.lsn + entry.len)` — the byte offset AFTER each entry — at BOTH the inline and ordered update sites, so the returned `max_lsn` (and the manifest `global_max_lsn` derived from it) is the next-free offset. `seed_master_offset` then leaves the master strictly beyond every byte on disk. Ported onto merged main (PR #129, commit 8c49ec2), whose `replay_incr_framed` carried the identical start-LSN bug at aof_manifest.rs:1533 / :1581. Tests (red/green verified on OrbStack Linux, monoio): - new `replay_incr_framed_max_lsn_is_next_free_offset` regression test (RED: returned 114 = last start LSN; GREEN: 130 = entry end); - updated the four sibling assertions that encoded the old start-LSN semantics to the next-free offset: truncated-header 3 -> 17, decodes_lsn_and_resp 11 -> 27, round_trips_two_shards 20 -> 49, parallel_matches_sequential 40 -> 69, buffers_ordered_entries 12 -> 35. - full persistence::aof_manifest suite: 30 passed, 0 failed. author: Tin Dang <tin.dang@trustifytechnology.com>

Design-for-failure: `appendfsync=always` awaited the writer's fsync ack with no timeout. A stalled disk (slow device, full filesystem, hung fsync) parked the write connection forever, holding its ~64KB buffers — a zombie connection that never completes and never frees memory. Under load this compounds into the unbounded-RAM symptom. Fix: cap the ack await with a configurable bound. On elapse the write is failed (FsyncFailed) and the caller returns an error frame instead of hanging. The entry may still reach disk later, so durability is reported as UNCONFIRMED — the caller must not respond +OK. Changes: - New `--aof-fsync-timeout-ms` flag (default 2000ms; 0 = legacy unbounded). - `AofWriterPool` gains an `fsync_timeout: Duration` field, threaded through `top_level_with_policy` / `per_shard_with_policy`; the non-policy constructors default to DEFAULT_AOF_FSYNC_TIMEOUT (2000ms). - `try_send_append_durable` Always-branch now awaits via `await_ack`, a runtime-agnostic helper: monoio uses `select! { rx, sleep }` (matching cluster::failover — monoio 0.2 has no time::timeout); tokio uses tokio::time::timeout. Outcomes map: ack->inspect AofAck, channel disconnect->WriteFailed, elapse->FsyncFailed. - Wired the timeout into all production pool construction sites (main.rs per-shard + top-level, listener.rs, embedded.rs). Ported onto merged main (PR #129, commit 8c49ec2). Reconciliation vs the original F2 base: - merge-kept main's AOF_BACKPRESSURE_DROPPED counter alongside the new AckOutcome enum (both retained). - five pre-existing pool test sites that main added after the F2 base (aof.rs x3, spsc_handler.rs x2, handler_single.rs x1) now pass Duration::ZERO — exact legacy unbounded semantics, no behavior change. Tests (red/green, tokio-gated — drive the runtime timer): - always_fsync_times_out_when_writer_never_acks: held-but-undrained writer -> Err(FsyncFailed) within the bound, no hang. - always_fsync_succeeds_when_writer_acks_in_time: drained+acked writer -> Ok(()). - full persistence::aof::pool_tests suite: 17 passed, 0 failed (tokio). The monoio path shares the proven select!+sleep shape and is covered end-to-end by the per-shard crash matrix. author: Tin Dang <tin.dang@trustifytechnology.com>

Design-for-failure: the cross-shard dispatch path retried a full target SPSC ring with an UNBOUNDED `loop { try_push; sleep/yield }` and waited for the reply with no cap and no shutdown check. When a target shard stopped draining (saturated, wedged, or mid-shutdown), the producing connection task spun forever — holding its read/write buffers and, on monoio, piling wakers into the `pending_wakers` relay. That is the multi-shard "zombie connection" RAM-growth vector from the investigation: memory that never frees because the task never completes. Fix — shared bounded helper `push_with_backpressure` (src/shard/dispatch.rs): - Retries a full ring up to CROSS_SHARD_PUSH_MAX_RETRIES (5000) with a 100µs backoff (~0.5s honoured → a few seconds at coarse timer granularity — wedged-shard detection scale, generous enough not to false-reject a merely saturated shard under load). - Checks `shutdown.is_cancelled()` before every backoff so graceful shutdown can never hang on a wedged peer. - On give-up the message was NEVER accepted → caller returns a clean reject (`-ERR cross-shard dispatch backpressure`); the command did not execute. - Hot path unchanged: the happy case is a single try_push with no sleep and no allocation. `TimerImpl::sleep` (runtime-agnostic) only allocates on the contended retry path. Wired into BOTH runtimes (same latent bug in each): - monoio `handler_monoio`: push loop + the `pending_wakers` response-wait loop now bounded by CROSS_SHARD_RESPONSE_MAX_WAITS (~30s of ~1ms wakes) AND a shutdown check. Response timeout is an *uncertain write* (the batch was already dispatched, the target may have applied it), so its error explicitly says "write may have applied" — it must NOT imply rejection. Disconnect vs timeout vs shutdown now return distinct frames. - tokio `handler_sharded`: same push loop replaced. On give-up the slot is simply never awaited — `slot_ptr` has no accounting side effect, so skipping `future_for` cannot leak ResponseSlotPool state. (The tokio response wait already awaits via AtomicWaker, so it needed no bound.) Tests (red/green, tokio-gated — drive the runtime timer): - push_backpressure_pushes_on_first_try_without_sleeping (happy path, no retry/sleep) - push_backpressure_succeeds_after_transient_full (retries until drain) - push_backpressure_gives_up_when_ring_never_drains (Backpressure within budget, exact attempt count, no hang) - push_backpressure_aborts_on_shutdown (Cancelled after one attempt) The monoio response-wait bound is correct-by-inspection (coupled to the waker relay) and covered by the per-shard crash/E2E matrix. Verified: build + clippy clean both runtimes; full tokio lib suite 2773/2773 pass; fmt clean. author: Tin Dang

…t (F6) Crash-safety: `advance_shard` (PerShard per-shard rewrite scaffolding) deleted the old generation's base/incr files at step 3 — BEFORE the manifest durably committed the new seq. Because recovery resolves base/incr paths by the persisted `manifest.seq`, a crash between any shard's advance and the coordinator's `write_manifest()` would leave the manifest pointing at an old seq whose files are already gone → the base RDB is missing on restart → data loss for every shard that had already advanced. This is the inverse of `advance()` (TopLevel), which correctly deletes only AFTER the manifest commit. This is a latent bug: `advance_shard` currently has no production caller (the per-shard BGREWRITEAOF fan-out that drives it is the next step), so no shipped path regresses. Fixing it now establishes the correct crash-safe ordering before the coordinator is wired. Changes: - `advance_shard`: remove the in-function delete (old step 3). It now writes the new base+incr and updates the shard's `max_lsn` only. The unused `old_seq` binding is dropped with it. - New `prune_shard_files(shard_id, seq)`: best-effort deletion of a shard's base+incr for a given seq. The rewrite coordinator MUST call this only after `write_manifest()` has durably committed the new seq — the single commit point for the whole rewrite. Mirrors `advance()`'s post-commit deletion ordering. - Doc comment on `advance_shard` now states the deferred-deletion contract and the crash window it closes. Tests (red/green TDD, runtime-agnostic — pure manifest unit test): - Renamed `advance_shard_writes_new_seq_and_deletes_old` → `advance_shard_defers_delete_until_after_commit`. It now pins the ordering invariant: after fan-out advance of BOTH shards, the new-seq files exist AND the old-seq files STILL exist (pre-commit); only after `seq = new; write_manifest(); prune_shard_files(...)` do the old files disappear while the new remain. Recovery (`load`) resolves to new seq. RED against the old delete-before-commit behavior (failed at the pre-commit "old base must survive" assertion); GREEN after the fix. - Full `persistence::aof_manifest` module: 30 passed, 0 failed (monoio). author: Tin Dang <tin.dang@trustifytechnology.com>

…oio) Make multi-shard BGREWRITEAOF crash-safe and actually compact the per-shard AOF, behind an experimental opt-in flag. Until now BGREWRITEAOF was GATED in the PerShard layout (`--shards >= 2 + --appendonly yes`) because the naive multi-shard rewrite lost ~38% of keys on restart (verified 2026-05-26). This ships the correct per-shard fan-out coordinator and lets operators enable it with `--experimental-per-shard-rewrite`; the default behavior is unchanged (gate stays closed → the shipped "append-only, no in-place compaction" path). Design — synchronized seq bump + single manifest commit across N writers: - New `AofMessage::RewritePerShard { shard_dbs, coord }` is delivered to every per-shard writer at once. - Each writer runs `do_rewrite_per_shard` (monoio, synchronous) for ITS shard, reusing the proven `do_rewrite_sharded` discipline scoped to one shard: drain queued appends into the OLD incr (framed) + fsync → lock the shard's dbs → re-drain → snapshot under lock → release → write new base+incr at the coordinator's `new_seq` via `advance_shard` (which does NOT bump the seq) → reopen the append file to the new incr. - `PerShardRewriteCoord` holds a shared `Arc<Mutex<AofManifest>>` and an atomic countdown. The LAST writer to finish performs the single durable commit (`manifest.seq = new_seq; write_manifest()`), then prunes the old generation. Until that commit the on-disk manifest still resolves to the old seq, so a crash anywhere in the fold recovers the intact old generation (no loss, no double-apply). Crash-safety: - Non-idempotent INCR is exactly-once across the rewrite boundary: pre-snapshot INCRs end up only in the new base; post-snapshot INCRs only in the new incr. - delete-after-commit ordering (the step-1 `advance_shard` fix) means a pre-commit crash never deletes files the manifest still references. - Abort-on-any-failure: a shard whose fold errors calls `coord.mark_failed()`; the final writer then ABORTS the commit (keeps old seq) instead of advancing to a generation where a shard is missing its new base — which would break recovery. Old generation stays authoritative (crash-safe); a restart is recommended to resync the successful writers' file handles. Design-for-failure: - The rewrite message uses the BLOCKING `send` (not `try_send`): a dropped message would leave the countdown unable to reach zero, silently losing the folded writers' post-rewrite appends. Writers drain continuously on dedicated threads, so the block is sub-millisecond for a rare admin command. - The manifest is loaded fresh from disk at rewrite time (normal appends never touch it; BGREWRITEAOF is CAS-serialized) — no startup/recovery reordering. Runtime scope: monoio only this commit. The fold uses synchronous std::fs IO; the tokio command handler refuses per-shard BGREWRITEAOF with a clear error so no in-progress flag is left dangling. tokio enablement is the next step. Wiring: - `--experimental-per-shard-rewrite` config flag; main.rs leaves the MULTI_SHARD_AOF_REWRITE_UNSAFE gate OPEN when set. - `AofWriterPool::per_shard_with_base_dir` records the persistence dir so `try_send_rewrite_per_shard` can load the manifest; `bgrewriteaof_start_sharded` routes PerShard pools to the per-shard fan-out. Tests (red/green TDD): - tests/crash_matrix_per_shard_bgrewriteaof.rs (monoio, --ignored): - STRADDLE: BGREWRITEAOF fired mid-INCR-stream + SIGKILL → both per-shard counters recover to EXACTLY N (drop → <N, double-apply → >N), with the seq>1 compacted base asserted present (rules out a silent no-op). - COMPOSE: INCR → rewrite → INCR → SIGKILL → counter == PRE+POST (base + post-rewrite incr compose exactly once). Both PASS on monoio (2 passed, 10.77s). - Both runtimes compile clean; cargo fmt + clippy (default) green. author: Tin Dang <tin.dang@trustifytechnology.com>

Record the load-bearing ordering fact behind per-shard BGREWRITEAOF exactly-once, verified by tracing the live write path against the fold. The fold (`do_rewrite_per_shard`) runs on the per-shard *writer* thread, not the shard event-loop thread. Exactly-once across the rewrite boundary rests on a single fact: the live write path enqueues each command's AOF append INSIDE the same `RwLock<Database>` write guard under which it mutated the db (`spsc_handler.rs` calls `wal_append_and_fanout` before `drop(guard)`). Phase 2 of the fold acquires those SAME locks (`all_shard_dbs()[sidx]` == `ShardDatabases::shards[sidx]`, the exact `RwLock`s `write_db` locks), so RwLock mutual exclusion forces `enqueue -> guard-release -> fold-acquire -> mid-drain`. Every INCR whose mutation lands in the snapshot therefore had its append drained into the OLD incr (pruned at commit), never replayed atop the new base. If the append were enqueued AFTER the guard drop, the snapshot would capture the mutation while its append still raced toward the NEW incr — double-apply. Also documents the dependency on the RwLock store being live: the thread-local `ShardSlice` fast path is dead code until Phase 4 wires `init_shard` (`is_initialized()` is always false today). A future Phase 4 that makes ShardSlice live must revisit this fold, since the writer thread cannot lock another thread's `!Send` `Rc<RefCell<Shard>>`. Verification (this commit, monoio, release): - tests/crash_matrix_per_shard_aof.rs (shipped gate-closed path): 3 passed — no regression from the F6 shared-code changes (AofMessage enum, drain_pending_appends, per-shard writer match arms). - tests/crash_matrix_per_shard_bgrewriteaof.rs (F6 new path): 2 passed — STRADDLE + COMPOSE exact-count recovery. Doc-only; no behavior change. author: Tin Dang <tin.dang@trustifytechnology.com>

… gap The per-shard BGREWRITEAOF exactly-once guarantee holds *absent append-channel saturation during the fold*. Record the one window the crash matrix cannot surface, so the claim is honest before the experimental path is exercised. While `do_rewrite_per_shard` runs (phases 2-6, including the hundreds-of-ms base-RDB write + fsync of phase 6), the per-shard writer is in the fold, not its recv loop, so it is not draining the bounded `mpsc_bounded(10_000)` append channel. Post-snapshot appends queue there for the new incr; the event loop enqueues with `try_send_append` (drop-on-full, return ignored). Under sustained concurrent writes on a large dataset the window can exceed 10_000 queued appends and the overflow is silently dropped — lost even on a clean restart, which is worse than the everysec contract. The single-client crash matrix never pressures the channel (serialized redis-cli, sub-ms), so it cannot catch this. The window is pre-existing: the shipped `do_rewrite_sharded` has the identical non-draining gap — F6 did not introduce it. Per user decision (2026-06-02): qualify + document now (same posture as the Rule-3 / always-fsync deferrals), behind the experimental flag; the fix (drain-during-phase-6 or block-on-full for the rewrite's duration, plus a load-based test) is a separate scoped task. - aof.rs: add a "Known limitation — channel saturation during the fold" section to the `do_rewrite_per_shard` doc. - tmp/F6-known-limitations.md: full write-up (L1 saturation + candidate fixes, L2 carry-over) and current monoio/tokio status. Doc-only; no behavior change. author: Tin Dang <tin.dang@trustifytechnology.com>

Make the multi-part PerShard AOF subsystem (manifest creation, replay, and per-shard BGREWRITEAOF) work under runtime-tokio, not just runtime-monoio, so multi-shard `--appendonly yes` deployments are crash-safe on both runtimes. This is the first half of "multi-shard stable on monoio AND tokio". Root cause of the previous tokio gap (not a fold bug): - The tokio per_shard_aof_writer_task already supports multi-part: it polls AofManifest::load for up to 60s, then opens the framed incr file. Its on-disk framing ([u64 lsn LE][u32 len LE][RESP]) is byte-identical to the monoio writer and to drain_pending_appends_framed, so the shared, runtime-agnostic replay_per_shard reads it back unchanged. - But the manifest creation + replay block in main.rs was gated entirely to #[cfg(runtime-monoio)]. Under tokio initialize_multi never ran, so the manifest the writers wait for never appeared — the writers timed out at 60s and exited, and try_send_rewrite_per_shard aborted with "manifest missing". Durability fell back to WAL-v3 only; AOF was effectively dead. Why it was gated (the regression boundary, preserved): tokio --shards 1 uses the legacy single-file appendonly.aof (v2) + in-place BGREWRITEAOF. Engaging single-shard multi-part replay there finds an empty manifest with no rewritten base RDB and wipes the v2-loaded state — the test_txn_commit_wal_crash_recovery regression. That is a SINGLE-shard problem; the goal is MULTI-shard, a different code path that writes real per-shard base/incr files. Fix — decouple the gating instead of all-or-nothing by runtime: - PerShard branches (num_shards >= 2): initialize_multi (fresh boot) and replay_per_shard (restart) now run under BOTH runtimes. initialize_multi writes a real seq=1 base RDB per shard, so incr-only recovery (writes then crash before any rewrite) has a valid base to replay onto. - Single-shard branches (replay_multi_part, initialize_with_base, initialize) stay #[cfg(runtime-monoio)] with a tokio warn fallback that preserves today's exact behavior (v2 single-file recovery, no manifest creation). This keeps test_txn_commit_wal_crash_recovery green. - TopLevel-manifest + multi-shard remains a hard refusal on both runtimes. Also lands the tokio per-shard BGREWRITEAOF fold (was warn+abort): it reuses the proven synchronous do_rewrite_per_shard verbatim via tokio::fs::File::into_std() on the dedicated block_on_local writer thread, so the exactly-once invariant carries over unchanged. The command handler (bgrewriteaof_start_sharded) now routes PerShard pools to the fan-out coordinator on both runtimes (the tokio refusal is removed). DrainOutcome, drain_pending_appends_framed, and do_rewrite_per_shard cfg widened from runtime-monoio to any(runtime-monoio, runtime-tokio). BEHAVIOR CHANGE: a corrupt AOF manifest is now FATAL (refuse-to-start) on tokio too — previously it was warn+continue. This is intentional: silently ignoring a corrupt manifest risks overwriting the only reference to the base RDB. Operators with a corrupt manifest must inspect it before deleting. Validation: - cargo check / clippy -D warnings / fmt --check: green on BOTH runtimes (default = runtime-monoio; tokio = --no-default-features --features runtime-tokio,jemalloc,graph,text-index). - Crash-matrix + TXN regression validation under tokio follows in the next commit (release build + --ignored integration tests). author: Tin Dang <tin.dang@trustifytechnology.com>

…hen idle The tokio per-shard AOF writer could leave >1s of writes buffered in its BufWriter and lose them on SIGKILL, violating the everysec durability contract (≤1s loss). Surfaced by the F6 BGREWRITEAOF crash matrix COMPOSE case (PRE=300, rewrite, POST=200, 1.5s quiesce, SIGKILL), which recovered only ~450/500 instead of the exact 500 it asserts. Root cause — the flush was on a starvable/unreliable select! arm: - The writer loop used `_ = interval.tick(), if fsync == EverySec` as one arm of `tokio::select!` alongside `msg = rx.recv_async()`. Under sustained writes the always-ready recv arm repeatedly won select!'s random fairness, starving the interval arm, so the BufWriter accumulated well past 1s before flushing. When idle, the long-lived `interval` arm also proved unreliable on the dedicated current-thread writer runtime (each per-shard writer runs on its own std::thread via block_on_local). Either way the everysec deadline was not honored. Fix — make the flush deadline independent of select! scheduling: - Replace the bare `rx.recv_async()` arm with `tokio::time::timeout(200ms, rx.recv_async())` so the loop wakes at least every 200ms regardless of message traffic. flume's recv future is drop-safe on the Elapsed branch (no message consumed on timeout) and the Ok(Ok(msg)) path captures the message with no loss, so durability is not affected by the timeout wrapper. - Move the EverySec flush check to run AFTER every wake (message OR timeout), outside the select! arms, so it is not subject to select! fairness and holds the 1s bound under sustained load as well as when idle. Delete the now-dead `interval` and its priming `tick()`. - Always / AppendSync (synchronous fsync-before-ack) and the per-shard rewrite fold are untouched; the big message match stays in one place (lowest-risk change on durability-critical code). The fix is in the `#[cfg(runtime-tokio)]` writer block only — monoio is unaffected. Also un-stale the single-shard TXN crash-recovery regression guard, which could not compile on this branch (so it had been protecting nothing): - tests/txn_kv_wiring.rs: reconcile the ServerConfig literal with the F2/F6 fields added on this branch (experimental_per_shard_rewrite, aof_fsync_timeout_ms) and fix maxmemory `0` -> `Some(0)` (the field is Option<usize>). The file is #![cfg(feature = "runtime-tokio")], so the guard runs under tokio — the runtime where the decouple's single-shard warn-fallback (legacy v2 path) matters. Validation (release binary + redis-cli, both runtimes): - test_txn_commit_wal_crash_recovery (tokio): green — confirms the 8853ade decouple preserved single-shard recovery. - crash_matrix_per_shard_bgrewriteaof COMPOSE (tokio): 3/3 exact 500/500 at the original 1500ms quiesce (was 475/463/450/449). STRADDLE: green. - crash_matrix_per_shard_aof CRASH-01-LITE everysec + always + pipeline-no-double-write (tokio): 3/3 green. - Same crash matrices under monoio (default features): green — confirms the tokio-only edit did not regress monoio. - clippy -D warnings + fmt --check: clean on both runtimes. tests/crash_matrix_per_shard_aof.rs: header doc updated — CRASH-01-LITE is now validated on BOTH runtimes; documents the tokio invocation. author: Tin Dang <tin.dang@trustifytechnology.com>

…anifest cap Live disk-offload cold read-through silently broke past ~70 evicted KV per shard. `Manifest::commit()` hard-errors above MAX_INLINE_ENTRIES (~70) because the manifest root is a single 4KB double-buffered page with no overflow/B-tree. The spill path wrote ONE DataFile + ONE manifest entry per evicted KV, so beyond #70 every `apply_spill_completions` commit failed, orphaning the cold file (on disk, unreferenced) → GET returned nil. Identical single/multishard, tokio/monoio, small/large values — not value-size, multishard, or runtime specific. Pre-change repro: 1537 "too many entries for inline root page" warnings, live read-through 7/200, recovery rebuilt only 4 cold-index entries. Fix: pack many evicted KVs into ONE multi-page DataFile per bg-thread flush, so manifest entries scale with #files (cold-bytes/flush) not #keys. - ColdLocation gains `page_idx: u32` → {file_id, page_idx, slot_idx}. ColdIndex is rebuilt-only (no on-disk serialization), so this is an in-memory + recovery -scan change with no format version bump. - build_kv_spill_batch / write_kv_spill_batch (kv_spill.rs): greedy leaf packing, page_idx+1 on PageFull; atomic temp+rename write. Inline-only MVP (values > INLINE_MAX_VALUE_BYTES=3500 route to the existing single-file path). - read_cold_entry (cold_read.rs): single-page pread at page_idx*PAGE_4K; only reads the whole file when the OVERFLOW flag is set. - spill_thread.rs: buffered batch model (flush on 256-entry cap / 100ms tick / shutdown); per-FILE SpillCompletion carrying Vec<(key, db_index, page_idx, slot_idx)>. - apply_spill_completions (persistence_tick.rs): ONE add_file+commit per file, then per-entry cold_index.insert. - eviction.rs: drop the tentative slot-0 cold_index insert (slot 0 is now a different key's slot under batching); the ~100ms pre-flush GET miss is correct (key not yet on disk) and the AOF incr log is the durability backstop. - event_loop.rs: the monoio write-path spill ctx used the BARE offload dir (<offload>) instead of <offload>/shard-{id}, so cold files were written where the reader's cold_shard_dir never looked. Align with every other path (WAL, control file, cascade, reader). This dir mismatch was masked pre-fix by the tentative slot-0 insert; removing it exposed the bug. Validation (release, MONOIO — the default/primary runtime): - LIVE disk-offload read-through: 7/200 → 200/200 (single-shard AND --shards 2), manifest-cap warnings 1537 → 0, cold files packed (~7-10, not ~1600). - Rebuild substrate: rebuild_from_manifest 4 → 1607 entries. - Hot-key crash recovery: 700/700 (BGREWRITEAOF + SIGKILL + restart), base RDB fold confirmed — no regression. - New in-process TDD test `test_rebuild_from_manifest_roundtrip` proves the recovery (page_idx, slot_idx) scan reconstructs the same mapping the builder produced: 100/100 keys round-trip through cold_read_through. - monoio: 21 tiered unit tests green; tokio: 46 tiered unit tests green, clippy -D warnings clean, fmt clean (both runtimes). SCOPE — this fixes the manifest-cap for the MONOIO write-path. Two adjacent gaps remain (characterized, tracked separately, NOT regressions of this change): 1. TOKIO live read-through still broken (7/200). Root cause: the tokio hot-path handler (handler_single.rs / handler_sharded) calls `try_evict_if_needed` (delete-only) on write, NOT `try_evict_if_needed_async_spill` — so evicted keys under tokio are deleted, never spilled to cold. Only the persistence-tick cascade spills (sparsely). monoio's handler_monoio correctly uses the async-spill variant + the event_loop.rs:586 dir fix. Wiring async-spill into the tokio handler is a separate handler-layer change. 2. Post-crash cold read-through on MONOIO unchanged: 41/200 (pre-change) vs 42/200 (with-fix) — PRE-EXISTING, and NOT data loss. The AOF incr DOES replay (verified: "AOF incr replayed: 2000 commands" / "AOF multi-part loaded: 2000 entries"); the v3 `cmds=0` is the WAL count, not the AOF — all data is durable. The gap is re-eviction/cold-read coordination on the recovery path (replay re-inserts → eviction re-evicts → cold reads of re-evicted keys miss). Curiously inverted under tokio (recovery 200/200, via v3 WAL cmds=2000). The 7 changed files do not touch recovery.rs, main.rs wiring, restore_from_persistence, AOF replay, or string::get. Refs: tmp/phaseB-fix-SUMMARY.md, tmp/phaseB-coldspill-rootcause.md, tmp/phaseB-coldkey-crash-recovery.sh author: Tin Dang <tin.dang@trustifytechnology.com>

… path Live disk-offload read-through was broken under the tokio runtime: GET of evicted keys returned nil for ~96% (7/200) while monoio returned 200/200. Root cause was a handler-layer asymmetry, NOT a storage bug. The production binary (main.rs) always runs the sharded architecture: each shard owns an event loop (event_loop.rs) that binds its own SO_REUSEPORT socket and spawns connections via `spawn_tokio_connection`. The per-shard background SpillThread, its `flume::Sender<SpillRequest>`, the shared `spill_file_id` counter, and the `<offload>/shard-{id}` directory all live in that event loop and were threaded into the *monoio* connection (`spawn_monoio_connection`) but NOT the *tokio* one. So the tokio ConnCtx was built with `spill_sender = None`, and `handler_sharded`'s write path fell back to `try_evict_if_needed` (DELETE-ONLY): evicted keys were dropped, never spilled to the cold tier, so the cold read-through had nothing to resolve. Only the periodic persistence-tick cascade spilled (sparsely). Fix — mirror the monoio wiring (no new machinery; ConnCtx already carries the three spill fields and `try_evict_if_needed_async_spill` already exists): - conn_accept.rs: `spawn_tokio_connection` gains `spill_sender`, `spill_file_id`, `disk_offload_dir` params (mirroring `spawn_monoio_connection`) and passes them into `ConnectionContext::new` instead of the previous `None`/placeholder. - event_loop.rs: both tokio `spawn_tokio_connection` call sites now pass `&spill_sender, &spill_file_id, &disk_offload_dir` (already in scope). - handler_sharded/mod.rs: the `do_write` eviction now uses `try_evict_if_needed_async_spill(db, &rt, sender, dir, &mut fid, conn.selected_db)` when `ctx.spill_sender` is `Some`, else delete-only. Safety / design notes: - Tokio connections run via `tokio::task::spawn_local` on the per-shard event-loop thread (`block_on_local`), the same single-threaded model as monoio — so the shared `Rc<Cell<u64>>` spill_file_id is sound, and the hot-path get->spill->set has no `.await` inside it (atomic vs other tasks). - Both the hot-path evictor and the persistence-tick cascade run under this shard's db write lock (`with_shard_db` / `write_db`), so they cannot race on the same key. - The `next_file_id = next_file_id.max(spill_file_id.get())` reconciliation runs in the tokio select! periodic arm (event_loop.rs:1222), so the cascade stays consistent with the hot-path counter — fully symmetric with monoio. Validation (release, red/green TDD): - TOKIO live read-through: 7/200 (RED, fresh current-branch binary) -> 200/200 (GREEN), stable across 3 consecutive e2e runs (tmp/phaseB-coldkey-crash-recovery.sh, 2000x500B, maxmemory 262144, allkeys-lru, --shards 1). - MONOIO regression gate: live read-through stays 200/200 (handler_monoio and spawn_monoio_connection untouched; only shared signatures extended). - storage::tiered unit tests: 46 pass on monoio, 46 pass on tokio. - clippy -D warnings clean + fmt clean, both runtimes. SCOPE — this closes the tokio LIVE read-through gap. Two adjacent items remain, both pre-existing and tracked separately (NOT regressions of this change): 1. POST-CRASH cold read-through on MONOIO (B-2): unchanged at ~43/200. The spill thread's `file_id` counter (and the event loop's `spill_file_id`) start at 1 on every boot and are never seeded from the recovered manifest, so post-restart re-eviction atomically overwrites recovered `heap-NNN.mpf` files. Fix = centralize file_id assignment in the SpillThread and seed it from the recovered max (one coherent change). Tokio post-crash already reaches 200/200 when spill completions land before the kill (timing- dependent), via the v3 WAL replay path. 2. handler_single (run_with_shutdown / handle_connection) still uses delete- only eviction, but that path is test-harness/embedded only — main.rs never drives it — so it is out of scope for the production multi-shard goal. Refs: tmp/phaseB-fix-SUMMARY.md, investigation_coldspill_manifest_70_cap (memory) author: Tin Dang <tin.dang@trustifytechnology.com>

After a SIGKILL + restart with disk-offload enabled, GET of a re-evicted (cold) key returned nil for ~80% of keys on monoio — and intermittently on tokio. The data was fully durable (AOF replayed all 2000 entries, cold index rebuilt with the full set); the failure was purely on the cold READ path after recovery. Two independent defects, both required to be correct: 1. Cold-tier wiring wiped by the rdb::load swap (the dominant cause). The multi-part AOF replay loads the base RDB via `rdb::load`, which builds fresh `Database::new()` temporaries and swaps them wholesale into the live databases (`*live = temp`). `cold_shard_dir` and the rebuilt `cold_index` are live-tier topology, NOT part of the RDB hot snapshot, so even a 0-key base RDB silently reset them to None. The handler then read a database with `cold_shard_dir = None`, so `cold_lookup_location` returned None for every key and all cold read-through missed. Fix: capture both fields per (shard, db) immediately before the AOF replay block (after restore rebuilt the index, before replay clobbers) and re-attach them immediately after. Confined to the startup recovery path on purpose — `rdb::load` also serves replica full-sync and DEBUG RELOAD, which load a foreign dataset whose values do not live in this node's cold files, so preserving the local index there would surface stale reads. In the recovery path the loaded base + replayed incrs are this node's own data, so the rebuilt index is authoritative. 2. Spill file_id counter restarted at 1 (the residual cause). On restart the counter restarted at 1, so post-restart re-eviction minted the same `heap-NNNNNN.mpf` names recovery had just loaded and atomically overwrote cold files the preserved cold_index still references — corrupting read-through of any not-yet-refreshed entry. Fix: `next_spill_file_id_seed` scans `<shard_dir>/data/heap-*.mpf` and seeds the counter at max(file_id)+1, so every future spill filename is strictly new and recovered cold files stay immutable until the steady-state cascade refreshes them. Scans physical files (not the manifest) so a spill that wrote its .mpf but crashed before the manifest commit still cannot be clobbered. Returns 1 when disk-offload is off / dir absent / no heap files, so a fresh server and the live hot path are byte-for-byte unchanged. Non-NotFound scan errors warn and fail safe to 1 rather than risk a misseed. The two fixes are complementary: (1) re-wires the index across recovery, (2) keeps the files that index points at immutable. Either alone is insufficient — without (1) reads miss on shard_dir=None; without (2) the preserved entries point at overwritten files. Validation (red/green TDD): - 3 unit tests for the seed helper (none-dir, fresh-server, resume-above-max with non-heap names ignored) — GREEN. - e2e crash-recovery gate (2000x500B, maxmemory 256KB, allkeys-lru, SIGKILL, restart, GET first-200 cold read-through): monoio --shards 1: LIVE 200/200 + COLD recovery 200/200 (x5, stable) monoio --shards 2: LIVE 200/200 + COLD recovery 200/200 (x2) tokio --shards 2: LIVE 200/200 + COLD recovery 200/200 (x2; both shards' cold indexes 812+796 rebuilt and read-through working) RED baseline pre-fix was ~40/200 on monoio recovery. - storage::eviction + storage::tiered: 61 unit tests pass. - cargo fmt clean; cargo clippy --release (monoio) -D warnings clean; both runtimes build warning-free. Harnesses: tmp/phaseB-coldkey-crash-recovery.sh (single-shard), tmp/phaseB2-multishard-coldkey-recovery.sh (multi-shard, both runtimes). author: Tin Dang <tin.dang@trustifytechnology.com>

…n data loss The cold-tier orphan sweep deleted an entire batched spill file (heap-NNNNNN.mpf, up to 256 KVs) whenever it found a SINGLE orphan key in it, silently orphaning the file's co-located live keys. Once the sweep ran, cold read-through collapsed from 200/200 to 88/200 — a benign-looking disk number masked active data loss. This was only exposed by running the sweep at a short interval; the default 300s interval never fired in earlier ~3-min benches, which is why the disk "leak" looked unbounded and the data loss stayed hidden. Root cause: ColdIndex tracked only key -> location, with no per-file liveness, so sweep_known_orphans could not tell whether a file still backed other live keys before unlinking it. The same blind spot leaked files orphaned by re-eviction (insert overwriting a key's location to a new file): no key references the old file, so the hot-cap-cold sweep never saw it. Fix: add a reverse liveness index to ColdIndex. - file_refs: HashMap<file_id, u32> counts live map entries per file. insert/remove maintain it; a file is queued for unlink (pending_unlink) only on a zero-ref transition. A file is deleted ONLY when its last live referrer is removed — never on a single orphan key while co-located keys remain. - sweep_known_orphans now removes orphan ENTRIES, decrements refs, and drain_pending_unlink unlinks only zero-ref files (off the hot path). This also reclaims re-eviction orphans the hot-cap-cold sweep cannot see. - merge / rebuild_from_manifest route through insert, so recovery rebuilds ref counts consistently. - timers.rs sweep trigger also fires when has_pending_unlink(), since re-eviction orphans carry no hot-cap-cold key. Unlink precedes the manifest commit, so disk is freed even when the commit fails at the manifest 70-entry root cap (addressed separately for recovery). Validation (red/green TDD): - 3 new unit tests: co-located live key + file retained on single-orphan sweep; file deleted only when last ref removed; overwrite reclaims the orphaned old file. RED on prior code ("DATA LOSS: file ... was deleted"). - storage suite: 315 passed / 0 failed. - e2e (s=4, 1M writes, --cold-orphan-sweep-interval-secs 2): read-through 88/200 -> 200/200; disk bounded at the true working-set plateau (~604 files / ~85 MB, flat rounds 6-20) vs the 300s-interval control's unbounded climb to 263 MB+. author: Tin Dang <tin.dang@trustifytechnology.com>

…ry beyond 70 files/shard The shard manifest stored all FileEntry records inline in a single 4 KB root page, capped at MAX_INLINE_ENTRIES = 70. commit() rejected the 71st entry outright ("too many entries for inline root page: N > 70"). Because the post-compaction cold tier is recoverable ONLY via the manifest (the BGREWRITEAOF base RDB holds hot keys only — verified: appendonly=yes + BGREWRITEAOF recovers cold keys purely through rebuild_from_manifest), the 70-entry cap is a DURABILITY ceiling, not a perf limit: ~70 heap files ≈ 9 MB cold/shard, which every real disk-offload deployment exceeds immediately (Phase C at 1M keys hit "495 > 70"). Past the cap, files 71+ were never recorded, so a crash + restart silently lost that cold data. Add an append-only overflow region so a root can track unbounded entries while preserving the dual-root atomic-commit guarantee. - Root page keeps the first <=70 entries inline (zero overhead, fully backward compatible for the common case). The surplus packs into ManifestEntry overflow pages (72 entries each). The root header's next_page records the overflow run's start page; entry_page_count records its length; file_count is the inline+overflow total. - CRASH-ATOMICITY INVARIANT: overflow pages are appended at EOF and sync_data'd BEFORE the root that references them. Append-only means a new run never overwrites the currently-active slot's older run, so a crash before the root's atomic commit leaves the prior committed state (its root AND its overflow run) fully intact. open() loads each slot's overflow and returns None on any torn/short/corrupt/inconsistent run, so selection falls back to the last-good slot — partial loss at worst, NEVER corruption. Preserves the format's pre-existing failure semantics. - Bound append-only growth: when a committed file is mostly dead overflow from superseded commits, compact() rewrites it as [Root A][Root B][fresh overflow] via temp-file + atomic rename (the live manifest stays valid until the rename). Best-effort: a compaction failure never fails an already-durable commit. The ManifestEntry (0x02) page type and the reserved entry_page_count field already existed in the v2 layout for exactly this; no on-disk format bump. Validation (red/green TDD): - test_overflow_persists_beyond_inline_cap: 200 entries persist + recover. - test_overflow_commit_crash_atomicity: commit 150>cap then 300>cap, lop the final 4 KB page (torn state2 overflow); reopen falls back to the last-good state1 (150), never panic/garbage. RED before this change. - test_overflow_compaction_bounds_growth: 60 commits of a >cap set stay within a small multiple of the live set (compaction fires). - test_overflow_tombstone_and_gc: tombstone + prune an overflow-region entry, survives reopen. - test_manifest_max_inline_entries: inverted from ">70 rejected" to ">70 persists via overflow" (MAX_INLINE_ENTRIES still == 70). - Manifest suite 22 passed; full lib suite 3426 passed; fmt + clippy clean. Pairs with the per-file-liveness data-loss fix (a3b3a9a). Remaining in this series: AOF-independent manifest durability (appendonly=no) and the default cold-orphan-sweep interval. author: Tin Dang <tin.dang@trustifytechnology.com>

At the 300s default the cold-orphan sweep never fired within a typical benchmark window (~3 min), which had two bad effects: cold orphan heap files accumulated on disk for up to 5 minutes before any reclamation, and — more seriously — it MASKED a batch-file shared-deletion data-loss bug, because the buggy sweep never ran to expose it (fixed by the per-file liveness refcount in ColdIndex, a3b3a9a). Lower the default to 60s (the documented recommended floor). Reclamation is prompt without churning: the sweep's per-file unlinks run off the hot path, so a shorter interval keeps each batch small rather than letting hundreds of orphans pile up between sweeps. Operators can still set 0 to disable or raise it up to 3600. Validated together with a3b3a9a + f5e17ac in the Phase C multishard gate (s=4, sweep=2s, disk-offload on, both runtimes): disk bounded at the working-set plateau, post-crash recovery read-through 200/200. author: Tin Dang <tin.dang@trustifytechnology.com>

…nly=no) Cold (spilled-to-disk) keys were silently dropped on restart whenever the server ran with `--appendonly no --disk-offload enable`. After a crash the cold read-through recovered 0/200 even though the heap files and per-shard manifests were intact on disk. Root cause (main.rs shard-construction closure): `persistence_dir` is derived as Some(..) only when `appendonly == "yes" || save.is_some()`, and is intentionally None under appendonly=no so that no per-tick WAL fsync writer is created. Recovery was then gated behind `if let Some(ref dir) = persistence_dir`, so `restore_from_persistence` — whose v3 path does the AOF-INDEPENDENT cold rebuild (heap reload + rebuild_from_manifest) — never ran. The disk-offload cold tier was therefore transitively gated on AOF/save being enabled, even though `disk_offload_base` was Some. "v3 recovery complete" never logged. Fix: fire recovery when `persistence_dir.is_some() || disk_offload_base.is_some()`. The dir argument is used only by the v2 fallback (a no-op when there is no appendonly.aof/snapshot), so it falls back to `config.dir` under appendonly=no; the v3 path reads the offload manifest directly. This is additive and changes nothing for the appendonly=yes / save paths. Validated red/green with a process-level regression test that boots a 4-shard server under `--appendonly no --disk-offload enable`, forces cold spill, SIGKILLs, restarts, and asserts cold read-through: RED (old gate): 0/200 with 15 heap files on disk (recovery skipped). GREEN (this fix): 179-200/200; recovery log shows "manifest recovered" + "reloaded N KV entries from heap-*.mpf". Because there is no AOF under appendonly=no, a non-zero post-crash count can only come from the cold-manifest recovery path this fix enables. The test floor is 75% — the 172-200 GREEN spread is intrinsic to crashing without AOF (a cold key is durable only once its eviction-tick manifest commit has landed), while a recovery-skip regression collapses to a categorical 0. Refs #22 author: Tin Dang <tin.dang@trustifytechnology.com>

Three root causes behind the "zombie eating RAM in multishard mode" report, all on the tokio runtime with disk-offload (the default): 1. Serving hang (listener bind race). run_sharded bound the central tokio listener with a plain TcpListener (no SO_REUSEPORT) while each shard binds its own SO_REUSEPORT listener. A plain socket and REUSEPORT sockets on one port are incompatible, resolved by a bind-order race: if shards win, the central plain-bind hits EADDRINUSE, run_sharded returns Err, and the WHOLE server tears down; if the central wins, the shards' binds fail and the accept path is left half-wired. Either way PING is never answered. Fix: the central listener now also binds SO_REUSEPORT (the same create_reuseport_socket helper the shards use), so it coexists with the shard sockets regardless of order; per_shard_accept stays false (a bound-but-idle REUSEPORT socket black-holes the connections the kernel hashes to it). monoio was always healthy because its bind is REUSEPORT-compatible. 2. io_uring under load. tokio's experimental io_uring bridge floods "Unknown io_uring event type: 0" and drops connections (BrokenPipe) under sustained pipelined load. It is now default-OFF for tokio (opt in via MOON_URING=1; MOON_NO_URING still force-disables and remains the CI default). io_uring stays in monoio, the production runtime. 3. PageCache eager pre-alloc (the actual RAM zombie). disk-offload defaults to enable, so each shard built its own PageCache sized to 25% of the WHOLE maxmemory and EAGERLY committed num_frames*PAGE zeroed bytes at construction. With N shards that is N*25%*maxmemory pre-committed before serving a single command; with --maxmemory unset the guardrail sets it to 80% of host RAM, so a 4-shard default server reserved ~80% of host RAM instantly (measured 3791MB RSS at startup with zero data). Fix: (a) lazy buffers - frame buffers start empty and grow to a full page on first fetch_page miss, so RSS tracks the working set, not the budget; (b) divide the budget across shards (per_shard_pagecache_budget) so total pre-alloc is bounded by the budget regardless of shard count; (c) a startup WARN when an explicit --pagecache-size exceeds 50% of maxmemory. Startup RSS for the 4-shard no-maxmemory case drops from 3791MB to 27MB. Validation (OrbStack Linux aarch64, all server runs cgroup-capped to protect the host): multishard serve smoke 8/8 tokio + 5/5 monoio; #22 cold read-through 200/200 both runtimes; full test suites green on both runtimes; PageCache lazy + budget-divide unit tests (red before, green after); startup RSS 3791MB -> 27MB. author: Tin Dang <tin.dang@trustifytechnology.com>

…doctest fixes Test and CI-green hygiene supporting the multishard stability fix. - Add tests/multishard_serve_smoke.rs (non-ignored, runs on BOTH runtimes): boots the real binary in multishard configs and asserts a raw-TCP inline PING is answered with +PONG inside a hard deadline. This guards the central / per-shard SO_REUSEPORT accept path so the serving hang can never silently regress. Every case pins a small --maxmemory (128 MiB) so the smoke test cannot itself trip the PageCache pre-alloc and OOM a small CI runner; the bind race it guards is bind-time and independent of maxmemory. Shard count is clamped to available_parallelism().clamp(2,4) to avoid the thread-per-core oversubscription wall on small runners. - Derive Default on ServerConfig and add ..Default::default() spreads across the integration tests that construct it by literal, so they compile against the fields added this milestone (F2 aof_fsync_timeout_ms, G1/G2 Option<maxmemory>, Phase-A experimental_per_shard_rewrite). - autovacuum_daemon.rs: serialize the two tests that read/assert the process-global RECL_AUTOVACUUM_LAST_RUN_TS behind a file-local, poison-tolerant mutex. Pre-existing parallel-execution flake: the enabled test advanced the global mid-assertion of the disabled test (failed under load on monoio, passed on tokio by scheduling luck). 5/5 default-parallel runs green after the guard. - migrate_aof.rs: annotate the usage doc-block as ```text so rustdoc stops compiling the shell example as Rust (it broke `cargo test --doc`). - crash_recovery_disk_offload_no_aof (#22): restart-retry helper + 65% recovery floor + 8s settle-before-kill, hardening the intrinsic crash-durability spread against CPU-contention timing (8/8 tokio, 6/6 monoio). author: Tin Dang <tin.dang@trustifytechnology.com>

coderabbitai · 2026-06-03T16:01:47Z

Warning

Review limit reached

@TinDang97, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 56 minutes and 11 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38da2457-9ce9-4a1e-bcd7-7493e814fe6a

📥 Commits

Reviewing files that changed from the base of the PR and between 9f91381 and 27dd61a.

📒 Files selected for processing (4)

CHANGELOG.md
CLAUDE.md
src/graph/cypher/mod.rs
src/graph/cypher/parser/mod.rs

📝 Walkthrough

Walkthrough

Adds per-shard maxmemory guardrail (optional CLI), per-shard AOF rewrite coordination, manifest overflow pages, batched multi-page cold-tier spill with page-indexed locations and file-liveness tracking, per-shard eviction budgets, bounded cross-shard backpressure, PageCache lazy buffers, and broad test updates.

Changes

Guardrail and per-shard maxmemory

Layer / File(s)	Summary
Memory guardrail and per-shard budget system `src/config.rs`, `src/main.rs`, `src/server/embedded.rs`, `src/server/listener.rs`	`ServerConfig::maxmemory` → `Option<usize>`; `apply_memory_guardrail()` resolves explicit/applied/skip outcomes; `RuntimeConfig` adds `num_shards` and `maxmemory_per_shard()` (uses div_ceil); startup logs sharding; CLI flags `--experimental-per-shard-rewrite` and `--aof-fsync-timeout-ms` added; tests updated.

Persistence: AOF and manifest

Layer / File(s)	Summary
AOF fsync timeout and writer pool wiring `src/persistence/aof.rs`, `src/main.rs`, `src/server/embedded.rs`, `src/server/listener.rs`	`AofWriterPool` constructors accept `fsync_timeout: Duration` with `DEFAULT_AOF_FSYNC_TIMEOUT`; `try_send_append_durable` maps bounded ack outcomes; startup passes configured timeout into pools; tests adjusted to new signatures.
Per-shard BGREWRITEAOF fan-out & coordinator `src/persistence/aof.rs`, `src/command/persistence.rs`	Adds `AofMessage::RewritePerShard`, `PerShardRewriteCoord`, and `AofWriterPool::try_send_rewrite_per_shard(...)`; `bgrewriteaof_start_sharded` routes per-shard when layout is PerShard and errors map to RESP frames.
Manifest: overflow pages & replay semantics `src/persistence/manifest.rs`, `src/persistence/aof_manifest.rs`	Adds `ENTRIES_PER_OVERFLOW_PAGE`, overflow page serialization/parsing, `load_root` that reconstructs inline+overflow, compaction with `needs_reopen`, tests for overflow durability; `advance_shard` defers deletions and new `prune_shard_files()` added; replay `max_lsn` uses next-free offsets (lsn+len).

Disk-offload tiering, spill, and cold index

Layer / File(s)	Summary
Batch spill and multi-page files `src/storage/tiered/kv_spill.rs`, `src/storage/tiered/spill_thread.rs`	Introduces `INLINE_MAX_VALUE_BYTES`, `SpillEntry`, `BatchPages`, `build_kv_spill_batch`, `write_kv_spill_batch`; spill thread buffers requests and emits one `SpillCompletion` per written file containing multiple `SpillCompletionEntry`s; tests updated to validate batch paths.
ColdIndex page indexing & deferred unlink `src/storage/tiered/cold_index.rs`	`ColdLocation` adds `page_idx`; `ColdIndex` tracks `file_refs` and `pending_unlink`, updates refs on insert/remove/merge, `drain_pending_unlink` unlinks files and tombstones manifest entries; `rebuild_from_manifest` computes page indices; tests expanded.
Cold-read optimization and wiring `src/storage/tiered/cold_read.rs`, `src/shard/conn_accept.rs`, `src/shard/event_loop.rs`, `src/server/conn/handler_sharded/mod.rs`	`read_cold_entry` reads only required 4KB page via pread; overflow loads full file only when needed; spawn_tokio_connection now receives spill plumbing; event loop seeds `spill_file_id` from recovered heap files and corrects `disk_offload_dir` per shard.

Eviction and budgeting

Layer / File(s)	Summary
Per-shard eviction budget & spill seeding `src/storage/eviction.rs`, `src/shard/persistence_tick.rs`	Eviction loops use `rt.maxmemory_per_shard()` budget; async spill no longer inserts tentative cold_index entries; adds `next_spill_file_id_seed` scanning `heap-*.mpf`; persistence tick applies per-file commit+add semantics and inserts `ColdLocation` per entry.

Cross-shard dispatch and runtime tuning

Layer / File(s)	Summary
Bounded backpressure for cross-shard pushes `src/shard/dispatch.rs`, `src/server/conn/handler_monoio/mod.rs`, `src/server/conn/handler_sharded/mod.rs`	Adds `push_with_backpressure` with backoff and max retries, `PushOutcome` enum; handlers replace unbounded retry loops with bounded helper and reject batch on give-up; monoio adds bounded reply wait constant.
Listener reuseport & io_uring gating `src/server/listener.rs`, `src/shard/event_loop.rs`	`run_sharded` attempts SO_REUSEPORT on Unix with fallback; io_uring disabled by default unless `MOON_URING=1` with updated messaging.

Tests & infra

Layer / File(s)	Summary
Test harness migration and new integration tests `tests/*` (many files)	All test ServerConfig literals migrated to `maxmemory: Some(...)` or `..Default::default()` where appropriate; adds per-shard BGREWRITEAOF crash-matrix tests, disk-offload crash recovery test, multishard smoke tests, and synchronization helper for autovacuum tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Per-shard BGREWRITEAOF (FIX-W2-3 follow-up from PR #129) #131: per-shard BGREWRITEAOF wiring and fan-out support closely matches the objectives described in this issue.
P1-5: ShardManifest MAX_INLINE_ENTRIES caps disk-offload at ~80 files (overflow pages) #48: manifest overflow pages and related compaction/recovery address the inline-root cap discussed in this issue.

Possibly related PRs

pilotspace/moon#96: both PRs gate multi-part AOF replay/manifest initialization to monoio and update related startup logic.
pilotspace/moon#63: related prior work on sharded BGREWRITEAOF dispatch that this PR extends with coordinator and fan-out.
pilotspace/moon#95: overlaps in embedded startup wiring (fsync_timeout, num_shards, guardrail logging).

Suggested labels

enhancement

🐰 A rabbit's note on shards and spills:
Per-shard caps guard every nest,
Overflow pages store the rest,
Batch the spills, then prune with care—
Rewrites fold, manifests fare,
Keys on disk find safe, soft rest. 🥕

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/pr129-port-onto-main

qodo-code-review · 2026-06-03T16:02:42Z

Review Summary by Qodo

Stable multishard + disk-offload on both runtimes; eliminate RAM zombie via lazy PageCache and per-shard budgets

🐞 Bug fix ✨ Enhancement 🧪 Tests

Walkthroughs

Description

  **Multishard stability and RAM zombie elimination:**
• Fixed tokio multishard serving hang by adding SO_REUSEPORT to central listener, resolving
  bind-order race with per-shard listeners
• Disabled io_uring by default for tokio runtime (opt-in via MOON_URING=1) to prevent connection
  drops under load
• Implemented lazy PageCache buffer allocation (frames grow on first use, not pre-committed) and
  per-shard budget division to eliminate N× RAM over-commitment
• Reduced idle RSS from 3791 MB to 29 MB on 4-shard default server; maintains flat 45–48 MB under
  600k ops on both runtimes
  **Per-shard AOF rewrite (F6) and durability hardening:**
• Ported PR #129 per-shard AOF work onto main with multi-part rewrite coordination via
  PerShardRewriteCoord
• Implemented fsync timeout bounding (F2) with AckOutcome enum and runtime-specific select/timeout
  logic
• Added per-shard fan-out BGREWRITEAOF compaction with synchronized seq bump and single manifest
  commit
  **Disk-offload cold-tier durability fixes:**
• Implemented manifest overflow pages for >70 file entries with crash-atomic ordering (write
  overflow before root commit)
• Fixed batch-file shared-deletion data-loss bug via per-file liveness refcount tracking; files only
  unlinked when ALL co-located keys removed
• Refactored spill thread to buffered batch flushing with inline/oversized entry routing; multi-page
  files with per-entry page indexing
• Fixed cold-tier recovery regression (#22): restore from persistence when
  disk_offload_base.is_some() even under appendonly no
• Implemented page-indexed cold read with lazy full-file I/O; only reads specific pages in
  multi-page batch files
  **Memory guardrail and per-shard eviction budget (G1/G2):**
• Added whole-instance maxmemory guardrail with cgroup-aware auto-capping at ~80% of detected memory
  limit (Linux cgroup v1/v2 support)
• Changed maxmemory from usize to Option to distinguish omitted flag (auto-guardrail) from
  explicit --maxmemory 0 (unlimited)
• Implemented per-shard eviction budget division (maxmemory / num_shards) enforced across all
  eviction thresholds and pressure cascades
• Added startup warnings for explicit --pagecache-size exceeding 50% of maxmemory
  **Cross-shard dispatch and backpressure:**
• Implemented bounded cross-shard dispatch with push_with_backpressure() helper (100µs backoff,
  5000 max retries) to prevent connection parking
• Added response-wait timeout (~30s) with shutdown checks for cross-shard writes
• Added PushOutcome enum to distinguish happy path, transient backpressure, and cancellation
  scenarios
  **Configuration and test updates:**
• Added experimental_per_shard_rewrite flag (opt-in per-shard BGREWRITEAOF) and
  aof_fsync_timeout_ms (2000ms default)
• Lowered cold_orphan_sweep_interval_secs default from 300 to 60 seconds to promptly reclaim
  orphans
• Updated 30+ test files for optional maxmemory field and new AOF configuration fields
• Added comprehensive crash-recovery test suite: cold-tier durability without AOF (#22), per-shard
  BGREWRITEAOF matrix, multishard serving smoke tests
• Serialized autovacuum timestamp tests to prevent flaking under parallel execution

Diagram

flowchart LR
  A["Multishard<br/>Serving"] -->|"SO_REUSEPORT<br/>central listener"| B["Tokio<br/>Stability"]
  A -->|"io_uring<br/>default OFF"| B
  C["PageCache<br/>Allocation"] -->|"lazy buffers<br/>per-shard budget"| D["RAM<br/>Bounded"]
  E["Spill<br/>Thread"] -->|"batch writes<br/>multi-page files"| F["Cold<br/>Tier"]
  F -->|"per-file liveness<br/>refcount"| G["Durability<br/>Safe"]
  F -->|"overflow pages<br/>crash-atomic"| G
  H["Eviction<br/>Pressure"] -->|"per-shard budget<br/>division"| D
  I["Cross-Shard<br/>Dispatch"] -->|"bounded backpressure<br/>retry limits"| J["Stability<br/>Guaranteed"]
  K["Memory<br/>Guardrail"] -->|"cgroup-aware<br/>auto-cap"| D

File Changes

1. src/persistence/aof.rs ✨ Enhancement +761/-26

Per-shard AOF rewrite (F6) and fsync timeout bounding (F2)

• Added AckOutcome enum to distinguish terminal states of bounded fsync-ack awaits (F2
 design-for-failure)
• Introduced PerShardRewriteCoord struct for crash-safe multi-shard BGREWRITEAOF coordination (F6)
• Added RewritePerShard message variant and per-shard rewrite entry point
 try_send_rewrite_per_shard
• Implemented fsync timeout bounding via await_ack helper with runtime-specific select/timeout
 logic
• Added do_rewrite_per_shard and drain_pending_appends_framed functions for per-shard fold
 operations
• Enhanced per_shard_aof_writer_task to handle per-shard rewrite messages and fixed EverySec
 deadline enforcement

src/persistence/aof.rs

2. src/storage/tiered/spill_thread.rs ✨ Enhancement +409/-303

Batched spill writes with inline/oversized entry routing

• Refactored spill thread from per-request single-file writes to buffered batch flushing
• Added FLUSH_ENTRY_CAP constant and flush_buffer function for multi-entry batch writes
• Introduced SpillCompletionEntry struct and updated SpillCompletion to carry multiple entries
 per file
• Implemented inline vs oversized entry routing: inline entries batch into one multi-page file,
 oversized get single-page files
• Updated test suite to verify batch semantics and per-file completion model

src/storage/tiered/spill_thread.rs

3. src/persistence/manifest.rs ✨ Enhancement +405/-39

Manifest overflow pages for >70 file entries

• Added ENTRIES_PER_OVERFLOW_PAGE constant and overflow page serialization/parsing functions
• Implemented append-only overflow pages for manifest entries beyond the 70-entry inline cap
• Added load_root function to fully load a slot including its overflow run with crash-safety
 validation
• Implemented compact function to reclaim dead overflow pages and bound file growth
• Enhanced commit to write overflow pages before root commit (crash-atomic ordering)
• Added comprehensive tests for overflow persistence, crash atomicity, compaction, and tombstoning

src/persistence/manifest.rs

View more (58)

4. src/shard/event_loop.rs 🐞 Bug fix +70/-14

io_uring default disable, PageCache lazy allocation, spill recovery

• Changed io_uring default for tokio runtime from enabled to disabled (opt-in via MOON_URING=1)
• Fixed PageCache per-shard budget division to prevent multishard RAM over-commitment
• Implemented lazy PageCache buffer allocation (frames grow on first use, not pre-allocated)
• Added startup warning when explicit --pagecache-size exceeds 50% of maxmemory
• Fixed spill directory path to use per-shard subdirectories (shard-{id})
• Added spill file_id counter seeding from recovered cold files to prevent post-restart overwrites

src/shard/event_loop.rs

5. src/persistence/migrate_aof.rs 📝 Documentation +1/-1

Documentation formatting fix
• Changed code block marker from triple backticks to text variant in documentation comment
src/persistence/migrate_aof.rs

6. src/config.rs ✨ Enhancement +452/-9

G1 memory guardrail and per-shard eviction budget division

• Added Default derive to ServerConfig and introduced G1 memory guardrail system with
 auto-capping at ~80% of detected memory limit (cgroup-aware on Linux)
• Changed maxmemory from usize to Option to distinguish between omitted flag (auto-guardrail)
 and explicit --maxmemory 0 (unlimited)
• Added per-shard eviction budget via RuntimeConfig::maxmemory_per_shard() to divide
 whole-instance cap across shards, fixing multishard RAM zombie
• Added new config flags: experimental_per_shard_rewrite (opt-in per-shard BGREWRITEAOF),
 aof_fsync_timeout_ms (2000ms default for fsync ack timeout)
• Lowered cold_orphan_sweep_interval_secs default from 300 to 60 seconds to promptly reclaim
 orphans and mask batch-file shared-deletion bug
• Added comprehensive memory detection and guardrail resolution logic with Linux cgroup v1/v2
 support and extensive unit tests

src/config.rs

7. src/main.rs 🐞 Bug fix +231/-88

Startup guardrail, cold recovery fix, and multishard serving stability

• Applied G1 memory guardrail at startup before building RuntimeConfig, logging the outcome
• Updated AOF writer pool initialization to pass aof_fsync_timeout_ms and base directory for
 per-shard BGREWRITEAOF coordination
• Gated per-shard BGREWRITEAOF behind experimental_per_shard_rewrite flag with appropriate
 warnings
• Fixed cold-tier recovery regression (#22): fire restore_from_persistence when
 disk_offload_base.is_some() even under appendonly no
• Preserved cold-tier wiring (cold_shard_dir + cold_index) across AOF/RDB recovery by capturing
 before replay and re-attaching after
• Fixed tokio multishard serving hang by removing the cfg!(target_os = "linux") gate on central
 accept loop (always run for tokio)
• Published resolved shard count to RuntimeConfig for per-shard eviction budget calculation and
 logged maxmemory sharding notice
• Wrapped single-shard multi-part AOF paths in #[cfg(runtime-monoio)] with tokio fallback warnings
 to prevent regression

src/main.rs

8. src/persistence/aof_manifest.rs 🐞 Bug fix +186/-62

Crash-safe per-shard rewrite deletion ordering and F5 offset tracking

• Refactored advance_shard() to defer file deletion until after manifest commit, eliminating crash
 window where old seq files are deleted before durability
• Added new prune_shard_files() method for post-commit deletion of old generation files, called
 only after write_manifest() succeeds
• Fixed F5 replication offset tracking: max_lsn now represents next-free offset (entry end = start
 LSN + RESP length) instead of start LSN, preventing lsn→entry uniqueness violations
• Updated test advance_shard_defers_delete_until_after_commit to verify pre-commit invariant (new
 files written, old files survive) and post-commit deletion
• Added comprehensive F5 tests validating next-free offset calculation across inline, ordered, and
 multi-shard entries

src/persistence/aof_manifest.rs

9. src/storage/tiered/cold_index.rs 🐞 Bug fix +319/-35

Per-file liveness refcount and batch-file shared-deletion fix

• Added page_idx field to ColdLocation for multi-page spill files (file-absolute 4KB page index)
• Implemented reverse liveness tracking via file_refs HashMap to count live entries per file and
 pending_unlink queue for zero-ref files
• Fixed batch-file shared-deletion data-loss bug: files only unlinked when ALL co-located keys are
 removed, not on single orphan
• Refactored sweep_known_orphans() into two phases: remove orphan entries (decrement refs), then
 drain pending unlinks off hot path
• Added drain_pending_unlink() to safely unlink zero-ref files post-commit with re-queueing on
 error
• Updated rebuild_from_manifest() to use raw file reads and absolute page indices for correct
 multi-page file reconstruction
• Added comprehensive regression tests for co-located key retention, last-ref deletion, and
 re-eviction churn scenarios

src/storage/tiered/cold_index.rs

10. tests/crash_recovery_disk_offload_no_aof.rs 🧪 Tests +425/-0

Crash recovery test for cold-tier durability without AOF

• New comprehensive crash-recovery test for #22 regression: cold keys must survive SIGKILL under
 --appendonly no --disk-offload enable
• Validates that cold spill manifests are recovered via per-shard persistence (no AOF), with 65%
 recovery floor to distinguish regression (0 keys) from legitimate variance
• Implements robust restart logic with transient EADDRINUSE rebind race handling and SO_REUSEPORT
 listener teardown synchronization
• Includes helper functions for unique port/directory allocation, server lifecycle management,
 redis-cli integration, and heap file counting

tests/crash_recovery_disk_offload_no_aof.rs

11. src/storage/eviction.rs 🐞 Bug fix +162/-19

Per-shard eviction budget and cold file_id seeding

• Updated eviction threshold comparisons to use config.maxmemory_per_shard() instead of raw
 maxmemory, enforcing per-shard budget division
• Added next_spill_file_id_seed() function to scan recovered heap-*.mpf files and seed counter
 above max file_id, preventing post-restart overwrites of cold files
• Removed tentative cold_index insertion at evict time (was inserting slot 0 which could return
 wrong key under batching); rely on completion handler for authoritative location
• Added unit tests for per-shard budget enforcement in multishard mode and spill file_id seeding
 logic

src/storage/eviction.rs

12. tests/autovacuum_daemon.rs 🧪 Tests +17/-0

Serialize autovacuum timestamp tests to prevent flaking

• Added process-global mutex lock (autovacuum_ts_lock()) to serialize tests that read/assert
 RECL_AUTOVACUUM_LAST_RUN_TS, preventing cross-test flaking under parallel execution
• Lock is poison-tolerant to avoid wedging sibling tests if one panics

tests/autovacuum_daemon.rs

13. src/storage/tiered/kv_spill.rs ✨ Enhancement +406/-0

Multi-page batch spill with per-entry page indexing

• Added page_idx field to ColdLocation in spill operations to support multi-page batch spilling
• Implemented build_kv_spill_batch() function for greedy packing of KV entries into multiple leaf
 pages with overflow handling
• Implemented write_kv_spill_batch() for atomic file writes with crash-safety via .tmp rename
 pattern
• Added comprehensive tests for multi-page batch building, round-trip read recovery, and
 crash-recovery manifest rebuilding

src/storage/tiered/kv_spill.rs

14. tests/crash_matrix_per_shard_bgrewriteaof.rs 🧪 Tests +341/-0

Per-shard BGREWRITEAOF crash-recovery matrix tests

• New crash-recovery test suite for per-shard BGREWRITEAOF fan-out compaction under
 --experimental-per-shard-rewrite
• Tests validate exact-count recovery of non-idempotent INCR commands across rewrite boundaries
 (straddle + compose scenarios)
• Includes helper functions for process spawning, port management, and compacted base file detection
• Acceptance gate for F6 monoio multi-shard rewrite correctness (0% data loss, 0% duplication)

tests/crash_matrix_per_shard_bgrewriteaof.rs

15. src/persistence/page_cache/mod.rs 🐞 Bug fix +132/-3

Lazy PageCache buffers and per-shard budget division

• Changed buffer allocation from eager (pre-committed at construction) to lazy (allocated on first
 fetch_page miss)
• Added pagecache_frame_counts() to split per-shard budget into 4KB/64KB frame counts with minimum
 floors
• Added per_shard_pagecache_budget() to divide whole-instance budget across shards, preventing N×
 over-commitment
• Added resident_buffer_bytes() method to report actual committed memory (reflects working set,
 not configured budget)
• Added regression tests locking in lazy allocation and per-shard budget division

src/persistence/page_cache/mod.rs

16. tests/multishard_serve_smoke.rs 🧪 Tests +203/-0

Multishard serving smoke tests for both runtimes

• New smoke test validating multishard server actually serves commands on both runtimes (monoio +
 tokio)
• Regression guard for tokio-on-Linux multishard accept hang caused by SO_REUSEPORT bind-order race
• Tests plain multishard, 1-shard, and multishard+disk-offload configurations with raw TCP PING
 verification
• Uses CARGO_BIN_EXE_moon for CI integration; runs in default cargo test (not #[ignore])

tests/multishard_serve_smoke.rs

17. tests/integration.rs ⚙️ Configuration changes +14/-7

Update integration tests for optional maxmemory field

• Changed maxmemory field from usize to Option across all server startup helpers
• Added ..Default::default() to ServerConfig struct initialization to handle new fields
• Updated 8 test helper functions to wrap maxmemory values in Some()

tests/integration.rs

18. src/shard/dispatch.rs ✨ Enhancement +158/-0

Bounded cross-shard dispatch backpressure with retry limits

• Added CROSS_SHARD_PUSH_BACKOFF (100µs) and CROSS_SHARD_PUSH_MAX_RETRIES (5000) constants for
 bounded backpressure
• Implemented PushOutcome enum (Pushed, Backpressure, Cancelled) for cross-shard SPSC push
 results
• Implemented push_with_backpressure() async helper with bounded retry, shutdown checks, and no
 allocation on happy path
• Added 4 tokio-gated unit tests validating happy path (no sleep), transient full recovery,
 wedged-shard give-up, and shutdown abort

src/shard/dispatch.rs

19. src/shard/persistence_tick.rs ✨ Enhancement +43/-32

Per-shard eviction thresholds and batch spill completion handling

• Updated apply_spill_completions() to process per-file batches (one manifest add+commit per file,
 multiple ColdIndex inserts per file)
• Changed from single-entry completion to multi-entry SpillCompletion with entries vector
 containing page_idx and slot_idx
• Updated should_run_pressure_cascade() to compare against maxmemory_per_shard() instead of
 whole-instance maxmemory
• Updated handle_memory_pressure() eviction threshold to use per-shard budget for bounded
 aggregate RSS

src/shard/persistence_tick.rs

20. src/server/conn/handler_monoio/mod.rs ✨ Enhancement +89/-24

Bounded cross-shard dispatch and response-wait timeouts

• Added CROSS_SHARD_RESPONSE_MAX_WAITS constant (~30s generous backstop for uncertain writes)
• Replaced unbounded loop { try_push; sleep(10µs) } with push_with_backpressure() helper for
 bounded cross-shard dispatch
• Added response-wait timeout with shutdown checks to prevent parking connections on wedged peers
• Enhanced error messages to distinguish backpressure, cancellation, and timeout scenarios

src/server/conn/handler_monoio/mod.rs

21. src/server/conn/handler_sharded/mod.rs ✨ Enhancement +71/-10

Disk-offload eviction and bounded cross-shard backpressure

• Added disk-offload spill support in eviction path via try_evict_if_needed_async_spill() when
 spill sender is available
• Replaced unbounded cross-shard push loop with push_with_backpressure() helper with bounded retry
 and shutdown checks
• Enhanced error handling to reject batch on backpressure/cancellation instead of parking connection
 forever

src/server/conn/handler_sharded/mod.rs

22. src/server/listener.rs 🐞 Bug fix +35/-1

SO_REUSEPORT central listener and AOF fsync timeout

• Central tokio listener now binds with SO_REUSEPORT via create_reuseport_socket() to coexist
 with per-shard listeners
• Fallback to plain bind on non-unix or socket creation failure (preserves existing behavior)
• Added aof_fsync_timeout_ms parameter to AofWriterPool::top_level_with_policy() calls

src/server/listener.rs

23. src/server/embedded.rs ✨ Enhancement +13/-1

Memory guardrail and per-shard maxmemory enforcement

• Added memory guardrail logging via log_memory_guardrail() before RuntimeConfig construction
• Published resolved shard count to RuntimeConfig so per-shard budget division is enforced
• Added per-shard maxmemory sharding log via log_maxmemory_sharding()
• Added aof_fsync_timeout_ms parameter to AofWriterPool::top_level_with_policy() call

src/server/embedded.rs

24. src/shard/spsc_handler.rs ⚙️ Configuration changes +11/-2

AOF writer pool timeout parameter updates

• Updated AofWriterPool::top_level_with_policy() calls to include aof_fsync_timeout_ms parameter
• Added AofMessage::RewritePerShard variant to match statement in test

src/shard/spsc_handler.rs

25. src/storage/tiered/cold_read.rs ✨ Enhancement +11/-8

Page-indexed cold read with lazy full-file I/O

• Changed from whole-file read to page-specific pread using FileExt::read_exact_at() with
 page_idx offset
• Only reads full file when following overflow chains (lazy full-file read on demand)
• Reduces I/O for single-page lookups in multi-page batch files

src/storage/tiered/cold_read.rs

26. tests/crash_matrix_per_shard_aof.rs 📝 Documentation +10/-5

Per-shard AOF crash-recovery now supports both runtimes
• Updated documentation to reflect per-shard AOF support on BOTH runtimes (monoio + tokio)
• Added build instructions for tokio runtime validation
tests/crash_matrix_per_shard_aof.rs

27. src/shard/conn_accept.rs ✨ Enhancement +11/-3

Tokio disk-offload spill context threading

• Added spill_sender, spill_file_id, and disk_offload_dir parameters to
 spawn_tokio_connection()
• Tokio handler now receives disk-offload context (mirrors monoio path) for spilling evicted KVs to
 cold tier

src/shard/conn_accept.rs

28. src/command/persistence.rs ✨ Enhancement +28/-0

Per-shard BGREWRITEAOF fan-out coordination

• Added F6 per-shard fan-out path for BGREWRITEAOF via pool.try_send_rewrite_per_shard() when
 layout is PerShard
• Synchronized seq bump + single manifest commit across all writers (both runtimes)
• Falls back to legacy RewriteSharded for TopLevel multi-DB pools

src/command/persistence.rs

29. src/shard/timers.rs 🐞 Bug fix +10/-1

Cold orphan sweep handles zero-ref file unlink

• Updated run_cold_orphan_sweep() to check for pending unlink files even when orphan_keys is
 empty
• Files orphaned by re-eviction or promotion carry no hot∩cold key but still need unlink processing

src/shard/timers.rs

30. src/server/conn/handler_single.rs ⚙️ Configuration changes +5/-1

AOF writer pool timeout parameter update
• Updated AofWriterPool::top_level_with_policy() call to include aof_fsync_timeout_ms parameter
src/server/conn/handler_single.rs

31. tests/lunaris_cypher_shortest_path.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/lunaris_cypher_shortest_path.rs

32. tests/lunaris_cypher_temporal.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/lunaris_cypher_temporal.rs

33. tests/ft_search_concurrent_readers.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/ft_search_concurrent_readers.rs

34. tests/ft_search_as_of_boundary.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/ft_search_as_of_boundary.rs

35. tests/ft_search_as_of_filter.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/ft_search_as_of_filter.rs

36. tests/ft_search_multi_shard_as_of.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/ft_search_multi_shard_as_of.rs

37. tests/ft_search_temporal_parity.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/ft_search_temporal_parity.rs

38. tests/workspace_integration.rs ⚙️ Configuration changes +6/-2

Update test config for optional maxmemory and new AOF fields

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added aof_fsync_timeout_ms and experimental_per_shard_rewrite fields to ServerConfig
• Added ..Default::default() to ServerConfig initialization

tests/workspace_integration.rs

39. tests/kill_snapshot.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/kill_snapshot.rs

40. tests/adversarial_v0110_fix01_set_delete_rollback.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/adversarial_v0110_fix01_set_delete_rollback.rs

41. tests/adversarial_v0110_fix02_err_path_intent.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/adversarial_v0110_fix02_err_path_intent.rs

42. tests/adversarial_v0110_fix03_simplestring_graph.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/adversarial_v0110_fix03_simplestring_graph.rs

43. tests/adversarial_v0110_fix07_multihop_edge_var_reject.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/adversarial_v0110_fix07_multihop_edge_var_reject.rs

44. tests/adversarial_v0110_fix04_shortest_path_call_parity.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/adversarial_v0110_fix04_shortest_path_call_parity.rs

45. tests/adversarial_v0110_fix06_shortest_path_min_hops.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/adversarial_v0110_fix06_shortest_path_min_hops.rs

46. tests/graph_bench_compare.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/graph_bench_compare.rs

47. tests/graph_bench_e2e.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/graph_bench_e2e.rs

48. tests/graph_integration.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/graph_integration.rs

49. tests/graph_stress_deep.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/graph_stress_deep.rs

50. tests/txn_cypher_write_rollback.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/txn_cypher_write_rollback.rs

51. tests/txn_graph_wiring.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/txn_graph_wiring.rs

52. tests/lunaris_hybrid_ft_search.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/lunaris_hybrid_ft_search.rs

53. tests/txn_completeness_edge_cases.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/txn_completeness_edge_cases.rs

54. tests/txn_ft_search_snapshot.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory
• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization
tests/txn_ft_search_snapshot.rs

55. tests/txn_kv_wiring.rs ⚙️ Configuration changes +3/-1

Update test config for optional maxmemory and new AOF fields

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added aof_fsync_timeout_ms and experimental_per_shard_rewrite fields to ServerConfig
• Added ..Default::default() to ServerConfig initialization

tests/txn_kv_wiring.rs

56. tests/mq_integration.rs ⚙️ Configuration changes +3/-1

Update test config for optional maxmemory and new AOF fields

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added aof_fsync_timeout_ms and experimental_per_shard_rewrite fields to ServerConfig
• Added ..Default::default() to ServerConfig initialization

tests/mq_integration.rs

57. tests/pipeline_auto_index.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/pipeline_auto_index.rs

58. tests/replication_test.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/replication_test.rs

59. tests/vacuum_commands.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/vacuum_commands.rs

60. CHANGELOG.md 📝 Documentation +23/-0

Changelog entry for whole-instance maxmemory enforcement

• Added G2 section documenting maxmemory as a true whole-instance cap across shards
• Explains per-shard budget division (maxmemory / num_shards) for bounded aggregate RSS
• Notes that CONFIG GET maxmemory and INFO remain unchanged (Redis-compatible)
• Includes startup log message format for visibility of per-shard budget resolution

CHANGELOG.md

61. README.md Additional files +8/-0

...

README.md

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

qodo-code-review · 2026-06-03T16:02:45Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (3) 🔗 Cross-repo conflicts (0)

Context used

✅ Compliance rules (platform): 27 rules

1. sigkill() unsafe missing SAFETY 📘 Rule violation ≡ Correctness

Description

A new unsafe block calls libc::kill without the required adjacent // SAFETY: justification,
violating the repository’s unsafe-code policy and making it harder for reviewers/auditors to
validate invariants and UB risk. This increases audit risk and can block automated unsafe audits.

Code

tests/crash_recovery_disk_offload_no_aof.rs[R332-337]

Evidence
UNSAFE_POLICY.md requires that every unsafe block include an adjacent // SAFETY: comment
describing the preconditions/invariants being upheld. In the newly added sigkill() helper, an
unsafe { libc::kill(...) } call was introduced without any accompanying SAFETY justification,
directly contradicting that requirement and preventing auditors from confirming why the call is
sound.
Rule 297369: Enforce unsafe code usage against UNSAFE_POLICY.md
tests/crash_recovery_disk_offload_no_aof.rs[331-339]
tests/crash_matrix_per_shard_bgrewriteaof.rs[138-145]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new `unsafe { libc::kill(...) }` usage was introduced without the policy-required adjacent `// SAFETY:` comment (per `UNSAFE_POLICY.md`), leaving the upheld preconditions/invariants undocumented.

## Issue Context
`UNSAFE_POLICY.md` mandates an explicit `// SAFETY:` justification for every unsafe block and prefers using safe alternatives when available (e.g., `Child::kill`) to reduce unsafe surface area. The new `sigkill()` helper currently calls `libc::kill` inside an unsafe block without documenting why this is safe.

## Fix Focus Areas
- tests/crash_recovery_disk_offload_no_aof.rs[331-339]
- tests/crash_matrix_per_shard_bgrewriteaof.rs[138-145]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Unannotated .unwrap() in spill tests 📘 Rule violation ✧ Quality

Description

New .unwrap() calls were added without the required #[allow(clippy::unwrap_used)] and an
immediately preceding justification comment. This breaks the unwrap-audit expectations and makes
future lint/audit enforcement unreliable.

Code

src/storage/tiered/spill_thread.rs[519]

Evidence

The rule requires a justification comment plus #[allow(clippy::unwrap_used)] in-scope for every
.unwrap() in the diff. The updated spill-thread tests include .unwrap() usages with no
allow+comment pair (e.g., the find(...).unwrap() and other unwraps in the same test block).

Rule 302083: Annotate safe unwrap calls with allow and justification
src/storage/tiered/spill_thread.rs[492-547]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New `.unwrap()` calls were introduced without the required `// ...` justification comment directly above an attached `#[allow(clippy::unwrap_used)]` attribute.

## Issue Context
The compliance rule requires every unwrap to be explicitly justified and locally allowed (or rewritten to avoid unwrap).

## Fix Focus Areas
- src/storage/tiered/spill_thread.rs[492-588]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. PerShardRewriteCoord missing loom test 📘 Rule violation ▣ Testability

Description

A new atomic coordination/state-machine type (PerShardRewriteCoord) was introduced without any
corresponding loom model test coverage. This risks concurrency bugs slipping through in a
multi-threaded rewrite coordinator.

Code

src/persistence/aof.rs[R217-327]

Evidence

The diff adds a new atomic coordination type that uses AtomicUsize/AtomicBool and cross-thread
completion logic, which qualifies as an atomic state machine. The existing loom test file is still
only modeling ResponseSlot and does not exercise PerShardRewriteCoord, so the required loom
coverage was not added.

Rule 302087: Require loom model tests for new atomic state machines
src/persistence/aof.rs[217-327]
tests/loom_response_slot.rs[1-15]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new atomic coordination type was added, but no loom model test was added/updated to exercise its concurrency behavior.

## Issue Context
The compliance requirement mandates loom-based model tests for new/modified atomic state machines, specifically in `tests/loom_response_slot.rs`.

## Fix Focus Areas
- src/persistence/aof.rs[217-327]
- tests/loom_response_slot.rs[1-120]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

View more (1)

4. Cold reads disabled in replay 🐞 Bug ≡ Correctness

Description

src/main.rs temporarily take()s cold_shard_dir/cold_index before multi-part AOF replay and
only reattaches them after replay completes, so replayed commands cannot read disk-offloaded keys.
This can permanently corrupt recovered state for read-modify-write commands during replay (e.g.,
INCR reads via db.get() and will treat cold-only keys as missing).

Code

src/main.rs[R758-770]

Evidence

main.rs explicitly removes cold-tier wiring (take()) before invoking AOF replay and only
restores it afterward; replay uses the standard command dispatch engine; INCR reads via db.get(),
and Database::get() only performs cold read-through when cold_shard_dir and cold_index are
present, so cold-only keys appear missing during replay and can be replayed incorrectly.

src/main.rs[739-775]
src/main.rs[1054-1065]
src/persistence/replay.rs[512-538]
src/command/string/string_write.rs[265-299]
src/storage/db.rs[681-722]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
During multi-part AOF replay, `cold_shard_dir` and `cold_index` are removed from every `Database` via `.take()` before calling the replay routines, and only restored after replay finishes. Because replay executes the normal command handlers, any replay-time read of a disk-offloaded key (e.g., INCR’s read-modify-write) cannot cold-read and will behave as if the key is missing, producing incorrect recovered state.

## Issue Context
- `rdb::load` swaps whole `Database` instances (`*live = temp`), which is why cold-tier fields get dropped.
- The current approach works around that swap by stashing cold wiring in `main.rs`, but it makes cold storage unavailable during the replay window.
- Fix must ensure cold wiring is present *while replaying incremental RESP commands*, not only after replay ends.

## Fix Focus Areas
- src/main.rs[739-775]
- src/main.rs[1054-1065]
- src/persistence/aof_manifest.rs[1299-1370]
- src/persistence/rdb.rs[362-367]

## Suggested fix approach
1. Change replay path so cold wiring is restored immediately after the base RDB load swap and **before** incremental command replay begins.
  - Option A: Add a recovery-only `rdb::load_preserving_wiring(...)` (or similar) that copies selected “topology” fields (cold wiring) from the current live DBs into the temp DBs right before the final swap, and use it from `replay_multi_part` / per-shard replay.
  - Option B: Have `replay_multi_part` / `replay_per_shard` accept a preserved-wiring bundle and reattach it right after `rdb::load(...)` returns and before `replay_incr_resp(...)`.
2. Remove (or narrow) the `.take()` window in `main.rs` so cold wiring remains available during replayed command execution.
3. Add a regression test that:
  - Sets up disk-offload state with a key only in cold storage.
  - Replays an AOF stream containing a read-modify-write command against that key (e.g., INCR).
  - Asserts the recovered value matches the expected incremented value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. Spill defaults to dot-dir 🐞 Bug ⛨ Security

Description

When spill_sender is set but disk_offload_dir is missing, the tokio sharded handler spills
evicted data into the process working directory via Path::new("."). This can create/overwrite
unexpected files outside the configured offload directory.

Code

src/server/conn/handler_sharded/mod.rs[R1283-1294]

Evidence

The handler explicitly falls back to . when the spill directory is missing; the context fields are
optional, so this state is representable; and the event loop constructs these values separately,
meaning a mismatch can occur if wiring changes or configuration is incomplete.

src/server/conn/handler_sharded/mod.rs[1271-1299]
src/server/conn/core.rs[87-92]
src/shard/event_loop.rs[597-623]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The tokio sharded handler uses `ctx.disk_offload_dir.as_deref().unwrap_or(Path::new("."))` when spilling is enabled. If `spill_sender` is present but `disk_offload_dir` is `None` (miswire/misconfig), the server will spill data files into the current working directory, which is unsafe and operationally surprising.

## Issue Context
`spill_sender`, `spill_file_id`, and `disk_offload_dir` are independently optional in `ConnectionContext`, so there is no type-level guarantee that a spill sender implies a valid spill directory.

## Fix Focus Areas
- src/server/conn/handler_sharded/mod.rs[1271-1301]
- src/server/conn/core.rs[87-92]
- src/shard/event_loop.rs[597-623]

## Suggested fix approach
1. Replace the `unwrap_or(Path::new("."))` fallback with an explicit invariant check:
  - If `spill_sender.is_some()` and `disk_offload_dir.is_none()`, log an error and fall back to delete-only eviction (`try_evict_if_needed`) or return a clear error (prefer falling back + warn to keep serving).
2. Consider enforcing the invariant earlier:
  - In `ConnectionContext::new`, if sender is Some and dir is None, set `spill_sender=None` (disable spilling) and emit a warning, or `debug_assert!`/`panic!` in debug builds.
3. Add a small unit/integration test that constructs a `ConnectionContext` with `spill_sender=Some` and `disk_offload_dir=None` and asserts spilling is disabled or rejected rather than writing to `.`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

qodo-code-review · 2026-06-03T16:14:12Z

+fn sigkill(child: &mut Child) {
+    let pid = child.id() as i32;
+    unsafe {
+        libc::kill(pid, libc::SIGKILL);
+    }
+    let _ = child.wait();


1. sigkill() unsafe missing safety 📘 Rule violation ≡ Correctness

A new unsafe block calls libc::kill without the required adjacent // SAFETY: justification, violating the repository’s unsafe-code policy and making it harder for reviewers/auditors to validate invariants and UB risk. This increases audit risk and can block automated unsafe audits.

Agent Prompt

## Issue description A new `unsafe { libc::kill(...) }` usage was introduced without the policy-required adjacent `// SAFETY:` comment (per `UNSAFE_POLICY.md`), leaving the upheld preconditions/invariants undocumented. ## Issue Context `UNSAFE_POLICY.md` mandates an explicit `// SAFETY:` justification for every unsafe block and prefers using safe alternatives when available (e.g., `Child::kill`) to reduce unsafe surface area. The new `sigkill()` helper currently calls `libc::kill` inside an unsafe block without documenting why this is safe. ## Fix Focus Areas - tests/crash_recovery_disk_offload_no_aof.rs[331-339] - tests/crash_matrix_per_shard_bgrewriteaof.rs[138-145]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-06-03T16:14:12Z

+        let total_entries: usize = completions.iter().map(|c| c.entries.len()).sum();
+        assert_eq!(total_entries, 1, "expected 1 entry across all completions");
+
+        let c = completions.iter().find(|c| !c.entries.is_empty()).unwrap();


2. Unannotated .unwrap() in spill tests 📘 Rule violation ✧ Quality

New .unwrap() calls were added without the required #[allow(clippy::unwrap_used)] and an immediately preceding justification comment. This breaks the unwrap-audit expectations and makes future lint/audit enforcement unreliable.

Agent Prompt

## Issue description New `.unwrap()` calls were introduced without the required `// ...` justification comment directly above an attached `#[allow(clippy::unwrap_used)]` attribute. ## Issue Context The compliance rule requires every unwrap to be explicitly justified and locally allowed (or rewritten to avoid unwrap). ## Fix Focus Areas - src/storage/tiered/spill_thread.rs[492-588]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-06-03T16:14:12Z

+pub struct PerShardRewriteCoord {
+    /// Writers still to finish. Starts at the shard count; the writer that
+    /// decrements it to zero performs the commit + prune.
+    remaining: std::sync::atomic::AtomicUsize,
+    /// Shared manifest, loaded fresh from disk by the BGREWRITEAOF handler at
+    /// rewrite time (normal appends never touch the manifest, and BGREWRITEAOF
+    /// is CAS-serialized, so a fresh load is the authoritative current state).
+    manifest: Arc<parking_lot::Mutex<crate::persistence::aof_manifest::AofManifest>>,
+    /// The generation every writer advances to. Computed once = old_seq + 1.
+    new_seq: u64,
+    /// The generation being retired; pruned only after the commit.
+    old_seq: u64,
+    /// Number of shards participating (= initial `remaining`).
+    n_shards: usize,
+    /// Set by any shard whose fold fails. The final writer checks this and
+    /// ABORTS the commit if set — committing `new_seq` while a shard never
+    /// wrote its new base would make recovery look for a missing base and
+    /// refuse to start. On abort the old generation (`old_seq`) stays the
+    /// committed state for all shards (crash-safe).
+    failed: std::sync::atomic::AtomicBool,
+}
+
+impl PerShardRewriteCoord {
+    /// Construct a coordinator for an `n_shards`-way rewrite advancing the
+    /// shared `manifest` from its current seq to `current_seq + 1`.
+    pub fn new(
+        manifest: Arc<parking_lot::Mutex<crate::persistence::aof_manifest::AofManifest>>,
+        current_seq: u64,
+        n_shards: usize,
+    ) -> Arc<Self> {
+        Arc::new(Self {
+            remaining: std::sync::atomic::AtomicUsize::new(n_shards),
+            manifest,
+            new_seq: current_seq + 1,
+            old_seq: current_seq,
+            n_shards,
+            failed: std::sync::atomic::AtomicBool::new(false),
+        })
+    }
+
+    /// The generation writers advance to.
+    #[inline]
+    pub fn new_seq(&self) -> u64 {
+        self.new_seq
+    }
+
+    /// Mark the whole rewrite as failed (called by a shard whose fold errored).
+    /// The final writer will abort the commit, leaving `old_seq` authoritative.
+    #[inline]
+    pub fn mark_failed(&self) {
+        self.failed
+            .store(true, std::sync::atomic::Ordering::Release);
+    }
+
+    /// Called by each writer AFTER it has durably written its new base+incr at
+    /// `new_seq` and reopened its append file. Decrements the countdown; the
+    /// final caller commits the manifest (single seq flip) and prunes the old
+    /// generation, then clears the global in-progress flag.
+    ///
+    /// Crash-safety: the commit (`write_manifest`) is the atomic flip point;
+    /// pruning runs strictly after it, so a crash mid-prune only orphans
+    /// already-superseded files (recovery uses `new_seq`).
+    pub fn shard_done(&self) {
+        use std::sync::atomic::Ordering;
+        // AcqRel: the decrement-to-zero must observe all prior writers'
+        // advance_shard manifest mutations before committing.
+        if self.remaining.fetch_sub(1, Ordering::AcqRel) == 1 {
+            // Abort if any shard failed to fold: committing new_seq while a
+            // shard lacks its new base would break recovery. Keep old_seq.
+            if self.failed.load(Ordering::Acquire) {
+                let m = self.manifest.lock();
+                // Best-effort: prune the orphaned new-seq files written by the
+                // shards that DID fold, so they don't linger.
+                for sid in 0..self.n_shards {
+                    m.prune_shard_files(sid as u16, self.new_seq);
+                }
+                drop(m);
+                error!(
+                    "F6 per-shard rewrite ABORTED: a shard failed to fold; seq stays {}. \
+                     Old generation remains authoritative (crash-safe). A RESTART is \
+                     recommended so successful shards' writers stop appending to the \
+                     discarded new generation.",
+                    self.old_seq
+                );
+                crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+                return;
+            }
+            let mut m = self.manifest.lock();
+            m.seq = self.new_seq;
+            if let Err(e) = m.write_manifest() {
+                error!(
+                    "F6 per-shard rewrite: final manifest commit (seq {}) failed: {}. \
+                     Old generation remains authoritative; rewrite did not take effect.",
+                    self.new_seq, e
+                );
+                // Do NOT prune — old generation is still the committed state.
+                drop(m);
+                crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+                return;
+            }
+            for sid in 0..self.n_shards {
+                m.prune_shard_files(sid as u16, self.old_seq);
+            }
+            drop(m);
+            info!(
+                "F6 per-shard rewrite complete: committed seq {} across {} shards, pruned seq {}",
+                self.new_seq, self.n_shards, self.old_seq
+            );
+            crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+        }
+    }


3. Pershardrewritecoord missing loom test 📘 Rule violation ▣ Testability

A new atomic coordination/state-machine type (PerShardRewriteCoord) was introduced without any corresponding loom model test coverage. This risks concurrency bugs slipping through in a multi-threaded rewrite coordinator.

Agent Prompt

## Issue description A new atomic coordination type was added, but no loom model test was added/updated to exercise its concurrency behavior. ## Issue Context The compliance requirement mandates loom-based model tests for new/modified atomic state machines, specifically in `tests/loom_response_slot.rs`. ## Fix Focus Areas - src/persistence/aof.rs[217-327] - tests/loom_response_slot.rs[1-120]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-06-03T16:14:12Z

+    let preserved_cold_wiring: Vec<
+        Vec<(
+            Option<std::path::PathBuf>,
+            Option<moon::storage::tiered::cold_index::ColdIndex>,
+        )>,
+    > = if disk_offload_base.is_some() {
+        shards
+            .iter_mut()
+            .map(|s| {
+                s.databases
+                    .iter_mut()
+                    .map(|db| (db.cold_shard_dir.take(), db.cold_index.take()))
+                    .collect()


4. Cold reads disabled in replay 🐞 Bug ≡ Correctness

src/main.rs temporarily take()s cold_shard_dir/cold_index before multi-part AOF replay and only reattaches them after replay completes, so replayed commands cannot read disk-offloaded keys. This can permanently corrupt recovered state for read-modify-write commands during replay (e.g., INCR reads via db.get() and will treat cold-only keys as missing).

Agent Prompt

## Issue description During multi-part AOF replay, `cold_shard_dir` and `cold_index` are removed from every `Database` via `.take()` before calling the replay routines, and only restored after replay finishes. Because replay executes the normal command handlers, any replay-time read of a disk-offloaded key (e.g., INCR’s read-modify-write) cannot cold-read and will behave as if the key is missing, producing incorrect recovered state. ## Issue Context - `rdb::load` swaps whole `Database` instances (`*live = temp`), which is why cold-tier fields get dropped. - The current approach works around that swap by stashing cold wiring in `main.rs`, but it makes cold storage unavailable during the replay window. - Fix must ensure cold wiring is present *while replaying incremental RESP commands*, not only after replay ends. ## Fix Focus Areas - src/main.rs[739-775] - src/main.rs[1054-1065] - src/persistence/aof_manifest.rs[1299-1370] - src/persistence/rdb.rs[362-367] ## Suggested fix approach 1. Change replay path so cold wiring is restored immediately after the base RDB load swap and **before** incremental command replay begins. - Option A: Add a recovery-only `rdb::load_preserving_wiring(...)` (or similar) that copies selected “topology” fields (cold wiring) from the current live DBs into the temp DBs right before the final swap, and use it from `replay_multi_part` / per-shard replay. - Option B: Have `replay_multi_part` / `replay_per_shard` accept a preserved-wiring bundle and reattach it right after `rdb::load(...)` returns and before `replay_incr_resp(...)`. 2. Remove (or narrow) the `.take()` window in `main.rs` so cold wiring remains available during replayed command execution. 3. Add a regression test that: - Sets up disk-offload state with a key only in cold storage. - Replays an AOF stream containing a read-modify-write command against that key (e.g., INCR). - Asserts the recovered value matches the expected incremented value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

coderabbitai

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/shard/conn_accept.rs (1)
368-396: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Migrated tokio connections still drop the disk-offload spill context.

spawn_tokio_connection() now wires spill_sender / spill_file_id / disk_offload_dir, but migrated tokio connections still build ConnectionContext with None, Cell(0), and None. After an affinity migration, that client falls back to the non-spilling eviction path even when disk-offload is enabled, so cold-tier durability depends on whether the connection migrated. This needs the same three parameters threaded through spawn_migrated_tokio_connection() and its call sites in src/shard/event_loop.rs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/shard/conn_accept.rs` around lines 368 - 396, ConnectionContext is being
built with None/Cell(0)/None for spill_sender, spill_file_id, and
disk_offload_dir causing migrated tokio connections to lose disk-offload state;
update spawn_migrated_tokio_connection and all its call sites to accept and
forward the same spill_sender, spill_file_id, and disk_offload_dir parameters
that spawn_tokio_connection uses, then change the ConnectionContext::new call in
conn_accept.rs (the construction currently passing None, Rc::new(Cell::new(0)),
None) to pass the propagated spill_sender, spill_file_id, and disk_offload_dir
values so migrated connections retain the spill context.
src/server/embedded.rs (1)
221-223: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Port the main.rs startup fixes into embedded mode too.

run_embedded() still diverges from the binary on two functional paths: it only restores when persistence_dir is set, so --appendonly no + disk-offload loses cold-tier recovery after restart, and it still enables tokio/Linux per_shard_accept, which is the same accept-path hang/race main.rs just disabled. Embedded startup needs the same recovery gate and central-accept behavior as main.rs.

Also applies to: 359-365
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/embedded.rs` around lines 221 - 223, In run_embedded, make two
changes to match main.rs: (1) call shard.restore_from_persistence whenever
disk_offload_base.is_some() even if persistence_dir is None (i.e. change the
current if let Some(ref dir) = persistence_dir guard to a condition that
restores when persistence_dir.is_some() || disk_offload_base.is_some(), passing
persistence_dir.as_deref() as before) so cold-tier recovery runs with
--appendonly no + disk-offload; (2) disable per-shard accept so embedded uses
the central accept path like main.rs by clearing or forcing the per_shard_accept
flag (refer to the per_shard_accept variable/setting used elsewhere) before
spawning listeners; apply the same two edits to the other restore block that
currently mirrors this logic.
src/persistence/aof_manifest.rs (1)
2901-2999: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Split this module before adding more persistence logic.

src/persistence/aof_manifest.rs is already far past the 1500-line cap, and this change adds more rewrite/replay/test surface on top of that. Please break the per-shard rewrite/replay pieces into submodules before this grows further.

As per coding guidelines, "No single .rs file should exceed 1500 lines. Split into submodules if approaching this limit."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/aof_manifest.rs` around lines 2901 - 2999, The
aof_manifest.rs file has grown beyond the 1500-line guideline — move the
per-shard rewrite/replay logic and related tests into one or more submodules
(e.g., shard_rewrite.rs and shard_replay.rs). Concretely: extract functions and
helpers that operate per-shard (advance_shard, prune_shard_files,
shard_base_path_seq, shard_incr_path_seq, plus any helpers used only for
shard-level file ops and the test
advance_shard_defers_delete_until_after_commit) into a new submodule(s), keep
the manifest-level state (AofManifest struct, seq, write_manifest,
initialize_multi, load) in the top-level file, and re-export needed symbols with
pub use so callers still reference
AofManifest::{advance_shard,prune_shard_files,...} as before; update
imports/paths in tests to use the new module layout and ensure visibility
(pub(crate)/pub) matches previous access.

🟠 Major comments (16)

src/server/conn/handler_monoio/mod.rs-2049-2057 (1)
2049-2057: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Reply-loss now skips AOF for writes that may already have executed.

Once the PipelineBatch has been accepted, the target shard can apply the write and then lose the reply. This error branch returns early and drops meta, so the aof_bytes never reach Lines 2066-2095. That leaves the target shard’s in-memory state ahead of its AOF until a later rewrite, so a crash in that window can lose a write that may already be visible to clients. The durability boundary needs to move into the target shard’s PipelineBatch execution/ack path instead of depending on the reply round-trip.

Also applies to: 2066-2095
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/conn/handler_monoio/mod.rs` around lines 2049 - 2057, The error
branch for shard_responses (when Err(err_msg)) returns early and skips appending
aof_bytes for the corresponding PipelineBatch, risking reply-loss durability
gaps; modify the handler so that whenever a PipelineBatch was accepted by a
target shard (identify via PipelineBatch or meta entries and aof_bytes), you
still persist the batch to AOF before continuing or ensure the target shard
writes to AOF synchronously as part of its accept/ack path. Concretely: in the
shard_responses Err branch (and the same logic covering the range that currently
writes aof_bytes), either call the existing AOF append routine with the same
aof_bytes tied to the meta entries or change the target shard execution/ack code
to perform the AOF write at accept time (so PipelineBatch acceptance guarantees
persistence), then remove the early-return that skips AOF.
src/persistence/page_cache/mod.rs-93-107 (1)
93-107: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Lazy startup allocation still does not reclaim memory after eviction.

These buffers start empty now, but evict_cold_frames() only clears frame metadata; it never drops the backing Vec<u8> for evicted frames. After the working set shrinks, touched pages stay resident forever, so the memory-pressure cascade still cannot actually give PageCache bytes back before escalating to KV eviction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/page_cache/mod.rs` around lines 93 - 107, evict_cold_frames
currently only clears frame metadata but leaves the backing buffers (buffers_4k
and buffers_64k) allocated, so freed pages never release heap capacity; update
evict_cold_frames to drop or replace the Vec<u8> backing an evicted frame (for
the 4k and 64k pools) — e.g. swap the slot's RwLock<Vec<u8>> contents with an
empty Vec::new() or call Vec::clear() plus shrink_to_fit() while holding the
lock — so that the heap allocation is released; locate uses in fetch_page,
FrameDescriptor and any code that indexes into buffers_4k / buffers_64k to
ensure the swap/replace is safe under the same locking scheme.
src/persistence/page_cache/mod.rs-68-71 (1)
68-71: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Minimum frame floors can still blow past the whole-instance budget.

per_shard_pagecache_budget() divides first, but these unconditional max(64) / max(8) floors add a fixed 768 KiB minimum per shard. Once whole_budget / num_shards drops below that, aggregate PageCache capacity exceeds the configured whole-instance budget again (for example, 4 MiB across 8 shards becomes 6 MiB of frame capacity). That breaks the new shard-budget contract on small-memory or high-shard-count deployments.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/page_cache/mod.rs` around lines 68 - 71, The function
pagecache_frame_counts currently forces fixed minimums (num_4k.max(64),
num_64k.max(8)) which can make per-shard capacity exceed the
per_shard_pagecache_budget; change pagecache_frame_counts to never increase
frame counts above what budget_bytes permits: compute num_4k and num_64k from
the 3:1 split as before, then remove the unconditional .max floor and instead
clamp/adjust the counts so (num_4k*PAGE_4K + num_64k*PAGE_64K) <= budget_bytes
(reduce 4K frames first or proportionally if needed) and ensure zero/low budgets
yield small counts rather than the fixed 64/8 floor; update any callers
(per_shard_pagecache_budget) expectations accordingly.
src/server/conn/handler_sharded/mod.rs-1283-1288 (1)
1283-1288: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't spill to the process CWD on a missing offload dir.

If ctx.spill_sender is set but ctx.disk_offload_dir is None, this still evicts the key and queues the spill under "./data". That hides broken wiring and can send cold files somewhere the shard's read-through/recovery code will never scan. Fail fast here instead of defaulting to ".".
Proposed fix
-                                    let dir = ctx
-                                        .disk_offload_dir
-                                        .as_deref()
-                                        .unwrap_or(std::path::Path::new("."));
+                                    let Some(dir) = ctx.disk_offload_dir.as_deref() else {
+                                        drop(rt);
+                                        return Err(Frame::Error(Bytes::from_static(
+                                            b"ERR disk offload misconfigured",
+                                        )));
+                                    };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/conn/handler_sharded/mod.rs` around lines 1283 - 1288, The code
currently falls back to "." when ctx.spill_sender is Some but
ctx.disk_offload_dir is None; update the evict logic so that when
ctx.spill_sender.is_some() and ctx.disk_offload_dir.is_none() you fail fast
instead of defaulting to CWD: in the evict_result block (where ctx.spill_sender,
ctx.spill_file_id and the dir variable are used) add an explicit check for
ctx.disk_offload_dir.is_none() and return or propagate an error (or log and
abort the eviction) rather than using unwrap_or("."), ensuring spills are only
queued when disk_offload_dir is present. Ensure you reference ctx.spill_sender,
ctx.disk_offload_dir, and ctx.spill_file_id in the change so the code path
short-circuits appropriately.
src/storage/tiered/kv_spill.rs-336-356 (1)
336-356: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Sync the data directory after publishing the batch file.

file.sync_all() only persists the temporary file contents. Without an fsync on data_dir after rename, a crash can lose heap-*.mpf even though later code commits that file into the manifest, leaving recovery with a manifest entry that points at a missing spill file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/tiered/kv_spill.rs` around lines 336 - 356, The
write_kv_spill_batch function currently syncs only the temp file
(file.sync_all()) but not the containing directory, so post-rename the directory
entry for final_path may not be persisted; after the std::fs::rename(&tmp_path,
&final_path)? call, open the data_dir (e.g., with
OpenOptions::new().read(true).open(&data_dir) or File::open(&data_dir)) and call
sync_all() on that directory handle to ensure the directory metadata (the new
heap-*.mpf entry) is flushed to disk.
src/shard/persistence_tick.rs-405-438 (1)
405-438: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when the manifest write fails.

If manifest.commit() errors here, the code still inserts every key into cold_index and starts serving that spill file immediately. After the next crash/restart, recovery never sees the file because it was never durably registered, so those evicted keys disappear. Skip indexing, or requeue the completion, until the manifest commit succeeds.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/shard/persistence_tick.rs` around lines 405 - 438, If manifest.commit()
fails after manifest.add_file (the block handling shard_manifest), do not
proceed to insert entries into the cold_index; instead treat the spill
completion as not durable and either requeue the completion or return an error
so it can be retried. Concretely: after calling manifest.commit() on the
manifest from shard_manifest, on Err(e) stop further processing of that
completion (do not run the ColdLocation insertion loop), and re-enqueue or
surface the completion so it can be retried later; ensure this change
short-circuits both the is_initialized()/with_shard_db path and the
shard_databases.write_db path that call ci.insert(...) so no keys are indexed
unless commit succeeded.
src/storage/tiered/cold_read.rs-48-59 (1)
48-59: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the Unix-only read_exact_at path.

src/storage/tiered/cold_read.rs imports std::os::unix::fs::FileExt as _; unconditionally (only #[cfg(test)] is nearby), so non-Unix targets won’t compile. Keep the page-offset fast path under #[cfg(unix)]/#[cfg(target_family = "unix")] and add a #[cfg(not(unix))] fallback (e.g., seek + read_exact into leaf_buf, or the previous whole-file read) so this module builds everywhere.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/tiered/cold_read.rs` around lines 48 - 59, The code
unconditionally imports std::os::unix::fs::FileExt and calls
file.read_exact_at(...) which breaks non-Unix builds; wrap the import and the
fast pread path in #[cfg(unix)] (or #[cfg(target_family = "unix")]) and provide
a #[cfg(not(unix))] fallback that uses file.seek(SeekFrom::Start(page_offset))
followed by file.read_exact(&mut leaf_buf) (or the previous whole-file read) to
fill leaf_buf for the specific page; apply these guards around the FileExt
import and the block that computes page_offset and calls read_exact_at so
symbols like file_path, file, page_offset, leaf_buf, PAGE_4K, and
location.page_idx remain usable on all platforms.
tests/crash_recovery_disk_offload_no_aof.rs-331-337 (1)
331-337: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add missing // SAFETY: for the libc::kill unsafe block in sigkill
In tests/crash_recovery_disk_offload_no_aof.rs (unix sigkill, lines ~331-338), unsafe { libc::kill(pid, libc::SIGKILL); } is missing a preceding // SAFETY: comment explaining the invariant that makes the call sound (or refactor to avoid unsafe). Note: scripts/audit-unsafe.sh greps src/**/*.rs, so this specific test may not be caught by that script, but it still violates the repo unsafe policy.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/crash_recovery_disk_offload_no_aof.rs` around lines 331 - 337, The
unsafe call to libc::kill inside the unix-only function sigkill lacks the
required safety comment; add a preceding "// SAFETY:" comment that documents the
invariants making the call sound (e.g., that child.id() yields a valid pid for
the child process, the pid value is non-zero and points to a process we
intentionally intend to SIGKILL, and that using libc::kill with SIGKILL is safe
here), or alternatively refactor sigkill to use a safe wrapper (e.g.,
nix::sys::signal::kill with nix::unistd::Pid) to avoid unsafe entirely;
reference the sigkill function and the libc::kill usage when making the change.
src/main.rs-790-850 (1)
790-850: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refuse startup here instead of warning-and-continuing.

If this data dir has already been migrated to single-shard multi-part AOF, appendonly.aof may have been renamed to appendonly.aof.legacy by the monoio path above. In that case this tokio branch starts without replaying the authoritative manifest pair and silently serves stale/empty state. This should be a hard error until tokio can load single-shard multi-part AOF, not a warning.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main.rs` around lines 790 - 850, The tokio branch currently logs a
warning when a multi-part AOF manifest exists but monoio-only single-shard
replay is unavailable, which can let the process start with stale/empty state;
change the #[cfg(not(feature = "runtime-monoio"))] block so it refuses startup
instead of continuing: detect the presence of the multi-part
manifest/appendonlydir (same condition the monoio branch would have used), and
return an error (propagate with anyhow::Context or similar) or exit with a
non-zero error rather than calling tracing::warn; update references inside
main's startup logic around num_shards, base_dir, and the legacy appendonly.aof
handling so that when a manifest is present and runtime-monoio is disabled the
program fails fast (mirroring the monoio path behavior that runs
replay_multi_part/retire legacy files) rather than continuing.
src/config.rs-32-34 (1)
32-34: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid deriving Default for ServerConfig (derived defaults don’t match clap CLI defaults).

ServerConfig derives Default, but many fields define clap default_value/default_value_t (e.g., bind="127.0.0.1", port=6379, appendonly="yes", protected_mode="yes", memory_arenas_cap=8). The derived Default will instead produce Rust defaults ("", 0, false, etc.), creating an invalid config if ServerConfig::default() is ever used.

Repo tests use ServerConfig::parse_from([]) for defaults, but the public derive remains a footgun. Remove the Default derive or implement Default by delegating to ServerConfig::parse_from([]) (so it matches the CLI defaults exactly).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/config.rs` around lines 32 - 34, ServerConfig currently derives Default
which yields Rust primitive defaults that differ from clap's CLI defaults;
remove the #[derive(Default)] or replace it with a custom impl Default that
delegates to clap's parsing so defaults match CLI behavior (e.g., implement
Default for ServerConfig by calling ServerConfig::parse_from(&[]) and returning
that value). Update the ServerConfig declaration (the struct with
#[derive(Parser, Debug, Clone, ...)] and any tests expecting
ServerConfig::default()) to use the new behavior so ServerConfig::default()
mirrors clap defaults exactly.
tests/crash_matrix_per_shard_bgrewriteaof.rs-27-27 (1)
27-27: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate this test to the runtime that actually supports the scenario.

The file comment says tokio rejects per-shard BGREWRITEAOF, but this cfg(any(...)) still compiles the tests for runtime-tokio. bgrewriteaof() then asserts the command is not refused, so an explicit tokio run fails by construction. Limit this file to monoio or add a tokio-specific expectation path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` at line 27, The test is
compiled for both runtimes but only valid for monoio; update the test gating or
add a tokio branch: either change the module cfg from cfg(any(feature =
"runtime-monoio", feature = "runtime-tokio")) to only cfg(feature =
"runtime-monoio") so the file is only built for monoio, or keep both runtimes
and modify the bgrewriteaof() test to branch on the runtime (runtime-tokio
should assert the command is refused while runtime-monoio asserts success).
Refer to the test function bgrewriteaof() and the module-level cfg to implement
the appropriate gate or conditional expectation.
tests/crash_matrix_per_shard_bgrewriteaof.rs-138-145 (1)
138-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix missing // SAFETY: for unsafe { libc::kill(...) } in sigkill

tests/crash_matrix_per_shard_bgrewriteaof.rs (sigkill, lines ~138-145) has an unsafe { libc::kill(pid, libc::SIGKILL) } block without the mandatory // SAFETY: comment explaining the invariant (e.g., that pid comes from the live Child handle for the spawned process).

The existing #[cfg(unix)] vs #[cfg(not(unix))] structure matches other test helpers in this repo that also use libc::kill, so the immediate required fix is the missing safety rationale.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` around lines 138 - 145, Add a
`// SAFETY:` rationale above the unsafe block in the `sigkill` function
explaining why the call to `libc::kill(pid, libc::SIGKILL)` is safe: state that
`pid` is derived from a live `std::process::Child` handle returned by
`child.id()` and thus refers to a valid process id for the spawned child, and
that killing with `SIGKILL` is intentional for the test; keep this comment
adjacent to the `unsafe { libc::kill(...) }` block in `sigkill` so it documents
the required invariant.
src/server/listener.rs-315-328 (1)
315-328: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip binding the central TCP listener when per_shard_accept is enabled.

This still creates a REUSEPORT listener even when the main loop never calls accept() on it. On Linux multishard runs, the kernel can hash connections to that idle socket, so those clients get black-holed again. Bind the central TCP listener only in the !per_shard_accept path, or make the per-shard mode use no central TCP socket at all.
Suggested fix
-    #[cfg(unix)]
-    let listener = match crate::shard::conn_accept::create_reuseport_socket(&addr) {
-        Ok(std_listener) => TcpListener::from_std(std_listener)?,
-        Err(e) => {
-            tracing::warn!(
-                "central SO_REUSEPORT bind failed ({}); falling back to plain bind on {}",
-                e,
-                addr
-            );
-            TcpListener::bind(&addr).await?
-        }
-    };
-    #[cfg(not(unix))]
-    let listener = TcpListener::bind(&addr).await?;
+    let listener = if per_shard_accept {
+        None
+    } else {
+        #[cfg(unix)]
+        {
+            Some(match crate::shard::conn_accept::create_reuseport_socket(&addr) {
+                Ok(std_listener) => TcpListener::from_std(std_listener)?,
+                Err(e) => {
+                    tracing::warn!(
+                        "central SO_REUSEPORT bind failed ({}); falling back to plain bind on {}",
+                        e,
+                        addr
+                    );
+                    TcpListener::bind(&addr).await?
+                }
+            })
+        }
+        #[cfg(not(unix))]
+        {
+            Some(TcpListener::bind(&addr).await?)
+        }
+    };
...
-                if per_shard_accept {
+                if per_shard_accept {
                     std::future::pending::<std::io::Result<(tokio::net::TcpStream, std::net::SocketAddr)>>().await
                 } else {
-                    listener.accept().await
+                    listener.as_ref().expect("listener exists when per_shard_accept is false").accept().await
                 }
Also applies to: 403-408
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/listener.rs` around lines 315 - 328, The central TCP listener is
still being bound even when per_shard_accept is enabled, causing an idle
REUSEPORT socket to receive hashed connections; update the code around
create_reuseport_socket/TcpListener::bind so the listener is only created when
per_shard_accept is false (i.e. move the entire match that calls
create_reuseport_socket and TcpListener::bind into the !per_shard_accept branch
or conditionally skip binding when per_shard_accept is true), and apply the same
change to the other analogous block around lines 403-408 so no central socket is
bound in per-shard accept mode.
src/persistence/aof_manifest.rs-1246-1251 (1)
1246-1251: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Persist the rewrite cutoff LSN here, not the manifest sequence.

max_lsn is now defined as the next-free replication offset, but this stores new_seq (2, 3, …). After the first rewrite, recovery can seed master_repl_offset below already-persisted data and reissue duplicate LSNs. Thread the coordinated shard cutoff/end offset into advance_shard and persist that value instead.
Suggested direction
 pub fn advance_shard(
     &mut self,
     shard_id: u16,
     new_seq: u64,
+    max_lsn: u64,
     rdb_bytes: &[u8],
 ) -> Result<PathBuf, crate::error::MoonError> {
@@
-        self.shards[shard_idx].max_lsn = self.shards[shard_idx].max_lsn.max(new_seq);
+        self.shards[shard_idx].max_lsn = max_lsn;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/aof_manifest.rs` around lines 1246 - 1251, The code currently
sets self.shards[shard_idx].max_lsn =
self.shards[shard_idx].max_lsn.max(new_seq) in advance_shard, but max_lsn
represents the next-free replication offset and must store the coordinated shard
rewrite cutoff/end offset (the persisted rewrite cutoff LSN) rather than the
manifest sequence number new_seq; modify advance_shard to accept the computed
shard_cutoff (or end_offset) parameter from the caller and persist that value
into self.shards[shard_idx].max_lsn (use max to avoid regressions), ensuring the
function uses shard_cutoff instead of new_seq so recovery cannot reuse stale
LSNs and master_repl_offset remains monotonic.
src/persistence/manifest.rs-1284-1437 (1)
1284-1437: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

This file is over the 1500-line limit.

The overflow-page implementation and new crash-path tests push src/persistence/manifest.rs past the repo cap. Splitting serialization/recovery helpers into submodules will make the dual-root and overflow invariants easier to audit.

As per coding guidelines, "No single .rs file should exceed 1500 lines. Split into submodules if approaching this limit."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/manifest.rs` around lines 1284 - 1437, The manifest.rs file
exceeds the 1500-line limit; refactor by extracting serialization/recovery and
overflow-related code into focused submodules so the top-level ShardManifest
implementation stays small and tests keep the same behavior. Identify the
helpers used by commit(), open(), and overflow handling (e.g., any functions
named like persist_overflow, recover_overflow, recover_root, serialize_root,
parse_overflow_pages) and move them into new submodules (serialization/recovery
and overflow), adjust their visibility (pub(crate) or pub) so
ShardManifest::commit, ShardManifest::open, gc_tombstones, add_file, remove_file
and the tests still compile unchanged, and update mod declarations and
use/imports in manifest.rs to delegate to the new modules while keeping the
public API intact. Ensure unit tests (test_overflow_*,
test_overflow_commit_crash_atomicity, test_overflow_compaction_bounds_growth,
test_overflow_tombstone_and_gc) continue to run by preserving behavior and error
paths after the split.
src/persistence/manifest.rs-438-450 (1)
438-450: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle torn overflow tails before computing start_page.

open() can recover from a partially written overflow run by falling back to the older root, but the next commit() appends from raw EOF and only debug_assert!s 4KB alignment. In release, a non-page-aligned EOF makes start_page floor to the previous page while the new overflow bytes are written at the unaligned offset, so the committed root points at the wrong location and the next reopen drops the new slot. Truncate or reject any eof % PAGE_4K != 0 tail before appending.
Suggested fix
-            let eof = self.file.seek(SeekFrom::End(0))?;
-            // The manifest is always a whole number of 4 KB pages.
-            debug_assert_eq!(eof % PAGE_4K as u64, 0);
+            let mut eof = self.file.seek(SeekFrom::End(0))?;
+            let misalignment = eof % PAGE_4K as u64;
+            if misalignment != 0 {
+                eof -= misalignment;
+                self.file.set_len(eof)?;
+                self.file.sync_data()?;
+            }
             let start_page = (eof / PAGE_4K as u64) as u32;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/manifest.rs` around lines 438 - 450, The overflow append path
computes start_page from EOF without handling a torn (non-4KB-aligned) tail,
which in release causes writes at the wrong offset; before computing start_page
in the block that writes overflow (the code around overflow_start_page and use
of PAGE_4K), detect if eof % PAGE_4K != 0 and truncate the file down to the
previous PAGE_4K boundary (or return an error) so subsequent writes are
page-aligned; implement this by checking eof =
self.file.seek(SeekFrom::End(0))?, computing eof_floor = (eof / PAGE_4K as u64)
* PAGE_4K as u64 and calling self.file.set_len(eof_floor) (or propagate an
error) before computing start_page and writing the overflow buffer.

🟡 Minor comments (5)

src/shard/dispatch.rs-690-693 (1)
690-693: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Drop the wall-clock assertion from this test.

start.elapsed() < 5ms measures scheduler noise as much as helper behavior, so it can flap on a loaded CI runner even when the fast path never sleeps. calls == 1 already proves the no-retry path; if you want timing coverage, use paused Tokio time instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/shard/dispatch.rs` around lines 690 - 693, Remove the flaky wall-clock
assertion that checks start.elapsed() < Duration::from_millis(5) in the test
(the assert! using start and the 5ms bound), because it measures scheduler
noise; instead rely on the existing calls == 1 check to prove the no-retry path.
If you need to assert timing behavior for the fast path, convert the test to use
paused Tokio time (tokio::time::pause/advance) and assert against simulated time
rather than real elapsed time. Ensure you remove the assert! line and keep the
calls variable assertion intact.
tests/multishard_serve_smoke.rs-106-110 (1)
106-110: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

safe_shards() still oversubscribes 1-core runners.

The lower clamp of 2 means a host with available_parallelism() == 1 still runs the “multishard” cases with 2 shards, which is exactly the oversubscription confound the comment says this helper avoids. Skip the multishard variants when fewer than 2 cores are available instead of forcing 2 here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/multishard_serve_smoke.rs` around lines 106 - 110, The helper
safe_shards() forces at least 2 shards, causing 1-core runners to be
oversubscribed; change the lower clamp from 2 to 1 so it returns the actual
available_parallelism() (bounded above by 4) instead of forcing two shards —
update the safe_shards() implementation (which uses
std::thread::available_parallelism()) to use .clamp(1, 4) and ensure any test
harness that expects multishard behavior skips multishard variants unless
safe_shards() >= 2.
tests/multishard_serve_smoke.rs-37-44 (1)
37-44: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid the probe-port race in the smoke harness.

free_port() releases the ephemeral listener before the child process binds, so a parallel test/process can steal that port and make this smoke test fail with the same “never served” symptoms it is trying to catch. Please retry port allocation/spawn on bind failure instead of treating the first chosen port as reserved.

Also applies to: 113-125
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/multishard_serve_smoke.rs` around lines 37 - 44, free_port() currently
picks and releases an ephemeral port before spawning the child, which allows
races; change the harness to loop: pick a candidate port (using the existing
free_port() helper or by trying to bind and immediately drop), spawn the child
process with that port, then probe/connect to the child (or parse its startup
output) with a short timeout; if the probe fails with a bind-refused or "address
already in use" condition, kill the child and retry with a new port up to a
small retry limit. Apply this retry-on-bind-failure pattern to the other
probe/spawn logic referenced (the code around lines 113-125) so the test only
fails after exhausting retries rather than on the first stolen port.
README.md-236-242 (1)
236-242: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the guardrail caveats here.

This overstates the behavior a bit: omitting --maxmemory does not always force allkeys-lru, and it can skip auto-capping entirely when no memory limit is detectable. The doc should match resolve_memory_guardrail() and say the policy flip happens only from noeviction, otherwise the existing eviction policy is preserved.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 236 - 242, Update the README blurb about --maxmemory
to match resolve_memory_guardrail(): clarify that omitting --maxmemory does not
always force allkeys-lru — the code only flips the eviction policy to
allkeys-lru when it would otherwise be noeviction and a memory guardrail is
applied; if no memory limit is detectable the auto-cap can be skipped entirely
and an existing eviction policy is preserved. Reference the behavior of
resolve_memory_guardrail(), the flags --maxmemory and --shards, and the policies
allkeys-lru and noeviction so readers understand the exact conditional policy
flip and the per-shard cap calculation.
tests/crash_matrix_per_shard_bgrewriteaof.rs-281-281 (1)
281-281: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't derive the port with +1; that socket was never reserved.

unique_port() only proves the returned port was free at probe time. saturating_add(1) picks a different port that may already be occupied, so this test can fail intermittently for reasons unrelated to BGREWRITEAOF. Call unique_port() again here instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` at line 281, The test currently
derives a second port by doing let port = unique_port().saturating_add(1); which
is unsafe because the +1 socket was never reserved; replace that expression by
calling unique_port() again to obtain an independently probed free port (i.e.,
call unique_port() twice rather than adding 1) so the variable assigned to port
is guaranteed to be probed free when used in the BGREWRITEAOF test.

🧹 Nitpick comments (1)

tests/autovacuum_daemon.rs (1)

14-28: 💤 Low value

Prefer parking_lot::Mutex for consistency with project guidelines.

The coding guidelines specify using parking_lot locks instead of std::sync locks across all Rust files. While this is test infrastructure (not a hot path), using parking_lot::Mutex maintains consistency and is simpler—parking_lot locks never poison, so you can replace the poison-recovery .unwrap_or_else(|e| e.into_inner()) with a plain .lock(). The non-poisoning behavior still prevents wedging sibling tests if one panics.

As per coding guidelines: "Use parking_lot::RwLock / parking_lot::Mutex — never std::sync locks."

♻️ Refactor to use parking_lot::Mutex

 use std::sync::atomic::Ordering;
-use std::sync::{Mutex, MutexGuard, OnceLock};
+use std::sync::OnceLock;
+use parking_lot::{Mutex, MutexGuard};

-/// Serializes the two tests that read/assert the process-global
-/// `RECL_AUTOVACUUM_LAST_RUN_TS`. The default parallel test runner otherwise lets
-/// `test_autovacuum_tick_updates_last_run_ts` advance that global while
-/// `test_disabled_autovacuum_is_noop` is mid-assertion → spurious cross-test
-/// failure (observed flaking under load). Poison-tolerant: a panicking holder
-/// must not wedge the sibling.
+/// Serializes the two tests that read/assert the process-global
+/// `RECL_AUTOVACUUM_LAST_RUN_TS`. parking_lot::Mutex never poisons, so a
+/// panicking holder will not wedge sibling tests.
 fn autovacuum_ts_lock() -> MutexGuard<'static, ()> {
     static GUARD: OnceLock<Mutex<()>> = OnceLock::new();
-    GUARD
-        .get_or_init(|| Mutex::new(()))
-        .lock()
-        .unwrap_or_else(|e| e.into_inner())
+    GUARD.get_or_init(|| Mutex::new(())).lock()
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/autovacuum_daemon.rs` around lines 14 - 28, The test helper
autovacuum_ts_lock currently uses std::sync::Mutex and poison-recovery
unwrap_or_else; replace std::sync::Mutex/MutexGuard with parking_lot::Mutex and
parking_lot::MutexGuard (keep std::sync::OnceLock or replace with equivalent if
desired), remove the poison-handling call and simply call .lock() on the
parking_lot mutex, and adjust imports accordingly; refer to the
autovacuum_ts_lock function and the GUARD static that protects
RECL_AUTOVACUUM_LAST_RUN_TS when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28f3eec1-3613-4d85-bb70-498af67cfcdc

📥 Commits

Reviewing files that changed from the base of the PR and between 03ca4ae and 8e6eb89.

📒 Files selected for processing (61)

CHANGELOG.md
README.md
src/command/persistence.rs
src/config.rs
src/main.rs
src/persistence/aof.rs
src/persistence/aof_manifest.rs
src/persistence/manifest.rs
src/persistence/migrate_aof.rs
src/persistence/page_cache/mod.rs
src/server/conn/handler_monoio/mod.rs
src/server/conn/handler_sharded/mod.rs
src/server/conn/handler_single.rs
src/server/embedded.rs
src/server/listener.rs
src/shard/conn_accept.rs
src/shard/dispatch.rs
src/shard/event_loop.rs
src/shard/persistence_tick.rs
src/shard/spsc_handler.rs
src/shard/timers.rs
src/storage/eviction.rs
src/storage/tiered/cold_index.rs
src/storage/tiered/cold_read.rs
src/storage/tiered/kv_spill.rs
src/storage/tiered/spill_thread.rs
tests/adversarial_v0110_fix01_set_delete_rollback.rs
tests/adversarial_v0110_fix02_err_path_intent.rs
tests/adversarial_v0110_fix03_simplestring_graph.rs
tests/adversarial_v0110_fix04_shortest_path_call_parity.rs
tests/adversarial_v0110_fix06_shortest_path_min_hops.rs
tests/adversarial_v0110_fix07_multihop_edge_var_reject.rs
tests/autovacuum_daemon.rs
tests/crash_matrix_per_shard_aof.rs
tests/crash_matrix_per_shard_bgrewriteaof.rs
tests/crash_recovery_disk_offload_no_aof.rs
tests/ft_search_as_of_boundary.rs
tests/ft_search_as_of_filter.rs
tests/ft_search_concurrent_readers.rs
tests/ft_search_multi_shard_as_of.rs
tests/ft_search_temporal_parity.rs
tests/graph_bench_compare.rs
tests/graph_bench_e2e.rs
tests/graph_integration.rs
tests/graph_stress_deep.rs
tests/integration.rs
tests/kill_snapshot.rs
tests/lunaris_cypher_shortest_path.rs
tests/lunaris_cypher_temporal.rs
tests/lunaris_hybrid_ft_search.rs
tests/mq_integration.rs
tests/multishard_serve_smoke.rs
tests/pipeline_auto_index.rs
tests/replication_test.rs
tests/txn_completeness_edge_cases.rs
tests/txn_cypher_write_rollback.rs
tests/txn_ft_search_snapshot.rs
tests/txn_graph_wiring.rs
tests/txn_kv_wiring.rs
tests/vacuum_commands.rs
tests/workspace_integration.rs

coderabbitai · 2026-06-03T16:25:10Z

+        if self.remaining.fetch_sub(1, Ordering::AcqRel) == 1 {
+            // Abort if any shard failed to fold: committing new_seq while a
+            // shard lacks its new base would break recovery. Keep old_seq.
+            if self.failed.load(Ordering::Acquire) {
+                let m = self.manifest.lock();
+                // Best-effort: prune the orphaned new-seq files written by the
+                // shards that DID fold, so they don't linger.
+                for sid in 0..self.n_shards {
+                    m.prune_shard_files(sid as u16, self.new_seq);
+                }
+                drop(m);
+                error!(
+                    "F6 per-shard rewrite ABORTED: a shard failed to fold; seq stays {}. \
+                     Old generation remains authoritative (crash-safe). A RESTART is \
+                     recommended so successful shards' writers stop appending to the \
+                     discarded new generation.",
+                    self.old_seq
+                );
+                crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+                return;
+            }
+            let mut m = self.manifest.lock();
+            m.seq = self.new_seq;
+            if let Err(e) = m.write_manifest() {
+                error!(
+                    "F6 per-shard rewrite: final manifest commit (seq {}) failed: {}. \
+                     Old generation remains authoritative; rewrite did not take effect.",
+                    self.new_seq, e
+                );
+                // Do NOT prune — old generation is still the committed state.
+                drop(m);
+                crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+                return;


⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Abort paths still leave successful shards appending to uncommitted files.

By the time this branch runs, the successful writers have already reopened onto new_seq in do_rewrite_per_shard(). Returning here with manifest.seq still on old_seq means subsequent writes on those shards go to files recovery will ignore after a restart. Logging “restart recommended” is not enough here; this path needs an explicit rollback or a fail-fast shutdown before more writes are accepted.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof.rs` around lines 283 - 315, The abort branch after checking self.failed currently returns with shards' writers already reopened on self.new_seq (done in do_rewrite_per_shard()), causing successful shards to keep appending to uncommitted files; fix by performing an explicit rollback or fail-fast stop before returning: acquire the manifest lock (m) as done, call the rollback routine that reopens each shard's writer back to self.old_seq (or invoke the writer-shutdown/reopen function used by do_rewrite_per_shard for each sid and ensure m.prune_shard_files is still called for any orphaned new_seq files), set AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst), and if a safe rollback cannot be completed atomically, trigger a fail-fast termination so no further writes to new_seq occur (ensure this uses the same writer-control APIs used elsewhere rather than leaving writers active).

coderabbitai · 2026-06-03T16:25:10Z

+        for s in &self.senders {
+            // Blocking send for guaranteed delivery — see the doc comment.
+            if s.send(AofMessage::RewritePerShard {
+                shard_dbs: shard_dbs.clone(),
+                coord: coord.clone(),
+            })
+            .is_err()
+            {
+                error!(
+                    "F6 per-shard rewrite: a writer channel is disconnected; \
+                     rewrite aborted (no manifest commit, old generation remains \
+                     authoritative). Inspect AOF writer threads."
+                );
+                return Err(AofPoolSendError::SendFailed);


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Handle partial fan-out failure before returning.

If one send() fails after earlier writers already received RewritePerShard, those writers can still fold and call shard_done(), but the failed sender and the unsent tail never decrement remaining. That leaves AOF_REWRITE_IN_PROGRESS stuck and can strand successful shards on new_seq files that will never be committed. Mark the coord failed and account for every unsent shard before returning so the dispatched writers can drive the abort path to completion.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof.rs` around lines 794 - 807, When a send to a writer fails in the loop over self.senders while dispatching AofMessage::RewritePerShard (with shard_dbs and coord), do not return immediately; instead mark the coord as failed (e.g. call coord.mark_failed() or coord.fail()) and account for every unsent shard by invoking the same completion path the writers use (e.g. call coord.shard_done() once per unsent shard or otherwise decrement the remaining counter for each shard that never received the message) before returning Err(AofPoolSendError::SendFailed) so AOF_REWRITE_IN_PROGRESS is cleared and successful shards can finish and commit new_seq as intended.

coderabbitai · 2026-06-03T16:25:10Z

+            AofMessage::AppendSync {
+                lsn,
+                bytes: data,
+                ack,
+            } => {
+                write_framed(file, lsn, &data).map_err(|e| AofError::Io {
+                    path: PathBuf::from("<aof per-shard incr drain>"),
+                    source: e,
+                })?;
+                outcome.drained += 1;
+                // Durability for these is covered by the post-drain fsync at
+                // the rewrite boundary (mirrors drain_pending_appends).
+                let _ = ack.send(AofAck::Synced);


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Do not ack AppendSync before the rewrite-boundary fsync succeeds.

Line 2799 sends AofAck::Synced before do_rewrite_per_shard() reaches its phase-1/phase-3 file.sync_data(). Under appendfsync=always, a client can therefore receive +OK during BGREWRITEAOF even though a crash in that window still loses the entry. Keep these ack senders pending and only fulfill them after the caller’s post-drain sync completes.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof.rs` around lines 2787 - 2799, The current code sends AofAck::Synced immediately inside the AofMessage::AppendSync arm (after calling write_framed), which can ack clients before the rewrite-boundary fsync in do_rewrite_per_shard() completes; instead, buffer the ack sender for later and only send AofAck::Synced after the caller’s post-drain sync finishes. Concretely: in the AppendSync branch (where write_framed is called and ack is available) stop calling ack.send(AofAck::Synced) and push the ack into a pending list/vec (e.g., outcome.pending_syncs or a new field) so do_rewrite_per_shard() can iterate and fulfill those senders only after its phase-1/phase-3 file.sync_data() succeeds; ensure any error paths still notify or drop the sender appropriately.

coderabbitai · 2026-06-03T16:25:11Z

+    // ── Partition into inline candidates and oversized entries ────────────────
+    // We use indices to avoid re-allocating keys.  `inline_indices` are the
+    // positions in `buffer` of entries that fit the inline threshold.
+    let mut inline_indices: Vec<usize> = Vec::with_capacity(buffer.len());
+    let mut oversized_indices: Vec<usize> = Vec::new();
+
+    for (i, req) in buffer.iter().enumerate() {
+        if req.value_bytes.len() <= INLINE_MAX_VALUE_BYTES {
+            inline_indices.push(i);
+        } else {
+            oversized_indices.push(i);
+        }
+    }
+
+    // ── Write ONE batch file for all inline entries ───────────────────────────
+    if !inline_indices.is_empty() {
+        // Use the file_id of the first inline entry for the batch file.
+        let file_id = buffer[inline_indices[0]].file_id;
+        let shard_dir = buffer[inline_indices[0]].shard_dir.clone();
+
+        let spill_entries: Vec<SpillEntry> = inline_indices
+            .iter()
+            .map(|&i| SpillEntry {
+                key: buffer[i].key.clone(),
+                value_bytes: buffer[i].value_bytes.clone(),
+                value_type: buffer[i].value_type,
+                flags: buffer[i].flags,
+                ttl_ms: buffer[i].ttl_ms,
+            })
+            .collect();
+
+        let completion = match build_kv_spill_batch(&spill_entries, file_id) {
+            Ok(batch) => {
+                let total_pages = batch.leaves.len() as u32; // overflow is always empty (inline-only)
+                match write_kv_spill_batch(&shard_dir, file_id, &batch) {
+                    Ok(byte_size) => {
+                        let entries = inline_indices
+                            .iter()
+                            .zip(batch.locations.iter())
+                            .map(|(&buf_idx, &(page_idx, slot_idx))| SpillCompletionEntry {
+                                key: buffer[buf_idx].key.clone(),
+                                db_index: buffer[buf_idx].db_index,
+                                page_idx,
+                                slot_idx,
+                            })
+                            .collect();
+                        SpillCompletion {
+                            file_entry: make_file_entry(file_id, total_pages, byte_size),
+                            entries,
+                            success: true,
+                        }
+                    }
+                    Err(e) => {
+                        warn!(
+                            file_id,
+                            error = %e,
+                            count = inline_indices.len(),
+                            "spill_thread: inline batch write failed"
+                        );
+                        SpillCompletion {
+                            file_entry: make_file_entry(file_id, 0, 0),
+                            entries: Vec::new(),
+                            success: false,
+                        }
+                    }
+                }
+            }
+            Err(e) => {
+                warn!(
+                    file_id,
+                    error = %e,
+                    count = inline_indices.len(),
+                    "spill_thread: inline batch build failed"
+                );
+                SpillCompletion {
+                    file_entry: make_file_entry(file_id, 0, 0),
+                    entries: Vec::new(),
+                    success: false,
+                }


⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Don't fail the entire inline flush when one entry misses the inline fit heuristic.

This pre-screen only checks value_bytes.len(), so a large key or an incompressible ~3.5KB value can still make build_kv_spill_batch() return InvalidData on a fresh leaf. That error path currently turns one bad candidate into a failed completion for the entire inline buffer, even though async eviction has already removed all of those keys from RAM. Please retry the offending entry via build_kv_spill_pages() (or split/rebuild the batch) instead of dropping the whole flush.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/storage/tiered/spill_thread.rs` around lines 155 - 233, The current logic treats any error from build_kv_spill_batch(...) as a fatal failure for the whole inline_indices batch; instead, when build_kv_spill_batch fails (e.g., InvalidData due to a single oversized or incompressible entry), retry the entries individually: for each index in inline_indices call build_kv_spill_pages() (or otherwise build a per-entry batch/page) for that single SpillEntry and then write it with the same writer (e.g., write_kv_spill_batch or the per-page writer), collecting successful SpillCompletionEntry items; on per-entry failure log and skip that entry but do not mark the entire SpillCompletion as failed if at least one entry succeeded, and only return a failed SpillCompletion when none of the entries could be written. Update the error branch around build_kv_spill_batch(...) to implement this per-entry retry loop and to populate SpillCompletion.file_entry/entries/success accordingly.

coderabbitai · 2026-06-03T16:25:11Z

+        match completion_tx.try_send(completion) {
+            Ok(()) => {}
+            Err(flume::TrySendError::Full(_)) => {
+                SPILL_COMPLETION_DROPPED.fetch_add(1, Ordering::Relaxed);
+                warn!(
+                    "spill_thread: completion channel full, dropping completion (total dropped: {})",
+                    SPILL_COMPLETION_DROPPED.load(Ordering::Relaxed)
+                );
+            }
+            Err(flume::TrySendError::Disconnected(_)) => {
+                // Event loop dropped its receiver -- shutting down; ignore.
+            }
+        }


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Completion drops are not safe here.

A dropped SpillCompletion never reaches apply_spill_completions(), so the file is not added to the manifest and its keys are not reinserted into cold_index. By this point the keys have already been evicted from RAM, so try_send(...Full) turns backpressure into immediate misses and permanent loss after restart.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/storage/tiered/spill_thread.rs` around lines 407 - 419, The code currently drops SpillCompletion when completion_tx.try_send(...) returns Full, causing lost completions and manifest/cold_index corruption; change the send logic in spill_thread to guarantee delivery of the SpillCompletion (do not increment SPILL_COMPLETION_DROPPED or drop the value). Replace the try_send branch with a blocking or async send loop using completion_tx.send(completion) (or await completion_tx.send_async/completion_tx.send with retry on Full) and only treat Disconnected as shutdown; ensure the SpillCompletion is retried until sent so apply_spill_completions() always receives it and keys can be reinserted into cold_index.

Self-review of PR #136 (BUG #1, MEDIUM, in-scope to f5e17ac). ShardManifest:: compact() rewrites the manifest to a temp file, fsyncs, and renames it over self.path (the durability point), then reopens self.file against the new inode. If that reopen failed, the error propagated out of compact() and was swallowed by commit() (warn-only, "commit already durable"). But self.file then still referred to the PRE-rename, now-orphaned inode: every SUBSEQUENT commit would seek+write through that detached handle, silently discarding all post-compact manifest updates while recovery (which reads self.path) saw only the compaction snapshot. Cold-tier file entries committed after the failed reopen were lost on restart. Trigger: reopen fails (fd exhaustion / permission / mount transient) while overflow is in use — rare, but silent and persistent when hit. Fix: on reopen failure, set a `needs_reopen` flag instead of leaving a stale handle. commit() checks the flag FIRST and reattaches to self.path before any write (resetting active_slot to 0, matching the compacted layout). If that reattach also fails, commit() returns the error — the data genuinely cannot be persisted, so fail loudly rather than write into a dead fd. TDD: compact_reopen_failure_does_not_silently_lose_later_commits drives a compact() with a cfg(test) reopen-failure injection (points self.file at a throwaway inode, the real "orphaned fd"), then commits another entry and asserts it survives recovery. RED before the guard (entry 999 missing after reopen); GREEN after. Full persistence::manifest module green on both runtimes. author: Tin Dang <tin.dang@trustifytechnology.com>

Self-review of PR #136 (BUG #2, MEDIUM). The tokio per-shard AOF writer (per_shard_aof_writer_task) wrote each record as [u64 lsn][u32 len][payload] by issuing the header write_all then the payload write_all, logging+`continue` on either failure — with NO write_error latch. The single-file writer (aof.rs ~:1467) and the monoio per-shard writer (~:2125) both carry that latch; the tokio per-shard path did not. A torn write (header lands, payload fails) left an orphaned 12-byte header; the NEXT Append wrote a fresh header right after it, so the framed reader on replay misread the orphan's `len` and consumed the following record's bytes as payload — corrupting every record after the tear. Pre-existing since the per-shard foundation (#129, 8c49ec2), but this branch's 8853ade makes the tokio per-shard writer live for the first time, so the latent bug is now reachable. Fix: mirror the monoio latch. A `write_error` flag is set on any header/payload/ flush/sync failure; once set, Append drops silently (fire-and-forget) and AppendSync replies WriteFailed (so the caller errors instead of hanging to the F2 fsync timeout or ack'ing a write into a corrupt stream) — never emitting another record. The writer stays latched for its lifetime; recovery replays the clean prefix and a rewrite starts a fresh file. Zero behavior change on the happy path (flag never set). TDD: tokio_per_shard_writer_latches_after_torn_write drives the real writer task against a real PerShard manifest, injects a torn write on the 2nd Append via a cfg(test) atomic (TEST_FAIL_WRITE_AT — not an env var; edition-2024 set_var is unsafe), and asserts (a) a following AppendSync returns WriteFailed and (b) the on-disk incr replays as exactly the pre-tear record with no corruption. RED before the latch (AppendSync acked Synced; stream corrupt); GREEN after. Full persistence::aof module green on both runtimes; fmt + clippy clean. author: Tin Dang <tin.dang@trustifytechnology.com>

pilotspacex-byte · 2026-06-03T16:56:00Z

Self-review (20 commits) — 2 MEDIUM durability bugs found + fixed

Ran a 3-way adversarial review of the whole branch. The headline multishard
RAM/serving fix verified sound. Two MEDIUM durability holes in the bundled
AOF/disk-offload work were found, hand-confirmed, and fixed here via red/green TDD:

bfe8b4f — manifest compact reopen-failure stale handle (in-scope to f5e17ac).
compact() renamed a fresh manifest durably, then reopened self.file. If the
reopen failed, the error was swallowed and the stale (orphaned-inode) handle was
reused → every later commit silently lost. Fix: needs_reopen flag; commit()
reattaches to the live file before writing, or fails loudly.
9f91381 — tokio per-shard AOF writer missing write_error latch (pre-existing
from fix(persistence): multi-shard AOF gate + per-shard AOF foundation (Option B step 1) #129, now reachable via 8853ade). A torn write (header lands, payload fails)
left an orphaned header; the next Append corrupted the frame stream on replay.
Fix: mirror the monoio/single-file latch — drop further Appends, reply WriteFailed
to AppendSync.

Each fix has a RED→GREEN regression test (RED verified by bypassing the check).
fmt + clippy clean on both feature sets; persistence::manifest+aof modules green
on monoio (53) and tokio (57).

Lower-severity follow-ups (NOT in this PR — tracked)

LOW-MED: apply_spill_completions failed manifest commit leaves a cold entry
unpersisted until the next commit (crash window).
LOW: panic in do_rewrite_per_shard (behind --experimental-per-shard-rewrite)
leaves AOF_REWRITE_IN_PROGRESS stuck + kills the shard writer.
LOW: multi-DB (SELECT >0) cold recovery writes to db0 only.

pilotspacex-byte · 2026-06-03T16:57:57Z

Filed the three LOW-severity follow-ups: #137 (cold-manifest failed-commit window), #138 (per-shard rewrite panic wedges flag + writer), #139 (multi-DB cold recovery restores db0 only).

…overflow DoS The precedence-climbing recursive-descent expression grammar pushes ~9 stack frames per source nesting level (parse_expression -> or -> and -> not -> comparison -> addition -> multiplication -> unary -> primary -> recurse), but check_depth() only counts one source level per recursion. With the previous limit of 64, a pathologically nested query descended ~64 levels (~576 debug frames) BEFORE the guard could fire at level 65, overflowing the stack. Cypher is parsed on async worker threads (tokio/monoio default to 2 MiB stacks, not the 8 MiB main thread), so this overflow aborted the whole process with SIGABRT instead of returning NestingDepthExceeded — a real remote DoS, and a red CI "Check" job (test_nesting_depth_exceeded overflowed its stack). Introduce DEFAULT_MAX_NESTING_DEPTH = 32, threaded through parse_cypher, the test harness, and the public re-export. 32 levels keeps the worst- case descent to ~288 debug frames (well under 2 MiB, ~50% margin vs the overflowing 64) while still allowing far deeper nesting than any real query needs. Deep nesting now returns NestingDepthExceeded gracefully, never aborts. Verified in OrbStack VM (cgroup-capped): test_nesting_depth_exceeded and test_nesting_within_limit both pass, full graph:: module 368/368 green, fmt + clippy clean under --features graph and --no-default-features --features runtime-tokio,jemalloc,graph. author: Tin Dang <tin.dang@trustifytechnology.com>

…pher depth bound Document the user-facing behavior changes shipped in PR #136: - CLAUDE.md "Environment Variables": add `MOON_URING=1` (opt into the tokio->io_uring bridge, now default-off because it floods errors under load) and clarify that `MOON_NO_URING=1` force-disables io_uring on both the monoio runtime and the tokio bridge. - CHANGELOG.md [0.2.0-alpha]: new "Fixed" subsection for PR #136 covering the three multishard RAM-zombie root causes (PageCache eager pre-alloc, tokio listener SO_REUSEPORT bind-race, io_uring-under-tokio default-off), the two in-PR durability TDD fixes (manifest compact-reopen latch, tokio per-shard AOF write_error latch), the Cypher nesting-depth bound, and the three low-severity follow-ups (#137/#138/#139). author: Tin Dang <tin.dang@trustifytechnology.com>

…ration spawn_tokio_connection, spawn_monoio_connection and spawn_migrated_monoio_ connection all thread the disk-offload spill context (spill_sender / spill_file_id / disk_offload_dir) into the ConnectionContext, but spawn_migrated_tokio_connection built it with None / Cell(0) / None. After an affinity migration the tokio connection therefore fell back to the non-spilling eviction path even when disk-offload was enabled, so cold-tier durability silently depended on whether the connection had migrated. CodeRabbit flagged this on PR #136 (review 4420584384); validated against the merged code (issue #141). Fix: thread spill_sender / spill_file_id / disk_offload_dir through spawn_migrated_tokio_connection and pass them to ConnectionContext::new, mirroring the three sibling spawn paths. Updated both call sites in event_loop.rs. Validation: this is mechanical parity with three already-correct sibling functions in the same file. A genuine red/green unit test is disproportionate here — ConnectionContext::new takes 26 args and is consumed inside a spawned task, so asserting the migrated context's spill fields would require stubbing the entire context or simulating a live cross-shard connection migration. Verified instead by: clean build + clippy under both default (monoio) and --no-default-features --features runtime-tokio,jemalloc, and structural parity (all four spawn paths now pass the spill context). The behavioural guarantee is covered by the existing disk-offload read-through integration suite. author: Tin Dang <tin.dang@trustifytechnology.com>

…ard accept run_embedded() diverged from the production main.rs startup path on two functional points (CodeRabbit PR #136 review 4420584384, issue #142): 1. Recovery only ran when `persistence_dir` was set, so `--appendonly no` + disk-offload (persistence_dir = None, disk_offload_base = Some) skipped `Shard::restore_from_persistence` entirely — cold-tier keys were unrecoverable after an embedded restart. Now gated on `should_run_recovery(persistence_dir, disk_offload_enabled)` = persistence OR disk-offload, recovering from `config.dir` when persistence_dir is None (mirrors main.rs:702-704). 2. Embedded forced `per_shard_accept = cfg!(target_os = "linux")`, re-enabling the tokio/Linux per-shard accept bind-order race that PR #136 disabled in main.rs. Now `false` (central accept), matching main.rs. Validated in OrbStack VM (cgroup-capped), red->green TDD under --no-default-features --features runtime-tokio,jemalloc (embedded is runtime-tokio-only): recovery_runs_for_disk_offload_without_persistence (RED with the disk-offload term dropped -> GREEN). fmt + clippy clean under both default (monoio) and tokio,jemalloc. Note: a full embedded+disk-offload+restart integration test was not added — there is no run_embedded test harness to extend and standing it up for a 2-line parity fix is disproportionate; the behavioural recovery path is the same one covered by tests/crash_recovery_disk_offload_no_aof.rs on main.rs. The extracted should_run_recovery predicate unit-tests the exact gate that regressed. author: Tin Dang <tin.dang@trustifytechnology.com>

…tway try_send_rewrite_per_shard fans the RewritePerShard message out to every shard writer with a blocking send. If a writer channel is disconnected the send fails and the function returned Err immediately — but the writers that ALREADY received the message decrement the PerShardRewriteCoord countdown when they fold, while the failed writer and the unsent tail never do. The countdown therefore never reaches zero, no shard runs the terminal branch, and AOF_REWRITE_IN_PROGRESS (set by bgrewriteaof_start_sharded before dispatch) is wedged true forever — every future BGREWRITEAOF is silently rejected for the lifetime of the process. Same wedged-flag family CodeRabbit flagged on PR #136 review 4420584384 (related to #138). Fix: on send failure, call coord.mark_failed() then coord.shard_done() once for each shard from the failing index through the tail (the failed shard plus every unsent one). The already-dispatched writers still decrement on fold, so the countdown reaches zero exactly once; mark_failed() is published before the tail decrements, so whichever decrement is terminal takes the abort path — keeps old_seq authoritative (no manifest commit) and clears AOF_REWRITE_IN_PROGRESS. The error log now names the offending writer index. Validated in OrbStack VM (cgroup-capped), red/green TDD: rewrite_fan_out_partial_failure_clears_in_progress_flag disconnects shard-0's writer, asserts the fan-out returns Err AND the in-progress flag is cleared (RED before the accounting fix — flag stayed true; GREEN after). fmt + clippy clean under both default (monoio) and tokio,jemalloc. author: Tin Dang <tin.dang@trustifytechnology.com>

…n on abort A per-shard BGREWRITEAOF reopens each shard's append file onto the NEW generation in phase 6 (do_rewrite_per_shard) BEFORE the coordinator decides whether the rewrite commits. If the rewrite then aborts — another shard failed to fold, or the final manifest commit errored — the manifest keeps old_seq and the abort path prunes the new-seq files, but the already-folded writers keep appending into the now-discarded new-seq incr. Recovery resolves base/incr by the committed seq (old_seq), so every post-abort write is silently lost. The old code acknowledged this with a "RESTART recommended" log; a server that is not restarted loses data. CodeRabbit flagged this on PR #136 review 4420584384. Fix — barrier-before-resume. PerShardRewriteCoord now publishes the committed generation exactly once from the terminal shard_done (new_seq on success, old_seq on abort/commit-failure) via a parking_lot Mutex<Option<u64>> + Condvar. After phase 7 each folded writer blocks in await_outcome and, if the committed seq != new_seq, reopens its append file onto the committed seq's incr (shard_incr_path_seq) — so it resumes appending into the authoritative generation with no restart and no loss. On the happy path committed == new_seq and the barrier is a no-op beyond unblocking. The barrier turns "every shard calls shard_done exactly once" into a LIVENESS requirement: a writer that decrements then blocks hangs forever if another shard never decrements. The one path that skips shard_done is a panic mid-fold (save_snapshot_to_bytes OOM-unwinding — issue #138), which under the default panic=unwind would now hang every healthy shard's AOF writer thread. Closed with a ShardDoneGuard: the fold creates it on entry and calls complete() on success (clean shard_done); on any ?-error or panic-unwind its Drop fires mark_failed + shard_done, so the countdown always closes, the terminal writer always publishes, and every waiter wakes and rolls back. The guard now owns the single decrement for ALL exits, so both writer-task error arms (monoio + tokio) no longer decrement after do_rewrite_per_shard (was a double-decrement risk). This incidentally closes the #138 panic-wedge. Each per-shard writer runs on its own dedicated OS thread (main.rs spawns aof-writer-{sid} → block_on_local), so blocking one at the barrier never starves the thread that must publish — no deadlock on either runtime. Validated in OrbStack VM (cgroup-capped), red/green TDD: - rewrite_abort_reopens_writer_onto_committed_old_generation: drives one shard's real fold with a second shard pre-failed; asserts the new-gen incr is pruned AND post-abort appends through *file land in the committed old-gen incr (RED before phase 8 — writes went to the pruned inode; GREEN after). - rewrite_abort_wakes_waiter_cross_thread: real condvar wait/notify across threads observes old_seq. - rewrite_fold_panic_releases_barrier_for_other_writers: a ShardDoneGuard dropped during catch_unwind releases a blocked waiter with old_seq (hangs forever without the guard). No regression: per-shard AOF SIGKILL recovery (crash_matrix_per_shard_aof, 3) and per-shard BGREWRITEAOF crash recovery (crash_matrix_per_shard_bgrewriteaof, 2) stay green. fmt + clippy clean under both default (monoio) and tokio,jemalloc. author: Tin Dang <tin.dang@trustifytechnology.com>

During a BGREWRITEAOF drain, both drain_pending_appends (legacy/TopLevel, default --shards 1 path) and drain_pending_appends_framed (per-shard) wrote the appended bytes and immediately acked AofAck::Synced — BEFORE the post-drain boundary sync_data(). Under appendfsync=always a client received +OK (durable) for an entry that a crash in the ack→fsync window would lose. CodeRabbit flagged this on PR #136 review 4420584384 (issue #140); the always-fsync durability contract requires the bytes to be on disk before the client is told Synced. Fix: park the AppendSync ack senders in DrainOutcome.pending_acks instead of acking inside the drain. New sync_and_fulfill_drain() performs the boundary fsync and only then resolves the parked acks — Synced on success, or FsyncFailed plus a propagated IO error on failure (never a false Synced). All six drain sites across do_rewrite_per_shard, do_rewrite_single and do_rewrite_sharded route through it. A write error inside the drain still drops the parked acks with the outcome (RecvError → hard failure), unchanged. Validated in OrbStack VM (cgroup-capped), red/green TDD: - drain_framed_parks_appendsync_ack_until_boundary_fsync: asserts the drain PARKS the ack (pending_acks.len()==1, not sent) and that it resolves Synced only after sync_data() (RED before: no pending_acks field / acked in drain). - drain_single_fulfills_fsync_failure_as_fsync_failed: the DEFAULT --shards 1 non-framed path must resolve FsyncFailed (never Synced) when the boundary fsync fails. No regression: full aof lib suite (66) green; bgrewriteaof TTL emission tests green; per-shard AOF SIGKILL + BGREWRITEAOF crash-recovery suites green. fmt + clippy clean under both default (monoio) and tokio,jemalloc. Note: tests/aof_fsync_err_subscribe_ordering single-shard variant (an #[ignore]d normal-append-path test, multi-shard twin passes) fails identically on HEAD without this change — pre-existing and unrelated to the rewrite-drain path touched here. author: Tin Dang <tin.dang@trustifytechnology.com>

…00-line cap src/persistence/aof_manifest.rs had grown to 3058 lines — past the 1500-line cap in CLAUDE.md. CodeRabbit flagged this on PR #136 review 4420584384 (issue #143). Converted the file into a directory module and relocated cohesive chunks, preserving every public path via re-export. Layout (all three files now under the cap): - aof_manifest/mod.rs (1395): AofManifest state + manifest-level I/O (paths, initialize/load, parse_v1/v2, cleanup_orphans, write_manifest, advance, migrate, verify, global_max_lsn) + manifest-core unit tests. - aof_manifest/shard_replay.rs (1159): the multi-part / per-shard replay free functions (replay_multi_part, replay_incr_resp, replay_incr_framed, replay_per_shard, replay_ordered_merge) + OrderedEntry + the replay unit tests. - aof_manifest/shard_rewrite.rs (577): the per-shard rewrite inherent methods (initialize_multi, try_initialize_multi, advance_shard, prune_shard_files) as `impl AofManifest` + the shard-rewrite unit tests. Mechanics: - `use super::*` in each submodule pulls the parent's types, std/tracing imports, consts, and private helpers down (a child module sees the parent's private items), so no import duplication beyond the test modules. - Relocated public replay API is re-exported from mod.rs (`pub use shard_replay::{OrderedEntry, replay_multi_part, replay_ordered_merge, replay_per_shard}`) so external callers (`crate::persistence::aof_manifest::replay_*`) keep resolving unchanged. - Per CLAUDE.md the cap and the "tests stay in mod.rs" convention can't both hold for a 3058-line file (tests alone are ~1100 lines); tests were moved next to the code they cover, each submodule getting its own temp_dir() copy with a distinct directory prefix (replay-/rewrite-) so the independent static counters never collide. No behaviour change. Validated in OrbStack VM: 30 aof_manifest unit tests green; per-shard AOF SIGKILL recovery (3) and per-shard BGREWRITEAOF crash recovery (2) green (exercise the relocated replay_per_shard / advance_shard / initialize_multi via recovery). fmt + clippy clean under both default (monoio) and tokio,jemalloc. author: Tin Dang <tin.dang@trustifytechnology.com>

…ion entry CI Lint gate requires a CHANGELOG.md entry (or the skip-changelog label) per PR. Document the PR #144 scope under [0.2.0-alpha] Unreleased: the 8 CodeRabbit PR #136 durability follow-ups, the two PR #144-review Majors (async-spill eviction regression test, cfg(unix) migration gate), the aof_manifest.rs and aof.rs decompositions, and the VECTOR_INDEXES RwLock test-isolation fix. author: Tin Dang <tin.dang@trustifytechnology.com>

fix(persistence): close 8 CodeRabbit PR #136 durability findings + aof_manifest split

TinDang97 and others added 20 commits June 2, 2026 13:38

qodo-code-review Bot reviewed Jun 3, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

tindangtts added 2 commits June 3, 2026 23:55

This was referenced Jun 3, 2026

disk-offload: failed cold-manifest commit can lose cold keys on crash (apply_spill_completions) #137

Open

per-shard BGREWRITEAOF: panic in do_rewrite_per_shard wedges rewrite flag + kills shard AOF writer #138

Open

pilotspacex-byte mentioned this pull request Jun 3, 2026

disk-offload: cold recovery only restores db0 (SELECT >0 keys lost after restart) #139

Open

tindangtts added 2 commits June 4, 2026 09:44

pilotspacex-byte merged commit ea25177 into main Jun 4, 2026
8 checks passed

pilotspacex-byte mentioned this pull request Jun 4, 2026

fix(persistence): close 8 CodeRabbit PR #136 durability findings + aof_manifest split #144

Merged

5 tasks

TinDang97 added a commit that referenced this pull request Jun 4, 2026

Merge pull request #144 from pilotspace/fix/coderabbit-pr136-durability

d5e796d

fix(persistence): close 8 CodeRabbit PR #136 durability findings + aof_manifest split

Conversation

pilotspacex-byte commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The "zombie eating RAM in multishard" report — three independent root causes

Benchmark — multishard + disk-offload, both runtimes

Also in this branch (PR #129 port + disk-offload durability)

Validation

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Uh oh!

qodo-code-review Bot commented Jun 3, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

pilotspacex-byte commented Jun 3, 2026

Self-review (20 commits) — 2 MEDIUM durability bugs found + fixed

Lower-severity follow-ups (NOT in this PR — tracked)

Uh oh!

pilotspacex-byte commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pilotspacex-byte commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

qodo-code-review Bot commented Jun 3, 2026 •

edited

Loading