Skip to content

fix(shard): stable multishard + disk-offload on both runtimes; kill the multishard RAM zombie#136

Merged
pilotspacex-byte merged 24 commits into
mainfrom
fix/pr129-port-onto-main
Jun 4, 2026
Merged

fix(shard): stable multishard + disk-offload on both runtimes; kill the multishard RAM zombie#136
pilotspacex-byte merged 24 commits into
mainfrom
fix/pr129-port-onto-main

Conversation

@pilotspacex-byte

@pilotspacex-byte pilotspacex-byte commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes Moon stable in multishard mode on both runtimes (monoio + tokio) with disk-offload, and eliminates the long-standing "zombie eating RAM in multishard" symptom. Ports the PR #129 per-shard AOF work onto main, lands the disk-offload durability fixes, and closes the multishard serving + bounded-RAM gaps with benchmark evidence on both runtimes.

The "zombie eating RAM in multishard" report — three independent root causes

  1. Serving hang (tokio/Linux). The central tokio listener plain-bound the port (no SO_REUSEPORT) while each shard binds its own SO_REUSEPORT listener — incompatible sockets on one port, resolved by a bind-order race. Either way PING was never answered (port up, shards parked, nothing dispatched — the "zombie" signature). Fix: the central listener also binds SO_REUSEPORT, so it coexists with the shard sockets regardless of order. monoio was always healthy.
  2. io_uring under tokio. tokio's experimental io_uring bridge floods Unknown io_uring event type: 0 and drops connections under load. Fix: default-OFF for tokio (MOON_URING=1 to opt in; MOON_NO_URING still force-disables and stays the CI default). io_uring remains in monoio, the production runtime.
  3. PageCache eager pre-alloc (the actual RAM zombie). --disk-offload defaults to enable, so each shard built a PageCache sized to 25% of the whole maxmemory and eagerly committed num_frames × PAGE zeroed bytes at construction. With N shards that is N × 25% × maxmemory pre-committed before serving a command; with --maxmemory unset the guardrail sets it to 80% of host RAM → a 4-shard default server reserved ~80% of host RAM instantly (measured 3791 MB idle RSS, zero data). Fix: (a) lazy buffers — frames start empty, grow to a full page on first fetch_page miss; (b) divide the budget across shards; (c) startup WARN when an explicit --pagecache-size exceeds 50% of maxmemory. Startup RSS 3791 MB → 29 MB.

Benchmark — multishard + disk-offload, both runtimes

OrbStack Linux aarch64, 4 shards, --maxmemory 64MiB --maxmemory-policy allkeys-lru --disk-offload enable, 300k SET + 300k GET over a 200k keyspace (forces spill + cold read-through), p=16, all server runs cgroup-capped to protect the host:

monoio (production) tokio (portability)
Serves PING
SET p=16 325,027 req/s 114,068 req/s
GET p=16 2,343,750 req/s 246,305 req/s
RSS under load 45 MB (flat) 45–48 MB (flat)
Spilled to disk 28.5 MB 28.6 MB

RSS stays flat through 600k ops on both runtimes — bounded RAM, working spill/read-through, no zombie.

Also in this branch (PR #129 port + disk-offload durability)

  • Per-shard AOF on the tokio runtime (multi-part), F2/F3/F5/F6 hardening, per-shard BGREWRITEAOF fan-out compaction.
  • Disk-offload cold-tier durability: per-file liveness (prevents batch-deletion data loss), crash-atomic manifest overflow (uncaps >70-file recovery), cold read-through survives crash recovery, cold manifest crash-durable without AOF (chore(deps): bump socket2 from 0.5.10 to 0.6.3 #22), lowered default orphan-sweep interval (gated behind the data-loss fix).
  • Whole-instance maxmemory guardrail (G1/G2).

Validation

  • Final tree compiles on both runtimes; new non-ignored tests/multishard_serve_smoke.rs passes 3/3 on monoio and tokio.
  • PageCache lazy-buffer + per-shard-budget unit tests (RED before, GREEN after); startup RSS 3791 MB → 29 MB.
  • chore(deps): bump socket2 from 0.5.10 to 0.6.3 #22 cold read-through 200/200 on both runtimes.
  • cargo fmt --check clean; cargo clippy -- -D warnings clean on default and runtime-tokio,jemalloc feature sets.

Test plan

  • cargo test (monoio) full suite green
  • cargo test --no-default-features --features runtime-tokio,jemalloc full suite green
  • Multishard serve smoke, both runtimes
  • Disk-offload multishard bench, both runtimes (table above)
  • Startup RSS regression check (3791 MB → 29 MB)

Summary by CodeRabbit

  • New Features

    • Per-shard maxmemory enforcement: multishard deployments treat --maxmemory as a whole-instance cap, with per-shard budgets.
    • Per-shard BGREWRITEAOF support (opt-in flag) and configurable AOF fsync timeout.
    • Automatic memory guardrail applied when --maxmemory is omitted (~80% default).
  • Improvements

    • Batched cold-storage spill writes, multi-page spill support, and stronger cold-recovery.
    • Bounded cross-shard dispatch/backpressure handling and safer eviction spill flows.
  • Documentation

    • Quick start updated to clarify maxmemory sharding, reporting, and guardrail behavior.

TinDang97 and others added 20 commits June 2, 2026 13:38
The originally-reported "zombie RAM in multishard" is, at root, unbounded
keyspace growth under the default `--maxmemory 0` (unlimited) + `noeviction`
policy: every distinct key adds memory that is never reclaimed, so a write
loop with unique keys grows RSS until the OOM killer fires. This is data,
not a leak — but the out-of-the-box default offered no protection.

G1 adds an opt-out-by-default guardrail: when `--maxmemory` is omitted, Moon
auto-caps at ~80% of the detected memory limit and switches a `noeviction`
policy to `allkeys-lru`, logging a one-line startup notice. This bounds
memory by default while staying fully Redis-compatible for operators who
want the old behavior.

`--maxmemory` is now `Option<usize>` to distinguish three intents:
- omitted (None)      -> auto-guardrail (80% + allkeys-lru + notice)
- `--maxmemory 0`     -> explicitly UNLIMITED (Redis escape hatch, preserved)
- `--maxmemory N`     -> exact cap, honored verbatim (no guardrail)

Resolution happens once at startup (`apply_memory_guardrail`), writing the
resolved value into `RuntimeConfig` (still `usize`, `0 = unlimited`) so ALL
downstream code (eviction, persistence_tick, CONFIG, pagecache) is untouched.

Detection is cgroup-aware and dependency-free — NO new `unsafe`, NO new
crates (safe `/proc` + `/sys` reads only):
- Linux: min(cgroup limit v2 `memory.max` / v1 `memory.limit_in_bytes`,
  host `/proc/meminfo` MemTotal). The min() matters: 80% of HOST RAM inside
  a container with a small cgroup cap would give zero real protection — the
  exact multishard/container scenario that motivated this.
- Non-Linux (dev): detection skipped, left UNLIMITED with a warning to set
  `--maxmemory` explicitly (production targets Linux per platform policy).

Wired at both entry points (main.rs binary + embedded.rs library) via the
shared `log_memory_guardrail` notice helper.

Tests (red/green): pure parsing (`/proc/meminfo`, cgroup v1/v2 sentinels) +
pure resolution (percent math, policy flip, evicting-policy preserved,
explicit-honored, skip-on-no-detection) + Option-type config parsing incl.
the `--maxmemory 0` = unlimited escape hatch. 47 config tests pass.

Verified live on Linux (OrbStack): no-flag -> cap = 80% of MemTotal +
allkeys-lru + notice; `--maxmemory 0` -> unlimited, no notice;
`--maxmemory N` -> exact. Build + clippy clean both runtimes; fmt clean.

author: Tin Dang
…s (G2)

The reported "zombie eats RAM in multishard mode" reproduces empirically:
with an identical `--maxmemory 100MiB --maxmemory-policy allkeys-lru`, a
4-shard server retained ~2.5x the RAM and ~2.4x the keys of a 1-shard server
(307MB / 768,800 keys vs 124MB / 321,649). Root cause: every write-path
eviction (`try_evict_if_needed`) compared a SINGLE shard's memory against the
FULL `maxmemory`. Each shard is shared-nothing and enforces independently, so
an N-shard server tolerated ~Nx `maxmemory` before any shard evicted — the box
OOM-kills before the cap ever bites. The server-wide atomic that could give a
true aggregate (`ShardDatabases::read_memory_sum` / `memory_per_shard`) is dead
code: no write path publishes to it ("Phase 2/3" wiring never landed), so it
always returns 0.

Per user decision (2026-06-02), `maxmemory` is now a true WHOLE-INSTANCE cap
for both the G1 auto-guardrail and explicit `--maxmemory`, with division at
ENFORCEMENT (Redis-compatible: CONFIG GET / INFO still report the whole-instance
value verbatim).

Implementation (no hot-path division, no new unsafe, no new deps):
- `RuntimeConfig` gains `num_shards` (default 1) + `maxmemory_per_shard()`:
  `maxmemory.div_ceil(num_shards.max(1))`, returning 0 (unlimited) iff
  `maxmemory == 0`. `div_ceil` keeps summed per-shard budgets >= the cap;
  `max(1)` guards a mis-set count.
- Eviction threshold comparisons switch from `config.maxmemory` to the
  per-shard budget in `storage/eviction.rs` (`try_evict_if_needed_with_spill_
  and_total`, the async-spill variant, and `try_evict_deferred`) and in
  `shard/persistence_tick.rs` (`should_run_pressure_cascade` threshold +
  the step-3 KV-eviction gate). The `== 0` unlimited gates and all reporting
  (CONFIG GET / SET / INFO / MEMORY) are untouched.
- The server publishes the resolved shard count onto the SHARED `RuntimeConfig`
  (the instance every shard's eviction reads) at startup in `main.rs` and
  `server/embedded.rs`. Single-shard keeps `num_shards == 1` => no division =>
  identical prior behavior.

Behavior change (accepted): existing explicit multishard `--maxmemory N` configs
now bound aggregate RSS (effective ceiling lowers ~Nx) instead of per-shard.

Operator visibility (this change is no longer silent): a startup notice
(`config::log_maxmemory_sharding`, emitted from `main.rs` + `server/embedded.rs`)
fires whenever `num_shards > 1` and `maxmemory > 0`, stating the resolved
per-shard eviction budget so an operator running e.g. `--maxmemory 8gb
--shards 4` sees the effective 8GB (not 32GB) ceiling at boot. CHANGELOG +
README document the whole-instance semantics. (Surfacing the per-shard budget
in `INFO memory` is a tracked follow-up — the INFO handler takes only
`&Database`, so it needs a dispatch-signature change across both runtimes.)

Known residual (tracked, not in scope): the connection-path eviction is per-DB,
so M actively-used logical DBs per shard still over-tolerate by ~M; the proper
aggregate fix (wire the per-shard memory publisher + switch eviction to
`read_memory_sum`) touches the hot write path and is a separate change.

Red/green TDD:
- RED: `per_shard_budget_divides_maxmemory_at_enforcement` FAILED pre-fix
  ("4 shards … got len 4" — 4 shards evicted nothing at budget == total).
- GREEN: passes post-fix; plus 5 pure `maxmemory_per_shard` unit tests. 12/12.

Empirical re-run (OrbStack Linux, same flood) — 4-shard DBSIZE collapses from
768,800 to 321,654 (≈ the 1-shard 321,649): retained keyspace no longer scales
with shard count. used_memory 307MB -> 197MB, VmRSS 300MB -> 193MB (-36%).
Single-shard unchanged. Residual gap is fixed per-shard structural overhead
(4 shard threads' arenas), bounded — not the prior unbounded keyspace growth.

author: Tin Dang
Per-shard AOF replay seeded `master_repl_offset` with the START LSN of the
last recovered entry instead of the next-free replication offset. Because
`ReplicationState::issue_lsn` returns the offset BEFORE adding the entry's
byte length, the first post-recovery write reissued the exact LSN already
on disk for the last pre-crash entry — breaking the `lsn -> entry`
uniqueness invariant the cross-shard backlog merge (RFC § 2 Rule 3) and
PSYNC continuity depend on. main.rs:829 seeds `seed_master_offset` directly
from this value, so the bug surfaces on every multi-shard AOF recovery.

`replay_incr_framed` now tracks `max(entry.lsn + entry.len)` — the byte
offset AFTER each entry — at BOTH the inline and ordered update sites, so
the returned `max_lsn` (and the manifest `global_max_lsn` derived from it)
is the next-free offset. `seed_master_offset` then leaves the master
strictly beyond every byte on disk.

Ported onto merged main (PR #129, commit 8c49ec2), whose `replay_incr_framed`
carried the identical start-LSN bug at aof_manifest.rs:1533 / :1581.

Tests (red/green verified on OrbStack Linux, monoio):
- new `replay_incr_framed_max_lsn_is_next_free_offset` regression test
  (RED: returned 114 = last start LSN; GREEN: 130 = entry end);
- updated the four sibling assertions that encoded the old start-LSN
  semantics to the next-free offset: truncated-header 3 -> 17,
  decodes_lsn_and_resp 11 -> 27, round_trips_two_shards 20 -> 49,
  parallel_matches_sequential 40 -> 69, buffers_ordered_entries 12 -> 35.
- full persistence::aof_manifest suite: 30 passed, 0 failed.

author: Tin Dang <tin.dang@trustifytechnology.com>
Design-for-failure: `appendfsync=always` awaited the writer's fsync ack
with no timeout. A stalled disk (slow device, full filesystem, hung
fsync) parked the write connection forever, holding its ~64KB buffers —
a zombie connection that never completes and never frees memory. Under
load this compounds into the unbounded-RAM symptom.

Fix: cap the ack await with a configurable bound. On elapse the write is
failed (FsyncFailed) and the caller returns an error frame instead of
hanging. The entry may still reach disk later, so durability is reported
as UNCONFIRMED — the caller must not respond +OK.

Changes:
- New `--aof-fsync-timeout-ms` flag (default 2000ms; 0 = legacy unbounded).
- `AofWriterPool` gains an `fsync_timeout: Duration` field, threaded
  through `top_level_with_policy` / `per_shard_with_policy`; the
  non-policy constructors default to DEFAULT_AOF_FSYNC_TIMEOUT (2000ms).
- `try_send_append_durable` Always-branch now awaits via `await_ack`,
  a runtime-agnostic helper: monoio uses `select! { rx, sleep }` (matching
  cluster::failover — monoio 0.2 has no time::timeout); tokio uses
  tokio::time::timeout. Outcomes map: ack->inspect AofAck, channel
  disconnect->WriteFailed, elapse->FsyncFailed.
- Wired the timeout into all production pool construction sites
  (main.rs per-shard + top-level, listener.rs, embedded.rs).

Ported onto merged main (PR #129, commit 8c49ec2). Reconciliation vs the
original F2 base:
- merge-kept main's AOF_BACKPRESSURE_DROPPED counter alongside the new
  AckOutcome enum (both retained).
- five pre-existing pool test sites that main added after the F2 base
  (aof.rs x3, spsc_handler.rs x2, handler_single.rs x1) now pass
  Duration::ZERO — exact legacy unbounded semantics, no behavior change.

Tests (red/green, tokio-gated — drive the runtime timer):
- always_fsync_times_out_when_writer_never_acks: held-but-undrained writer
  -> Err(FsyncFailed) within the bound, no hang.
- always_fsync_succeeds_when_writer_acks_in_time: drained+acked writer
  -> Ok(()).
- full persistence::aof::pool_tests suite: 17 passed, 0 failed (tokio).
The monoio path shares the proven select!+sleep shape and is covered
end-to-end by the per-shard crash matrix.

author: Tin Dang <tin.dang@trustifytechnology.com>
Design-for-failure: the cross-shard dispatch path retried a full target
SPSC ring with an UNBOUNDED `loop { try_push; sleep/yield }` and waited
for the reply with no cap and no shutdown check. When a target shard
stopped draining (saturated, wedged, or mid-shutdown), the producing
connection task spun forever — holding its read/write buffers and, on
monoio, piling wakers into the `pending_wakers` relay. That is the
multi-shard "zombie connection" RAM-growth vector from the investigation:
memory that never frees because the task never completes.

Fix — shared bounded helper `push_with_backpressure` (src/shard/dispatch.rs):
- Retries a full ring up to CROSS_SHARD_PUSH_MAX_RETRIES (5000) with a
  100µs backoff (~0.5s honoured → a few seconds at coarse timer
  granularity — wedged-shard detection scale, generous enough not to
  false-reject a merely saturated shard under load).
- Checks `shutdown.is_cancelled()` before every backoff so graceful
  shutdown can never hang on a wedged peer.
- On give-up the message was NEVER accepted → caller returns a clean
  reject (`-ERR cross-shard dispatch backpressure`); the command did not
  execute.
- Hot path unchanged: the happy case is a single try_push with no sleep
  and no allocation. `TimerImpl::sleep` (runtime-agnostic) only allocates
  on the contended retry path.

Wired into BOTH runtimes (same latent bug in each):
- monoio `handler_monoio`: push loop + the `pending_wakers` response-wait
  loop now bounded by CROSS_SHARD_RESPONSE_MAX_WAITS (~30s of ~1ms wakes)
  AND a shutdown check. Response timeout is an *uncertain write* (the
  batch was already dispatched, the target may have applied it), so its
  error explicitly says "write may have applied" — it must NOT imply
  rejection. Disconnect vs timeout vs shutdown now return distinct frames.
- tokio `handler_sharded`: same push loop replaced. On give-up the slot is
  simply never awaited — `slot_ptr` has no accounting side effect, so
  skipping `future_for` cannot leak ResponseSlotPool state. (The tokio
  response wait already awaits via AtomicWaker, so it needed no bound.)

Tests (red/green, tokio-gated — drive the runtime timer):
- push_backpressure_pushes_on_first_try_without_sleeping (happy path, no
  retry/sleep)
- push_backpressure_succeeds_after_transient_full (retries until drain)
- push_backpressure_gives_up_when_ring_never_drains (Backpressure within
  budget, exact attempt count, no hang)
- push_backpressure_aborts_on_shutdown (Cancelled after one attempt)
The monoio response-wait bound is correct-by-inspection (coupled to the
waker relay) and covered by the per-shard crash/E2E matrix.

Verified: build + clippy clean both runtimes; full tokio lib suite
2773/2773 pass; fmt clean.

author: Tin Dang
…t (F6)

Crash-safety: `advance_shard` (PerShard per-shard rewrite scaffolding)
deleted the old generation's base/incr files at step 3 — BEFORE the
manifest durably committed the new seq. Because recovery resolves
base/incr paths by the persisted `manifest.seq`, a crash between any
shard's advance and the coordinator's `write_manifest()` would leave the
manifest pointing at an old seq whose files are already gone → the base
RDB is missing on restart → data loss for every shard that had already
advanced. This is the inverse of `advance()` (TopLevel), which correctly
deletes only AFTER the manifest commit.

This is a latent bug: `advance_shard` currently has no production caller
(the per-shard BGREWRITEAOF fan-out that drives it is the next step), so
no shipped path regresses. Fixing it now establishes the correct
crash-safe ordering before the coordinator is wired.

Changes:
- `advance_shard`: remove the in-function delete (old step 3). It now
  writes the new base+incr and updates the shard's `max_lsn` only. The
  unused `old_seq` binding is dropped with it.
- New `prune_shard_files(shard_id, seq)`: best-effort deletion of a
  shard's base+incr for a given seq. The rewrite coordinator MUST call
  this only after `write_manifest()` has durably committed the new seq —
  the single commit point for the whole rewrite. Mirrors `advance()`'s
  post-commit deletion ordering.
- Doc comment on `advance_shard` now states the deferred-deletion
  contract and the crash window it closes.

Tests (red/green TDD, runtime-agnostic — pure manifest unit test):
- Renamed `advance_shard_writes_new_seq_and_deletes_old` →
  `advance_shard_defers_delete_until_after_commit`. It now pins the
  ordering invariant: after fan-out advance of BOTH shards, the new-seq
  files exist AND the old-seq files STILL exist (pre-commit); only after
  `seq = new; write_manifest(); prune_shard_files(...)` do the old files
  disappear while the new remain. Recovery (`load`) resolves to new seq.
  RED against the old delete-before-commit behavior (failed at the
  pre-commit "old base must survive" assertion); GREEN after the fix.
- Full `persistence::aof_manifest` module: 30 passed, 0 failed (monoio).

author: Tin Dang <tin.dang@trustifytechnology.com>
…oio)

Make multi-shard BGREWRITEAOF crash-safe and actually compact the per-shard
AOF, behind an experimental opt-in flag. Until now BGREWRITEAOF was GATED in
the PerShard layout (`--shards >= 2 + --appendonly yes`) because the naive
multi-shard rewrite lost ~38% of keys on restart (verified 2026-05-26). This
ships the correct per-shard fan-out coordinator and lets operators enable it
with `--experimental-per-shard-rewrite`; the default behavior is unchanged
(gate stays closed → the shipped "append-only, no in-place compaction" path).

Design — synchronized seq bump + single manifest commit across N writers:
- New `AofMessage::RewritePerShard { shard_dbs, coord }` is delivered to every
  per-shard writer at once.
- Each writer runs `do_rewrite_per_shard` (monoio, synchronous) for ITS shard,
  reusing the proven `do_rewrite_sharded` discipline scoped to one shard:
  drain queued appends into the OLD incr (framed) + fsync → lock the shard's
  dbs → re-drain → snapshot under lock → release → write new base+incr at the
  coordinator's `new_seq` via `advance_shard` (which does NOT bump the seq) →
  reopen the append file to the new incr.
- `PerShardRewriteCoord` holds a shared `Arc<Mutex<AofManifest>>` and an
  atomic countdown. The LAST writer to finish performs the single durable
  commit (`manifest.seq = new_seq; write_manifest()`), then prunes the old
  generation. Until that commit the on-disk manifest still resolves to the old
  seq, so a crash anywhere in the fold recovers the intact old generation
  (no loss, no double-apply).

Crash-safety:
- Non-idempotent INCR is exactly-once across the rewrite boundary: pre-snapshot
  INCRs end up only in the new base; post-snapshot INCRs only in the new incr.
- delete-after-commit ordering (the step-1 `advance_shard` fix) means a
  pre-commit crash never deletes files the manifest still references.
- Abort-on-any-failure: a shard whose fold errors calls `coord.mark_failed()`;
  the final writer then ABORTS the commit (keeps old seq) instead of advancing
  to a generation where a shard is missing its new base — which would break
  recovery. Old generation stays authoritative (crash-safe); a restart is
  recommended to resync the successful writers' file handles.

Design-for-failure:
- The rewrite message uses the BLOCKING `send` (not `try_send`): a dropped
  message would leave the countdown unable to reach zero, silently losing the
  folded writers' post-rewrite appends. Writers drain continuously on dedicated
  threads, so the block is sub-millisecond for a rare admin command.
- The manifest is loaded fresh from disk at rewrite time (normal appends never
  touch it; BGREWRITEAOF is CAS-serialized) — no startup/recovery reordering.

Runtime scope: monoio only this commit. The fold uses synchronous std::fs IO;
the tokio command handler refuses per-shard BGREWRITEAOF with a clear error so
no in-progress flag is left dangling. tokio enablement is the next step.

Wiring:
- `--experimental-per-shard-rewrite` config flag; main.rs leaves the
  MULTI_SHARD_AOF_REWRITE_UNSAFE gate OPEN when set.
- `AofWriterPool::per_shard_with_base_dir` records the persistence dir so
  `try_send_rewrite_per_shard` can load the manifest; `bgrewriteaof_start_sharded`
  routes PerShard pools to the per-shard fan-out.

Tests (red/green TDD):
- tests/crash_matrix_per_shard_bgrewriteaof.rs (monoio, --ignored):
  - STRADDLE: BGREWRITEAOF fired mid-INCR-stream + SIGKILL → both per-shard
    counters recover to EXACTLY N (drop → <N, double-apply → >N), with the
    seq>1 compacted base asserted present (rules out a silent no-op).
  - COMPOSE: INCR → rewrite → INCR → SIGKILL → counter == PRE+POST (base +
    post-rewrite incr compose exactly once).
  Both PASS on monoio (2 passed, 10.77s).
- Both runtimes compile clean; cargo fmt + clippy (default) green.

author: Tin Dang <tin.dang@trustifytechnology.com>
Record the load-bearing ordering fact behind per-shard BGREWRITEAOF
exactly-once, verified by tracing the live write path against the fold.

The fold (`do_rewrite_per_shard`) runs on the per-shard *writer* thread,
not the shard event-loop thread. Exactly-once across the rewrite boundary
rests on a single fact: the live write path enqueues each command's AOF
append INSIDE the same `RwLock<Database>` write guard under which it
mutated the db (`spsc_handler.rs` calls `wal_append_and_fanout` before
`drop(guard)`). Phase 2 of the fold acquires those SAME locks
(`all_shard_dbs()[sidx]` == `ShardDatabases::shards[sidx]`, the exact
`RwLock`s `write_db` locks), so RwLock mutual exclusion forces
`enqueue -> guard-release -> fold-acquire -> mid-drain`. Every INCR whose
mutation lands in the snapshot therefore had its append drained into the
OLD incr (pruned at commit), never replayed atop the new base. If the
append were enqueued AFTER the guard drop, the snapshot would capture the
mutation while its append still raced toward the NEW incr — double-apply.

Also documents the dependency on the RwLock store being live: the
thread-local `ShardSlice` fast path is dead code until Phase 4 wires
`init_shard` (`is_initialized()` is always false today). A future Phase 4
that makes ShardSlice live must revisit this fold, since the writer thread
cannot lock another thread's `!Send` `Rc<RefCell<Shard>>`.

Verification (this commit, monoio, release):
- tests/crash_matrix_per_shard_aof.rs (shipped gate-closed path):
  3 passed — no regression from the F6 shared-code changes (AofMessage
  enum, drain_pending_appends, per-shard writer match arms).
- tests/crash_matrix_per_shard_bgrewriteaof.rs (F6 new path):
  2 passed — STRADDLE + COMPOSE exact-count recovery.

Doc-only; no behavior change.

author: Tin Dang <tin.dang@trustifytechnology.com>
… gap

The per-shard BGREWRITEAOF exactly-once guarantee holds *absent append-channel
saturation during the fold*. Record the one window the crash matrix cannot
surface, so the claim is honest before the experimental path is exercised.

While `do_rewrite_per_shard` runs (phases 2-6, including the hundreds-of-ms
base-RDB write + fsync of phase 6), the per-shard writer is in the fold, not its
recv loop, so it is not draining the bounded `mpsc_bounded(10_000)` append
channel. Post-snapshot appends queue there for the new incr; the event loop
enqueues with `try_send_append` (drop-on-full, return ignored). Under sustained
concurrent writes on a large dataset the window can exceed 10_000 queued appends
and the overflow is silently dropped — lost even on a clean restart, which is
worse than the everysec contract.

The single-client crash matrix never pressures the channel (serialized
redis-cli, sub-ms), so it cannot catch this. The window is pre-existing: the
shipped `do_rewrite_sharded` has the identical non-draining gap — F6 did not
introduce it.

Per user decision (2026-06-02): qualify + document now (same posture as the
Rule-3 / always-fsync deferrals), behind the experimental flag; the fix
(drain-during-phase-6 or block-on-full for the rewrite's duration, plus a
load-based test) is a separate scoped task.

- aof.rs: add a "Known limitation — channel saturation during the fold" section
  to the `do_rewrite_per_shard` doc.
- tmp/F6-known-limitations.md: full write-up (L1 saturation + candidate fixes,
  L2 carry-over) and current monoio/tokio status.

Doc-only; no behavior change.

author: Tin Dang <tin.dang@trustifytechnology.com>
Make the multi-part PerShard AOF subsystem (manifest creation, replay, and
per-shard BGREWRITEAOF) work under runtime-tokio, not just runtime-monoio,
so multi-shard `--appendonly yes` deployments are crash-safe on both
runtimes. This is the first half of "multi-shard stable on monoio AND tokio".

Root cause of the previous tokio gap (not a fold bug):
- The tokio per_shard_aof_writer_task already supports multi-part: it polls
  AofManifest::load for up to 60s, then opens the framed incr file. Its
  on-disk framing ([u64 lsn LE][u32 len LE][RESP]) is byte-identical to the
  monoio writer and to drain_pending_appends_framed, so the shared,
  runtime-agnostic replay_per_shard reads it back unchanged.
- But the manifest creation + replay block in main.rs was gated entirely to
  #[cfg(runtime-monoio)]. Under tokio initialize_multi never ran, so the
  manifest the writers wait for never appeared — the writers timed out at 60s
  and exited, and try_send_rewrite_per_shard aborted with "manifest missing".
  Durability fell back to WAL-v3 only; AOF was effectively dead.

Why it was gated (the regression boundary, preserved): tokio --shards 1 uses
the legacy single-file appendonly.aof (v2) + in-place BGREWRITEAOF. Engaging
single-shard multi-part replay there finds an empty manifest with no rewritten
base RDB and wipes the v2-loaded state — the test_txn_commit_wal_crash_recovery
regression. That is a SINGLE-shard problem; the goal is MULTI-shard, a
different code path that writes real per-shard base/incr files.

Fix — decouple the gating instead of all-or-nothing by runtime:
- PerShard branches (num_shards >= 2): initialize_multi (fresh boot) and
  replay_per_shard (restart) now run under BOTH runtimes. initialize_multi
  writes a real seq=1 base RDB per shard, so incr-only recovery (writes then
  crash before any rewrite) has a valid base to replay onto.
- Single-shard branches (replay_multi_part, initialize_with_base, initialize)
  stay #[cfg(runtime-monoio)] with a tokio warn fallback that preserves
  today's exact behavior (v2 single-file recovery, no manifest creation). This
  keeps test_txn_commit_wal_crash_recovery green.
- TopLevel-manifest + multi-shard remains a hard refusal on both runtimes.

Also lands the tokio per-shard BGREWRITEAOF fold (was warn+abort): it reuses
the proven synchronous do_rewrite_per_shard verbatim via
tokio::fs::File::into_std() on the dedicated block_on_local writer thread, so
the exactly-once invariant carries over unchanged. The command handler
(bgrewriteaof_start_sharded) now routes PerShard pools to the fan-out
coordinator on both runtimes (the tokio refusal is removed). DrainOutcome,
drain_pending_appends_framed, and do_rewrite_per_shard cfg widened from
runtime-monoio to any(runtime-monoio, runtime-tokio).

BEHAVIOR CHANGE: a corrupt AOF manifest is now FATAL (refuse-to-start) on
tokio too — previously it was warn+continue. This is intentional: silently
ignoring a corrupt manifest risks overwriting the only reference to the base
RDB. Operators with a corrupt manifest must inspect it before deleting.

Validation:
- cargo check / clippy -D warnings / fmt --check: green on BOTH runtimes
  (default = runtime-monoio; tokio = --no-default-features --features
  runtime-tokio,jemalloc,graph,text-index).
- Crash-matrix + TXN regression validation under tokio follows in the next
  commit (release build + --ignored integration tests).

author: Tin Dang <tin.dang@trustifytechnology.com>
…hen idle

The tokio per-shard AOF writer could leave >1s of writes buffered in its
BufWriter and lose them on SIGKILL, violating the everysec durability
contract (≤1s loss). Surfaced by the F6 BGREWRITEAOF crash matrix COMPOSE
case (PRE=300, rewrite, POST=200, 1.5s quiesce, SIGKILL), which recovered
only ~450/500 instead of the exact 500 it asserts.

Root cause — the flush was on a starvable/unreliable select! arm:
- The writer loop used `_ = interval.tick(), if fsync == EverySec` as one
  arm of `tokio::select!` alongside `msg = rx.recv_async()`. Under
  sustained writes the always-ready recv arm repeatedly won select!'s
  random fairness, starving the interval arm, so the BufWriter accumulated
  well past 1s before flushing. When idle, the long-lived `interval` arm
  also proved unreliable on the dedicated current-thread writer runtime
  (each per-shard writer runs on its own std::thread via block_on_local).
  Either way the everysec deadline was not honored.

Fix — make the flush deadline independent of select! scheduling:
- Replace the bare `rx.recv_async()` arm with
  `tokio::time::timeout(200ms, rx.recv_async())` so the loop wakes at
  least every 200ms regardless of message traffic. flume's recv future is
  drop-safe on the Elapsed branch (no message consumed on timeout) and the
  Ok(Ok(msg)) path captures the message with no loss, so durability is not
  affected by the timeout wrapper.
- Move the EverySec flush check to run AFTER every wake (message OR
  timeout), outside the select! arms, so it is not subject to select!
  fairness and holds the 1s bound under sustained load as well as when
  idle. Delete the now-dead `interval` and its priming `tick()`.
- Always / AppendSync (synchronous fsync-before-ack) and the per-shard
  rewrite fold are untouched; the big message match stays in one place
  (lowest-risk change on durability-critical code). The fix is in the
  `#[cfg(runtime-tokio)]` writer block only — monoio is unaffected.

Also un-stale the single-shard TXN crash-recovery regression guard, which
could not compile on this branch (so it had been protecting nothing):
- tests/txn_kv_wiring.rs: reconcile the ServerConfig literal with the F2/F6
  fields added on this branch (experimental_per_shard_rewrite,
  aof_fsync_timeout_ms) and fix maxmemory `0` -> `Some(0)` (the field is
  Option<usize>). The file is #![cfg(feature = "runtime-tokio")], so the
  guard runs under tokio — the runtime where the decouple's single-shard
  warn-fallback (legacy v2 path) matters.

Validation (release binary + redis-cli, both runtimes):
- test_txn_commit_wal_crash_recovery (tokio): green — confirms the 8853ade
  decouple preserved single-shard recovery.
- crash_matrix_per_shard_bgrewriteaof COMPOSE (tokio): 3/3 exact 500/500 at
  the original 1500ms quiesce (was 475/463/450/449). STRADDLE: green.
- crash_matrix_per_shard_aof CRASH-01-LITE everysec + always +
  pipeline-no-double-write (tokio): 3/3 green.
- Same crash matrices under monoio (default features): green — confirms the
  tokio-only edit did not regress monoio.
- clippy -D warnings + fmt --check: clean on both runtimes.

tests/crash_matrix_per_shard_aof.rs: header doc updated — CRASH-01-LITE is
now validated on BOTH runtimes; documents the tokio invocation.

author: Tin Dang <tin.dang@trustifytechnology.com>
…anifest cap

Live disk-offload cold read-through silently broke past ~70 evicted KV per
shard. `Manifest::commit()` hard-errors above MAX_INLINE_ENTRIES (~70) because
the manifest root is a single 4KB double-buffered page with no overflow/B-tree.
The spill path wrote ONE DataFile + ONE manifest entry per evicted KV, so beyond
#70 every `apply_spill_completions` commit failed, orphaning the cold file
(on disk, unreferenced) → GET returned nil. Identical single/multishard,
tokio/monoio, small/large values — not value-size, multishard, or runtime
specific. Pre-change repro: 1537 "too many entries for inline root page"
warnings, live read-through 7/200, recovery rebuilt only 4 cold-index entries.

Fix: pack many evicted KVs into ONE multi-page DataFile per bg-thread flush, so
manifest entries scale with #files (cold-bytes/flush) not #keys.

- ColdLocation gains `page_idx: u32` → {file_id, page_idx, slot_idx}. ColdIndex
  is rebuilt-only (no on-disk serialization), so this is an in-memory + recovery
  -scan change with no format version bump.
- build_kv_spill_batch / write_kv_spill_batch (kv_spill.rs): greedy leaf packing,
  page_idx+1 on PageFull; atomic temp+rename write. Inline-only MVP (values
  > INLINE_MAX_VALUE_BYTES=3500 route to the existing single-file path).
- read_cold_entry (cold_read.rs): single-page pread at page_idx*PAGE_4K; only
  reads the whole file when the OVERFLOW flag is set.
- spill_thread.rs: buffered batch model (flush on 256-entry cap / 100ms tick /
  shutdown); per-FILE SpillCompletion carrying Vec<(key, db_index, page_idx,
  slot_idx)>.
- apply_spill_completions (persistence_tick.rs): ONE add_file+commit per file,
  then per-entry cold_index.insert.
- eviction.rs: drop the tentative slot-0 cold_index insert (slot 0 is now a
  different key's slot under batching); the ~100ms pre-flush GET miss is
  correct (key not yet on disk) and the AOF incr log is the durability backstop.
- event_loop.rs: the monoio write-path spill ctx used the BARE offload dir
  (<offload>) instead of <offload>/shard-{id}, so cold files were written where
  the reader's cold_shard_dir never looked. Align with every other path (WAL,
  control file, cascade, reader). This dir mismatch was masked pre-fix by the
  tentative slot-0 insert; removing it exposed the bug.

Validation (release, MONOIO — the default/primary runtime):
- LIVE disk-offload read-through: 7/200 → 200/200 (single-shard AND --shards 2),
  manifest-cap warnings 1537 → 0, cold files packed (~7-10, not ~1600).
- Rebuild substrate: rebuild_from_manifest 4 → 1607 entries.
- Hot-key crash recovery: 700/700 (BGREWRITEAOF + SIGKILL + restart), base RDB
  fold confirmed — no regression.
- New in-process TDD test `test_rebuild_from_manifest_roundtrip` proves the
  recovery (page_idx, slot_idx) scan reconstructs the same mapping the builder
  produced: 100/100 keys round-trip through cold_read_through.
- monoio: 21 tiered unit tests green; tokio: 46 tiered unit tests green,
  clippy -D warnings clean, fmt clean (both runtimes).

SCOPE — this fixes the manifest-cap for the MONOIO write-path. Two adjacent gaps
remain (characterized, tracked separately, NOT regressions of this change):

1. TOKIO live read-through still broken (7/200). Root cause: the tokio hot-path
   handler (handler_single.rs / handler_sharded) calls `try_evict_if_needed`
   (delete-only) on write, NOT `try_evict_if_needed_async_spill` — so evicted
   keys under tokio are deleted, never spilled to cold. Only the persistence-tick
   cascade spills (sparsely). monoio's handler_monoio correctly uses the
   async-spill variant + the event_loop.rs:586 dir fix. Wiring async-spill into
   the tokio handler is a separate handler-layer change.

2. Post-crash cold read-through on MONOIO unchanged: 41/200 (pre-change) vs
   42/200 (with-fix) — PRE-EXISTING, and NOT data loss. The AOF incr DOES replay
   (verified: "AOF incr replayed: 2000 commands" / "AOF multi-part loaded:
   2000 entries"); the v3 `cmds=0` is the WAL count, not the AOF — all data is
   durable. The gap is re-eviction/cold-read coordination on the recovery path
   (replay re-inserts → eviction re-evicts → cold reads of re-evicted keys miss).
   Curiously inverted under tokio (recovery 200/200, via v3 WAL cmds=2000). The 7
   changed files do not touch recovery.rs, main.rs wiring, restore_from_persistence,
   AOF replay, or string::get.

Refs: tmp/phaseB-fix-SUMMARY.md, tmp/phaseB-coldspill-rootcause.md,
tmp/phaseB-coldkey-crash-recovery.sh

author: Tin Dang <tin.dang@trustifytechnology.com>
… path

Live disk-offload read-through was broken under the tokio runtime: GET of
evicted keys returned nil for ~96% (7/200) while monoio returned 200/200.
Root cause was a handler-layer asymmetry, NOT a storage bug.

The production binary (main.rs) always runs the sharded architecture: each
shard owns an event loop (event_loop.rs) that binds its own SO_REUSEPORT
socket and spawns connections via `spawn_tokio_connection`. The per-shard
background SpillThread, its `flume::Sender<SpillRequest>`, the shared
`spill_file_id` counter, and the `<offload>/shard-{id}` directory all live in
that event loop and were threaded into the *monoio* connection
(`spawn_monoio_connection`) but NOT the *tokio* one. So the tokio ConnCtx was
built with `spill_sender = None`, and `handler_sharded`'s write path fell back
to `try_evict_if_needed` (DELETE-ONLY): evicted keys were dropped, never
spilled to the cold tier, so the cold read-through had nothing to resolve.
Only the periodic persistence-tick cascade spilled (sparsely).

Fix — mirror the monoio wiring (no new machinery; ConnCtx already carries the
three spill fields and `try_evict_if_needed_async_spill` already exists):

- conn_accept.rs: `spawn_tokio_connection` gains `spill_sender`,
  `spill_file_id`, `disk_offload_dir` params (mirroring
  `spawn_monoio_connection`) and passes them into `ConnectionContext::new`
  instead of the previous `None`/placeholder.
- event_loop.rs: both tokio `spawn_tokio_connection` call sites now pass
  `&spill_sender, &spill_file_id, &disk_offload_dir` (already in scope).
- handler_sharded/mod.rs: the `do_write` eviction now uses
  `try_evict_if_needed_async_spill(db, &rt, sender, dir, &mut fid,
  conn.selected_db)` when `ctx.spill_sender` is `Some`, else delete-only.

Safety / design notes:
- Tokio connections run via `tokio::task::spawn_local` on the per-shard
  event-loop thread (`block_on_local`), the same single-threaded model as
  monoio — so the shared `Rc<Cell<u64>>` spill_file_id is sound, and the
  hot-path get->spill->set has no `.await` inside it (atomic vs other tasks).
- Both the hot-path evictor and the persistence-tick cascade run under this
  shard's db write lock (`with_shard_db` / `write_db`), so they cannot race
  on the same key.
- The `next_file_id = next_file_id.max(spill_file_id.get())` reconciliation
  runs in the tokio select! periodic arm (event_loop.rs:1222), so the cascade
  stays consistent with the hot-path counter — fully symmetric with monoio.

Validation (release, red/green TDD):
- TOKIO live read-through: 7/200 (RED, fresh current-branch binary) -> 200/200
  (GREEN), stable across 3 consecutive e2e runs
  (tmp/phaseB-coldkey-crash-recovery.sh, 2000x500B, maxmemory 262144,
  allkeys-lru, --shards 1).
- MONOIO regression gate: live read-through stays 200/200 (handler_monoio and
  spawn_monoio_connection untouched; only shared signatures extended).
- storage::tiered unit tests: 46 pass on monoio, 46 pass on tokio.
- clippy -D warnings clean + fmt clean, both runtimes.

SCOPE — this closes the tokio LIVE read-through gap. Two adjacent items remain,
both pre-existing and tracked separately (NOT regressions of this change):

1. POST-CRASH cold read-through on MONOIO (B-2): unchanged at ~43/200. The
   spill thread's `file_id` counter (and the event loop's `spill_file_id`)
   start at 1 on every boot and are never seeded from the recovered manifest,
   so post-restart re-eviction atomically overwrites recovered `heap-NNN.mpf`
   files. Fix = centralize file_id assignment in the SpillThread and seed it
   from the recovered max (one coherent change). Tokio post-crash already
   reaches 200/200 when spill completions land before the kill (timing-
   dependent), via the v3 WAL replay path.

2. handler_single (run_with_shutdown / handle_connection) still uses delete-
   only eviction, but that path is test-harness/embedded only — main.rs never
   drives it — so it is out of scope for the production multi-shard goal.

Refs: tmp/phaseB-fix-SUMMARY.md,
investigation_coldspill_manifest_70_cap (memory)

author: Tin Dang <tin.dang@trustifytechnology.com>
After a SIGKILL + restart with disk-offload enabled, GET of a re-evicted
(cold) key returned nil for ~80% of keys on monoio — and intermittently on
tokio. The data was fully durable (AOF replayed all 2000 entries, cold index
rebuilt with the full set); the failure was purely on the cold READ path
after recovery. Two independent defects, both required to be correct:

1. Cold-tier wiring wiped by the rdb::load swap (the dominant cause).
   The multi-part AOF replay loads the base RDB via `rdb::load`, which builds
   fresh `Database::new()` temporaries and swaps them wholesale into the live
   databases (`*live = temp`). `cold_shard_dir` and the rebuilt `cold_index`
   are live-tier topology, NOT part of the RDB hot snapshot, so even a 0-key
   base RDB silently reset them to None. The handler then read a database with
   `cold_shard_dir = None`, so `cold_lookup_location` returned None for every
   key and all cold read-through missed.

   Fix: capture both fields per (shard, db) immediately before the AOF replay
   block (after restore rebuilt the index, before replay clobbers) and
   re-attach them immediately after. Confined to the startup recovery path on
   purpose — `rdb::load` also serves replica full-sync and DEBUG RELOAD, which
   load a foreign dataset whose values do not live in this node's cold files,
   so preserving the local index there would surface stale reads. In the
   recovery path the loaded base + replayed incrs are this node's own data, so
   the rebuilt index is authoritative.

2. Spill file_id counter restarted at 1 (the residual cause).
   On restart the counter restarted at 1, so post-restart re-eviction minted
   the same `heap-NNNNNN.mpf` names recovery had just loaded and atomically
   overwrote cold files the preserved cold_index still references — corrupting
   read-through of any not-yet-refreshed entry.

   Fix: `next_spill_file_id_seed` scans `<shard_dir>/data/heap-*.mpf` and
   seeds the counter at max(file_id)+1, so every future spill filename is
   strictly new and recovered cold files stay immutable until the steady-state
   cascade refreshes them. Scans physical files (not the manifest) so a spill
   that wrote its .mpf but crashed before the manifest commit still cannot be
   clobbered. Returns 1 when disk-offload is off / dir absent / no heap files,
   so a fresh server and the live hot path are byte-for-byte unchanged.
   Non-NotFound scan errors warn and fail safe to 1 rather than risk a misseed.

The two fixes are complementary: (1) re-wires the index across recovery,
(2) keeps the files that index points at immutable. Either alone is
insufficient — without (1) reads miss on shard_dir=None; without (2) the
preserved entries point at overwritten files.

Validation (red/green TDD):
- 3 unit tests for the seed helper (none-dir, fresh-server, resume-above-max
  with non-heap names ignored) — GREEN.
- e2e crash-recovery gate (2000x500B, maxmemory 256KB, allkeys-lru, SIGKILL,
  restart, GET first-200 cold read-through):
    monoio --shards 1: LIVE 200/200 + COLD recovery 200/200 (x5, stable)
    monoio --shards 2: LIVE 200/200 + COLD recovery 200/200 (x2)
    tokio  --shards 2: LIVE 200/200 + COLD recovery 200/200 (x2; both shards'
                       cold indexes 812+796 rebuilt and read-through working)
  RED baseline pre-fix was ~40/200 on monoio recovery.
- storage::eviction + storage::tiered: 61 unit tests pass.
- cargo fmt clean; cargo clippy --release (monoio) -D warnings clean; both
  runtimes build warning-free.

Harnesses: tmp/phaseB-coldkey-crash-recovery.sh (single-shard),
tmp/phaseB2-multishard-coldkey-recovery.sh (multi-shard, both runtimes).

author: Tin Dang <tin.dang@trustifytechnology.com>
…n data loss

The cold-tier orphan sweep deleted an entire batched spill file
(heap-NNNNNN.mpf, up to 256 KVs) whenever it found a SINGLE orphan key in
it, silently orphaning the file's co-located live keys. Once the sweep ran,
cold read-through collapsed from 200/200 to 88/200 — a benign-looking disk
number masked active data loss. This was only exposed by running the sweep
at a short interval; the default 300s interval never fired in earlier
~3-min benches, which is why the disk "leak" looked unbounded and the data
loss stayed hidden.

Root cause: ColdIndex tracked only key -> location, with no per-file
liveness, so sweep_known_orphans could not tell whether a file still backed
other live keys before unlinking it. The same blind spot leaked files
orphaned by re-eviction (insert overwriting a key's location to a new file):
no key references the old file, so the hot-cap-cold sweep never saw it.

Fix: add a reverse liveness index to ColdIndex.
- file_refs: HashMap<file_id, u32> counts live map entries per file.
  insert/remove maintain it; a file is queued for unlink (pending_unlink)
  only on a zero-ref transition. A file is deleted ONLY when its last live
  referrer is removed — never on a single orphan key while co-located keys
  remain.
- sweep_known_orphans now removes orphan ENTRIES, decrements refs, and
  drain_pending_unlink unlinks only zero-ref files (off the hot path). This
  also reclaims re-eviction orphans the hot-cap-cold sweep cannot see.
- merge / rebuild_from_manifest route through insert, so recovery rebuilds
  ref counts consistently.
- timers.rs sweep trigger also fires when has_pending_unlink(), since
  re-eviction orphans carry no hot-cap-cold key.

Unlink precedes the manifest commit, so disk is freed even when the commit
fails at the manifest 70-entry root cap (addressed separately for recovery).

Validation (red/green TDD):
- 3 new unit tests: co-located live key + file retained on single-orphan
  sweep; file deleted only when last ref removed; overwrite reclaims the
  orphaned old file. RED on prior code ("DATA LOSS: file ... was deleted").
- storage suite: 315 passed / 0 failed.
- e2e (s=4, 1M writes, --cold-orphan-sweep-interval-secs 2):
  read-through 88/200 -> 200/200; disk bounded at the true working-set
  plateau (~604 files / ~85 MB, flat rounds 6-20) vs the 300s-interval
  control's unbounded climb to 263 MB+.

author: Tin Dang <tin.dang@trustifytechnology.com>
…ry beyond 70 files/shard

The shard manifest stored all FileEntry records inline in a single 4 KB
root page, capped at MAX_INLINE_ENTRIES = 70. commit() rejected the 71st
entry outright ("too many entries for inline root page: N > 70"). Because
the post-compaction cold tier is recoverable ONLY via the manifest (the
BGREWRITEAOF base RDB holds hot keys only — verified: appendonly=yes +
BGREWRITEAOF recovers cold keys purely through rebuild_from_manifest), the
70-entry cap is a DURABILITY ceiling, not a perf limit: ~70 heap files ≈
9 MB cold/shard, which every real disk-offload deployment exceeds
immediately (Phase C at 1M keys hit "495 > 70"). Past the cap, files 71+
were never recorded, so a crash + restart silently lost that cold data.

Add an append-only overflow region so a root can track unbounded entries
while preserving the dual-root atomic-commit guarantee.

- Root page keeps the first <=70 entries inline (zero overhead, fully
  backward compatible for the common case). The surplus packs into
  ManifestEntry overflow pages (72 entries each). The root header's
  next_page records the overflow run's start page; entry_page_count
  records its length; file_count is the inline+overflow total.
- CRASH-ATOMICITY INVARIANT: overflow pages are appended at EOF and
  sync_data'd BEFORE the root that references them. Append-only means a
  new run never overwrites the currently-active slot's older run, so a
  crash before the root's atomic commit leaves the prior committed state
  (its root AND its overflow run) fully intact. open() loads each slot's
  overflow and returns None on any torn/short/corrupt/inconsistent run,
  so selection falls back to the last-good slot — partial loss at worst,
  NEVER corruption. Preserves the format's pre-existing failure semantics.
- Bound append-only growth: when a committed file is mostly dead overflow
  from superseded commits, compact() rewrites it as
  [Root A][Root B][fresh overflow] via temp-file + atomic rename (the live
  manifest stays valid until the rename). Best-effort: a compaction
  failure never fails an already-durable commit.

The ManifestEntry (0x02) page type and the reserved entry_page_count field
already existed in the v2 layout for exactly this; no on-disk format bump.

Validation (red/green TDD):
- test_overflow_persists_beyond_inline_cap: 200 entries persist + recover.
- test_overflow_commit_crash_atomicity: commit 150>cap then 300>cap, lop
  the final 4 KB page (torn state2 overflow); reopen falls back to the
  last-good state1 (150), never panic/garbage. RED before this change.
- test_overflow_compaction_bounds_growth: 60 commits of a >cap set stay
  within a small multiple of the live set (compaction fires).
- test_overflow_tombstone_and_gc: tombstone + prune an overflow-region
  entry, survives reopen.
- test_manifest_max_inline_entries: inverted from ">70 rejected" to
  ">70 persists via overflow" (MAX_INLINE_ENTRIES still == 70).
- Manifest suite 22 passed; full lib suite 3426 passed; fmt + clippy clean.

Pairs with the per-file-liveness data-loss fix (a3b3a9a). Remaining in this
series: AOF-independent manifest durability (appendonly=no) and the default
cold-orphan-sweep interval.

author: Tin Dang <tin.dang@trustifytechnology.com>
At the 300s default the cold-orphan sweep never fired within a typical
benchmark window (~3 min), which had two bad effects: cold orphan heap
files accumulated on disk for up to 5 minutes before any reclamation, and
— more seriously — it MASKED a batch-file shared-deletion data-loss bug,
because the buggy sweep never ran to expose it (fixed by the per-file
liveness refcount in ColdIndex, a3b3a9a).

Lower the default to 60s (the documented recommended floor). Reclamation
is prompt without churning: the sweep's per-file unlinks run off the hot
path, so a shorter interval keeps each batch small rather than letting
hundreds of orphans pile up between sweeps. Operators can still set 0 to
disable or raise it up to 3600.

Validated together with a3b3a9a + f5e17ac in the Phase C multishard gate
(s=4, sweep=2s, disk-offload on, both runtimes): disk bounded at the
working-set plateau, post-crash recovery read-through 200/200.

author: Tin Dang <tin.dang@trustifytechnology.com>
…nly=no)

Cold (spilled-to-disk) keys were silently dropped on restart whenever the
server ran with `--appendonly no --disk-offload enable`. After a crash the
cold read-through recovered 0/200 even though the heap files and per-shard
manifests were intact on disk.

Root cause (main.rs shard-construction closure): `persistence_dir` is derived
as Some(..) only when `appendonly == "yes" || save.is_some()`, and is
intentionally None under appendonly=no so that no per-tick WAL fsync writer is
created. Recovery was then gated behind `if let Some(ref dir) = persistence_dir`,
so `restore_from_persistence` — whose v3 path does the AOF-INDEPENDENT cold
rebuild (heap reload + rebuild_from_manifest) — never ran. The disk-offload
cold tier was therefore transitively gated on AOF/save being enabled, even
though `disk_offload_base` was Some. "v3 recovery complete" never logged.

Fix: fire recovery when `persistence_dir.is_some() || disk_offload_base.is_some()`.
The dir argument is used only by the v2 fallback (a no-op when there is no
appendonly.aof/snapshot), so it falls back to `config.dir` under appendonly=no;
the v3 path reads the offload manifest directly. This is additive and changes
nothing for the appendonly=yes / save paths.

Validated red/green with a process-level regression test that boots a 4-shard
server under `--appendonly no --disk-offload enable`, forces cold spill, SIGKILLs,
restarts, and asserts cold read-through:
  RED  (old gate): 0/200 with 15 heap files on disk (recovery skipped).
  GREEN (this fix): 179-200/200; recovery log shows "manifest recovered" +
                    "reloaded N KV entries from heap-*.mpf".
Because there is no AOF under appendonly=no, a non-zero post-crash count can
only come from the cold-manifest recovery path this fix enables. The test floor
is 75% — the 172-200 GREEN spread is intrinsic to crashing without AOF (a cold
key is durable only once its eviction-tick manifest commit has landed), while a
recovery-skip regression collapses to a categorical 0.

Refs #22

author: Tin Dang <tin.dang@trustifytechnology.com>
Three root causes behind the "zombie eating RAM in multishard mode" report,
all on the tokio runtime with disk-offload (the default):

1. Serving hang (listener bind race). run_sharded bound the central tokio
   listener with a plain TcpListener (no SO_REUSEPORT) while each shard binds
   its own SO_REUSEPORT listener. A plain socket and REUSEPORT sockets on one
   port are incompatible, resolved by a bind-order race: if shards win, the
   central plain-bind hits EADDRINUSE, run_sharded returns Err, and the WHOLE
   server tears down; if the central wins, the shards' binds fail and the accept
   path is left half-wired. Either way PING is never answered. Fix: the central
   listener now also binds SO_REUSEPORT (the same create_reuseport_socket helper
   the shards use), so it coexists with the shard sockets regardless of order;
   per_shard_accept stays false (a bound-but-idle REUSEPORT socket black-holes
   the connections the kernel hashes to it). monoio was always healthy because
   its bind is REUSEPORT-compatible.

2. io_uring under load. tokio's experimental io_uring bridge floods "Unknown
   io_uring event type: 0" and drops connections (BrokenPipe) under sustained
   pipelined load. It is now default-OFF for tokio (opt in via MOON_URING=1;
   MOON_NO_URING still force-disables and remains the CI default). io_uring
   stays in monoio, the production runtime.

3. PageCache eager pre-alloc (the actual RAM zombie). disk-offload defaults to
   enable, so each shard built its own PageCache sized to 25% of the WHOLE
   maxmemory and EAGERLY committed num_frames*PAGE zeroed bytes at construction.
   With N shards that is N*25%*maxmemory pre-committed before serving a single
   command; with --maxmemory unset the guardrail sets it to 80% of host RAM, so
   a 4-shard default server reserved ~80% of host RAM instantly (measured 3791MB
   RSS at startup with zero data). Fix: (a) lazy buffers - frame buffers start
   empty and grow to a full page on first fetch_page miss, so RSS tracks the
   working set, not the budget; (b) divide the budget across shards
   (per_shard_pagecache_budget) so total pre-alloc is bounded by the budget
   regardless of shard count; (c) a startup WARN when an explicit --pagecache-size
   exceeds 50% of maxmemory. Startup RSS for the 4-shard no-maxmemory case drops
   from 3791MB to 27MB.

Validation (OrbStack Linux aarch64, all server runs cgroup-capped to protect the
host): multishard serve smoke 8/8 tokio + 5/5 monoio; #22 cold read-through
200/200 both runtimes; full test suites green on both runtimes; PageCache lazy +
budget-divide unit tests (red before, green after); startup RSS 3791MB -> 27MB.

author: Tin Dang <tin.dang@trustifytechnology.com>
…doctest fixes

Test and CI-green hygiene supporting the multishard stability fix.

- Add tests/multishard_serve_smoke.rs (non-ignored, runs on BOTH runtimes):
  boots the real binary in multishard configs and asserts a raw-TCP inline PING
  is answered with +PONG inside a hard deadline. This guards the central /
  per-shard SO_REUSEPORT accept path so the serving hang can never silently
  regress. Every case pins a small --maxmemory (128 MiB) so the smoke test
  cannot itself trip the PageCache pre-alloc and OOM a small CI runner; the bind
  race it guards is bind-time and independent of maxmemory. Shard count is
  clamped to available_parallelism().clamp(2,4) to avoid the thread-per-core
  oversubscription wall on small runners.

- Derive Default on ServerConfig and add ..Default::default() spreads across the
  integration tests that construct it by literal, so they compile against the
  fields added this milestone (F2 aof_fsync_timeout_ms, G1/G2 Option<maxmemory>,
  Phase-A experimental_per_shard_rewrite).

- autovacuum_daemon.rs: serialize the two tests that read/assert the
  process-global RECL_AUTOVACUUM_LAST_RUN_TS behind a file-local, poison-tolerant
  mutex. Pre-existing parallel-execution flake: the enabled test advanced the
  global mid-assertion of the disabled test (failed under load on monoio, passed
  on tokio by scheduling luck). 5/5 default-parallel runs green after the guard.

- migrate_aof.rs: annotate the usage doc-block as ```text so rustdoc stops
  compiling the shell example as Rust (it broke `cargo test --doc`).

- crash_recovery_disk_offload_no_aof (#22): restart-retry helper + 65% recovery
  floor + 8s settle-before-kill, hardening the intrinsic crash-durability spread
  against CPU-contention timing (8/8 tokio, 6/6 monoio).

author: Tin Dang <tin.dang@trustifytechnology.com>
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@TinDang97, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 56 minutes and 11 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38da2457-9ce9-4a1e-bcd7-7493e814fe6a

📥 Commits

Reviewing files that changed from the base of the PR and between 9f91381 and 27dd61a.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • CLAUDE.md
  • src/graph/cypher/mod.rs
  • src/graph/cypher/parser/mod.rs
📝 Walkthrough

Walkthrough

Adds per-shard maxmemory guardrail (optional CLI), per-shard AOF rewrite coordination, manifest overflow pages, batched multi-page cold-tier spill with page-indexed locations and file-liveness tracking, per-shard eviction budgets, bounded cross-shard backpressure, PageCache lazy buffers, and broad test updates.

Changes

Guardrail and per-shard maxmemory

Layer / File(s) Summary
Memory guardrail and per-shard budget system
src/config.rs, src/main.rs, src/server/embedded.rs, src/server/listener.rs
ServerConfig::maxmemoryOption<usize>; apply_memory_guardrail() resolves explicit/applied/skip outcomes; RuntimeConfig adds num_shards and maxmemory_per_shard() (uses div_ceil); startup logs sharding; CLI flags --experimental-per-shard-rewrite and --aof-fsync-timeout-ms added; tests updated.

Persistence: AOF and manifest

Layer / File(s) Summary
AOF fsync timeout and writer pool wiring
src/persistence/aof.rs, src/main.rs, src/server/embedded.rs, src/server/listener.rs
AofWriterPool constructors accept fsync_timeout: Duration with DEFAULT_AOF_FSYNC_TIMEOUT; try_send_append_durable maps bounded ack outcomes; startup passes configured timeout into pools; tests adjusted to new signatures.
Per-shard BGREWRITEAOF fan-out & coordinator
src/persistence/aof.rs, src/command/persistence.rs
Adds AofMessage::RewritePerShard, PerShardRewriteCoord, and AofWriterPool::try_send_rewrite_per_shard(...); bgrewriteaof_start_sharded routes per-shard when layout is PerShard and errors map to RESP frames.
Manifest: overflow pages & replay semantics
src/persistence/manifest.rs, src/persistence/aof_manifest.rs
Adds ENTRIES_PER_OVERFLOW_PAGE, overflow page serialization/parsing, load_root that reconstructs inline+overflow, compaction with needs_reopen, tests for overflow durability; advance_shard defers deletions and new prune_shard_files() added; replay max_lsn uses next-free offsets (lsn+len).

Disk-offload tiering, spill, and cold index

Layer / File(s) Summary
Batch spill and multi-page files
src/storage/tiered/kv_spill.rs, src/storage/tiered/spill_thread.rs
Introduces INLINE_MAX_VALUE_BYTES, SpillEntry, BatchPages, build_kv_spill_batch, write_kv_spill_batch; spill thread buffers requests and emits one SpillCompletion per written file containing multiple SpillCompletionEntrys; tests updated to validate batch paths.
ColdIndex page indexing & deferred unlink
src/storage/tiered/cold_index.rs
ColdLocation adds page_idx; ColdIndex tracks file_refs and pending_unlink, updates refs on insert/remove/merge, drain_pending_unlink unlinks files and tombstones manifest entries; rebuild_from_manifest computes page indices; tests expanded.
Cold-read optimization and wiring
src/storage/tiered/cold_read.rs, src/shard/conn_accept.rs, src/shard/event_loop.rs, src/server/conn/handler_sharded/mod.rs
read_cold_entry reads only required 4KB page via pread; overflow loads full file only when needed; spawn_tokio_connection now receives spill plumbing; event loop seeds spill_file_id from recovered heap files and corrects disk_offload_dir per shard.

Eviction and budgeting

Layer / File(s) Summary
Per-shard eviction budget & spill seeding
src/storage/eviction.rs, src/shard/persistence_tick.rs
Eviction loops use rt.maxmemory_per_shard() budget; async spill no longer inserts tentative cold_index entries; adds next_spill_file_id_seed scanning heap-*.mpf; persistence tick applies per-file commit+add semantics and inserts ColdLocation per entry.

Cross-shard dispatch and runtime tuning

Layer / File(s) Summary
Bounded backpressure for cross-shard pushes
src/shard/dispatch.rs, src/server/conn/handler_monoio/mod.rs, src/server/conn/handler_sharded/mod.rs
Adds push_with_backpressure with backoff and max retries, PushOutcome enum; handlers replace unbounded retry loops with bounded helper and reject batch on give-up; monoio adds bounded reply wait constant.
Listener reuseport & io_uring gating
src/server/listener.rs, src/shard/event_loop.rs
run_sharded attempts SO_REUSEPORT on Unix with fallback; io_uring disabled by default unless MOON_URING=1 with updated messaging.

Tests & infra

Layer / File(s) Summary
Test harness migration and new integration tests
tests/* (many files)
All test ServerConfig literals migrated to maxmemory: Some(...) or ..Default::default() where appropriate; adds per-shard BGREWRITEAOF crash-matrix tests, disk-offload crash recovery test, multishard smoke tests, and synchronization helper for autovacuum tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • pilotspace/moon#96: both PRs gate multi-part AOF replay/manifest initialization to monoio and update related startup logic.
  • pilotspace/moon#63: related prior work on sharded BGREWRITEAOF dispatch that this PR extends with coordinator and fan-out.
  • pilotspace/moon#95: overlaps in embedded startup wiring (fsync_timeout, num_shards, guardrail logging).

Suggested labels

enhancement

🐰 A rabbit's note on shards and spills:
Per-shard caps guard every nest,
Overflow pages store the rest,
Batch the spills, then prune with care—
Rewrites fold, manifests fare,
Keys on disk find safe, soft rest. 🥕

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/pr129-port-onto-main

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Stable multishard + disk-offload on both runtimes; eliminate RAM zombie via lazy PageCache and per-shard budgets

🐞 Bug fix ✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
  **Multishard stability and RAM zombie elimination:**
• Fixed tokio multishard serving hang by adding SO_REUSEPORT to central listener, resolving
  bind-order race with per-shard listeners
• Disabled io_uring by default for tokio runtime (opt-in via MOON_URING=1) to prevent connection
  drops under load
• Implemented lazy PageCache buffer allocation (frames grow on first use, not pre-committed) and
  per-shard budget division to eliminate N× RAM over-commitment
• Reduced idle RSS from 3791 MB to 29 MB on 4-shard default server; maintains flat 45–48 MB under
  600k ops on both runtimes
  **Per-shard AOF rewrite (F6) and durability hardening:**
• Ported PR #129 per-shard AOF work onto main with multi-part rewrite coordination via
  PerShardRewriteCoord
• Implemented fsync timeout bounding (F2) with AckOutcome enum and runtime-specific select/timeout
  logic
• Added per-shard fan-out BGREWRITEAOF compaction with synchronized seq bump and single manifest
  commit
  **Disk-offload cold-tier durability fixes:**
• Implemented manifest overflow pages for >70 file entries with crash-atomic ordering (write
  overflow before root commit)
• Fixed batch-file shared-deletion data-loss bug via per-file liveness refcount tracking; files only
  unlinked when ALL co-located keys removed
• Refactored spill thread to buffered batch flushing with inline/oversized entry routing; multi-page
  files with per-entry page indexing
• Fixed cold-tier recovery regression (#22): restore from persistence when
  disk_offload_base.is_some() even under appendonly no
• Implemented page-indexed cold read with lazy full-file I/O; only reads specific pages in
  multi-page batch files
  **Memory guardrail and per-shard eviction budget (G1/G2):**
• Added whole-instance maxmemory guardrail with cgroup-aware auto-capping at ~80% of detected memory
  limit (Linux cgroup v1/v2 support)
• Changed maxmemory from usize to Option to distinguish omitted flag (auto-guardrail) from
  explicit --maxmemory 0 (unlimited)
• Implemented per-shard eviction budget division (maxmemory / num_shards) enforced across all
  eviction thresholds and pressure cascades
• Added startup warnings for explicit --pagecache-size exceeding 50% of maxmemory
  **Cross-shard dispatch and backpressure:**
• Implemented bounded cross-shard dispatch with push_with_backpressure() helper (100µs backoff,
  5000 max retries) to prevent connection parking
• Added response-wait timeout (~30s) with shutdown checks for cross-shard writes
• Added PushOutcome enum to distinguish happy path, transient backpressure, and cancellation
  scenarios
  **Configuration and test updates:**
• Added experimental_per_shard_rewrite flag (opt-in per-shard BGREWRITEAOF) and
  aof_fsync_timeout_ms (2000ms default)
• Lowered cold_orphan_sweep_interval_secs default from 300 to 60 seconds to promptly reclaim
  orphans
• Updated 30+ test files for optional maxmemory field and new AOF configuration fields
• Added comprehensive crash-recovery test suite: cold-tier durability without AOF (#22), per-shard
  BGREWRITEAOF matrix, multishard serving smoke tests
• Serialized autovacuum timestamp tests to prevent flaking under parallel execution
Diagram
flowchart LR
  A["Multishard<br/>Serving"] -->|"SO_REUSEPORT<br/>central listener"| B["Tokio<br/>Stability"]
  A -->|"io_uring<br/>default OFF"| B
  C["PageCache<br/>Allocation"] -->|"lazy buffers<br/>per-shard budget"| D["RAM<br/>Bounded"]
  E["Spill<br/>Thread"] -->|"batch writes<br/>multi-page files"| F["Cold<br/>Tier"]
  F -->|"per-file liveness<br/>refcount"| G["Durability<br/>Safe"]
  F -->|"overflow pages<br/>crash-atomic"| G
  H["Eviction<br/>Pressure"] -->|"per-shard budget<br/>division"| D
  I["Cross-Shard<br/>Dispatch"] -->|"bounded backpressure<br/>retry limits"| J["Stability<br/>Guaranteed"]
  K["Memory<br/>Guardrail"] -->|"cgroup-aware<br/>auto-cap"| D

Loading

Grey Divider

File Changes

1. src/persistence/aof.rs ✨ Enhancement +761/-26

Per-shard AOF rewrite (F6) and fsync timeout bounding (F2)

• Added AckOutcome enum to distinguish terminal states of bounded fsync-ack awaits (F2
 design-for-failure)
• Introduced PerShardRewriteCoord struct for crash-safe multi-shard BGREWRITEAOF coordination (F6)
• Added RewritePerShard message variant and per-shard rewrite entry point
 try_send_rewrite_per_shard
• Implemented fsync timeout bounding via await_ack helper with runtime-specific select/timeout
 logic
• Added do_rewrite_per_shard and drain_pending_appends_framed functions for per-shard fold
 operations
• Enhanced per_shard_aof_writer_task to handle per-shard rewrite messages and fixed EverySec
 deadline enforcement

src/persistence/aof.rs


2. src/storage/tiered/spill_thread.rs ✨ Enhancement +409/-303

Batched spill writes with inline/oversized entry routing

• Refactored spill thread from per-request single-file writes to buffered batch flushing
• Added FLUSH_ENTRY_CAP constant and flush_buffer function for multi-entry batch writes
• Introduced SpillCompletionEntry struct and updated SpillCompletion to carry multiple entries
 per file
• Implemented inline vs oversized entry routing: inline entries batch into one multi-page file,
 oversized get single-page files
• Updated test suite to verify batch semantics and per-file completion model

src/storage/tiered/spill_thread.rs


3. src/persistence/manifest.rs ✨ Enhancement +405/-39

Manifest overflow pages for >70 file entries

• Added ENTRIES_PER_OVERFLOW_PAGE constant and overflow page serialization/parsing functions
• Implemented append-only overflow pages for manifest entries beyond the 70-entry inline cap
• Added load_root function to fully load a slot including its overflow run with crash-safety
 validation
• Implemented compact function to reclaim dead overflow pages and bound file growth
• Enhanced commit to write overflow pages before root commit (crash-atomic ordering)
• Added comprehensive tests for overflow persistence, crash atomicity, compaction, and tombstoning

src/persistence/manifest.rs


View more (58)
4. src/shard/event_loop.rs 🐞 Bug fix +70/-14

io_uring default disable, PageCache lazy allocation, spill recovery

• Changed io_uring default for tokio runtime from enabled to disabled (opt-in via MOON_URING=1)
• Fixed PageCache per-shard budget division to prevent multishard RAM over-commitment
• Implemented lazy PageCache buffer allocation (frames grow on first use, not pre-allocated)
• Added startup warning when explicit --pagecache-size exceeds 50% of maxmemory
• Fixed spill directory path to use per-shard subdirectories (shard-{id})
• Added spill file_id counter seeding from recovered cold files to prevent post-restart overwrites

src/shard/event_loop.rs


5. src/persistence/migrate_aof.rs 📝 Documentation +1/-1

Documentation formatting fix

• Changed code block marker from triple backticks to text variant in documentation comment

src/persistence/migrate_aof.rs


6. src/config.rs ✨ Enhancement +452/-9

G1 memory guardrail and per-shard eviction budget division

• Added Default derive to ServerConfig and introduced G1 memory guardrail system with
 auto-capping at ~80% of detected memory limit (cgroup-aware on Linux)
• Changed maxmemory from usize to Option to distinguish between omitted flag (auto-guardrail)
 and explicit --maxmemory 0 (unlimited)
• Added per-shard eviction budget via RuntimeConfig::maxmemory_per_shard() to divide
 whole-instance cap across shards, fixing multishard RAM zombie
• Added new config flags: experimental_per_shard_rewrite (opt-in per-shard BGREWRITEAOF),
 aof_fsync_timeout_ms (2000ms default for fsync ack timeout)
• Lowered cold_orphan_sweep_interval_secs default from 300 to 60 seconds to promptly reclaim
 orphans and mask batch-file shared-deletion bug
• Added comprehensive memory detection and guardrail resolution logic with Linux cgroup v1/v2
 support and extensive unit tests

src/config.rs


7. src/main.rs 🐞 Bug fix +231/-88

Startup guardrail, cold recovery fix, and multishard serving stability

• Applied G1 memory guardrail at startup before building RuntimeConfig, logging the outcome
• Updated AOF writer pool initialization to pass aof_fsync_timeout_ms and base directory for
 per-shard BGREWRITEAOF coordination
• Gated per-shard BGREWRITEAOF behind experimental_per_shard_rewrite flag with appropriate
 warnings
• Fixed cold-tier recovery regression (#22): fire restore_from_persistence when
 disk_offload_base.is_some() even under appendonly no
• Preserved cold-tier wiring (cold_shard_dir + cold_index) across AOF/RDB recovery by capturing
 before replay and re-attaching after
• Fixed tokio multishard serving hang by removing the cfg!(target_os = "linux") gate on central
 accept loop (always run for tokio)
• Published resolved shard count to RuntimeConfig for per-shard eviction budget calculation and
 logged maxmemory sharding notice
• Wrapped single-shard multi-part AOF paths in #[cfg(runtime-monoio)] with tokio fallback warnings
 to prevent regression

src/main.rs


8. src/persistence/aof_manifest.rs 🐞 Bug fix +186/-62

Crash-safe per-shard rewrite deletion ordering and F5 offset tracking

• Refactored advance_shard() to defer file deletion until after manifest commit, eliminating crash
 window where old seq files are deleted before durability
• Added new prune_shard_files() method for post-commit deletion of old generation files, called
 only after write_manifest() succeeds
• Fixed F5 replication offset tracking: max_lsn now represents next-free offset (entry end = start
 LSN + RESP length) instead of start LSN, preventing lsn→entry uniqueness violations
• Updated test advance_shard_defers_delete_until_after_commit to verify pre-commit invariant (new
 files written, old files survive) and post-commit deletion
• Added comprehensive F5 tests validating next-free offset calculation across inline, ordered, and
 multi-shard entries

src/persistence/aof_manifest.rs


9. src/storage/tiered/cold_index.rs 🐞 Bug fix +319/-35

Per-file liveness refcount and batch-file shared-deletion fix

• Added page_idx field to ColdLocation for multi-page spill files (file-absolute 4KB page index)
• Implemented reverse liveness tracking via file_refs HashMap to count live entries per file and
 pending_unlink queue for zero-ref files
• Fixed batch-file shared-deletion data-loss bug: files only unlinked when ALL co-located keys are
 removed, not on single orphan
• Refactored sweep_known_orphans() into two phases: remove orphan entries (decrement refs), then
 drain pending unlinks off hot path
• Added drain_pending_unlink() to safely unlink zero-ref files post-commit with re-queueing on
 error
• Updated rebuild_from_manifest() to use raw file reads and absolute page indices for correct
 multi-page file reconstruction
• Added comprehensive regression tests for co-located key retention, last-ref deletion, and
 re-eviction churn scenarios

src/storage/tiered/cold_index.rs


10. tests/crash_recovery_disk_offload_no_aof.rs 🧪 Tests +425/-0

Crash recovery test for cold-tier durability without AOF

• New comprehensive crash-recovery test for #22 regression: cold keys must survive SIGKILL under
 --appendonly no --disk-offload enable
• Validates that cold spill manifests are recovered via per-shard persistence (no AOF), with 65%
 recovery floor to distinguish regression (0 keys) from legitimate variance
• Implements robust restart logic with transient EADDRINUSE rebind race handling and SO_REUSEPORT
 listener teardown synchronization
• Includes helper functions for unique port/directory allocation, server lifecycle management,
 redis-cli integration, and heap file counting

tests/crash_recovery_disk_offload_no_aof.rs


11. src/storage/eviction.rs 🐞 Bug fix +162/-19

Per-shard eviction budget and cold file_id seeding

• Updated eviction threshold comparisons to use config.maxmemory_per_shard() instead of raw
 maxmemory, enforcing per-shard budget division
• Added next_spill_file_id_seed() function to scan recovered heap-*.mpf files and seed counter
 above max file_id, preventing post-restart overwrites of cold files
• Removed tentative cold_index insertion at evict time (was inserting slot 0 which could return
 wrong key under batching); rely on completion handler for authoritative location
• Added unit tests for per-shard budget enforcement in multishard mode and spill file_id seeding
 logic

src/storage/eviction.rs


12. tests/autovacuum_daemon.rs 🧪 Tests +17/-0

Serialize autovacuum timestamp tests to prevent flaking

• Added process-global mutex lock (autovacuum_ts_lock()) to serialize tests that read/assert
 RECL_AUTOVACUUM_LAST_RUN_TS, preventing cross-test flaking under parallel execution
• Lock is poison-tolerant to avoid wedging sibling tests if one panics

tests/autovacuum_daemon.rs


13. src/storage/tiered/kv_spill.rs ✨ Enhancement +406/-0

Multi-page batch spill with per-entry page indexing

• Added page_idx field to ColdLocation in spill operations to support multi-page batch spilling
• Implemented build_kv_spill_batch() function for greedy packing of KV entries into multiple leaf
 pages with overflow handling
• Implemented write_kv_spill_batch() for atomic file writes with crash-safety via .tmp rename
 pattern
• Added comprehensive tests for multi-page batch building, round-trip read recovery, and
 crash-recovery manifest rebuilding

src/storage/tiered/kv_spill.rs


14. tests/crash_matrix_per_shard_bgrewriteaof.rs 🧪 Tests +341/-0

Per-shard BGREWRITEAOF crash-recovery matrix tests

• New crash-recovery test suite for per-shard BGREWRITEAOF fan-out compaction under
 --experimental-per-shard-rewrite
• Tests validate exact-count recovery of non-idempotent INCR commands across rewrite boundaries
 (straddle + compose scenarios)
• Includes helper functions for process spawning, port management, and compacted base file detection
• Acceptance gate for F6 monoio multi-shard rewrite correctness (0% data loss, 0% duplication)

tests/crash_matrix_per_shard_bgrewriteaof.rs


15. src/persistence/page_cache/mod.rs 🐞 Bug fix +132/-3

Lazy PageCache buffers and per-shard budget division

• Changed buffer allocation from eager (pre-committed at construction) to lazy (allocated on first
 fetch_page miss)
• Added pagecache_frame_counts() to split per-shard budget into 4KB/64KB frame counts with minimum
 floors
• Added per_shard_pagecache_budget() to divide whole-instance budget across shards, preventing N×
 over-commitment
• Added resident_buffer_bytes() method to report actual committed memory (reflects working set,
 not configured budget)
• Added regression tests locking in lazy allocation and per-shard budget division

src/persistence/page_cache/mod.rs


16. tests/multishard_serve_smoke.rs 🧪 Tests +203/-0

Multishard serving smoke tests for both runtimes

• New smoke test validating multishard server actually serves commands on both runtimes (monoio +
 tokio)
• Regression guard for tokio-on-Linux multishard accept hang caused by SO_REUSEPORT bind-order race
• Tests plain multishard, 1-shard, and multishard+disk-offload configurations with raw TCP PING
 verification
• Uses CARGO_BIN_EXE_moon for CI integration; runs in default cargo test (not #[ignore])

tests/multishard_serve_smoke.rs


17. tests/integration.rs ⚙️ Configuration changes +14/-7

Update integration tests for optional maxmemory field

• Changed maxmemory field from usize to Option across all server startup helpers
• Added ..Default::default() to ServerConfig struct initialization to handle new fields
• Updated 8 test helper functions to wrap maxmemory values in Some()

tests/integration.rs


18. src/shard/dispatch.rs ✨ Enhancement +158/-0

Bounded cross-shard dispatch backpressure with retry limits

• Added CROSS_SHARD_PUSH_BACKOFF (100µs) and CROSS_SHARD_PUSH_MAX_RETRIES (5000) constants for
 bounded backpressure
• Implemented PushOutcome enum (Pushed, Backpressure, Cancelled) for cross-shard SPSC push
 results
• Implemented push_with_backpressure() async helper with bounded retry, shutdown checks, and no
 allocation on happy path
• Added 4 tokio-gated unit tests validating happy path (no sleep), transient full recovery,
 wedged-shard give-up, and shutdown abort

src/shard/dispatch.rs


19. src/shard/persistence_tick.rs ✨ Enhancement +43/-32

Per-shard eviction thresholds and batch spill completion handling

• Updated apply_spill_completions() to process per-file batches (one manifest add+commit per file,
 multiple ColdIndex inserts per file)
• Changed from single-entry completion to multi-entry SpillCompletion with entries vector
 containing page_idx and slot_idx
• Updated should_run_pressure_cascade() to compare against maxmemory_per_shard() instead of
 whole-instance maxmemory
• Updated handle_memory_pressure() eviction threshold to use per-shard budget for bounded
 aggregate RSS

src/shard/persistence_tick.rs


20. src/server/conn/handler_monoio/mod.rs ✨ Enhancement +89/-24

Bounded cross-shard dispatch and response-wait timeouts

• Added CROSS_SHARD_RESPONSE_MAX_WAITS constant (~30s generous backstop for uncertain writes)
• Replaced unbounded loop { try_push; sleep(10µs) } with push_with_backpressure() helper for
 bounded cross-shard dispatch
• Added response-wait timeout with shutdown checks to prevent parking connections on wedged peers
• Enhanced error messages to distinguish backpressure, cancellation, and timeout scenarios

src/server/conn/handler_monoio/mod.rs


21. src/server/conn/handler_sharded/mod.rs ✨ Enhancement +71/-10

Disk-offload eviction and bounded cross-shard backpressure

• Added disk-offload spill support in eviction path via try_evict_if_needed_async_spill() when
 spill sender is available
• Replaced unbounded cross-shard push loop with push_with_backpressure() helper with bounded retry
 and shutdown checks
• Enhanced error handling to reject batch on backpressure/cancellation instead of parking connection
 forever

src/server/conn/handler_sharded/mod.rs


22. src/server/listener.rs 🐞 Bug fix +35/-1

SO_REUSEPORT central listener and AOF fsync timeout

• Central tokio listener now binds with SO_REUSEPORT via create_reuseport_socket() to coexist
 with per-shard listeners
• Fallback to plain bind on non-unix or socket creation failure (preserves existing behavior)
• Added aof_fsync_timeout_ms parameter to AofWriterPool::top_level_with_policy() calls

src/server/listener.rs


23. src/server/embedded.rs ✨ Enhancement +13/-1

Memory guardrail and per-shard maxmemory enforcement

• Added memory guardrail logging via log_memory_guardrail() before RuntimeConfig construction
• Published resolved shard count to RuntimeConfig so per-shard budget division is enforced
• Added per-shard maxmemory sharding log via log_maxmemory_sharding()
• Added aof_fsync_timeout_ms parameter to AofWriterPool::top_level_with_policy() call

src/server/embedded.rs


24. src/shard/spsc_handler.rs ⚙️ Configuration changes +11/-2

AOF writer pool timeout parameter updates

• Updated AofWriterPool::top_level_with_policy() calls to include aof_fsync_timeout_ms parameter
• Added AofMessage::RewritePerShard variant to match statement in test

src/shard/spsc_handler.rs


25. src/storage/tiered/cold_read.rs ✨ Enhancement +11/-8

Page-indexed cold read with lazy full-file I/O

• Changed from whole-file read to page-specific pread using FileExt::read_exact_at() with
 page_idx offset
• Only reads full file when following overflow chains (lazy full-file read on demand)
• Reduces I/O for single-page lookups in multi-page batch files

src/storage/tiered/cold_read.rs


26. tests/crash_matrix_per_shard_aof.rs 📝 Documentation +10/-5

Per-shard AOF crash-recovery now supports both runtimes

• Updated documentation to reflect per-shard AOF support on BOTH runtimes (monoio + tokio)
• Added build instructions for tokio runtime validation

tests/crash_matrix_per_shard_aof.rs


27. src/shard/conn_accept.rs ✨ Enhancement +11/-3

Tokio disk-offload spill context threading

• Added spill_sender, spill_file_id, and disk_offload_dir parameters to
 spawn_tokio_connection()
• Tokio handler now receives disk-offload context (mirrors monoio path) for spilling evicted KVs to
 cold tier

src/shard/conn_accept.rs


28. src/command/persistence.rs ✨ Enhancement +28/-0

Per-shard BGREWRITEAOF fan-out coordination

• Added F6 per-shard fan-out path for BGREWRITEAOF via pool.try_send_rewrite_per_shard() when
 layout is PerShard
• Synchronized seq bump + single manifest commit across all writers (both runtimes)
• Falls back to legacy RewriteSharded for TopLevel multi-DB pools

src/command/persistence.rs


29. src/shard/timers.rs 🐞 Bug fix +10/-1

Cold orphan sweep handles zero-ref file unlink

• Updated run_cold_orphan_sweep() to check for pending unlink files even when orphan_keys is
 empty
• Files orphaned by re-eviction or promotion carry no hot∩cold key but still need unlink processing

src/shard/timers.rs


30. src/server/conn/handler_single.rs ⚙️ Configuration changes +5/-1

AOF writer pool timeout parameter update

• Updated AofWriterPool::top_level_with_policy() call to include aof_fsync_timeout_ms parameter

src/server/conn/handler_single.rs


31. tests/lunaris_cypher_shortest_path.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/lunaris_cypher_shortest_path.rs


32. tests/lunaris_cypher_temporal.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/lunaris_cypher_temporal.rs


33. tests/ft_search_concurrent_readers.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/ft_search_concurrent_readers.rs


34. tests/ft_search_as_of_boundary.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/ft_search_as_of_boundary.rs


35. tests/ft_search_as_of_filter.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/ft_search_as_of_filter.rs


36. tests/ft_search_multi_shard_as_of.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/ft_search_multi_shard_as_of.rs


37. tests/ft_search_temporal_parity.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/ft_search_temporal_parity.rs


38. tests/workspace_integration.rs ⚙️ Configuration changes +6/-2

Update test config for optional maxmemory and new AOF fields

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added aof_fsync_timeout_ms and experimental_per_shard_rewrite fields to ServerConfig
• Added ..Default::default() to ServerConfig initialization

tests/workspace_integration.rs


39. tests/kill_snapshot.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/kill_snapshot.rs


40. tests/adversarial_v0110_fix01_set_delete_rollback.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/adversarial_v0110_fix01_set_delete_rollback.rs


41. tests/adversarial_v0110_fix02_err_path_intent.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/adversarial_v0110_fix02_err_path_intent.rs


42. tests/adversarial_v0110_fix03_simplestring_graph.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/adversarial_v0110_fix03_simplestring_graph.rs


43. tests/adversarial_v0110_fix07_multihop_edge_var_reject.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/adversarial_v0110_fix07_multihop_edge_var_reject.rs


44. tests/adversarial_v0110_fix04_shortest_path_call_parity.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/adversarial_v0110_fix04_shortest_path_call_parity.rs


45. tests/adversarial_v0110_fix06_shortest_path_min_hops.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/adversarial_v0110_fix06_shortest_path_min_hops.rs


46. tests/graph_bench_compare.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/graph_bench_compare.rs


47. tests/graph_bench_e2e.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/graph_bench_e2e.rs


48. tests/graph_integration.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/graph_integration.rs


49. tests/graph_stress_deep.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/graph_stress_deep.rs


50. tests/txn_cypher_write_rollback.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/txn_cypher_write_rollback.rs


51. tests/txn_graph_wiring.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/txn_graph_wiring.rs


52. tests/lunaris_hybrid_ft_search.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/lunaris_hybrid_ft_search.rs


53. tests/txn_completeness_edge_cases.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/txn_completeness_edge_cases.rs


54. tests/txn_ft_search_snapshot.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/txn_ft_search_snapshot.rs


55. tests/txn_kv_wiring.rs ⚙️ Configuration changes +3/-1

Update test config for optional maxmemory and new AOF fields

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added aof_fsync_timeout_ms and experimental_per_shard_rewrite fields to ServerConfig
• Added ..Default::default() to ServerConfig initialization

tests/txn_kv_wiring.rs


56. tests/mq_integration.rs ⚙️ Configuration changes +3/-1

Update test config for optional maxmemory and new AOF fields

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added aof_fsync_timeout_ms and experimental_per_shard_rewrite fields to ServerConfig
• Added ..Default::default() to ServerConfig initialization

tests/mq_integration.rs


57. tests/pipeline_auto_index.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/pipeline_auto_index.rs


58. tests/replication_test.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/replication_test.rs


59. tests/vacuum_commands.rs ⚙️ Configuration changes +2/-1

Update test config for optional maxmemory

• Changed maxmemory from usize to Option and wrapped value in Some()
• Added ..Default::default() to ServerConfig initialization

tests/vacuum_commands.rs


60. CHANGELOG.md 📝 Documentation +23/-0

Changelog entry for whole-instance maxmemory enforcement

• Added G2 section documenting maxmemory as a true whole-instance cap across shards
• Explains per-shard budget division (maxmemory / num_shards) for bounded aggregate RSS
• Notes that CONFIG GET maxmemory and INFO remain unchanged (Redis-compatible)
• Includes startup log message format for visibility of per-shard budget resolution

CHANGELOG.md


61. README.md Additional files +8/-0

...

README.md


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Jun 3, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (3) 🔗 Cross-repo conflicts (0)

Context used
✅ Compliance rules (platform): 27 rules

Grey Divider


Action required

1. sigkill() unsafe missing SAFETY 📘 Rule violation ≡ Correctness
Description
A new unsafe block calls libc::kill without the required adjacent // SAFETY: justification,
violating the repository’s unsafe-code policy and making it harder for reviewers/auditors to
validate invariants and UB risk. This increases audit risk and can block automated unsafe audits.
Code

tests/crash_recovery_disk_offload_no_aof.rs[R332-337]

Evidence
UNSAFE_POLICY.md requires that every unsafe block include an adjacent // SAFETY: comment
describing the preconditions/invariants being upheld. In the newly added sigkill() helper, an
unsafe { libc::kill(...) } call was introduced without any accompanying SAFETY justification,
directly contradicting that requirement and preventing auditors from confirming why the call is
sound.

Rule 297369: Enforce unsafe code usage against UNSAFE_POLICY.md
tests/crash_recovery_disk_offload_no_aof.rs[331-339]
tests/crash_matrix_per_shard_bgrewriteaof.rs[138-145]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new `unsafe { libc::kill(...) }` usage was introduced without the policy-required adjacent `// SAFETY:` comment (per `UNSAFE_POLICY.md`), leaving the upheld preconditions/invariants undocumented.

## Issue Context
`UNSAFE_POLICY.md` mandates an explicit `// SAFETY:` justification for every unsafe block and prefers using safe alternatives when available (e.g., `Child::kill`) to reduce unsafe surface area. The new `sigkill()` helper currently calls `libc::kill` inside an unsafe block without documenting why this is safe.

## Fix Focus Areas
- tests/crash_recovery_disk_offload_no_aof.rs[331-339]
- tests/crash_matrix_per_shard_bgrewriteaof.rs[138-145]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Unannotated .unwrap() in spill tests 📘 Rule violation ✧ Quality
Description
New .unwrap() calls were added without the required #[allow(clippy::unwrap_used)] and an
immediately preceding justification comment. This breaks the unwrap-audit expectations and makes
future lint/audit enforcement unreliable.
Code

src/storage/tiered/spill_thread.rs[519]

Evidence
The rule requires a justification comment plus #[allow(clippy::unwrap_used)] in-scope for every
.unwrap() in the diff. The updated spill-thread tests include .unwrap() usages with no
allow+comment pair (e.g., the find(...).unwrap() and other unwraps in the same test block).

Rule 302083: Annotate safe unwrap calls with allow and justification
src/storage/tiered/spill_thread.rs[492-547]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New `.unwrap()` calls were introduced without the required `// ...` justification comment directly above an attached `#[allow(clippy::unwrap_used)]` attribute.

## Issue Context
The compliance rule requires every unwrap to be explicitly justified and locally allowed (or rewritten to avoid unwrap).

## Fix Focus Areas
- src/storage/tiered/spill_thread.rs[492-588]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. PerShardRewriteCoord missing loom test 📘 Rule violation ▣ Testability
Description
A new atomic coordination/state-machine type (PerShardRewriteCoord) was introduced without any
corresponding loom model test coverage. This risks concurrency bugs slipping through in a
multi-threaded rewrite coordinator.
Code

src/persistence/aof.rs[R217-327]

Evidence
The diff adds a new atomic coordination type that uses AtomicUsize/AtomicBool and cross-thread
completion logic, which qualifies as an atomic state machine. The existing loom test file is still
only modeling ResponseSlot and does not exercise PerShardRewriteCoord, so the required loom
coverage was not added.

Rule 302087: Require loom model tests for new atomic state machines
src/persistence/aof.rs[217-327]
tests/loom_response_slot.rs[1-15]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new atomic coordination type was added, but no loom model test was added/updated to exercise its concurrency behavior.

## Issue Context
The compliance requirement mandates loom-based model tests for new/modified atomic state machines, specifically in `tests/loom_response_slot.rs`.

## Fix Focus Areas
- src/persistence/aof.rs[217-327]
- tests/loom_response_slot.rs[1-120]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
4. Cold reads disabled in replay 🐞 Bug ≡ Correctness
Description
src/main.rs temporarily take()s cold_shard_dir/cold_index before multi-part AOF replay and
only reattaches them after replay completes, so replayed commands cannot read disk-offloaded keys.
This can permanently corrupt recovered state for read-modify-write commands during replay (e.g.,
INCR reads via db.get() and will treat cold-only keys as missing).
Code

src/main.rs[R758-770]

Evidence
main.rs explicitly removes cold-tier wiring (take()) before invoking AOF replay and only
restores it afterward; replay uses the standard command dispatch engine; INCR reads via db.get(),
and Database::get() only performs cold read-through when cold_shard_dir and cold_index are
present, so cold-only keys appear missing during replay and can be replayed incorrectly.

src/main.rs[739-775]
src/main.rs[1054-1065]
src/persistence/replay.rs[512-538]
src/command/string/string_write.rs[265-299]
src/storage/db.rs[681-722]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
During multi-part AOF replay, `cold_shard_dir` and `cold_index` are removed from every `Database` via `.take()` before calling the replay routines, and only restored after replay finishes. Because replay executes the normal command handlers, any replay-time read of a disk-offloaded key (e.g., INCR’s read-modify-write) cannot cold-read and will behave as if the key is missing, producing incorrect recovered state.

## Issue Context
- `rdb::load` swaps whole `Database` instances (`*live = temp`), which is why cold-tier fields get dropped.
- The current approach works around that swap by stashing cold wiring in `main.rs`, but it makes cold storage unavailable during the replay window.
- Fix must ensure cold wiring is present *while replaying incremental RESP commands*, not only after replay ends.

## Fix Focus Areas
- src/main.rs[739-775]
- src/main.rs[1054-1065]
- src/persistence/aof_manifest.rs[1299-1370]
- src/persistence/rdb.rs[362-367]

## Suggested fix approach
1. Change replay path so cold wiring is restored immediately after the base RDB load swap and **before** incremental command replay begins.
  - Option A: Add a recovery-only `rdb::load_preserving_wiring(...)` (or similar) that copies selected “topology” fields (cold wiring) from the current live DBs into the temp DBs right before the final swap, and use it from `replay_multi_part` / per-shard replay.
  - Option B: Have `replay_multi_part` / `replay_per_shard` accept a preserved-wiring bundle and reattach it right after `rdb::load(...)` returns and before `replay_incr_resp(...)`.
2. Remove (or narrow) the `.take()` window in `main.rs` so cold wiring remains available during replayed command execution.
3. Add a regression test that:
  - Sets up disk-offload state with a key only in cold storage.
  - Replays an AOF stream containing a read-modify-write command against that key (e.g., INCR).
  - Asserts the recovered value matches the expected incremented value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

5. Spill defaults to dot-dir 🐞 Bug ⛨ Security
Description
When spill_sender is set but disk_offload_dir is missing, the tokio sharded handler spills
evicted data into the process working directory via Path::new("."). This can create/overwrite
unexpected files outside the configured offload directory.
Code

src/server/conn/handler_sharded/mod.rs[R1283-1294]

Evidence
The handler explicitly falls back to . when the spill directory is missing; the context fields are
optional, so this state is representable; and the event loop constructs these values separately,
meaning a mismatch can occur if wiring changes or configuration is incomplete.

src/server/conn/handler_sharded/mod.rs[1271-1299]
src/server/conn/core.rs[87-92]
src/shard/event_loop.rs[597-623]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The tokio sharded handler uses `ctx.disk_offload_dir.as_deref().unwrap_or(Path::new("."))` when spilling is enabled. If `spill_sender` is present but `disk_offload_dir` is `None` (miswire/misconfig), the server will spill data files into the current working directory, which is unsafe and operationally surprising.

## Issue Context
`spill_sender`, `spill_file_id`, and `disk_offload_dir` are independently optional in `ConnectionContext`, so there is no type-level guarantee that a spill sender implies a valid spill directory.

## Fix Focus Areas
- src/server/conn/handler_sharded/mod.rs[1271-1301]
- src/server/conn/core.rs[87-92]
- src/shard/event_loop.rs[597-623]

## Suggested fix approach
1. Replace the `unwrap_or(Path::new("."))` fallback with an explicit invariant check:
  - If `spill_sender.is_some()` and `disk_offload_dir.is_none()`, log an error and fall back to delete-only eviction (`try_evict_if_needed`) or return a clear error (prefer falling back + warn to keep serving).
2. Consider enforcing the invariant earlier:
  - In `ConnectionContext::new`, if sender is Some and dir is None, set `spill_sender=None` (disable spilling) and emit a warning, or `debug_assert!`/`panic!` in debug builds.
3. Add a small unit/integration test that constructs a `ConnectionContext` with `spill_sender=Some` and `disk_offload_dir=None` and asserts spilling is disabled or rejected rather than writing to `.`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

Comment on lines +332 to +337
fn sigkill(child: &mut Child) {
let pid = child.id() as i32;
unsafe {
libc::kill(pid, libc::SIGKILL);
}
let _ = child.wait();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. sigkill() unsafe missing safety 📘 Rule violation ≡ Correctness

A new unsafe block calls libc::kill without the required adjacent // SAFETY: justification,
violating the repository’s unsafe-code policy and making it harder for reviewers/auditors to
validate invariants and UB risk. This increases audit risk and can block automated unsafe audits.
Agent Prompt
## Issue description
A new `unsafe { libc::kill(...) }` usage was introduced without the policy-required adjacent `// SAFETY:` comment (per `UNSAFE_POLICY.md`), leaving the upheld preconditions/invariants undocumented.

## Issue Context
`UNSAFE_POLICY.md` mandates an explicit `// SAFETY:` justification for every unsafe block and prefers using safe alternatives when available (e.g., `Child::kill`) to reduce unsafe surface area. The new `sigkill()` helper currently calls `libc::kill` inside an unsafe block without documenting why this is safe.

## Fix Focus Areas
- tests/crash_recovery_disk_offload_no_aof.rs[331-339]
- tests/crash_matrix_per_shard_bgrewriteaof.rs[138-145]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

let total_entries: usize = completions.iter().map(|c| c.entries.len()).sum();
assert_eq!(total_entries, 1, "expected 1 entry across all completions");

let c = completions.iter().find(|c| !c.entries.is_empty()).unwrap();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Unannotated .unwrap() in spill tests 📘 Rule violation ✧ Quality

New .unwrap() calls were added without the required #[allow(clippy::unwrap_used)] and an
immediately preceding justification comment. This breaks the unwrap-audit expectations and makes
future lint/audit enforcement unreliable.
Agent Prompt
## Issue description
New `.unwrap()` calls were introduced without the required `// ...` justification comment directly above an attached `#[allow(clippy::unwrap_used)]` attribute.

## Issue Context
The compliance rule requires every unwrap to be explicitly justified and locally allowed (or rewritten to avoid unwrap).

## Fix Focus Areas
- src/storage/tiered/spill_thread.rs[492-588]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread src/persistence/aof.rs
Comment on lines +217 to +327
pub struct PerShardRewriteCoord {
/// Writers still to finish. Starts at the shard count; the writer that
/// decrements it to zero performs the commit + prune.
remaining: std::sync::atomic::AtomicUsize,
/// Shared manifest, loaded fresh from disk by the BGREWRITEAOF handler at
/// rewrite time (normal appends never touch the manifest, and BGREWRITEAOF
/// is CAS-serialized, so a fresh load is the authoritative current state).
manifest: Arc<parking_lot::Mutex<crate::persistence::aof_manifest::AofManifest>>,
/// The generation every writer advances to. Computed once = old_seq + 1.
new_seq: u64,
/// The generation being retired; pruned only after the commit.
old_seq: u64,
/// Number of shards participating (= initial `remaining`).
n_shards: usize,
/// Set by any shard whose fold fails. The final writer checks this and
/// ABORTS the commit if set — committing `new_seq` while a shard never
/// wrote its new base would make recovery look for a missing base and
/// refuse to start. On abort the old generation (`old_seq`) stays the
/// committed state for all shards (crash-safe).
failed: std::sync::atomic::AtomicBool,
}

impl PerShardRewriteCoord {
/// Construct a coordinator for an `n_shards`-way rewrite advancing the
/// shared `manifest` from its current seq to `current_seq + 1`.
pub fn new(
manifest: Arc<parking_lot::Mutex<crate::persistence::aof_manifest::AofManifest>>,
current_seq: u64,
n_shards: usize,
) -> Arc<Self> {
Arc::new(Self {
remaining: std::sync::atomic::AtomicUsize::new(n_shards),
manifest,
new_seq: current_seq + 1,
old_seq: current_seq,
n_shards,
failed: std::sync::atomic::AtomicBool::new(false),
})
}

/// The generation writers advance to.
#[inline]
pub fn new_seq(&self) -> u64 {
self.new_seq
}

/// Mark the whole rewrite as failed (called by a shard whose fold errored).
/// The final writer will abort the commit, leaving `old_seq` authoritative.
#[inline]
pub fn mark_failed(&self) {
self.failed
.store(true, std::sync::atomic::Ordering::Release);
}

/// Called by each writer AFTER it has durably written its new base+incr at
/// `new_seq` and reopened its append file. Decrements the countdown; the
/// final caller commits the manifest (single seq flip) and prunes the old
/// generation, then clears the global in-progress flag.
///
/// Crash-safety: the commit (`write_manifest`) is the atomic flip point;
/// pruning runs strictly after it, so a crash mid-prune only orphans
/// already-superseded files (recovery uses `new_seq`).
pub fn shard_done(&self) {
use std::sync::atomic::Ordering;
// AcqRel: the decrement-to-zero must observe all prior writers'
// advance_shard manifest mutations before committing.
if self.remaining.fetch_sub(1, Ordering::AcqRel) == 1 {
// Abort if any shard failed to fold: committing new_seq while a
// shard lacks its new base would break recovery. Keep old_seq.
if self.failed.load(Ordering::Acquire) {
let m = self.manifest.lock();
// Best-effort: prune the orphaned new-seq files written by the
// shards that DID fold, so they don't linger.
for sid in 0..self.n_shards {
m.prune_shard_files(sid as u16, self.new_seq);
}
drop(m);
error!(
"F6 per-shard rewrite ABORTED: a shard failed to fold; seq stays {}. \
Old generation remains authoritative (crash-safe). A RESTART is \
recommended so successful shards' writers stop appending to the \
discarded new generation.",
self.old_seq
);
crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
return;
}
let mut m = self.manifest.lock();
m.seq = self.new_seq;
if let Err(e) = m.write_manifest() {
error!(
"F6 per-shard rewrite: final manifest commit (seq {}) failed: {}. \
Old generation remains authoritative; rewrite did not take effect.",
self.new_seq, e
);
// Do NOT prune — old generation is still the committed state.
drop(m);
crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
return;
}
for sid in 0..self.n_shards {
m.prune_shard_files(sid as u16, self.old_seq);
}
drop(m);
info!(
"F6 per-shard rewrite complete: committed seq {} across {} shards, pruned seq {}",
self.new_seq, self.n_shards, self.old_seq
);
crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Pershardrewritecoord missing loom test 📘 Rule violation ▣ Testability

A new atomic coordination/state-machine type (PerShardRewriteCoord) was introduced without any
corresponding loom model test coverage. This risks concurrency bugs slipping through in a
multi-threaded rewrite coordinator.
Agent Prompt
## Issue description
A new atomic coordination type was added, but no loom model test was added/updated to exercise its concurrency behavior.

## Issue Context
The compliance requirement mandates loom-based model tests for new/modified atomic state machines, specifically in `tests/loom_response_slot.rs`.

## Fix Focus Areas
- src/persistence/aof.rs[217-327]
- tests/loom_response_slot.rs[1-120]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread src/main.rs
Comment on lines +758 to +770
let preserved_cold_wiring: Vec<
Vec<(
Option<std::path::PathBuf>,
Option<moon::storage::tiered::cold_index::ColdIndex>,
)>,
> = if disk_offload_base.is_some() {
shards
.iter_mut()
.map(|s| {
s.databases
.iter_mut()
.map(|db| (db.cold_shard_dir.take(), db.cold_index.take()))
.collect()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

4. Cold reads disabled in replay 🐞 Bug ≡ Correctness

src/main.rs temporarily take()s cold_shard_dir/cold_index before multi-part AOF replay and
only reattaches them after replay completes, so replayed commands cannot read disk-offloaded keys.
This can permanently corrupt recovered state for read-modify-write commands during replay (e.g.,
INCR reads via db.get() and will treat cold-only keys as missing).
Agent Prompt
## Issue description
During multi-part AOF replay, `cold_shard_dir` and `cold_index` are removed from every `Database` via `.take()` before calling the replay routines, and only restored after replay finishes. Because replay executes the normal command handlers, any replay-time read of a disk-offloaded key (e.g., INCR’s read-modify-write) cannot cold-read and will behave as if the key is missing, producing incorrect recovered state.

## Issue Context
- `rdb::load` swaps whole `Database` instances (`*live = temp`), which is why cold-tier fields get dropped.
- The current approach works around that swap by stashing cold wiring in `main.rs`, but it makes cold storage unavailable during the replay window.
- Fix must ensure cold wiring is present *while replaying incremental RESP commands*, not only after replay ends.

## Fix Focus Areas
- src/main.rs[739-775]
- src/main.rs[1054-1065]
- src/persistence/aof_manifest.rs[1299-1370]
- src/persistence/rdb.rs[362-367]

## Suggested fix approach
1. Change replay path so cold wiring is restored immediately after the base RDB load swap and **before** incremental command replay begins.
   - Option A: Add a recovery-only `rdb::load_preserving_wiring(...)` (or similar) that copies selected “topology” fields (cold wiring) from the current live DBs into the temp DBs right before the final swap, and use it from `replay_multi_part` / per-shard replay.
   - Option B: Have `replay_multi_part` / `replay_per_shard` accept a preserved-wiring bundle and reattach it right after `rdb::load(...)` returns and before `replay_incr_resp(...)`.
2. Remove (or narrow) the `.take()` window in `main.rs` so cold wiring remains available during replayed command execution.
3. Add a regression test that:
   - Sets up disk-offload state with a key only in cold storage.
   - Replays an AOF stream containing a read-modify-write command against that key (e.g., INCR).
   - Asserts the recovered value matches the expected incremented value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/shard/conn_accept.rs (1)

368-396: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Migrated tokio connections still drop the disk-offload spill context.

spawn_tokio_connection() now wires spill_sender / spill_file_id / disk_offload_dir, but migrated tokio connections still build ConnectionContext with None, Cell(0), and None. After an affinity migration, that client falls back to the non-spilling eviction path even when disk-offload is enabled, so cold-tier durability depends on whether the connection migrated. This needs the same three parameters threaded through spawn_migrated_tokio_connection() and its call sites in src/shard/event_loop.rs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/shard/conn_accept.rs` around lines 368 - 396, ConnectionContext is being
built with None/Cell(0)/None for spill_sender, spill_file_id, and
disk_offload_dir causing migrated tokio connections to lose disk-offload state;
update spawn_migrated_tokio_connection and all its call sites to accept and
forward the same spill_sender, spill_file_id, and disk_offload_dir parameters
that spawn_tokio_connection uses, then change the ConnectionContext::new call in
conn_accept.rs (the construction currently passing None, Rc::new(Cell::new(0)),
None) to pass the propagated spill_sender, spill_file_id, and disk_offload_dir
values so migrated connections retain the spill context.
src/server/embedded.rs (1)

221-223: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Port the main.rs startup fixes into embedded mode too.

run_embedded() still diverges from the binary on two functional paths: it only restores when persistence_dir is set, so --appendonly no + disk-offload loses cold-tier recovery after restart, and it still enables tokio/Linux per_shard_accept, which is the same accept-path hang/race main.rs just disabled. Embedded startup needs the same recovery gate and central-accept behavior as main.rs.

Also applies to: 359-365

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/embedded.rs` around lines 221 - 223, In run_embedded, make two
changes to match main.rs: (1) call shard.restore_from_persistence whenever
disk_offload_base.is_some() even if persistence_dir is None (i.e. change the
current if let Some(ref dir) = persistence_dir guard to a condition that
restores when persistence_dir.is_some() || disk_offload_base.is_some(), passing
persistence_dir.as_deref() as before) so cold-tier recovery runs with
--appendonly no + disk-offload; (2) disable per-shard accept so embedded uses
the central accept path like main.rs by clearing or forcing the per_shard_accept
flag (refer to the per_shard_accept variable/setting used elsewhere) before
spawning listeners; apply the same two edits to the other restore block that
currently mirrors this logic.
src/persistence/aof_manifest.rs (1)

2901-2999: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Split this module before adding more persistence logic.

src/persistence/aof_manifest.rs is already far past the 1500-line cap, and this change adds more rewrite/replay/test surface on top of that. Please break the per-shard rewrite/replay pieces into submodules before this grows further.

As per coding guidelines, "No single .rs file should exceed 1500 lines. Split into submodules if approaching this limit."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/aof_manifest.rs` around lines 2901 - 2999, The
aof_manifest.rs file has grown beyond the 1500-line guideline — move the
per-shard rewrite/replay logic and related tests into one or more submodules
(e.g., shard_rewrite.rs and shard_replay.rs). Concretely: extract functions and
helpers that operate per-shard (advance_shard, prune_shard_files,
shard_base_path_seq, shard_incr_path_seq, plus any helpers used only for
shard-level file ops and the test
advance_shard_defers_delete_until_after_commit) into a new submodule(s), keep
the manifest-level state (AofManifest struct, seq, write_manifest,
initialize_multi, load) in the top-level file, and re-export needed symbols with
pub use so callers still reference
AofManifest::{advance_shard,prune_shard_files,...} as before; update
imports/paths in tests to use the new module layout and ensure visibility
(pub(crate)/pub) matches previous access.
🟠 Major comments (16)
src/server/conn/handler_monoio/mod.rs-2049-2057 (1)

2049-2057: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Reply-loss now skips AOF for writes that may already have executed.

Once the PipelineBatch has been accepted, the target shard can apply the write and then lose the reply. This error branch returns early and drops meta, so the aof_bytes never reach Lines 2066-2095. That leaves the target shard’s in-memory state ahead of its AOF until a later rewrite, so a crash in that window can lose a write that may already be visible to clients. The durability boundary needs to move into the target shard’s PipelineBatch execution/ack path instead of depending on the reply round-trip.

Also applies to: 2066-2095

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/conn/handler_monoio/mod.rs` around lines 2049 - 2057, The error
branch for shard_responses (when Err(err_msg)) returns early and skips appending
aof_bytes for the corresponding PipelineBatch, risking reply-loss durability
gaps; modify the handler so that whenever a PipelineBatch was accepted by a
target shard (identify via PipelineBatch or meta entries and aof_bytes), you
still persist the batch to AOF before continuing or ensure the target shard
writes to AOF synchronously as part of its accept/ack path. Concretely: in the
shard_responses Err branch (and the same logic covering the range that currently
writes aof_bytes), either call the existing AOF append routine with the same
aof_bytes tied to the meta entries or change the target shard execution/ack code
to perform the AOF write at accept time (so PipelineBatch acceptance guarantees
persistence), then remove the early-return that skips AOF.
src/persistence/page_cache/mod.rs-93-107 (1)

93-107: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Lazy startup allocation still does not reclaim memory after eviction.

These buffers start empty now, but evict_cold_frames() only clears frame metadata; it never drops the backing Vec<u8> for evicted frames. After the working set shrinks, touched pages stay resident forever, so the memory-pressure cascade still cannot actually give PageCache bytes back before escalating to KV eviction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/page_cache/mod.rs` around lines 93 - 107, evict_cold_frames
currently only clears frame metadata but leaves the backing buffers (buffers_4k
and buffers_64k) allocated, so freed pages never release heap capacity; update
evict_cold_frames to drop or replace the Vec<u8> backing an evicted frame (for
the 4k and 64k pools) — e.g. swap the slot's RwLock<Vec<u8>> contents with an
empty Vec::new() or call Vec::clear() plus shrink_to_fit() while holding the
lock — so that the heap allocation is released; locate uses in fetch_page,
FrameDescriptor and any code that indexes into buffers_4k / buffers_64k to
ensure the swap/replace is safe under the same locking scheme.
src/persistence/page_cache/mod.rs-68-71 (1)

68-71: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Minimum frame floors can still blow past the whole-instance budget.

per_shard_pagecache_budget() divides first, but these unconditional max(64) / max(8) floors add a fixed 768 KiB minimum per shard. Once whole_budget / num_shards drops below that, aggregate PageCache capacity exceeds the configured whole-instance budget again (for example, 4 MiB across 8 shards becomes 6 MiB of frame capacity). That breaks the new shard-budget contract on small-memory or high-shard-count deployments.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/page_cache/mod.rs` around lines 68 - 71, The function
pagecache_frame_counts currently forces fixed minimums (num_4k.max(64),
num_64k.max(8)) which can make per-shard capacity exceed the
per_shard_pagecache_budget; change pagecache_frame_counts to never increase
frame counts above what budget_bytes permits: compute num_4k and num_64k from
the 3:1 split as before, then remove the unconditional .max floor and instead
clamp/adjust the counts so (num_4k*PAGE_4K + num_64k*PAGE_64K) <= budget_bytes
(reduce 4K frames first or proportionally if needed) and ensure zero/low budgets
yield small counts rather than the fixed 64/8 floor; update any callers
(per_shard_pagecache_budget) expectations accordingly.
src/server/conn/handler_sharded/mod.rs-1283-1288 (1)

1283-1288: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't spill to the process CWD on a missing offload dir.

If ctx.spill_sender is set but ctx.disk_offload_dir is None, this still evicts the key and queues the spill under "./data". That hides broken wiring and can send cold files somewhere the shard's read-through/recovery code will never scan. Fail fast here instead of defaulting to ".".

Proposed fix
-                                    let dir = ctx
-                                        .disk_offload_dir
-                                        .as_deref()
-                                        .unwrap_or(std::path::Path::new("."));
+                                    let Some(dir) = ctx.disk_offload_dir.as_deref() else {
+                                        drop(rt);
+                                        return Err(Frame::Error(Bytes::from_static(
+                                            b"ERR disk offload misconfigured",
+                                        )));
+                                    };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/conn/handler_sharded/mod.rs` around lines 1283 - 1288, The code
currently falls back to "." when ctx.spill_sender is Some but
ctx.disk_offload_dir is None; update the evict logic so that when
ctx.spill_sender.is_some() and ctx.disk_offload_dir.is_none() you fail fast
instead of defaulting to CWD: in the evict_result block (where ctx.spill_sender,
ctx.spill_file_id and the dir variable are used) add an explicit check for
ctx.disk_offload_dir.is_none() and return or propagate an error (or log and
abort the eviction) rather than using unwrap_or("."), ensuring spills are only
queued when disk_offload_dir is present. Ensure you reference ctx.spill_sender,
ctx.disk_offload_dir, and ctx.spill_file_id in the change so the code path
short-circuits appropriately.
src/storage/tiered/kv_spill.rs-336-356 (1)

336-356: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Sync the data directory after publishing the batch file.

file.sync_all() only persists the temporary file contents. Without an fsync on data_dir after rename, a crash can lose heap-*.mpf even though later code commits that file into the manifest, leaving recovery with a manifest entry that points at a missing spill file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/tiered/kv_spill.rs` around lines 336 - 356, The
write_kv_spill_batch function currently syncs only the temp file
(file.sync_all()) but not the containing directory, so post-rename the directory
entry for final_path may not be persisted; after the std::fs::rename(&tmp_path,
&final_path)? call, open the data_dir (e.g., with
OpenOptions::new().read(true).open(&data_dir) or File::open(&data_dir)) and call
sync_all() on that directory handle to ensure the directory metadata (the new
heap-*.mpf entry) is flushed to disk.
src/shard/persistence_tick.rs-405-438 (1)

405-438: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when the manifest write fails.

If manifest.commit() errors here, the code still inserts every key into cold_index and starts serving that spill file immediately. After the next crash/restart, recovery never sees the file because it was never durably registered, so those evicted keys disappear. Skip indexing, or requeue the completion, until the manifest commit succeeds.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/shard/persistence_tick.rs` around lines 405 - 438, If manifest.commit()
fails after manifest.add_file (the block handling shard_manifest), do not
proceed to insert entries into the cold_index; instead treat the spill
completion as not durable and either requeue the completion or return an error
so it can be retried. Concretely: after calling manifest.commit() on the
manifest from shard_manifest, on Err(e) stop further processing of that
completion (do not run the ColdLocation insertion loop), and re-enqueue or
surface the completion so it can be retried later; ensure this change
short-circuits both the is_initialized()/with_shard_db path and the
shard_databases.write_db path that call ci.insert(...) so no keys are indexed
unless commit succeeded.
src/storage/tiered/cold_read.rs-48-59 (1)

48-59: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the Unix-only read_exact_at path.

src/storage/tiered/cold_read.rs imports std::os::unix::fs::FileExt as _; unconditionally (only #[cfg(test)] is nearby), so non-Unix targets won’t compile. Keep the page-offset fast path under #[cfg(unix)]/#[cfg(target_family = "unix")] and add a #[cfg(not(unix))] fallback (e.g., seek + read_exact into leaf_buf, or the previous whole-file read) so this module builds everywhere.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/tiered/cold_read.rs` around lines 48 - 59, The code
unconditionally imports std::os::unix::fs::FileExt and calls
file.read_exact_at(...) which breaks non-Unix builds; wrap the import and the
fast pread path in #[cfg(unix)] (or #[cfg(target_family = "unix")]) and provide
a #[cfg(not(unix))] fallback that uses file.seek(SeekFrom::Start(page_offset))
followed by file.read_exact(&mut leaf_buf) (or the previous whole-file read) to
fill leaf_buf for the specific page; apply these guards around the FileExt
import and the block that computes page_offset and calls read_exact_at so
symbols like file_path, file, page_offset, leaf_buf, PAGE_4K, and
location.page_idx remain usable on all platforms.
tests/crash_recovery_disk_offload_no_aof.rs-331-337 (1)

331-337: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add missing // SAFETY: for the libc::kill unsafe block in sigkill
In tests/crash_recovery_disk_offload_no_aof.rs (unix sigkill, lines ~331-338), unsafe { libc::kill(pid, libc::SIGKILL); } is missing a preceding // SAFETY: comment explaining the invariant that makes the call sound (or refactor to avoid unsafe). Note: scripts/audit-unsafe.sh greps src/**/*.rs, so this specific test may not be caught by that script, but it still violates the repo unsafe policy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/crash_recovery_disk_offload_no_aof.rs` around lines 331 - 337, The
unsafe call to libc::kill inside the unix-only function sigkill lacks the
required safety comment; add a preceding "// SAFETY:" comment that documents the
invariants making the call sound (e.g., that child.id() yields a valid pid for
the child process, the pid value is non-zero and points to a process we
intentionally intend to SIGKILL, and that using libc::kill with SIGKILL is safe
here), or alternatively refactor sigkill to use a safe wrapper (e.g.,
nix::sys::signal::kill with nix::unistd::Pid) to avoid unsafe entirely;
reference the sigkill function and the libc::kill usage when making the change.
src/main.rs-790-850 (1)

790-850: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refuse startup here instead of warning-and-continuing.

If this data dir has already been migrated to single-shard multi-part AOF, appendonly.aof may have been renamed to appendonly.aof.legacy by the monoio path above. In that case this tokio branch starts without replaying the authoritative manifest pair and silently serves stale/empty state. This should be a hard error until tokio can load single-shard multi-part AOF, not a warning.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main.rs` around lines 790 - 850, The tokio branch currently logs a
warning when a multi-part AOF manifest exists but monoio-only single-shard
replay is unavailable, which can let the process start with stale/empty state;
change the #[cfg(not(feature = "runtime-monoio"))] block so it refuses startup
instead of continuing: detect the presence of the multi-part
manifest/appendonlydir (same condition the monoio branch would have used), and
return an error (propagate with anyhow::Context or similar) or exit with a
non-zero error rather than calling tracing::warn; update references inside
main's startup logic around num_shards, base_dir, and the legacy appendonly.aof
handling so that when a manifest is present and runtime-monoio is disabled the
program fails fast (mirroring the monoio path behavior that runs
replay_multi_part/retire legacy files) rather than continuing.
src/config.rs-32-34 (1)

32-34: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid deriving Default for ServerConfig (derived defaults don’t match clap CLI defaults).

ServerConfig derives Default, but many fields define clap default_value/default_value_t (e.g., bind="127.0.0.1", port=6379, appendonly="yes", protected_mode="yes", memory_arenas_cap=8). The derived Default will instead produce Rust defaults ("", 0, false, etc.), creating an invalid config if ServerConfig::default() is ever used.

Repo tests use ServerConfig::parse_from([]) for defaults, but the public derive remains a footgun. Remove the Default derive or implement Default by delegating to ServerConfig::parse_from([]) (so it matches the CLI defaults exactly).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/config.rs` around lines 32 - 34, ServerConfig currently derives Default
which yields Rust primitive defaults that differ from clap's CLI defaults;
remove the #[derive(Default)] or replace it with a custom impl Default that
delegates to clap's parsing so defaults match CLI behavior (e.g., implement
Default for ServerConfig by calling ServerConfig::parse_from(&[]) and returning
that value). Update the ServerConfig declaration (the struct with
#[derive(Parser, Debug, Clone, ...)] and any tests expecting
ServerConfig::default()) to use the new behavior so ServerConfig::default()
mirrors clap defaults exactly.
tests/crash_matrix_per_shard_bgrewriteaof.rs-27-27 (1)

27-27: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate this test to the runtime that actually supports the scenario.

The file comment says tokio rejects per-shard BGREWRITEAOF, but this cfg(any(...)) still compiles the tests for runtime-tokio. bgrewriteaof() then asserts the command is not refused, so an explicit tokio run fails by construction. Limit this file to monoio or add a tokio-specific expectation path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` at line 27, The test is
compiled for both runtimes but only valid for monoio; update the test gating or
add a tokio branch: either change the module cfg from cfg(any(feature =
"runtime-monoio", feature = "runtime-tokio")) to only cfg(feature =
"runtime-monoio") so the file is only built for monoio, or keep both runtimes
and modify the bgrewriteaof() test to branch on the runtime (runtime-tokio
should assert the command is refused while runtime-monoio asserts success).
Refer to the test function bgrewriteaof() and the module-level cfg to implement
the appropriate gate or conditional expectation.
tests/crash_matrix_per_shard_bgrewriteaof.rs-138-145 (1)

138-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix missing // SAFETY: for unsafe { libc::kill(...) } in sigkill

tests/crash_matrix_per_shard_bgrewriteaof.rs (sigkill, lines ~138-145) has an unsafe { libc::kill(pid, libc::SIGKILL) } block without the mandatory // SAFETY: comment explaining the invariant (e.g., that pid comes from the live Child handle for the spawned process).

The existing #[cfg(unix)] vs #[cfg(not(unix))] structure matches other test helpers in this repo that also use libc::kill, so the immediate required fix is the missing safety rationale.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` around lines 138 - 145, Add a
`// SAFETY:` rationale above the unsafe block in the `sigkill` function
explaining why the call to `libc::kill(pid, libc::SIGKILL)` is safe: state that
`pid` is derived from a live `std::process::Child` handle returned by
`child.id()` and thus refers to a valid process id for the spawned child, and
that killing with `SIGKILL` is intentional for the test; keep this comment
adjacent to the `unsafe { libc::kill(...) }` block in `sigkill` so it documents
the required invariant.
src/server/listener.rs-315-328 (1)

315-328: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip binding the central TCP listener when per_shard_accept is enabled.

This still creates a REUSEPORT listener even when the main loop never calls accept() on it. On Linux multishard runs, the kernel can hash connections to that idle socket, so those clients get black-holed again. Bind the central TCP listener only in the !per_shard_accept path, or make the per-shard mode use no central TCP socket at all.

Suggested fix
-    #[cfg(unix)]
-    let listener = match crate::shard::conn_accept::create_reuseport_socket(&addr) {
-        Ok(std_listener) => TcpListener::from_std(std_listener)?,
-        Err(e) => {
-            tracing::warn!(
-                "central SO_REUSEPORT bind failed ({}); falling back to plain bind on {}",
-                e,
-                addr
-            );
-            TcpListener::bind(&addr).await?
-        }
-    };
-    #[cfg(not(unix))]
-    let listener = TcpListener::bind(&addr).await?;
+    let listener = if per_shard_accept {
+        None
+    } else {
+        #[cfg(unix)]
+        {
+            Some(match crate::shard::conn_accept::create_reuseport_socket(&addr) {
+                Ok(std_listener) => TcpListener::from_std(std_listener)?,
+                Err(e) => {
+                    tracing::warn!(
+                        "central SO_REUSEPORT bind failed ({}); falling back to plain bind on {}",
+                        e,
+                        addr
+                    );
+                    TcpListener::bind(&addr).await?
+                }
+            })
+        }
+        #[cfg(not(unix))]
+        {
+            Some(TcpListener::bind(&addr).await?)
+        }
+    };
...
-                if per_shard_accept {
+                if per_shard_accept {
                     std::future::pending::<std::io::Result<(tokio::net::TcpStream, std::net::SocketAddr)>>().await
                 } else {
-                    listener.accept().await
+                    listener.as_ref().expect("listener exists when per_shard_accept is false").accept().await
                 }

Also applies to: 403-408

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/listener.rs` around lines 315 - 328, The central TCP listener is
still being bound even when per_shard_accept is enabled, causing an idle
REUSEPORT socket to receive hashed connections; update the code around
create_reuseport_socket/TcpListener::bind so the listener is only created when
per_shard_accept is false (i.e. move the entire match that calls
create_reuseport_socket and TcpListener::bind into the !per_shard_accept branch
or conditionally skip binding when per_shard_accept is true), and apply the same
change to the other analogous block around lines 403-408 so no central socket is
bound in per-shard accept mode.
src/persistence/aof_manifest.rs-1246-1251 (1)

1246-1251: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Persist the rewrite cutoff LSN here, not the manifest sequence.

max_lsn is now defined as the next-free replication offset, but this stores new_seq (2, 3, …). After the first rewrite, recovery can seed master_repl_offset below already-persisted data and reissue duplicate LSNs. Thread the coordinated shard cutoff/end offset into advance_shard and persist that value instead.

Suggested direction
 pub fn advance_shard(
     &mut self,
     shard_id: u16,
     new_seq: u64,
+    max_lsn: u64,
     rdb_bytes: &[u8],
 ) -> Result<PathBuf, crate::error::MoonError> {
@@
-        self.shards[shard_idx].max_lsn = self.shards[shard_idx].max_lsn.max(new_seq);
+        self.shards[shard_idx].max_lsn = max_lsn;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/aof_manifest.rs` around lines 1246 - 1251, The code currently
sets self.shards[shard_idx].max_lsn =
self.shards[shard_idx].max_lsn.max(new_seq) in advance_shard, but max_lsn
represents the next-free replication offset and must store the coordinated shard
rewrite cutoff/end offset (the persisted rewrite cutoff LSN) rather than the
manifest sequence number new_seq; modify advance_shard to accept the computed
shard_cutoff (or end_offset) parameter from the caller and persist that value
into self.shards[shard_idx].max_lsn (use max to avoid regressions), ensuring the
function uses shard_cutoff instead of new_seq so recovery cannot reuse stale
LSNs and master_repl_offset remains monotonic.
src/persistence/manifest.rs-1284-1437 (1)

1284-1437: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

This file is over the 1500-line limit.

The overflow-page implementation and new crash-path tests push src/persistence/manifest.rs past the repo cap. Splitting serialization/recovery helpers into submodules will make the dual-root and overflow invariants easier to audit.

As per coding guidelines, "No single .rs file should exceed 1500 lines. Split into submodules if approaching this limit."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/manifest.rs` around lines 1284 - 1437, The manifest.rs file
exceeds the 1500-line limit; refactor by extracting serialization/recovery and
overflow-related code into focused submodules so the top-level ShardManifest
implementation stays small and tests keep the same behavior. Identify the
helpers used by commit(), open(), and overflow handling (e.g., any functions
named like persist_overflow, recover_overflow, recover_root, serialize_root,
parse_overflow_pages) and move them into new submodules (serialization/recovery
and overflow), adjust their visibility (pub(crate) or pub) so
ShardManifest::commit, ShardManifest::open, gc_tombstones, add_file, remove_file
and the tests still compile unchanged, and update mod declarations and
use/imports in manifest.rs to delegate to the new modules while keeping the
public API intact. Ensure unit tests (test_overflow_*,
test_overflow_commit_crash_atomicity, test_overflow_compaction_bounds_growth,
test_overflow_tombstone_and_gc) continue to run by preserving behavior and error
paths after the split.
src/persistence/manifest.rs-438-450 (1)

438-450: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle torn overflow tails before computing start_page.

open() can recover from a partially written overflow run by falling back to the older root, but the next commit() appends from raw EOF and only debug_assert!s 4KB alignment. In release, a non-page-aligned EOF makes start_page floor to the previous page while the new overflow bytes are written at the unaligned offset, so the committed root points at the wrong location and the next reopen drops the new slot. Truncate or reject any eof % PAGE_4K != 0 tail before appending.

Suggested fix
-            let eof = self.file.seek(SeekFrom::End(0))?;
-            // The manifest is always a whole number of 4 KB pages.
-            debug_assert_eq!(eof % PAGE_4K as u64, 0);
+            let mut eof = self.file.seek(SeekFrom::End(0))?;
+            let misalignment = eof % PAGE_4K as u64;
+            if misalignment != 0 {
+                eof -= misalignment;
+                self.file.set_len(eof)?;
+                self.file.sync_data()?;
+            }
             let start_page = (eof / PAGE_4K as u64) as u32;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/manifest.rs` around lines 438 - 450, The overflow append path
computes start_page from EOF without handling a torn (non-4KB-aligned) tail,
which in release causes writes at the wrong offset; before computing start_page
in the block that writes overflow (the code around overflow_start_page and use
of PAGE_4K), detect if eof % PAGE_4K != 0 and truncate the file down to the
previous PAGE_4K boundary (or return an error) so subsequent writes are
page-aligned; implement this by checking eof =
self.file.seek(SeekFrom::End(0))?, computing eof_floor = (eof / PAGE_4K as u64)
* PAGE_4K as u64 and calling self.file.set_len(eof_floor) (or propagate an
error) before computing start_page and writing the overflow buffer.
🟡 Minor comments (5)
src/shard/dispatch.rs-690-693 (1)

690-693: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Drop the wall-clock assertion from this test.

start.elapsed() < 5ms measures scheduler noise as much as helper behavior, so it can flap on a loaded CI runner even when the fast path never sleeps. calls == 1 already proves the no-retry path; if you want timing coverage, use paused Tokio time instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/shard/dispatch.rs` around lines 690 - 693, Remove the flaky wall-clock
assertion that checks start.elapsed() < Duration::from_millis(5) in the test
(the assert! using start and the 5ms bound), because it measures scheduler
noise; instead rely on the existing calls == 1 check to prove the no-retry path.
If you need to assert timing behavior for the fast path, convert the test to use
paused Tokio time (tokio::time::pause/advance) and assert against simulated time
rather than real elapsed time. Ensure you remove the assert! line and keep the
calls variable assertion intact.
tests/multishard_serve_smoke.rs-106-110 (1)

106-110: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

safe_shards() still oversubscribes 1-core runners.

The lower clamp of 2 means a host with available_parallelism() == 1 still runs the “multishard” cases with 2 shards, which is exactly the oversubscription confound the comment says this helper avoids. Skip the multishard variants when fewer than 2 cores are available instead of forcing 2 here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/multishard_serve_smoke.rs` around lines 106 - 110, The helper
safe_shards() forces at least 2 shards, causing 1-core runners to be
oversubscribed; change the lower clamp from 2 to 1 so it returns the actual
available_parallelism() (bounded above by 4) instead of forcing two shards —
update the safe_shards() implementation (which uses
std::thread::available_parallelism()) to use .clamp(1, 4) and ensure any test
harness that expects multishard behavior skips multishard variants unless
safe_shards() >= 2.
tests/multishard_serve_smoke.rs-37-44 (1)

37-44: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid the probe-port race in the smoke harness.

free_port() releases the ephemeral listener before the child process binds, so a parallel test/process can steal that port and make this smoke test fail with the same “never served” symptoms it is trying to catch. Please retry port allocation/spawn on bind failure instead of treating the first chosen port as reserved.

Also applies to: 113-125

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/multishard_serve_smoke.rs` around lines 37 - 44, free_port() currently
picks and releases an ephemeral port before spawning the child, which allows
races; change the harness to loop: pick a candidate port (using the existing
free_port() helper or by trying to bind and immediately drop), spawn the child
process with that port, then probe/connect to the child (or parse its startup
output) with a short timeout; if the probe fails with a bind-refused or "address
already in use" condition, kill the child and retry with a new port up to a
small retry limit. Apply this retry-on-bind-failure pattern to the other
probe/spawn logic referenced (the code around lines 113-125) so the test only
fails after exhausting retries rather than on the first stolen port.
README.md-236-242 (1)

236-242: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the guardrail caveats here.

This overstates the behavior a bit: omitting --maxmemory does not always force allkeys-lru, and it can skip auto-capping entirely when no memory limit is detectable. The doc should match resolve_memory_guardrail() and say the policy flip happens only from noeviction, otherwise the existing eviction policy is preserved.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 236 - 242, Update the README blurb about --maxmemory
to match resolve_memory_guardrail(): clarify that omitting --maxmemory does not
always force allkeys-lru — the code only flips the eviction policy to
allkeys-lru when it would otherwise be noeviction and a memory guardrail is
applied; if no memory limit is detectable the auto-cap can be skipped entirely
and an existing eviction policy is preserved. Reference the behavior of
resolve_memory_guardrail(), the flags --maxmemory and --shards, and the policies
allkeys-lru and noeviction so readers understand the exact conditional policy
flip and the per-shard cap calculation.
tests/crash_matrix_per_shard_bgrewriteaof.rs-281-281 (1)

281-281: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't derive the port with +1; that socket was never reserved.

unique_port() only proves the returned port was free at probe time. saturating_add(1) picks a different port that may already be occupied, so this test can fail intermittently for reasons unrelated to BGREWRITEAOF. Call unique_port() again here instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` at line 281, The test currently
derives a second port by doing let port = unique_port().saturating_add(1); which
is unsafe because the +1 socket was never reserved; replace that expression by
calling unique_port() again to obtain an independently probed free port (i.e.,
call unique_port() twice rather than adding 1) so the variable assigned to port
is guaranteed to be probed free when used in the BGREWRITEAOF test.
🧹 Nitpick comments (1)
tests/autovacuum_daemon.rs (1)

14-28: 💤 Low value

Prefer parking_lot::Mutex for consistency with project guidelines.

The coding guidelines specify using parking_lot locks instead of std::sync locks across all Rust files. While this is test infrastructure (not a hot path), using parking_lot::Mutex maintains consistency and is simpler—parking_lot locks never poison, so you can replace the poison-recovery .unwrap_or_else(|e| e.into_inner()) with a plain .lock(). The non-poisoning behavior still prevents wedging sibling tests if one panics.

As per coding guidelines: "Use parking_lot::RwLock / parking_lot::Mutex — never std::sync locks."

♻️ Refactor to use parking_lot::Mutex
 use std::sync::atomic::Ordering;
-use std::sync::{Mutex, MutexGuard, OnceLock};
+use std::sync::OnceLock;
+use parking_lot::{Mutex, MutexGuard};

-/// Serializes the two tests that read/assert the process-global
-/// `RECL_AUTOVACUUM_LAST_RUN_TS`. The default parallel test runner otherwise lets
-/// `test_autovacuum_tick_updates_last_run_ts` advance that global while
-/// `test_disabled_autovacuum_is_noop` is mid-assertion → spurious cross-test
-/// failure (observed flaking under load). Poison-tolerant: a panicking holder
-/// must not wedge the sibling.
+/// Serializes the two tests that read/assert the process-global
+/// `RECL_AUTOVACUUM_LAST_RUN_TS`. parking_lot::Mutex never poisons, so a
+/// panicking holder will not wedge sibling tests.
 fn autovacuum_ts_lock() -> MutexGuard<'static, ()> {
     static GUARD: OnceLock<Mutex<()>> = OnceLock::new();
-    GUARD
-        .get_or_init(|| Mutex::new(()))
-        .lock()
-        .unwrap_or_else(|e| e.into_inner())
+    GUARD.get_or_init(|| Mutex::new(())).lock()
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/autovacuum_daemon.rs` around lines 14 - 28, The test helper
autovacuum_ts_lock currently uses std::sync::Mutex and poison-recovery
unwrap_or_else; replace std::sync::Mutex/MutexGuard with parking_lot::Mutex and
parking_lot::MutexGuard (keep std::sync::OnceLock or replace with equivalent if
desired), remove the poison-handling call and simply call .lock() on the
parking_lot mutex, and adjust imports accordingly; refer to the
autovacuum_ts_lock function and the GUARD static that protects
RECL_AUTOVACUUM_LAST_RUN_TS when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28f3eec1-3613-4d85-bb70-498af67cfcdc

📥 Commits

Reviewing files that changed from the base of the PR and between 03ca4ae and 8e6eb89.

📒 Files selected for processing (61)
  • CHANGELOG.md
  • README.md
  • src/command/persistence.rs
  • src/config.rs
  • src/main.rs
  • src/persistence/aof.rs
  • src/persistence/aof_manifest.rs
  • src/persistence/manifest.rs
  • src/persistence/migrate_aof.rs
  • src/persistence/page_cache/mod.rs
  • src/server/conn/handler_monoio/mod.rs
  • src/server/conn/handler_sharded/mod.rs
  • src/server/conn/handler_single.rs
  • src/server/embedded.rs
  • src/server/listener.rs
  • src/shard/conn_accept.rs
  • src/shard/dispatch.rs
  • src/shard/event_loop.rs
  • src/shard/persistence_tick.rs
  • src/shard/spsc_handler.rs
  • src/shard/timers.rs
  • src/storage/eviction.rs
  • src/storage/tiered/cold_index.rs
  • src/storage/tiered/cold_read.rs
  • src/storage/tiered/kv_spill.rs
  • src/storage/tiered/spill_thread.rs
  • tests/adversarial_v0110_fix01_set_delete_rollback.rs
  • tests/adversarial_v0110_fix02_err_path_intent.rs
  • tests/adversarial_v0110_fix03_simplestring_graph.rs
  • tests/adversarial_v0110_fix04_shortest_path_call_parity.rs
  • tests/adversarial_v0110_fix06_shortest_path_min_hops.rs
  • tests/adversarial_v0110_fix07_multihop_edge_var_reject.rs
  • tests/autovacuum_daemon.rs
  • tests/crash_matrix_per_shard_aof.rs
  • tests/crash_matrix_per_shard_bgrewriteaof.rs
  • tests/crash_recovery_disk_offload_no_aof.rs
  • tests/ft_search_as_of_boundary.rs
  • tests/ft_search_as_of_filter.rs
  • tests/ft_search_concurrent_readers.rs
  • tests/ft_search_multi_shard_as_of.rs
  • tests/ft_search_temporal_parity.rs
  • tests/graph_bench_compare.rs
  • tests/graph_bench_e2e.rs
  • tests/graph_integration.rs
  • tests/graph_stress_deep.rs
  • tests/integration.rs
  • tests/kill_snapshot.rs
  • tests/lunaris_cypher_shortest_path.rs
  • tests/lunaris_cypher_temporal.rs
  • tests/lunaris_hybrid_ft_search.rs
  • tests/mq_integration.rs
  • tests/multishard_serve_smoke.rs
  • tests/pipeline_auto_index.rs
  • tests/replication_test.rs
  • tests/txn_completeness_edge_cases.rs
  • tests/txn_cypher_write_rollback.rs
  • tests/txn_ft_search_snapshot.rs
  • tests/txn_graph_wiring.rs
  • tests/txn_kv_wiring.rs
  • tests/vacuum_commands.rs
  • tests/workspace_integration.rs

Comment thread src/persistence/aof.rs
Comment on lines +283 to +315
if self.remaining.fetch_sub(1, Ordering::AcqRel) == 1 {
// Abort if any shard failed to fold: committing new_seq while a
// shard lacks its new base would break recovery. Keep old_seq.
if self.failed.load(Ordering::Acquire) {
let m = self.manifest.lock();
// Best-effort: prune the orphaned new-seq files written by the
// shards that DID fold, so they don't linger.
for sid in 0..self.n_shards {
m.prune_shard_files(sid as u16, self.new_seq);
}
drop(m);
error!(
"F6 per-shard rewrite ABORTED: a shard failed to fold; seq stays {}. \
Old generation remains authoritative (crash-safe). A RESTART is \
recommended so successful shards' writers stop appending to the \
discarded new generation.",
self.old_seq
);
crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
return;
}
let mut m = self.manifest.lock();
m.seq = self.new_seq;
if let Err(e) = m.write_manifest() {
error!(
"F6 per-shard rewrite: final manifest commit (seq {}) failed: {}. \
Old generation remains authoritative; rewrite did not take effect.",
self.new_seq, e
);
// Do NOT prune — old generation is still the committed state.
drop(m);
crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Abort paths still leave successful shards appending to uncommitted files.

By the time this branch runs, the successful writers have already reopened onto new_seq in do_rewrite_per_shard(). Returning here with manifest.seq still on old_seq means subsequent writes on those shards go to files recovery will ignore after a restart. Logging “restart recommended” is not enough here; this path needs an explicit rollback or a fail-fast shutdown before more writes are accepted.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/aof.rs` around lines 283 - 315, The abort branch after
checking self.failed currently returns with shards' writers already reopened on
self.new_seq (done in do_rewrite_per_shard()), causing successful shards to keep
appending to uncommitted files; fix by performing an explicit rollback or
fail-fast stop before returning: acquire the manifest lock (m) as done, call the
rollback routine that reopens each shard's writer back to self.old_seq (or
invoke the writer-shutdown/reopen function used by do_rewrite_per_shard for each
sid and ensure m.prune_shard_files is still called for any orphaned new_seq
files), set AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst), and if a
safe rollback cannot be completed atomically, trigger a fail-fast termination so
no further writes to new_seq occur (ensure this uses the same writer-control
APIs used elsewhere rather than leaving writers active).

Comment thread src/persistence/aof.rs
Comment on lines +794 to +807
for s in &self.senders {
// Blocking send for guaranteed delivery — see the doc comment.
if s.send(AofMessage::RewritePerShard {
shard_dbs: shard_dbs.clone(),
coord: coord.clone(),
})
.is_err()
{
error!(
"F6 per-shard rewrite: a writer channel is disconnected; \
rewrite aborted (no manifest commit, old generation remains \
authoritative). Inspect AOF writer threads."
);
return Err(AofPoolSendError::SendFailed);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Handle partial fan-out failure before returning.

If one send() fails after earlier writers already received RewritePerShard, those writers can still fold and call shard_done(), but the failed sender and the unsent tail never decrement remaining. That leaves AOF_REWRITE_IN_PROGRESS stuck and can strand successful shards on new_seq files that will never be committed. Mark the coord failed and account for every unsent shard before returning so the dispatched writers can drive the abort path to completion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/aof.rs` around lines 794 - 807, When a send to a writer fails
in the loop over self.senders while dispatching AofMessage::RewritePerShard
(with shard_dbs and coord), do not return immediately; instead mark the coord as
failed (e.g. call coord.mark_failed() or coord.fail()) and account for every
unsent shard by invoking the same completion path the writers use (e.g. call
coord.shard_done() once per unsent shard or otherwise decrement the remaining
counter for each shard that never received the message) before returning
Err(AofPoolSendError::SendFailed) so AOF_REWRITE_IN_PROGRESS is cleared and
successful shards can finish and commit new_seq as intended.

Comment thread src/persistence/aof.rs
Comment on lines +2787 to +2799
AofMessage::AppendSync {
lsn,
bytes: data,
ack,
} => {
write_framed(file, lsn, &data).map_err(|e| AofError::Io {
path: PathBuf::from("<aof per-shard incr drain>"),
source: e,
})?;
outcome.drained += 1;
// Durability for these is covered by the post-drain fsync at
// the rewrite boundary (mirrors drain_pending_appends).
let _ = ack.send(AofAck::Synced);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Do not ack AppendSync before the rewrite-boundary fsync succeeds.

Line 2799 sends AofAck::Synced before do_rewrite_per_shard() reaches its phase-1/phase-3 file.sync_data(). Under appendfsync=always, a client can therefore receive +OK during BGREWRITEAOF even though a crash in that window still loses the entry. Keep these ack senders pending and only fulfill them after the caller’s post-drain sync completes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/persistence/aof.rs` around lines 2787 - 2799, The current code sends
AofAck::Synced immediately inside the AofMessage::AppendSync arm (after calling
write_framed), which can ack clients before the rewrite-boundary fsync in
do_rewrite_per_shard() completes; instead, buffer the ack sender for later and
only send AofAck::Synced after the caller’s post-drain sync finishes.
Concretely: in the AppendSync branch (where write_framed is called and ack is
available) stop calling ack.send(AofAck::Synced) and push the ack into a pending
list/vec (e.g., outcome.pending_syncs or a new field) so do_rewrite_per_shard()
can iterate and fulfill those senders only after its phase-1/phase-3
file.sync_data() succeeds; ensure any error paths still notify or drop the
sender appropriately.

Comment on lines +155 to +233
// ── Partition into inline candidates and oversized entries ────────────────
// We use indices to avoid re-allocating keys. `inline_indices` are the
// positions in `buffer` of entries that fit the inline threshold.
let mut inline_indices: Vec<usize> = Vec::with_capacity(buffer.len());
let mut oversized_indices: Vec<usize> = Vec::new();

for (i, req) in buffer.iter().enumerate() {
if req.value_bytes.len() <= INLINE_MAX_VALUE_BYTES {
inline_indices.push(i);
} else {
oversized_indices.push(i);
}
}

// ── Write ONE batch file for all inline entries ───────────────────────────
if !inline_indices.is_empty() {
// Use the file_id of the first inline entry for the batch file.
let file_id = buffer[inline_indices[0]].file_id;
let shard_dir = buffer[inline_indices[0]].shard_dir.clone();

let spill_entries: Vec<SpillEntry> = inline_indices
.iter()
.map(|&i| SpillEntry {
key: buffer[i].key.clone(),
value_bytes: buffer[i].value_bytes.clone(),
value_type: buffer[i].value_type,
flags: buffer[i].flags,
ttl_ms: buffer[i].ttl_ms,
})
.collect();

let completion = match build_kv_spill_batch(&spill_entries, file_id) {
Ok(batch) => {
let total_pages = batch.leaves.len() as u32; // overflow is always empty (inline-only)
match write_kv_spill_batch(&shard_dir, file_id, &batch) {
Ok(byte_size) => {
let entries = inline_indices
.iter()
.zip(batch.locations.iter())
.map(|(&buf_idx, &(page_idx, slot_idx))| SpillCompletionEntry {
key: buffer[buf_idx].key.clone(),
db_index: buffer[buf_idx].db_index,
page_idx,
slot_idx,
})
.collect();
SpillCompletion {
file_entry: make_file_entry(file_id, total_pages, byte_size),
entries,
success: true,
}
}
Err(e) => {
warn!(
file_id,
error = %e,
count = inline_indices.len(),
"spill_thread: inline batch write failed"
);
SpillCompletion {
file_entry: make_file_entry(file_id, 0, 0),
entries: Vec::new(),
success: false,
}
}
}
}
Err(e) => {
warn!(
file_id,
error = %e,
count = inline_indices.len(),
"spill_thread: inline batch build failed"
);
SpillCompletion {
file_entry: make_file_entry(file_id, 0, 0),
entries: Vec::new(),
success: false,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Don't fail the entire inline flush when one entry misses the inline fit heuristic.

This pre-screen only checks value_bytes.len(), so a large key or an incompressible ~3.5KB value can still make build_kv_spill_batch() return InvalidData on a fresh leaf. That error path currently turns one bad candidate into a failed completion for the entire inline buffer, even though async eviction has already removed all of those keys from RAM. Please retry the offending entry via build_kv_spill_pages() (or split/rebuild the batch) instead of dropping the whole flush.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/tiered/spill_thread.rs` around lines 155 - 233, The current logic
treats any error from build_kv_spill_batch(...) as a fatal failure for the whole
inline_indices batch; instead, when build_kv_spill_batch fails (e.g.,
InvalidData due to a single oversized or incompressible entry), retry the
entries individually: for each index in inline_indices call
build_kv_spill_pages() (or otherwise build a per-entry batch/page) for that
single SpillEntry and then write it with the same writer (e.g.,
write_kv_spill_batch or the per-page writer), collecting successful
SpillCompletionEntry items; on per-entry failure log and skip that entry but do
not mark the entire SpillCompletion as failed if at least one entry succeeded,
and only return a failed SpillCompletion when none of the entries could be
written. Update the error branch around build_kv_spill_batch(...) to implement
this per-entry retry loop and to populate
SpillCompletion.file_entry/entries/success accordingly.

Comment on lines +407 to +419
match completion_tx.try_send(completion) {
Ok(()) => {}
Err(flume::TrySendError::Full(_)) => {
SPILL_COMPLETION_DROPPED.fetch_add(1, Ordering::Relaxed);
warn!(
"spill_thread: completion channel full, dropping completion (total dropped: {})",
SPILL_COMPLETION_DROPPED.load(Ordering::Relaxed)
);
}
Err(flume::TrySendError::Disconnected(_)) => {
// Event loop dropped its receiver -- shutting down; ignore.
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Completion drops are not safe here.

A dropped SpillCompletion never reaches apply_spill_completions(), so the file is not added to the manifest and its keys are not reinserted into cold_index. By this point the keys have already been evicted from RAM, so try_send(...Full) turns backpressure into immediate misses and permanent loss after restart.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/tiered/spill_thread.rs` around lines 407 - 419, The code
currently drops SpillCompletion when completion_tx.try_send(...) returns Full,
causing lost completions and manifest/cold_index corruption; change the send
logic in spill_thread to guarantee delivery of the SpillCompletion (do not
increment SPILL_COMPLETION_DROPPED or drop the value). Replace the try_send
branch with a blocking or async send loop using completion_tx.send(completion)
(or await completion_tx.send_async/completion_tx.send with retry on Full) and
only treat Disconnected as shutdown; ensure the SpillCompletion is retried until
sent so apply_spill_completions() always receives it and keys can be reinserted
into cold_index.

Self-review of PR #136 (BUG #1, MEDIUM, in-scope to f5e17ac). ShardManifest::
compact() rewrites the manifest to a temp file, fsyncs, and renames it over
self.path (the durability point), then reopens self.file against the new inode.
If that reopen failed, the error propagated out of compact() and was swallowed
by commit() (warn-only, "commit already durable"). But self.file then still
referred to the PRE-rename, now-orphaned inode: every SUBSEQUENT commit would
seek+write through that detached handle, silently discarding all post-compact
manifest updates while recovery (which reads self.path) saw only the compaction
snapshot. Cold-tier file entries committed after the failed reopen were lost on
restart. Trigger: reopen fails (fd exhaustion / permission / mount transient)
while overflow is in use — rare, but silent and persistent when hit.

Fix: on reopen failure, set a `needs_reopen` flag instead of leaving a stale
handle. commit() checks the flag FIRST and reattaches to self.path before any
write (resetting active_slot to 0, matching the compacted layout). If that
reattach also fails, commit() returns the error — the data genuinely cannot be
persisted, so fail loudly rather than write into a dead fd.

TDD: compact_reopen_failure_does_not_silently_lose_later_commits drives a
compact() with a cfg(test) reopen-failure injection (points self.file at a
throwaway inode, the real "orphaned fd"), then commits another entry and asserts
it survives recovery. RED before the guard (entry 999 missing after reopen);
GREEN after. Full persistence::manifest module green on both runtimes.

author: Tin Dang <tin.dang@trustifytechnology.com>
Self-review of PR #136 (BUG #2, MEDIUM). The tokio per-shard AOF writer
(per_shard_aof_writer_task) wrote each record as [u64 lsn][u32 len][payload] by
issuing the header write_all then the payload write_all, logging+`continue` on
either failure — with NO write_error latch. The single-file writer (aof.rs
~:1467) and the monoio per-shard writer (~:2125) both carry that latch; the
tokio per-shard path did not. A torn write (header lands, payload fails) left an
orphaned 12-byte header; the NEXT Append wrote a fresh header right after it, so
the framed reader on replay misread the orphan's `len` and consumed the
following record's bytes as payload — corrupting every record after the tear.
Pre-existing since the per-shard foundation (#129, 8c49ec2), but this branch's
8853ade makes the tokio per-shard writer live for the first time, so the latent
bug is now reachable.

Fix: mirror the monoio latch. A `write_error` flag is set on any header/payload/
flush/sync failure; once set, Append drops silently (fire-and-forget) and
AppendSync replies WriteFailed (so the caller errors instead of hanging to the
F2 fsync timeout or ack'ing a write into a corrupt stream) — never emitting
another record. The writer stays latched for its lifetime; recovery replays the
clean prefix and a rewrite starts a fresh file. Zero behavior change on the
happy path (flag never set).

TDD: tokio_per_shard_writer_latches_after_torn_write drives the real writer task
against a real PerShard manifest, injects a torn write on the 2nd Append via a
cfg(test) atomic (TEST_FAIL_WRITE_AT — not an env var; edition-2024 set_var is
unsafe), and asserts (a) a following AppendSync returns WriteFailed and (b) the
on-disk incr replays as exactly the pre-tear record with no corruption. RED
before the latch (AppendSync acked Synced; stream corrupt); GREEN after. Full
persistence::aof module green on both runtimes; fmt + clippy clean.

author: Tin Dang <tin.dang@trustifytechnology.com>
@pilotspacex-byte

Copy link
Copy Markdown
Contributor Author

Self-review (20 commits) — 2 MEDIUM durability bugs found + fixed

Ran a 3-way adversarial review of the whole branch. The headline multishard
RAM/serving fix verified sound. Two MEDIUM durability holes in the bundled
AOF/disk-offload work were found, hand-confirmed, and fixed here via red/green TDD:

  • bfe8b4f — manifest compact reopen-failure stale handle (in-scope to f5e17ac).
    compact() renamed a fresh manifest durably, then reopened self.file. If the
    reopen failed, the error was swallowed and the stale (orphaned-inode) handle was
    reused → every later commit silently lost. Fix: needs_reopen flag; commit()
    reattaches to the live file before writing, or fails loudly.
  • 9f91381 — tokio per-shard AOF writer missing write_error latch (pre-existing
    from fix(persistence): multi-shard AOF gate + per-shard AOF foundation (Option B step 1) #129, now reachable via 8853ade).
    A torn write (header lands, payload fails)
    left an orphaned header; the next Append corrupted the frame stream on replay.
    Fix: mirror the monoio/single-file latch — drop further Appends, reply WriteFailed
    to AppendSync.

Each fix has a RED→GREEN regression test (RED verified by bypassing the check).
fmt + clippy clean on both feature sets; persistence::manifest+aof modules green
on monoio (53) and tokio (57).

Lower-severity follow-ups (NOT in this PR — tracked)

  • LOW-MED: apply_spill_completions failed manifest commit leaves a cold entry
    unpersisted until the next commit (crash window).
  • LOW: panic in do_rewrite_per_shard (behind --experimental-per-shard-rewrite)
    leaves AOF_REWRITE_IN_PROGRESS stuck + kills the shard writer.
  • LOW: multi-DB (SELECT >0) cold recovery writes to db0 only.

@pilotspacex-byte

Copy link
Copy Markdown
Contributor Author

Filed the three LOW-severity follow-ups: #137 (cold-manifest failed-commit window), #138 (per-shard rewrite panic wedges flag + writer), #139 (multi-DB cold recovery restores db0 only).

…overflow DoS

The precedence-climbing recursive-descent expression grammar pushes ~9
stack frames per source nesting level (parse_expression -> or -> and ->
not -> comparison -> addition -> multiplication -> unary -> primary ->
recurse), but check_depth() only counts one source level per recursion.
With the previous limit of 64, a pathologically nested query descended
~64 levels (~576 debug frames) BEFORE the guard could fire at level 65,
overflowing the stack.

Cypher is parsed on async worker threads (tokio/monoio default to 2 MiB
stacks, not the 8 MiB main thread), so this overflow aborted the whole
process with SIGABRT instead of returning NestingDepthExceeded — a real
remote DoS, and a red CI "Check" job (test_nesting_depth_exceeded
overflowed its stack).

Introduce DEFAULT_MAX_NESTING_DEPTH = 32, threaded through parse_cypher,
the test harness, and the public re-export. 32 levels keeps the worst-
case descent to ~288 debug frames (well under 2 MiB, ~50% margin vs the
overflowing 64) while still allowing far deeper nesting than any real
query needs. Deep nesting now returns NestingDepthExceeded gracefully,
never aborts.

Verified in OrbStack VM (cgroup-capped): test_nesting_depth_exceeded and
test_nesting_within_limit both pass, full graph:: module 368/368 green,
fmt + clippy clean under --features graph and
--no-default-features --features runtime-tokio,jemalloc,graph.

author: Tin Dang <tin.dang@trustifytechnology.com>
…pher depth bound

Document the user-facing behavior changes shipped in PR #136:

- CLAUDE.md "Environment Variables": add `MOON_URING=1` (opt into the
  tokio->io_uring bridge, now default-off because it floods errors under
  load) and clarify that `MOON_NO_URING=1` force-disables io_uring on both
  the monoio runtime and the tokio bridge.
- CHANGELOG.md [0.2.0-alpha]: new "Fixed" subsection for PR #136 covering
  the three multishard RAM-zombie root causes (PageCache eager pre-alloc,
  tokio listener SO_REUSEPORT bind-race, io_uring-under-tokio default-off),
  the two in-PR durability TDD fixes (manifest compact-reopen latch, tokio
  per-shard AOF write_error latch), the Cypher nesting-depth bound, and the
  three low-severity follow-ups (#137/#138/#139).

author: Tin Dang <tin.dang@trustifytechnology.com>
@pilotspacex-byte pilotspacex-byte merged commit ea25177 into main Jun 4, 2026
8 checks passed
TinDang97 pushed a commit that referenced this pull request Jun 4, 2026
…ration

spawn_tokio_connection, spawn_monoio_connection and spawn_migrated_monoio_
connection all thread the disk-offload spill context (spill_sender / spill_file_id
/ disk_offload_dir) into the ConnectionContext, but spawn_migrated_tokio_connection
built it with None / Cell(0) / None. After an affinity migration the tokio
connection therefore fell back to the non-spilling eviction path even when
disk-offload was enabled, so cold-tier durability silently depended on whether
the connection had migrated. CodeRabbit flagged this on PR #136 (review
4420584384); validated against the merged code (issue #141).

Fix: thread spill_sender / spill_file_id / disk_offload_dir through
spawn_migrated_tokio_connection and pass them to ConnectionContext::new, mirroring
the three sibling spawn paths. Updated both call sites in event_loop.rs.

Validation: this is mechanical parity with three already-correct sibling
functions in the same file. A genuine red/green unit test is disproportionate
here — ConnectionContext::new takes 26 args and is consumed inside a spawned
task, so asserting the migrated context's spill fields would require stubbing the
entire context or simulating a live cross-shard connection migration. Verified
instead by: clean build + clippy under both default (monoio) and
--no-default-features --features runtime-tokio,jemalloc, and structural parity
(all four spawn paths now pass the spill context). The behavioural guarantee is
covered by the existing disk-offload read-through integration suite.

author: Tin Dang <tin.dang@trustifytechnology.com>
TinDang97 pushed a commit that referenced this pull request Jun 4, 2026
…ard accept

run_embedded() diverged from the production main.rs startup path on two
functional points (CodeRabbit PR #136 review 4420584384, issue #142):

1. Recovery only ran when `persistence_dir` was set, so `--appendonly no` +
   disk-offload (persistence_dir = None, disk_offload_base = Some) skipped
   `Shard::restore_from_persistence` entirely — cold-tier keys were
   unrecoverable after an embedded restart. Now gated on
   `should_run_recovery(persistence_dir, disk_offload_enabled)` =
   persistence OR disk-offload, recovering from `config.dir` when
   persistence_dir is None (mirrors main.rs:702-704).
2. Embedded forced `per_shard_accept = cfg!(target_os = "linux")`, re-enabling
   the tokio/Linux per-shard accept bind-order race that PR #136 disabled in
   main.rs. Now `false` (central accept), matching main.rs.

Validated in OrbStack VM (cgroup-capped), red->green TDD under
--no-default-features --features runtime-tokio,jemalloc (embedded is
runtime-tokio-only): recovery_runs_for_disk_offload_without_persistence
(RED with the disk-offload term dropped -> GREEN). fmt + clippy clean under
both default (monoio) and tokio,jemalloc.

Note: a full embedded+disk-offload+restart integration test was not added —
there is no run_embedded test harness to extend and standing it up for a
2-line parity fix is disproportionate; the behavioural recovery path is the
same one covered by tests/crash_recovery_disk_offload_no_aof.rs on main.rs.
The extracted should_run_recovery predicate unit-tests the exact gate that
regressed.

author: Tin Dang <tin.dang@trustifytechnology.com>
TinDang97 pushed a commit that referenced this pull request Jun 4, 2026
…tway

try_send_rewrite_per_shard fans the RewritePerShard message out to every
shard writer with a blocking send. If a writer channel is disconnected the
send fails and the function returned Err immediately — but the writers that
ALREADY received the message decrement the PerShardRewriteCoord countdown
when they fold, while the failed writer and the unsent tail never do. The
countdown therefore never reaches zero, no shard runs the terminal branch,
and AOF_REWRITE_IN_PROGRESS (set by bgrewriteaof_start_sharded before
dispatch) is wedged true forever — every future BGREWRITEAOF is silently
rejected for the lifetime of the process. Same wedged-flag family CodeRabbit
flagged on PR #136 review 4420584384 (related to #138).

Fix: on send failure, call coord.mark_failed() then coord.shard_done() once
for each shard from the failing index through the tail (the failed shard plus
every unsent one). The already-dispatched writers still decrement on fold, so
the countdown reaches zero exactly once; mark_failed() is published before the
tail decrements, so whichever decrement is terminal takes the abort path —
keeps old_seq authoritative (no manifest commit) and clears
AOF_REWRITE_IN_PROGRESS. The error log now names the offending writer index.

Validated in OrbStack VM (cgroup-capped), red/green TDD:
rewrite_fan_out_partial_failure_clears_in_progress_flag disconnects shard-0's
writer, asserts the fan-out returns Err AND the in-progress flag is cleared
(RED before the accounting fix — flag stayed true; GREEN after). fmt + clippy
clean under both default (monoio) and tokio,jemalloc.

author: Tin Dang <tin.dang@trustifytechnology.com>
TinDang97 pushed a commit that referenced this pull request Jun 4, 2026
…n on abort

A per-shard BGREWRITEAOF reopens each shard's append file onto the NEW
generation in phase 6 (do_rewrite_per_shard) BEFORE the coordinator decides
whether the rewrite commits. If the rewrite then aborts — another shard failed
to fold, or the final manifest commit errored — the manifest keeps old_seq and
the abort path prunes the new-seq files, but the already-folded writers keep
appending into the now-discarded new-seq incr. Recovery resolves base/incr by
the committed seq (old_seq), so every post-abort write is silently lost. The
old code acknowledged this with a "RESTART recommended" log; a server that is
not restarted loses data. CodeRabbit flagged this on PR #136 review
4420584384.

Fix — barrier-before-resume. PerShardRewriteCoord now publishes the committed
generation exactly once from the terminal shard_done (new_seq on success,
old_seq on abort/commit-failure) via a parking_lot Mutex<Option<u64>> +
Condvar. After phase 7 each folded writer blocks in await_outcome and, if the
committed seq != new_seq, reopens its append file onto the committed seq's incr
(shard_incr_path_seq) — so it resumes appending into the authoritative
generation with no restart and no loss. On the happy path committed == new_seq
and the barrier is a no-op beyond unblocking.

The barrier turns "every shard calls shard_done exactly once" into a LIVENESS
requirement: a writer that decrements then blocks hangs forever if another
shard never decrements. The one path that skips shard_done is a panic mid-fold
(save_snapshot_to_bytes OOM-unwinding — issue #138), which under the default
panic=unwind would now hang every healthy shard's AOF writer thread. Closed
with a ShardDoneGuard: the fold creates it on entry and calls complete() on
success (clean shard_done); on any ?-error or panic-unwind its Drop fires
mark_failed + shard_done, so the countdown always closes, the terminal writer
always publishes, and every waiter wakes and rolls back. The guard now owns the
single decrement for ALL exits, so both writer-task error arms (monoio + tokio)
no longer decrement after do_rewrite_per_shard (was a double-decrement risk).
This incidentally closes the #138 panic-wedge.

Each per-shard writer runs on its own dedicated OS thread (main.rs spawns
aof-writer-{sid} → block_on_local), so blocking one at the barrier never
starves the thread that must publish — no deadlock on either runtime.

Validated in OrbStack VM (cgroup-capped), red/green TDD:
- rewrite_abort_reopens_writer_onto_committed_old_generation: drives one shard's
  real fold with a second shard pre-failed; asserts the new-gen incr is pruned
  AND post-abort appends through *file land in the committed old-gen incr (RED
  before phase 8 — writes went to the pruned inode; GREEN after).
- rewrite_abort_wakes_waiter_cross_thread: real condvar wait/notify across
  threads observes old_seq.
- rewrite_fold_panic_releases_barrier_for_other_writers: a ShardDoneGuard
  dropped during catch_unwind releases a blocked waiter with old_seq (hangs
  forever without the guard).
No regression: per-shard AOF SIGKILL recovery (crash_matrix_per_shard_aof, 3)
and per-shard BGREWRITEAOF crash recovery (crash_matrix_per_shard_bgrewriteaof,
2) stay green. fmt + clippy clean under both default (monoio) and
tokio,jemalloc.

author: Tin Dang <tin.dang@trustifytechnology.com>
TinDang97 pushed a commit that referenced this pull request Jun 4, 2026
During a BGREWRITEAOF drain, both drain_pending_appends (legacy/TopLevel,
default --shards 1 path) and drain_pending_appends_framed (per-shard) wrote the
appended bytes and immediately acked AofAck::Synced — BEFORE the post-drain
boundary sync_data(). Under appendfsync=always a client received +OK (durable)
for an entry that a crash in the ack→fsync window would lose. CodeRabbit flagged
this on PR #136 review 4420584384 (issue #140); the always-fsync durability
contract requires the bytes to be on disk before the client is told Synced.

Fix: park the AppendSync ack senders in DrainOutcome.pending_acks instead of
acking inside the drain. New sync_and_fulfill_drain() performs the boundary
fsync and only then resolves the parked acks — Synced on success, or
FsyncFailed plus a propagated IO error on failure (never a false Synced). All
six drain sites across do_rewrite_per_shard, do_rewrite_single and
do_rewrite_sharded route through it. A write error inside the drain still drops
the parked acks with the outcome (RecvError → hard failure), unchanged.

Validated in OrbStack VM (cgroup-capped), red/green TDD:
- drain_framed_parks_appendsync_ack_until_boundary_fsync: asserts the drain
  PARKS the ack (pending_acks.len()==1, not sent) and that it resolves Synced
  only after sync_data() (RED before: no pending_acks field / acked in drain).
- drain_single_fulfills_fsync_failure_as_fsync_failed: the DEFAULT --shards 1
  non-framed path must resolve FsyncFailed (never Synced) when the boundary
  fsync fails.
No regression: full aof lib suite (66) green; bgrewriteaof TTL emission tests
green; per-shard AOF SIGKILL + BGREWRITEAOF crash-recovery suites green. fmt +
clippy clean under both default (monoio) and tokio,jemalloc.

Note: tests/aof_fsync_err_subscribe_ordering single-shard variant (an #[ignore]d
normal-append-path test, multi-shard twin passes) fails identically on HEAD
without this change — pre-existing and unrelated to the rewrite-drain path
touched here.

author: Tin Dang <tin.dang@trustifytechnology.com>
TinDang97 pushed a commit that referenced this pull request Jun 4, 2026
…00-line cap

src/persistence/aof_manifest.rs had grown to 3058 lines — past the 1500-line
cap in CLAUDE.md. CodeRabbit flagged this on PR #136 review 4420584384 (issue
#143). Converted the file into a directory module and relocated cohesive
chunks, preserving every public path via re-export.

Layout (all three files now under the cap):
- aof_manifest/mod.rs (1395): AofManifest state + manifest-level I/O
  (paths, initialize/load, parse_v1/v2, cleanup_orphans, write_manifest,
  advance, migrate, verify, global_max_lsn) + manifest-core unit tests.
- aof_manifest/shard_replay.rs (1159): the multi-part / per-shard replay free
  functions (replay_multi_part, replay_incr_resp, replay_incr_framed,
  replay_per_shard, replay_ordered_merge) + OrderedEntry + the replay unit
  tests.
- aof_manifest/shard_rewrite.rs (577): the per-shard rewrite inherent methods
  (initialize_multi, try_initialize_multi, advance_shard, prune_shard_files) as
  `impl AofManifest` + the shard-rewrite unit tests.

Mechanics:
- `use super::*` in each submodule pulls the parent's types, std/tracing
  imports, consts, and private helpers down (a child module sees the parent's
  private items), so no import duplication beyond the test modules.
- Relocated public replay API is re-exported from mod.rs
  (`pub use shard_replay::{OrderedEntry, replay_multi_part, replay_ordered_merge,
  replay_per_shard}`) so external callers
  (`crate::persistence::aof_manifest::replay_*`) keep resolving unchanged.
- Per CLAUDE.md the cap and the "tests stay in mod.rs" convention can't both
  hold for a 3058-line file (tests alone are ~1100 lines); tests were moved next
  to the code they cover, each submodule getting its own temp_dir() copy with a
  distinct directory prefix (replay-/rewrite-) so the independent static
  counters never collide.

No behaviour change. Validated in OrbStack VM: 30 aof_manifest unit tests green;
per-shard AOF SIGKILL recovery (3) and per-shard BGREWRITEAOF crash recovery (2)
green (exercise the relocated replay_per_shard / advance_shard / initialize_multi
via recovery). fmt + clippy clean under both default (monoio) and tokio,jemalloc.

author: Tin Dang <tin.dang@trustifytechnology.com>
TinDang97 pushed a commit that referenced this pull request Jun 4, 2026
…ion entry

CI Lint gate requires a CHANGELOG.md entry (or the skip-changelog label) per PR.
Document the PR #144 scope under [0.2.0-alpha] Unreleased: the 8 CodeRabbit
PR #136 durability follow-ups, the two PR #144-review Majors (async-spill
eviction regression test, cfg(unix) migration gate), the aof_manifest.rs and
aof.rs decompositions, and the VECTOR_INDEXES RwLock test-isolation fix.

author: Tin Dang <tin.dang@trustifytechnology.com>
TinDang97 added a commit that referenced this pull request Jun 4, 2026
fix(persistence): close 8 CodeRabbit PR #136 durability findings + aof_manifest split
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants