fix(shard): stable multishard + disk-offload on both runtimes; kill the multishard RAM zombie#136
Conversation
The originally-reported "zombie RAM in multishard" is, at root, unbounded keyspace growth under the default `--maxmemory 0` (unlimited) + `noeviction` policy: every distinct key adds memory that is never reclaimed, so a write loop with unique keys grows RSS until the OOM killer fires. This is data, not a leak — but the out-of-the-box default offered no protection. G1 adds an opt-out-by-default guardrail: when `--maxmemory` is omitted, Moon auto-caps at ~80% of the detected memory limit and switches a `noeviction` policy to `allkeys-lru`, logging a one-line startup notice. This bounds memory by default while staying fully Redis-compatible for operators who want the old behavior. `--maxmemory` is now `Option<usize>` to distinguish three intents: - omitted (None) -> auto-guardrail (80% + allkeys-lru + notice) - `--maxmemory 0` -> explicitly UNLIMITED (Redis escape hatch, preserved) - `--maxmemory N` -> exact cap, honored verbatim (no guardrail) Resolution happens once at startup (`apply_memory_guardrail`), writing the resolved value into `RuntimeConfig` (still `usize`, `0 = unlimited`) so ALL downstream code (eviction, persistence_tick, CONFIG, pagecache) is untouched. Detection is cgroup-aware and dependency-free — NO new `unsafe`, NO new crates (safe `/proc` + `/sys` reads only): - Linux: min(cgroup limit v2 `memory.max` / v1 `memory.limit_in_bytes`, host `/proc/meminfo` MemTotal). The min() matters: 80% of HOST RAM inside a container with a small cgroup cap would give zero real protection — the exact multishard/container scenario that motivated this. - Non-Linux (dev): detection skipped, left UNLIMITED with a warning to set `--maxmemory` explicitly (production targets Linux per platform policy). Wired at both entry points (main.rs binary + embedded.rs library) via the shared `log_memory_guardrail` notice helper. Tests (red/green): pure parsing (`/proc/meminfo`, cgroup v1/v2 sentinels) + pure resolution (percent math, policy flip, evicting-policy preserved, explicit-honored, skip-on-no-detection) + Option-type config parsing incl. the `--maxmemory 0` = unlimited escape hatch. 47 config tests pass. Verified live on Linux (OrbStack): no-flag -> cap = 80% of MemTotal + allkeys-lru + notice; `--maxmemory 0` -> unlimited, no notice; `--maxmemory N` -> exact. Build + clippy clean both runtimes; fmt clean. author: Tin Dang
…s (G2)
The reported "zombie eats RAM in multishard mode" reproduces empirically:
with an identical `--maxmemory 100MiB --maxmemory-policy allkeys-lru`, a
4-shard server retained ~2.5x the RAM and ~2.4x the keys of a 1-shard server
(307MB / 768,800 keys vs 124MB / 321,649). Root cause: every write-path
eviction (`try_evict_if_needed`) compared a SINGLE shard's memory against the
FULL `maxmemory`. Each shard is shared-nothing and enforces independently, so
an N-shard server tolerated ~Nx `maxmemory` before any shard evicted — the box
OOM-kills before the cap ever bites. The server-wide atomic that could give a
true aggregate (`ShardDatabases::read_memory_sum` / `memory_per_shard`) is dead
code: no write path publishes to it ("Phase 2/3" wiring never landed), so it
always returns 0.
Per user decision (2026-06-02), `maxmemory` is now a true WHOLE-INSTANCE cap
for both the G1 auto-guardrail and explicit `--maxmemory`, with division at
ENFORCEMENT (Redis-compatible: CONFIG GET / INFO still report the whole-instance
value verbatim).
Implementation (no hot-path division, no new unsafe, no new deps):
- `RuntimeConfig` gains `num_shards` (default 1) + `maxmemory_per_shard()`:
`maxmemory.div_ceil(num_shards.max(1))`, returning 0 (unlimited) iff
`maxmemory == 0`. `div_ceil` keeps summed per-shard budgets >= the cap;
`max(1)` guards a mis-set count.
- Eviction threshold comparisons switch from `config.maxmemory` to the
per-shard budget in `storage/eviction.rs` (`try_evict_if_needed_with_spill_
and_total`, the async-spill variant, and `try_evict_deferred`) and in
`shard/persistence_tick.rs` (`should_run_pressure_cascade` threshold +
the step-3 KV-eviction gate). The `== 0` unlimited gates and all reporting
(CONFIG GET / SET / INFO / MEMORY) are untouched.
- The server publishes the resolved shard count onto the SHARED `RuntimeConfig`
(the instance every shard's eviction reads) at startup in `main.rs` and
`server/embedded.rs`. Single-shard keeps `num_shards == 1` => no division =>
identical prior behavior.
Behavior change (accepted): existing explicit multishard `--maxmemory N` configs
now bound aggregate RSS (effective ceiling lowers ~Nx) instead of per-shard.
Operator visibility (this change is no longer silent): a startup notice
(`config::log_maxmemory_sharding`, emitted from `main.rs` + `server/embedded.rs`)
fires whenever `num_shards > 1` and `maxmemory > 0`, stating the resolved
per-shard eviction budget so an operator running e.g. `--maxmemory 8gb
--shards 4` sees the effective 8GB (not 32GB) ceiling at boot. CHANGELOG +
README document the whole-instance semantics. (Surfacing the per-shard budget
in `INFO memory` is a tracked follow-up — the INFO handler takes only
`&Database`, so it needs a dispatch-signature change across both runtimes.)
Known residual (tracked, not in scope): the connection-path eviction is per-DB,
so M actively-used logical DBs per shard still over-tolerate by ~M; the proper
aggregate fix (wire the per-shard memory publisher + switch eviction to
`read_memory_sum`) touches the hot write path and is a separate change.
Red/green TDD:
- RED: `per_shard_budget_divides_maxmemory_at_enforcement` FAILED pre-fix
("4 shards … got len 4" — 4 shards evicted nothing at budget == total).
- GREEN: passes post-fix; plus 5 pure `maxmemory_per_shard` unit tests. 12/12.
Empirical re-run (OrbStack Linux, same flood) — 4-shard DBSIZE collapses from
768,800 to 321,654 (≈ the 1-shard 321,649): retained keyspace no longer scales
with shard count. used_memory 307MB -> 197MB, VmRSS 300MB -> 193MB (-36%).
Single-shard unchanged. Residual gap is fixed per-shard structural overhead
(4 shard threads' arenas), bounded — not the prior unbounded keyspace growth.
author: Tin Dang
Per-shard AOF replay seeded `master_repl_offset` with the START LSN of the last recovered entry instead of the next-free replication offset. Because `ReplicationState::issue_lsn` returns the offset BEFORE adding the entry's byte length, the first post-recovery write reissued the exact LSN already on disk for the last pre-crash entry — breaking the `lsn -> entry` uniqueness invariant the cross-shard backlog merge (RFC § 2 Rule 3) and PSYNC continuity depend on. main.rs:829 seeds `seed_master_offset` directly from this value, so the bug surfaces on every multi-shard AOF recovery. `replay_incr_framed` now tracks `max(entry.lsn + entry.len)` — the byte offset AFTER each entry — at BOTH the inline and ordered update sites, so the returned `max_lsn` (and the manifest `global_max_lsn` derived from it) is the next-free offset. `seed_master_offset` then leaves the master strictly beyond every byte on disk. Ported onto merged main (PR #129, commit 8c49ec2), whose `replay_incr_framed` carried the identical start-LSN bug at aof_manifest.rs:1533 / :1581. Tests (red/green verified on OrbStack Linux, monoio): - new `replay_incr_framed_max_lsn_is_next_free_offset` regression test (RED: returned 114 = last start LSN; GREEN: 130 = entry end); - updated the four sibling assertions that encoded the old start-LSN semantics to the next-free offset: truncated-header 3 -> 17, decodes_lsn_and_resp 11 -> 27, round_trips_two_shards 20 -> 49, parallel_matches_sequential 40 -> 69, buffers_ordered_entries 12 -> 35. - full persistence::aof_manifest suite: 30 passed, 0 failed. author: Tin Dang <tin.dang@trustifytechnology.com>
Design-for-failure: `appendfsync=always` awaited the writer's fsync ack
with no timeout. A stalled disk (slow device, full filesystem, hung
fsync) parked the write connection forever, holding its ~64KB buffers —
a zombie connection that never completes and never frees memory. Under
load this compounds into the unbounded-RAM symptom.
Fix: cap the ack await with a configurable bound. On elapse the write is
failed (FsyncFailed) and the caller returns an error frame instead of
hanging. The entry may still reach disk later, so durability is reported
as UNCONFIRMED — the caller must not respond +OK.
Changes:
- New `--aof-fsync-timeout-ms` flag (default 2000ms; 0 = legacy unbounded).
- `AofWriterPool` gains an `fsync_timeout: Duration` field, threaded
through `top_level_with_policy` / `per_shard_with_policy`; the
non-policy constructors default to DEFAULT_AOF_FSYNC_TIMEOUT (2000ms).
- `try_send_append_durable` Always-branch now awaits via `await_ack`,
a runtime-agnostic helper: monoio uses `select! { rx, sleep }` (matching
cluster::failover — monoio 0.2 has no time::timeout); tokio uses
tokio::time::timeout. Outcomes map: ack->inspect AofAck, channel
disconnect->WriteFailed, elapse->FsyncFailed.
- Wired the timeout into all production pool construction sites
(main.rs per-shard + top-level, listener.rs, embedded.rs).
Ported onto merged main (PR #129, commit 8c49ec2). Reconciliation vs the
original F2 base:
- merge-kept main's AOF_BACKPRESSURE_DROPPED counter alongside the new
AckOutcome enum (both retained).
- five pre-existing pool test sites that main added after the F2 base
(aof.rs x3, spsc_handler.rs x2, handler_single.rs x1) now pass
Duration::ZERO — exact legacy unbounded semantics, no behavior change.
Tests (red/green, tokio-gated — drive the runtime timer):
- always_fsync_times_out_when_writer_never_acks: held-but-undrained writer
-> Err(FsyncFailed) within the bound, no hang.
- always_fsync_succeeds_when_writer_acks_in_time: drained+acked writer
-> Ok(()).
- full persistence::aof::pool_tests suite: 17 passed, 0 failed (tokio).
The monoio path shares the proven select!+sleep shape and is covered
end-to-end by the per-shard crash matrix.
author: Tin Dang <tin.dang@trustifytechnology.com>
Design-for-failure: the cross-shard dispatch path retried a full target
SPSC ring with an UNBOUNDED `loop { try_push; sleep/yield }` and waited
for the reply with no cap and no shutdown check. When a target shard
stopped draining (saturated, wedged, or mid-shutdown), the producing
connection task spun forever — holding its read/write buffers and, on
monoio, piling wakers into the `pending_wakers` relay. That is the
multi-shard "zombie connection" RAM-growth vector from the investigation:
memory that never frees because the task never completes.
Fix — shared bounded helper `push_with_backpressure` (src/shard/dispatch.rs):
- Retries a full ring up to CROSS_SHARD_PUSH_MAX_RETRIES (5000) with a
100µs backoff (~0.5s honoured → a few seconds at coarse timer
granularity — wedged-shard detection scale, generous enough not to
false-reject a merely saturated shard under load).
- Checks `shutdown.is_cancelled()` before every backoff so graceful
shutdown can never hang on a wedged peer.
- On give-up the message was NEVER accepted → caller returns a clean
reject (`-ERR cross-shard dispatch backpressure`); the command did not
execute.
- Hot path unchanged: the happy case is a single try_push with no sleep
and no allocation. `TimerImpl::sleep` (runtime-agnostic) only allocates
on the contended retry path.
Wired into BOTH runtimes (same latent bug in each):
- monoio `handler_monoio`: push loop + the `pending_wakers` response-wait
loop now bounded by CROSS_SHARD_RESPONSE_MAX_WAITS (~30s of ~1ms wakes)
AND a shutdown check. Response timeout is an *uncertain write* (the
batch was already dispatched, the target may have applied it), so its
error explicitly says "write may have applied" — it must NOT imply
rejection. Disconnect vs timeout vs shutdown now return distinct frames.
- tokio `handler_sharded`: same push loop replaced. On give-up the slot is
simply never awaited — `slot_ptr` has no accounting side effect, so
skipping `future_for` cannot leak ResponseSlotPool state. (The tokio
response wait already awaits via AtomicWaker, so it needed no bound.)
Tests (red/green, tokio-gated — drive the runtime timer):
- push_backpressure_pushes_on_first_try_without_sleeping (happy path, no
retry/sleep)
- push_backpressure_succeeds_after_transient_full (retries until drain)
- push_backpressure_gives_up_when_ring_never_drains (Backpressure within
budget, exact attempt count, no hang)
- push_backpressure_aborts_on_shutdown (Cancelled after one attempt)
The monoio response-wait bound is correct-by-inspection (coupled to the
waker relay) and covered by the per-shard crash/E2E matrix.
Verified: build + clippy clean both runtimes; full tokio lib suite
2773/2773 pass; fmt clean.
author: Tin Dang
…t (F6) Crash-safety: `advance_shard` (PerShard per-shard rewrite scaffolding) deleted the old generation's base/incr files at step 3 — BEFORE the manifest durably committed the new seq. Because recovery resolves base/incr paths by the persisted `manifest.seq`, a crash between any shard's advance and the coordinator's `write_manifest()` would leave the manifest pointing at an old seq whose files are already gone → the base RDB is missing on restart → data loss for every shard that had already advanced. This is the inverse of `advance()` (TopLevel), which correctly deletes only AFTER the manifest commit. This is a latent bug: `advance_shard` currently has no production caller (the per-shard BGREWRITEAOF fan-out that drives it is the next step), so no shipped path regresses. Fixing it now establishes the correct crash-safe ordering before the coordinator is wired. Changes: - `advance_shard`: remove the in-function delete (old step 3). It now writes the new base+incr and updates the shard's `max_lsn` only. The unused `old_seq` binding is dropped with it. - New `prune_shard_files(shard_id, seq)`: best-effort deletion of a shard's base+incr for a given seq. The rewrite coordinator MUST call this only after `write_manifest()` has durably committed the new seq — the single commit point for the whole rewrite. Mirrors `advance()`'s post-commit deletion ordering. - Doc comment on `advance_shard` now states the deferred-deletion contract and the crash window it closes. Tests (red/green TDD, runtime-agnostic — pure manifest unit test): - Renamed `advance_shard_writes_new_seq_and_deletes_old` → `advance_shard_defers_delete_until_after_commit`. It now pins the ordering invariant: after fan-out advance of BOTH shards, the new-seq files exist AND the old-seq files STILL exist (pre-commit); only after `seq = new; write_manifest(); prune_shard_files(...)` do the old files disappear while the new remain. Recovery (`load`) resolves to new seq. RED against the old delete-before-commit behavior (failed at the pre-commit "old base must survive" assertion); GREEN after the fix. - Full `persistence::aof_manifest` module: 30 passed, 0 failed (monoio). author: Tin Dang <tin.dang@trustifytechnology.com>
…oio)
Make multi-shard BGREWRITEAOF crash-safe and actually compact the per-shard
AOF, behind an experimental opt-in flag. Until now BGREWRITEAOF was GATED in
the PerShard layout (`--shards >= 2 + --appendonly yes`) because the naive
multi-shard rewrite lost ~38% of keys on restart (verified 2026-05-26). This
ships the correct per-shard fan-out coordinator and lets operators enable it
with `--experimental-per-shard-rewrite`; the default behavior is unchanged
(gate stays closed → the shipped "append-only, no in-place compaction" path).
Design — synchronized seq bump + single manifest commit across N writers:
- New `AofMessage::RewritePerShard { shard_dbs, coord }` is delivered to every
per-shard writer at once.
- Each writer runs `do_rewrite_per_shard` (monoio, synchronous) for ITS shard,
reusing the proven `do_rewrite_sharded` discipline scoped to one shard:
drain queued appends into the OLD incr (framed) + fsync → lock the shard's
dbs → re-drain → snapshot under lock → release → write new base+incr at the
coordinator's `new_seq` via `advance_shard` (which does NOT bump the seq) →
reopen the append file to the new incr.
- `PerShardRewriteCoord` holds a shared `Arc<Mutex<AofManifest>>` and an
atomic countdown. The LAST writer to finish performs the single durable
commit (`manifest.seq = new_seq; write_manifest()`), then prunes the old
generation. Until that commit the on-disk manifest still resolves to the old
seq, so a crash anywhere in the fold recovers the intact old generation
(no loss, no double-apply).
Crash-safety:
- Non-idempotent INCR is exactly-once across the rewrite boundary: pre-snapshot
INCRs end up only in the new base; post-snapshot INCRs only in the new incr.
- delete-after-commit ordering (the step-1 `advance_shard` fix) means a
pre-commit crash never deletes files the manifest still references.
- Abort-on-any-failure: a shard whose fold errors calls `coord.mark_failed()`;
the final writer then ABORTS the commit (keeps old seq) instead of advancing
to a generation where a shard is missing its new base — which would break
recovery. Old generation stays authoritative (crash-safe); a restart is
recommended to resync the successful writers' file handles.
Design-for-failure:
- The rewrite message uses the BLOCKING `send` (not `try_send`): a dropped
message would leave the countdown unable to reach zero, silently losing the
folded writers' post-rewrite appends. Writers drain continuously on dedicated
threads, so the block is sub-millisecond for a rare admin command.
- The manifest is loaded fresh from disk at rewrite time (normal appends never
touch it; BGREWRITEAOF is CAS-serialized) — no startup/recovery reordering.
Runtime scope: monoio only this commit. The fold uses synchronous std::fs IO;
the tokio command handler refuses per-shard BGREWRITEAOF with a clear error so
no in-progress flag is left dangling. tokio enablement is the next step.
Wiring:
- `--experimental-per-shard-rewrite` config flag; main.rs leaves the
MULTI_SHARD_AOF_REWRITE_UNSAFE gate OPEN when set.
- `AofWriterPool::per_shard_with_base_dir` records the persistence dir so
`try_send_rewrite_per_shard` can load the manifest; `bgrewriteaof_start_sharded`
routes PerShard pools to the per-shard fan-out.
Tests (red/green TDD):
- tests/crash_matrix_per_shard_bgrewriteaof.rs (monoio, --ignored):
- STRADDLE: BGREWRITEAOF fired mid-INCR-stream + SIGKILL → both per-shard
counters recover to EXACTLY N (drop → <N, double-apply → >N), with the
seq>1 compacted base asserted present (rules out a silent no-op).
- COMPOSE: INCR → rewrite → INCR → SIGKILL → counter == PRE+POST (base +
post-rewrite incr compose exactly once).
Both PASS on monoio (2 passed, 10.77s).
- Both runtimes compile clean; cargo fmt + clippy (default) green.
author: Tin Dang <tin.dang@trustifytechnology.com>
Record the load-bearing ordering fact behind per-shard BGREWRITEAOF exactly-once, verified by tracing the live write path against the fold. The fold (`do_rewrite_per_shard`) runs on the per-shard *writer* thread, not the shard event-loop thread. Exactly-once across the rewrite boundary rests on a single fact: the live write path enqueues each command's AOF append INSIDE the same `RwLock<Database>` write guard under which it mutated the db (`spsc_handler.rs` calls `wal_append_and_fanout` before `drop(guard)`). Phase 2 of the fold acquires those SAME locks (`all_shard_dbs()[sidx]` == `ShardDatabases::shards[sidx]`, the exact `RwLock`s `write_db` locks), so RwLock mutual exclusion forces `enqueue -> guard-release -> fold-acquire -> mid-drain`. Every INCR whose mutation lands in the snapshot therefore had its append drained into the OLD incr (pruned at commit), never replayed atop the new base. If the append were enqueued AFTER the guard drop, the snapshot would capture the mutation while its append still raced toward the NEW incr — double-apply. Also documents the dependency on the RwLock store being live: the thread-local `ShardSlice` fast path is dead code until Phase 4 wires `init_shard` (`is_initialized()` is always false today). A future Phase 4 that makes ShardSlice live must revisit this fold, since the writer thread cannot lock another thread's `!Send` `Rc<RefCell<Shard>>`. Verification (this commit, monoio, release): - tests/crash_matrix_per_shard_aof.rs (shipped gate-closed path): 3 passed — no regression from the F6 shared-code changes (AofMessage enum, drain_pending_appends, per-shard writer match arms). - tests/crash_matrix_per_shard_bgrewriteaof.rs (F6 new path): 2 passed — STRADDLE + COMPOSE exact-count recovery. Doc-only; no behavior change. author: Tin Dang <tin.dang@trustifytechnology.com>
… gap The per-shard BGREWRITEAOF exactly-once guarantee holds *absent append-channel saturation during the fold*. Record the one window the crash matrix cannot surface, so the claim is honest before the experimental path is exercised. While `do_rewrite_per_shard` runs (phases 2-6, including the hundreds-of-ms base-RDB write + fsync of phase 6), the per-shard writer is in the fold, not its recv loop, so it is not draining the bounded `mpsc_bounded(10_000)` append channel. Post-snapshot appends queue there for the new incr; the event loop enqueues with `try_send_append` (drop-on-full, return ignored). Under sustained concurrent writes on a large dataset the window can exceed 10_000 queued appends and the overflow is silently dropped — lost even on a clean restart, which is worse than the everysec contract. The single-client crash matrix never pressures the channel (serialized redis-cli, sub-ms), so it cannot catch this. The window is pre-existing: the shipped `do_rewrite_sharded` has the identical non-draining gap — F6 did not introduce it. Per user decision (2026-06-02): qualify + document now (same posture as the Rule-3 / always-fsync deferrals), behind the experimental flag; the fix (drain-during-phase-6 or block-on-full for the rewrite's duration, plus a load-based test) is a separate scoped task. - aof.rs: add a "Known limitation — channel saturation during the fold" section to the `do_rewrite_per_shard` doc. - tmp/F6-known-limitations.md: full write-up (L1 saturation + candidate fixes, L2 carry-over) and current monoio/tokio status. Doc-only; no behavior change. author: Tin Dang <tin.dang@trustifytechnology.com>
Make the multi-part PerShard AOF subsystem (manifest creation, replay, and per-shard BGREWRITEAOF) work under runtime-tokio, not just runtime-monoio, so multi-shard `--appendonly yes` deployments are crash-safe on both runtimes. This is the first half of "multi-shard stable on monoio AND tokio". Root cause of the previous tokio gap (not a fold bug): - The tokio per_shard_aof_writer_task already supports multi-part: it polls AofManifest::load for up to 60s, then opens the framed incr file. Its on-disk framing ([u64 lsn LE][u32 len LE][RESP]) is byte-identical to the monoio writer and to drain_pending_appends_framed, so the shared, runtime-agnostic replay_per_shard reads it back unchanged. - But the manifest creation + replay block in main.rs was gated entirely to #[cfg(runtime-monoio)]. Under tokio initialize_multi never ran, so the manifest the writers wait for never appeared — the writers timed out at 60s and exited, and try_send_rewrite_per_shard aborted with "manifest missing". Durability fell back to WAL-v3 only; AOF was effectively dead. Why it was gated (the regression boundary, preserved): tokio --shards 1 uses the legacy single-file appendonly.aof (v2) + in-place BGREWRITEAOF. Engaging single-shard multi-part replay there finds an empty manifest with no rewritten base RDB and wipes the v2-loaded state — the test_txn_commit_wal_crash_recovery regression. That is a SINGLE-shard problem; the goal is MULTI-shard, a different code path that writes real per-shard base/incr files. Fix — decouple the gating instead of all-or-nothing by runtime: - PerShard branches (num_shards >= 2): initialize_multi (fresh boot) and replay_per_shard (restart) now run under BOTH runtimes. initialize_multi writes a real seq=1 base RDB per shard, so incr-only recovery (writes then crash before any rewrite) has a valid base to replay onto. - Single-shard branches (replay_multi_part, initialize_with_base, initialize) stay #[cfg(runtime-monoio)] with a tokio warn fallback that preserves today's exact behavior (v2 single-file recovery, no manifest creation). This keeps test_txn_commit_wal_crash_recovery green. - TopLevel-manifest + multi-shard remains a hard refusal on both runtimes. Also lands the tokio per-shard BGREWRITEAOF fold (was warn+abort): it reuses the proven synchronous do_rewrite_per_shard verbatim via tokio::fs::File::into_std() on the dedicated block_on_local writer thread, so the exactly-once invariant carries over unchanged. The command handler (bgrewriteaof_start_sharded) now routes PerShard pools to the fan-out coordinator on both runtimes (the tokio refusal is removed). DrainOutcome, drain_pending_appends_framed, and do_rewrite_per_shard cfg widened from runtime-monoio to any(runtime-monoio, runtime-tokio). BEHAVIOR CHANGE: a corrupt AOF manifest is now FATAL (refuse-to-start) on tokio too — previously it was warn+continue. This is intentional: silently ignoring a corrupt manifest risks overwriting the only reference to the base RDB. Operators with a corrupt manifest must inspect it before deleting. Validation: - cargo check / clippy -D warnings / fmt --check: green on BOTH runtimes (default = runtime-monoio; tokio = --no-default-features --features runtime-tokio,jemalloc,graph,text-index). - Crash-matrix + TXN regression validation under tokio follows in the next commit (release build + --ignored integration tests). author: Tin Dang <tin.dang@trustifytechnology.com>
…hen idle The tokio per-shard AOF writer could leave >1s of writes buffered in its BufWriter and lose them on SIGKILL, violating the everysec durability contract (≤1s loss). Surfaced by the F6 BGREWRITEAOF crash matrix COMPOSE case (PRE=300, rewrite, POST=200, 1.5s quiesce, SIGKILL), which recovered only ~450/500 instead of the exact 500 it asserts. Root cause — the flush was on a starvable/unreliable select! arm: - The writer loop used `_ = interval.tick(), if fsync == EverySec` as one arm of `tokio::select!` alongside `msg = rx.recv_async()`. Under sustained writes the always-ready recv arm repeatedly won select!'s random fairness, starving the interval arm, so the BufWriter accumulated well past 1s before flushing. When idle, the long-lived `interval` arm also proved unreliable on the dedicated current-thread writer runtime (each per-shard writer runs on its own std::thread via block_on_local). Either way the everysec deadline was not honored. Fix — make the flush deadline independent of select! scheduling: - Replace the bare `rx.recv_async()` arm with `tokio::time::timeout(200ms, rx.recv_async())` so the loop wakes at least every 200ms regardless of message traffic. flume's recv future is drop-safe on the Elapsed branch (no message consumed on timeout) and the Ok(Ok(msg)) path captures the message with no loss, so durability is not affected by the timeout wrapper. - Move the EverySec flush check to run AFTER every wake (message OR timeout), outside the select! arms, so it is not subject to select! fairness and holds the 1s bound under sustained load as well as when idle. Delete the now-dead `interval` and its priming `tick()`. - Always / AppendSync (synchronous fsync-before-ack) and the per-shard rewrite fold are untouched; the big message match stays in one place (lowest-risk change on durability-critical code). The fix is in the `#[cfg(runtime-tokio)]` writer block only — monoio is unaffected. Also un-stale the single-shard TXN crash-recovery regression guard, which could not compile on this branch (so it had been protecting nothing): - tests/txn_kv_wiring.rs: reconcile the ServerConfig literal with the F2/F6 fields added on this branch (experimental_per_shard_rewrite, aof_fsync_timeout_ms) and fix maxmemory `0` -> `Some(0)` (the field is Option<usize>). The file is #![cfg(feature = "runtime-tokio")], so the guard runs under tokio — the runtime where the decouple's single-shard warn-fallback (legacy v2 path) matters. Validation (release binary + redis-cli, both runtimes): - test_txn_commit_wal_crash_recovery (tokio): green — confirms the 8853ade decouple preserved single-shard recovery. - crash_matrix_per_shard_bgrewriteaof COMPOSE (tokio): 3/3 exact 500/500 at the original 1500ms quiesce (was 475/463/450/449). STRADDLE: green. - crash_matrix_per_shard_aof CRASH-01-LITE everysec + always + pipeline-no-double-write (tokio): 3/3 green. - Same crash matrices under monoio (default features): green — confirms the tokio-only edit did not regress monoio. - clippy -D warnings + fmt --check: clean on both runtimes. tests/crash_matrix_per_shard_aof.rs: header doc updated — CRASH-01-LITE is now validated on BOTH runtimes; documents the tokio invocation. author: Tin Dang <tin.dang@trustifytechnology.com>
…anifest cap Live disk-offload cold read-through silently broke past ~70 evicted KV per shard. `Manifest::commit()` hard-errors above MAX_INLINE_ENTRIES (~70) because the manifest root is a single 4KB double-buffered page with no overflow/B-tree. The spill path wrote ONE DataFile + ONE manifest entry per evicted KV, so beyond #70 every `apply_spill_completions` commit failed, orphaning the cold file (on disk, unreferenced) → GET returned nil. Identical single/multishard, tokio/monoio, small/large values — not value-size, multishard, or runtime specific. Pre-change repro: 1537 "too many entries for inline root page" warnings, live read-through 7/200, recovery rebuilt only 4 cold-index entries. Fix: pack many evicted KVs into ONE multi-page DataFile per bg-thread flush, so manifest entries scale with #files (cold-bytes/flush) not #keys. - ColdLocation gains `page_idx: u32` → {file_id, page_idx, slot_idx}. ColdIndex is rebuilt-only (no on-disk serialization), so this is an in-memory + recovery -scan change with no format version bump. - build_kv_spill_batch / write_kv_spill_batch (kv_spill.rs): greedy leaf packing, page_idx+1 on PageFull; atomic temp+rename write. Inline-only MVP (values > INLINE_MAX_VALUE_BYTES=3500 route to the existing single-file path). - read_cold_entry (cold_read.rs): single-page pread at page_idx*PAGE_4K; only reads the whole file when the OVERFLOW flag is set. - spill_thread.rs: buffered batch model (flush on 256-entry cap / 100ms tick / shutdown); per-FILE SpillCompletion carrying Vec<(key, db_index, page_idx, slot_idx)>. - apply_spill_completions (persistence_tick.rs): ONE add_file+commit per file, then per-entry cold_index.insert. - eviction.rs: drop the tentative slot-0 cold_index insert (slot 0 is now a different key's slot under batching); the ~100ms pre-flush GET miss is correct (key not yet on disk) and the AOF incr log is the durability backstop. - event_loop.rs: the monoio write-path spill ctx used the BARE offload dir (<offload>) instead of <offload>/shard-{id}, so cold files were written where the reader's cold_shard_dir never looked. Align with every other path (WAL, control file, cascade, reader). This dir mismatch was masked pre-fix by the tentative slot-0 insert; removing it exposed the bug. Validation (release, MONOIO — the default/primary runtime): - LIVE disk-offload read-through: 7/200 → 200/200 (single-shard AND --shards 2), manifest-cap warnings 1537 → 0, cold files packed (~7-10, not ~1600). - Rebuild substrate: rebuild_from_manifest 4 → 1607 entries. - Hot-key crash recovery: 700/700 (BGREWRITEAOF + SIGKILL + restart), base RDB fold confirmed — no regression. - New in-process TDD test `test_rebuild_from_manifest_roundtrip` proves the recovery (page_idx, slot_idx) scan reconstructs the same mapping the builder produced: 100/100 keys round-trip through cold_read_through. - monoio: 21 tiered unit tests green; tokio: 46 tiered unit tests green, clippy -D warnings clean, fmt clean (both runtimes). SCOPE — this fixes the manifest-cap for the MONOIO write-path. Two adjacent gaps remain (characterized, tracked separately, NOT regressions of this change): 1. TOKIO live read-through still broken (7/200). Root cause: the tokio hot-path handler (handler_single.rs / handler_sharded) calls `try_evict_if_needed` (delete-only) on write, NOT `try_evict_if_needed_async_spill` — so evicted keys under tokio are deleted, never spilled to cold. Only the persistence-tick cascade spills (sparsely). monoio's handler_monoio correctly uses the async-spill variant + the event_loop.rs:586 dir fix. Wiring async-spill into the tokio handler is a separate handler-layer change. 2. Post-crash cold read-through on MONOIO unchanged: 41/200 (pre-change) vs 42/200 (with-fix) — PRE-EXISTING, and NOT data loss. The AOF incr DOES replay (verified: "AOF incr replayed: 2000 commands" / "AOF multi-part loaded: 2000 entries"); the v3 `cmds=0` is the WAL count, not the AOF — all data is durable. The gap is re-eviction/cold-read coordination on the recovery path (replay re-inserts → eviction re-evicts → cold reads of re-evicted keys miss). Curiously inverted under tokio (recovery 200/200, via v3 WAL cmds=2000). The 7 changed files do not touch recovery.rs, main.rs wiring, restore_from_persistence, AOF replay, or string::get. Refs: tmp/phaseB-fix-SUMMARY.md, tmp/phaseB-coldspill-rootcause.md, tmp/phaseB-coldkey-crash-recovery.sh author: Tin Dang <tin.dang@trustifytechnology.com>
… path
Live disk-offload read-through was broken under the tokio runtime: GET of
evicted keys returned nil for ~96% (7/200) while monoio returned 200/200.
Root cause was a handler-layer asymmetry, NOT a storage bug.
The production binary (main.rs) always runs the sharded architecture: each
shard owns an event loop (event_loop.rs) that binds its own SO_REUSEPORT
socket and spawns connections via `spawn_tokio_connection`. The per-shard
background SpillThread, its `flume::Sender<SpillRequest>`, the shared
`spill_file_id` counter, and the `<offload>/shard-{id}` directory all live in
that event loop and were threaded into the *monoio* connection
(`spawn_monoio_connection`) but NOT the *tokio* one. So the tokio ConnCtx was
built with `spill_sender = None`, and `handler_sharded`'s write path fell back
to `try_evict_if_needed` (DELETE-ONLY): evicted keys were dropped, never
spilled to the cold tier, so the cold read-through had nothing to resolve.
Only the periodic persistence-tick cascade spilled (sparsely).
Fix — mirror the monoio wiring (no new machinery; ConnCtx already carries the
three spill fields and `try_evict_if_needed_async_spill` already exists):
- conn_accept.rs: `spawn_tokio_connection` gains `spill_sender`,
`spill_file_id`, `disk_offload_dir` params (mirroring
`spawn_monoio_connection`) and passes them into `ConnectionContext::new`
instead of the previous `None`/placeholder.
- event_loop.rs: both tokio `spawn_tokio_connection` call sites now pass
`&spill_sender, &spill_file_id, &disk_offload_dir` (already in scope).
- handler_sharded/mod.rs: the `do_write` eviction now uses
`try_evict_if_needed_async_spill(db, &rt, sender, dir, &mut fid,
conn.selected_db)` when `ctx.spill_sender` is `Some`, else delete-only.
Safety / design notes:
- Tokio connections run via `tokio::task::spawn_local` on the per-shard
event-loop thread (`block_on_local`), the same single-threaded model as
monoio — so the shared `Rc<Cell<u64>>` spill_file_id is sound, and the
hot-path get->spill->set has no `.await` inside it (atomic vs other tasks).
- Both the hot-path evictor and the persistence-tick cascade run under this
shard's db write lock (`with_shard_db` / `write_db`), so they cannot race
on the same key.
- The `next_file_id = next_file_id.max(spill_file_id.get())` reconciliation
runs in the tokio select! periodic arm (event_loop.rs:1222), so the cascade
stays consistent with the hot-path counter — fully symmetric with monoio.
Validation (release, red/green TDD):
- TOKIO live read-through: 7/200 (RED, fresh current-branch binary) -> 200/200
(GREEN), stable across 3 consecutive e2e runs
(tmp/phaseB-coldkey-crash-recovery.sh, 2000x500B, maxmemory 262144,
allkeys-lru, --shards 1).
- MONOIO regression gate: live read-through stays 200/200 (handler_monoio and
spawn_monoio_connection untouched; only shared signatures extended).
- storage::tiered unit tests: 46 pass on monoio, 46 pass on tokio.
- clippy -D warnings clean + fmt clean, both runtimes.
SCOPE — this closes the tokio LIVE read-through gap. Two adjacent items remain,
both pre-existing and tracked separately (NOT regressions of this change):
1. POST-CRASH cold read-through on MONOIO (B-2): unchanged at ~43/200. The
spill thread's `file_id` counter (and the event loop's `spill_file_id`)
start at 1 on every boot and are never seeded from the recovered manifest,
so post-restart re-eviction atomically overwrites recovered `heap-NNN.mpf`
files. Fix = centralize file_id assignment in the SpillThread and seed it
from the recovered max (one coherent change). Tokio post-crash already
reaches 200/200 when spill completions land before the kill (timing-
dependent), via the v3 WAL replay path.
2. handler_single (run_with_shutdown / handle_connection) still uses delete-
only eviction, but that path is test-harness/embedded only — main.rs never
drives it — so it is out of scope for the production multi-shard goal.
Refs: tmp/phaseB-fix-SUMMARY.md,
investigation_coldspill_manifest_70_cap (memory)
author: Tin Dang <tin.dang@trustifytechnology.com>
After a SIGKILL + restart with disk-offload enabled, GET of a re-evicted
(cold) key returned nil for ~80% of keys on monoio — and intermittently on
tokio. The data was fully durable (AOF replayed all 2000 entries, cold index
rebuilt with the full set); the failure was purely on the cold READ path
after recovery. Two independent defects, both required to be correct:
1. Cold-tier wiring wiped by the rdb::load swap (the dominant cause).
The multi-part AOF replay loads the base RDB via `rdb::load`, which builds
fresh `Database::new()` temporaries and swaps them wholesale into the live
databases (`*live = temp`). `cold_shard_dir` and the rebuilt `cold_index`
are live-tier topology, NOT part of the RDB hot snapshot, so even a 0-key
base RDB silently reset them to None. The handler then read a database with
`cold_shard_dir = None`, so `cold_lookup_location` returned None for every
key and all cold read-through missed.
Fix: capture both fields per (shard, db) immediately before the AOF replay
block (after restore rebuilt the index, before replay clobbers) and
re-attach them immediately after. Confined to the startup recovery path on
purpose — `rdb::load` also serves replica full-sync and DEBUG RELOAD, which
load a foreign dataset whose values do not live in this node's cold files,
so preserving the local index there would surface stale reads. In the
recovery path the loaded base + replayed incrs are this node's own data, so
the rebuilt index is authoritative.
2. Spill file_id counter restarted at 1 (the residual cause).
On restart the counter restarted at 1, so post-restart re-eviction minted
the same `heap-NNNNNN.mpf` names recovery had just loaded and atomically
overwrote cold files the preserved cold_index still references — corrupting
read-through of any not-yet-refreshed entry.
Fix: `next_spill_file_id_seed` scans `<shard_dir>/data/heap-*.mpf` and
seeds the counter at max(file_id)+1, so every future spill filename is
strictly new and recovered cold files stay immutable until the steady-state
cascade refreshes them. Scans physical files (not the manifest) so a spill
that wrote its .mpf but crashed before the manifest commit still cannot be
clobbered. Returns 1 when disk-offload is off / dir absent / no heap files,
so a fresh server and the live hot path are byte-for-byte unchanged.
Non-NotFound scan errors warn and fail safe to 1 rather than risk a misseed.
The two fixes are complementary: (1) re-wires the index across recovery,
(2) keeps the files that index points at immutable. Either alone is
insufficient — without (1) reads miss on shard_dir=None; without (2) the
preserved entries point at overwritten files.
Validation (red/green TDD):
- 3 unit tests for the seed helper (none-dir, fresh-server, resume-above-max
with non-heap names ignored) — GREEN.
- e2e crash-recovery gate (2000x500B, maxmemory 256KB, allkeys-lru, SIGKILL,
restart, GET first-200 cold read-through):
monoio --shards 1: LIVE 200/200 + COLD recovery 200/200 (x5, stable)
monoio --shards 2: LIVE 200/200 + COLD recovery 200/200 (x2)
tokio --shards 2: LIVE 200/200 + COLD recovery 200/200 (x2; both shards'
cold indexes 812+796 rebuilt and read-through working)
RED baseline pre-fix was ~40/200 on monoio recovery.
- storage::eviction + storage::tiered: 61 unit tests pass.
- cargo fmt clean; cargo clippy --release (monoio) -D warnings clean; both
runtimes build warning-free.
Harnesses: tmp/phaseB-coldkey-crash-recovery.sh (single-shard),
tmp/phaseB2-multishard-coldkey-recovery.sh (multi-shard, both runtimes).
author: Tin Dang <tin.dang@trustifytechnology.com>
…n data loss
The cold-tier orphan sweep deleted an entire batched spill file
(heap-NNNNNN.mpf, up to 256 KVs) whenever it found a SINGLE orphan key in
it, silently orphaning the file's co-located live keys. Once the sweep ran,
cold read-through collapsed from 200/200 to 88/200 — a benign-looking disk
number masked active data loss. This was only exposed by running the sweep
at a short interval; the default 300s interval never fired in earlier
~3-min benches, which is why the disk "leak" looked unbounded and the data
loss stayed hidden.
Root cause: ColdIndex tracked only key -> location, with no per-file
liveness, so sweep_known_orphans could not tell whether a file still backed
other live keys before unlinking it. The same blind spot leaked files
orphaned by re-eviction (insert overwriting a key's location to a new file):
no key references the old file, so the hot-cap-cold sweep never saw it.
Fix: add a reverse liveness index to ColdIndex.
- file_refs: HashMap<file_id, u32> counts live map entries per file.
insert/remove maintain it; a file is queued for unlink (pending_unlink)
only on a zero-ref transition. A file is deleted ONLY when its last live
referrer is removed — never on a single orphan key while co-located keys
remain.
- sweep_known_orphans now removes orphan ENTRIES, decrements refs, and
drain_pending_unlink unlinks only zero-ref files (off the hot path). This
also reclaims re-eviction orphans the hot-cap-cold sweep cannot see.
- merge / rebuild_from_manifest route through insert, so recovery rebuilds
ref counts consistently.
- timers.rs sweep trigger also fires when has_pending_unlink(), since
re-eviction orphans carry no hot-cap-cold key.
Unlink precedes the manifest commit, so disk is freed even when the commit
fails at the manifest 70-entry root cap (addressed separately for recovery).
Validation (red/green TDD):
- 3 new unit tests: co-located live key + file retained on single-orphan
sweep; file deleted only when last ref removed; overwrite reclaims the
orphaned old file. RED on prior code ("DATA LOSS: file ... was deleted").
- storage suite: 315 passed / 0 failed.
- e2e (s=4, 1M writes, --cold-orphan-sweep-interval-secs 2):
read-through 88/200 -> 200/200; disk bounded at the true working-set
plateau (~604 files / ~85 MB, flat rounds 6-20) vs the 300s-interval
control's unbounded climb to 263 MB+.
author: Tin Dang <tin.dang@trustifytechnology.com>
…ry beyond 70 files/shard
The shard manifest stored all FileEntry records inline in a single 4 KB
root page, capped at MAX_INLINE_ENTRIES = 70. commit() rejected the 71st
entry outright ("too many entries for inline root page: N > 70"). Because
the post-compaction cold tier is recoverable ONLY via the manifest (the
BGREWRITEAOF base RDB holds hot keys only — verified: appendonly=yes +
BGREWRITEAOF recovers cold keys purely through rebuild_from_manifest), the
70-entry cap is a DURABILITY ceiling, not a perf limit: ~70 heap files ≈
9 MB cold/shard, which every real disk-offload deployment exceeds
immediately (Phase C at 1M keys hit "495 > 70"). Past the cap, files 71+
were never recorded, so a crash + restart silently lost that cold data.
Add an append-only overflow region so a root can track unbounded entries
while preserving the dual-root atomic-commit guarantee.
- Root page keeps the first <=70 entries inline (zero overhead, fully
backward compatible for the common case). The surplus packs into
ManifestEntry overflow pages (72 entries each). The root header's
next_page records the overflow run's start page; entry_page_count
records its length; file_count is the inline+overflow total.
- CRASH-ATOMICITY INVARIANT: overflow pages are appended at EOF and
sync_data'd BEFORE the root that references them. Append-only means a
new run never overwrites the currently-active slot's older run, so a
crash before the root's atomic commit leaves the prior committed state
(its root AND its overflow run) fully intact. open() loads each slot's
overflow and returns None on any torn/short/corrupt/inconsistent run,
so selection falls back to the last-good slot — partial loss at worst,
NEVER corruption. Preserves the format's pre-existing failure semantics.
- Bound append-only growth: when a committed file is mostly dead overflow
from superseded commits, compact() rewrites it as
[Root A][Root B][fresh overflow] via temp-file + atomic rename (the live
manifest stays valid until the rename). Best-effort: a compaction
failure never fails an already-durable commit.
The ManifestEntry (0x02) page type and the reserved entry_page_count field
already existed in the v2 layout for exactly this; no on-disk format bump.
Validation (red/green TDD):
- test_overflow_persists_beyond_inline_cap: 200 entries persist + recover.
- test_overflow_commit_crash_atomicity: commit 150>cap then 300>cap, lop
the final 4 KB page (torn state2 overflow); reopen falls back to the
last-good state1 (150), never panic/garbage. RED before this change.
- test_overflow_compaction_bounds_growth: 60 commits of a >cap set stay
within a small multiple of the live set (compaction fires).
- test_overflow_tombstone_and_gc: tombstone + prune an overflow-region
entry, survives reopen.
- test_manifest_max_inline_entries: inverted from ">70 rejected" to
">70 persists via overflow" (MAX_INLINE_ENTRIES still == 70).
- Manifest suite 22 passed; full lib suite 3426 passed; fmt + clippy clean.
Pairs with the per-file-liveness data-loss fix (a3b3a9a). Remaining in this
series: AOF-independent manifest durability (appendonly=no) and the default
cold-orphan-sweep interval.
author: Tin Dang <tin.dang@trustifytechnology.com>
At the 300s default the cold-orphan sweep never fired within a typical benchmark window (~3 min), which had two bad effects: cold orphan heap files accumulated on disk for up to 5 minutes before any reclamation, and — more seriously — it MASKED a batch-file shared-deletion data-loss bug, because the buggy sweep never ran to expose it (fixed by the per-file liveness refcount in ColdIndex, a3b3a9a). Lower the default to 60s (the documented recommended floor). Reclamation is prompt without churning: the sweep's per-file unlinks run off the hot path, so a shorter interval keeps each batch small rather than letting hundreds of orphans pile up between sweeps. Operators can still set 0 to disable or raise it up to 3600. Validated together with a3b3a9a + f5e17ac in the Phase C multishard gate (s=4, sweep=2s, disk-offload on, both runtimes): disk bounded at the working-set plateau, post-crash recovery read-through 200/200. author: Tin Dang <tin.dang@trustifytechnology.com>
…nly=no)
Cold (spilled-to-disk) keys were silently dropped on restart whenever the
server ran with `--appendonly no --disk-offload enable`. After a crash the
cold read-through recovered 0/200 even though the heap files and per-shard
manifests were intact on disk.
Root cause (main.rs shard-construction closure): `persistence_dir` is derived
as Some(..) only when `appendonly == "yes" || save.is_some()`, and is
intentionally None under appendonly=no so that no per-tick WAL fsync writer is
created. Recovery was then gated behind `if let Some(ref dir) = persistence_dir`,
so `restore_from_persistence` — whose v3 path does the AOF-INDEPENDENT cold
rebuild (heap reload + rebuild_from_manifest) — never ran. The disk-offload
cold tier was therefore transitively gated on AOF/save being enabled, even
though `disk_offload_base` was Some. "v3 recovery complete" never logged.
Fix: fire recovery when `persistence_dir.is_some() || disk_offload_base.is_some()`.
The dir argument is used only by the v2 fallback (a no-op when there is no
appendonly.aof/snapshot), so it falls back to `config.dir` under appendonly=no;
the v3 path reads the offload manifest directly. This is additive and changes
nothing for the appendonly=yes / save paths.
Validated red/green with a process-level regression test that boots a 4-shard
server under `--appendonly no --disk-offload enable`, forces cold spill, SIGKILLs,
restarts, and asserts cold read-through:
RED (old gate): 0/200 with 15 heap files on disk (recovery skipped).
GREEN (this fix): 179-200/200; recovery log shows "manifest recovered" +
"reloaded N KV entries from heap-*.mpf".
Because there is no AOF under appendonly=no, a non-zero post-crash count can
only come from the cold-manifest recovery path this fix enables. The test floor
is 75% — the 172-200 GREEN spread is intrinsic to crashing without AOF (a cold
key is durable only once its eviction-tick manifest commit has landed), while a
recovery-skip regression collapses to a categorical 0.
Refs #22
author: Tin Dang <tin.dang@trustifytechnology.com>
Three root causes behind the "zombie eating RAM in multishard mode" report, all on the tokio runtime with disk-offload (the default): 1. Serving hang (listener bind race). run_sharded bound the central tokio listener with a plain TcpListener (no SO_REUSEPORT) while each shard binds its own SO_REUSEPORT listener. A plain socket and REUSEPORT sockets on one port are incompatible, resolved by a bind-order race: if shards win, the central plain-bind hits EADDRINUSE, run_sharded returns Err, and the WHOLE server tears down; if the central wins, the shards' binds fail and the accept path is left half-wired. Either way PING is never answered. Fix: the central listener now also binds SO_REUSEPORT (the same create_reuseport_socket helper the shards use), so it coexists with the shard sockets regardless of order; per_shard_accept stays false (a bound-but-idle REUSEPORT socket black-holes the connections the kernel hashes to it). monoio was always healthy because its bind is REUSEPORT-compatible. 2. io_uring under load. tokio's experimental io_uring bridge floods "Unknown io_uring event type: 0" and drops connections (BrokenPipe) under sustained pipelined load. It is now default-OFF for tokio (opt in via MOON_URING=1; MOON_NO_URING still force-disables and remains the CI default). io_uring stays in monoio, the production runtime. 3. PageCache eager pre-alloc (the actual RAM zombie). disk-offload defaults to enable, so each shard built its own PageCache sized to 25% of the WHOLE maxmemory and EAGERLY committed num_frames*PAGE zeroed bytes at construction. With N shards that is N*25%*maxmemory pre-committed before serving a single command; with --maxmemory unset the guardrail sets it to 80% of host RAM, so a 4-shard default server reserved ~80% of host RAM instantly (measured 3791MB RSS at startup with zero data). Fix: (a) lazy buffers - frame buffers start empty and grow to a full page on first fetch_page miss, so RSS tracks the working set, not the budget; (b) divide the budget across shards (per_shard_pagecache_budget) so total pre-alloc is bounded by the budget regardless of shard count; (c) a startup WARN when an explicit --pagecache-size exceeds 50% of maxmemory. Startup RSS for the 4-shard no-maxmemory case drops from 3791MB to 27MB. Validation (OrbStack Linux aarch64, all server runs cgroup-capped to protect the host): multishard serve smoke 8/8 tokio + 5/5 monoio; #22 cold read-through 200/200 both runtimes; full test suites green on both runtimes; PageCache lazy + budget-divide unit tests (red before, green after); startup RSS 3791MB -> 27MB. author: Tin Dang <tin.dang@trustifytechnology.com>
…doctest fixes Test and CI-green hygiene supporting the multishard stability fix. - Add tests/multishard_serve_smoke.rs (non-ignored, runs on BOTH runtimes): boots the real binary in multishard configs and asserts a raw-TCP inline PING is answered with +PONG inside a hard deadline. This guards the central / per-shard SO_REUSEPORT accept path so the serving hang can never silently regress. Every case pins a small --maxmemory (128 MiB) so the smoke test cannot itself trip the PageCache pre-alloc and OOM a small CI runner; the bind race it guards is bind-time and independent of maxmemory. Shard count is clamped to available_parallelism().clamp(2,4) to avoid the thread-per-core oversubscription wall on small runners. - Derive Default on ServerConfig and add ..Default::default() spreads across the integration tests that construct it by literal, so they compile against the fields added this milestone (F2 aof_fsync_timeout_ms, G1/G2 Option<maxmemory>, Phase-A experimental_per_shard_rewrite). - autovacuum_daemon.rs: serialize the two tests that read/assert the process-global RECL_AUTOVACUUM_LAST_RUN_TS behind a file-local, poison-tolerant mutex. Pre-existing parallel-execution flake: the enabled test advanced the global mid-assertion of the disabled test (failed under load on monoio, passed on tokio by scheduling luck). 5/5 default-parallel runs green after the guard. - migrate_aof.rs: annotate the usage doc-block as ```text so rustdoc stops compiling the shell example as Rust (it broke `cargo test --doc`). - crash_recovery_disk_offload_no_aof (#22): restart-retry helper + 65% recovery floor + 8s settle-before-kill, hardening the intrinsic crash-durability spread against CPU-contention timing (8/8 tokio, 6/6 monoio). author: Tin Dang <tin.dang@trustifytechnology.com>
|
Warning Review limit reached
More reviews will be available in 56 minutes and 11 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds per-shard maxmemory guardrail (optional CLI), per-shard AOF rewrite coordination, manifest overflow pages, batched multi-page cold-tier spill with page-indexed locations and file-liveness tracking, per-shard eviction budgets, bounded cross-shard backpressure, PageCache lazy buffers, and broad test updates. ChangesGuardrail and per-shard maxmemory
Persistence: AOF and manifest
Disk-offload tiering, spill, and cold index
Eviction and budgeting
Cross-shard dispatch and runtime tuning
Tests & infra
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested labels
✨ Finishing Touches🧪 Generate unit tests (beta)
|
Review Summary by QodoStable multishard + disk-offload on both runtimes; eliminate RAM zombie via lazy PageCache and per-shard budgets
WalkthroughsDescription**Multishard stability and RAM zombie elimination:** • Fixed tokio multishard serving hang by adding SO_REUSEPORT to central listener, resolving bind-order race with per-shard listeners • Disabled io_uring by default for tokio runtime (opt-in via MOON_URING=1) to prevent connection drops under load • Implemented lazy PageCache buffer allocation (frames grow on first use, not pre-committed) and per-shard budget division to eliminate N× RAM over-commitment • Reduced idle RSS from 3791 MB to 29 MB on 4-shard default server; maintains flat 45–48 MB under 600k ops on both runtimes **Per-shard AOF rewrite (F6) and durability hardening:** • Ported PR #129 per-shard AOF work onto main with multi-part rewrite coordination via PerShardRewriteCoord • Implemented fsync timeout bounding (F2) with AckOutcome enum and runtime-specific select/timeout logic • Added per-shard fan-out BGREWRITEAOF compaction with synchronized seq bump and single manifest commit **Disk-offload cold-tier durability fixes:** • Implemented manifest overflow pages for >70 file entries with crash-atomic ordering (write overflow before root commit) • Fixed batch-file shared-deletion data-loss bug via per-file liveness refcount tracking; files only unlinked when ALL co-located keys removed • Refactored spill thread to buffered batch flushing with inline/oversized entry routing; multi-page files with per-entry page indexing • Fixed cold-tier recovery regression (#22): restore from persistence when disk_offload_base.is_some() even under appendonly no • Implemented page-indexed cold read with lazy full-file I/O; only reads specific pages in multi-page batch files **Memory guardrail and per-shard eviction budget (G1/G2):** • Added whole-instance maxmemory guardrail with cgroup-aware auto-capping at ~80% of detected memory limit (Linux cgroup v1/v2 support) • Changed maxmemory from usize to Option to distinguish omitted flag (auto-guardrail) from explicit --maxmemory 0 (unlimited) • Implemented per-shard eviction budget division (maxmemory / num_shards) enforced across all eviction thresholds and pressure cascades • Added startup warnings for explicit --pagecache-size exceeding 50% of maxmemory **Cross-shard dispatch and backpressure:** • Implemented bounded cross-shard dispatch with push_with_backpressure() helper (100µs backoff, 5000 max retries) to prevent connection parking • Added response-wait timeout (~30s) with shutdown checks for cross-shard writes • Added PushOutcome enum to distinguish happy path, transient backpressure, and cancellation scenarios **Configuration and test updates:** • Added experimental_per_shard_rewrite flag (opt-in per-shard BGREWRITEAOF) and aof_fsync_timeout_ms (2000ms default) • Lowered cold_orphan_sweep_interval_secs default from 300 to 60 seconds to promptly reclaim orphans • Updated 30+ test files for optional maxmemory field and new AOF configuration fields • Added comprehensive crash-recovery test suite: cold-tier durability without AOF (#22), per-shard BGREWRITEAOF matrix, multishard serving smoke tests • Serialized autovacuum timestamp tests to prevent flaking under parallel execution Diagramflowchart LR
A["Multishard<br/>Serving"] -->|"SO_REUSEPORT<br/>central listener"| B["Tokio<br/>Stability"]
A -->|"io_uring<br/>default OFF"| B
C["PageCache<br/>Allocation"] -->|"lazy buffers<br/>per-shard budget"| D["RAM<br/>Bounded"]
E["Spill<br/>Thread"] -->|"batch writes<br/>multi-page files"| F["Cold<br/>Tier"]
F -->|"per-file liveness<br/>refcount"| G["Durability<br/>Safe"]
F -->|"overflow pages<br/>crash-atomic"| G
H["Eviction<br/>Pressure"] -->|"per-shard budget<br/>division"| D
I["Cross-Shard<br/>Dispatch"] -->|"bounded backpressure<br/>retry limits"| J["Stability<br/>Guaranteed"]
K["Memory<br/>Guardrail"] -->|"cgroup-aware<br/>auto-cap"| D
File Changes1. src/persistence/aof.rs
|
Code Review by Qodo
Context used✅ Compliance rules (platform):
27 rules 1. sigkill() unsafe missing SAFETY
|
| fn sigkill(child: &mut Child) { | ||
| let pid = child.id() as i32; | ||
| unsafe { | ||
| libc::kill(pid, libc::SIGKILL); | ||
| } | ||
| let _ = child.wait(); |
There was a problem hiding this comment.
1. sigkill() unsafe missing safety 📘 Rule violation ≡ Correctness
A new unsafe block calls libc::kill without the required adjacent // SAFETY: justification, violating the repository’s unsafe-code policy and making it harder for reviewers/auditors to validate invariants and UB risk. This increases audit risk and can block automated unsafe audits.
Agent Prompt
## Issue description
A new `unsafe { libc::kill(...) }` usage was introduced without the policy-required adjacent `// SAFETY:` comment (per `UNSAFE_POLICY.md`), leaving the upheld preconditions/invariants undocumented.
## Issue Context
`UNSAFE_POLICY.md` mandates an explicit `// SAFETY:` justification for every unsafe block and prefers using safe alternatives when available (e.g., `Child::kill`) to reduce unsafe surface area. The new `sigkill()` helper currently calls `libc::kill` inside an unsafe block without documenting why this is safe.
## Fix Focus Areas
- tests/crash_recovery_disk_offload_no_aof.rs[331-339]
- tests/crash_matrix_per_shard_bgrewriteaof.rs[138-145]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| let total_entries: usize = completions.iter().map(|c| c.entries.len()).sum(); | ||
| assert_eq!(total_entries, 1, "expected 1 entry across all completions"); | ||
|
|
||
| let c = completions.iter().find(|c| !c.entries.is_empty()).unwrap(); |
There was a problem hiding this comment.
2. Unannotated .unwrap() in spill tests 📘 Rule violation ✧ Quality
New .unwrap() calls were added without the required #[allow(clippy::unwrap_used)] and an immediately preceding justification comment. This breaks the unwrap-audit expectations and makes future lint/audit enforcement unreliable.
Agent Prompt
## Issue description
New `.unwrap()` calls were introduced without the required `// ...` justification comment directly above an attached `#[allow(clippy::unwrap_used)]` attribute.
## Issue Context
The compliance rule requires every unwrap to be explicitly justified and locally allowed (or rewritten to avoid unwrap).
## Fix Focus Areas
- src/storage/tiered/spill_thread.rs[492-588]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| pub struct PerShardRewriteCoord { | ||
| /// Writers still to finish. Starts at the shard count; the writer that | ||
| /// decrements it to zero performs the commit + prune. | ||
| remaining: std::sync::atomic::AtomicUsize, | ||
| /// Shared manifest, loaded fresh from disk by the BGREWRITEAOF handler at | ||
| /// rewrite time (normal appends never touch the manifest, and BGREWRITEAOF | ||
| /// is CAS-serialized, so a fresh load is the authoritative current state). | ||
| manifest: Arc<parking_lot::Mutex<crate::persistence::aof_manifest::AofManifest>>, | ||
| /// The generation every writer advances to. Computed once = old_seq + 1. | ||
| new_seq: u64, | ||
| /// The generation being retired; pruned only after the commit. | ||
| old_seq: u64, | ||
| /// Number of shards participating (= initial `remaining`). | ||
| n_shards: usize, | ||
| /// Set by any shard whose fold fails. The final writer checks this and | ||
| /// ABORTS the commit if set — committing `new_seq` while a shard never | ||
| /// wrote its new base would make recovery look for a missing base and | ||
| /// refuse to start. On abort the old generation (`old_seq`) stays the | ||
| /// committed state for all shards (crash-safe). | ||
| failed: std::sync::atomic::AtomicBool, | ||
| } | ||
|
|
||
| impl PerShardRewriteCoord { | ||
| /// Construct a coordinator for an `n_shards`-way rewrite advancing the | ||
| /// shared `manifest` from its current seq to `current_seq + 1`. | ||
| pub fn new( | ||
| manifest: Arc<parking_lot::Mutex<crate::persistence::aof_manifest::AofManifest>>, | ||
| current_seq: u64, | ||
| n_shards: usize, | ||
| ) -> Arc<Self> { | ||
| Arc::new(Self { | ||
| remaining: std::sync::atomic::AtomicUsize::new(n_shards), | ||
| manifest, | ||
| new_seq: current_seq + 1, | ||
| old_seq: current_seq, | ||
| n_shards, | ||
| failed: std::sync::atomic::AtomicBool::new(false), | ||
| }) | ||
| } | ||
|
|
||
| /// The generation writers advance to. | ||
| #[inline] | ||
| pub fn new_seq(&self) -> u64 { | ||
| self.new_seq | ||
| } | ||
|
|
||
| /// Mark the whole rewrite as failed (called by a shard whose fold errored). | ||
| /// The final writer will abort the commit, leaving `old_seq` authoritative. | ||
| #[inline] | ||
| pub fn mark_failed(&self) { | ||
| self.failed | ||
| .store(true, std::sync::atomic::Ordering::Release); | ||
| } | ||
|
|
||
| /// Called by each writer AFTER it has durably written its new base+incr at | ||
| /// `new_seq` and reopened its append file. Decrements the countdown; the | ||
| /// final caller commits the manifest (single seq flip) and prunes the old | ||
| /// generation, then clears the global in-progress flag. | ||
| /// | ||
| /// Crash-safety: the commit (`write_manifest`) is the atomic flip point; | ||
| /// pruning runs strictly after it, so a crash mid-prune only orphans | ||
| /// already-superseded files (recovery uses `new_seq`). | ||
| pub fn shard_done(&self) { | ||
| use std::sync::atomic::Ordering; | ||
| // AcqRel: the decrement-to-zero must observe all prior writers' | ||
| // advance_shard manifest mutations before committing. | ||
| if self.remaining.fetch_sub(1, Ordering::AcqRel) == 1 { | ||
| // Abort if any shard failed to fold: committing new_seq while a | ||
| // shard lacks its new base would break recovery. Keep old_seq. | ||
| if self.failed.load(Ordering::Acquire) { | ||
| let m = self.manifest.lock(); | ||
| // Best-effort: prune the orphaned new-seq files written by the | ||
| // shards that DID fold, so they don't linger. | ||
| for sid in 0..self.n_shards { | ||
| m.prune_shard_files(sid as u16, self.new_seq); | ||
| } | ||
| drop(m); | ||
| error!( | ||
| "F6 per-shard rewrite ABORTED: a shard failed to fold; seq stays {}. \ | ||
| Old generation remains authoritative (crash-safe). A RESTART is \ | ||
| recommended so successful shards' writers stop appending to the \ | ||
| discarded new generation.", | ||
| self.old_seq | ||
| ); | ||
| crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); | ||
| return; | ||
| } | ||
| let mut m = self.manifest.lock(); | ||
| m.seq = self.new_seq; | ||
| if let Err(e) = m.write_manifest() { | ||
| error!( | ||
| "F6 per-shard rewrite: final manifest commit (seq {}) failed: {}. \ | ||
| Old generation remains authoritative; rewrite did not take effect.", | ||
| self.new_seq, e | ||
| ); | ||
| // Do NOT prune — old generation is still the committed state. | ||
| drop(m); | ||
| crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); | ||
| return; | ||
| } | ||
| for sid in 0..self.n_shards { | ||
| m.prune_shard_files(sid as u16, self.old_seq); | ||
| } | ||
| drop(m); | ||
| info!( | ||
| "F6 per-shard rewrite complete: committed seq {} across {} shards, pruned seq {}", | ||
| self.new_seq, self.n_shards, self.old_seq | ||
| ); | ||
| crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); | ||
| } | ||
| } |
There was a problem hiding this comment.
3. Pershardrewritecoord missing loom test 📘 Rule violation ▣ Testability
A new atomic coordination/state-machine type (PerShardRewriteCoord) was introduced without any corresponding loom model test coverage. This risks concurrency bugs slipping through in a multi-threaded rewrite coordinator.
Agent Prompt
## Issue description
A new atomic coordination type was added, but no loom model test was added/updated to exercise its concurrency behavior.
## Issue Context
The compliance requirement mandates loom-based model tests for new/modified atomic state machines, specifically in `tests/loom_response_slot.rs`.
## Fix Focus Areas
- src/persistence/aof.rs[217-327]
- tests/loom_response_slot.rs[1-120]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| let preserved_cold_wiring: Vec< | ||
| Vec<( | ||
| Option<std::path::PathBuf>, | ||
| Option<moon::storage::tiered::cold_index::ColdIndex>, | ||
| )>, | ||
| > = if disk_offload_base.is_some() { | ||
| shards | ||
| .iter_mut() | ||
| .map(|s| { | ||
| s.databases | ||
| .iter_mut() | ||
| .map(|db| (db.cold_shard_dir.take(), db.cold_index.take())) | ||
| .collect() |
There was a problem hiding this comment.
4. Cold reads disabled in replay 🐞 Bug ≡ Correctness
src/main.rs temporarily take()s cold_shard_dir/cold_index before multi-part AOF replay and only reattaches them after replay completes, so replayed commands cannot read disk-offloaded keys. This can permanently corrupt recovered state for read-modify-write commands during replay (e.g., INCR reads via db.get() and will treat cold-only keys as missing).
Agent Prompt
## Issue description
During multi-part AOF replay, `cold_shard_dir` and `cold_index` are removed from every `Database` via `.take()` before calling the replay routines, and only restored after replay finishes. Because replay executes the normal command handlers, any replay-time read of a disk-offloaded key (e.g., INCR’s read-modify-write) cannot cold-read and will behave as if the key is missing, producing incorrect recovered state.
## Issue Context
- `rdb::load` swaps whole `Database` instances (`*live = temp`), which is why cold-tier fields get dropped.
- The current approach works around that swap by stashing cold wiring in `main.rs`, but it makes cold storage unavailable during the replay window.
- Fix must ensure cold wiring is present *while replaying incremental RESP commands*, not only after replay ends.
## Fix Focus Areas
- src/main.rs[739-775]
- src/main.rs[1054-1065]
- src/persistence/aof_manifest.rs[1299-1370]
- src/persistence/rdb.rs[362-367]
## Suggested fix approach
1. Change replay path so cold wiring is restored immediately after the base RDB load swap and **before** incremental command replay begins.
- Option A: Add a recovery-only `rdb::load_preserving_wiring(...)` (or similar) that copies selected “topology” fields (cold wiring) from the current live DBs into the temp DBs right before the final swap, and use it from `replay_multi_part` / per-shard replay.
- Option B: Have `replay_multi_part` / `replay_per_shard` accept a preserved-wiring bundle and reattach it right after `rdb::load(...)` returns and before `replay_incr_resp(...)`.
2. Remove (or narrow) the `.take()` window in `main.rs` so cold wiring remains available during replayed command execution.
3. Add a regression test that:
- Sets up disk-offload state with a key only in cold storage.
- Replays an AOF stream containing a read-modify-write command against that key (e.g., INCR).
- Asserts the recovered value matches the expected incremented value.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Actionable comments posted: 5
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/shard/conn_accept.rs (1)
368-396:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMigrated tokio connections still drop the disk-offload spill context.
spawn_tokio_connection()now wiresspill_sender/spill_file_id/disk_offload_dir, but migrated tokio connections still buildConnectionContextwithNone,Cell(0), andNone. After an affinity migration, that client falls back to the non-spilling eviction path even when disk-offload is enabled, so cold-tier durability depends on whether the connection migrated. This needs the same three parameters threaded throughspawn_migrated_tokio_connection()and its call sites insrc/shard/event_loop.rs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/shard/conn_accept.rs` around lines 368 - 396, ConnectionContext is being built with None/Cell(0)/None for spill_sender, spill_file_id, and disk_offload_dir causing migrated tokio connections to lose disk-offload state; update spawn_migrated_tokio_connection and all its call sites to accept and forward the same spill_sender, spill_file_id, and disk_offload_dir parameters that spawn_tokio_connection uses, then change the ConnectionContext::new call in conn_accept.rs (the construction currently passing None, Rc::new(Cell::new(0)), None) to pass the propagated spill_sender, spill_file_id, and disk_offload_dir values so migrated connections retain the spill context.src/server/embedded.rs (1)
221-223:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPort the
main.rsstartup fixes into embedded mode too.
run_embedded()still diverges from the binary on two functional paths: it only restores whenpersistence_diris set, so--appendonly no+ disk-offload loses cold-tier recovery after restart, and it still enables tokio/Linuxper_shard_accept, which is the same accept-path hang/racemain.rsjust disabled. Embedded startup needs the same recovery gate and central-accept behavior asmain.rs.Also applies to: 359-365
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/embedded.rs` around lines 221 - 223, In run_embedded, make two changes to match main.rs: (1) call shard.restore_from_persistence whenever disk_offload_base.is_some() even if persistence_dir is None (i.e. change the current if let Some(ref dir) = persistence_dir guard to a condition that restores when persistence_dir.is_some() || disk_offload_base.is_some(), passing persistence_dir.as_deref() as before) so cold-tier recovery runs with --appendonly no + disk-offload; (2) disable per-shard accept so embedded uses the central accept path like main.rs by clearing or forcing the per_shard_accept flag (refer to the per_shard_accept variable/setting used elsewhere) before spawning listeners; apply the same two edits to the other restore block that currently mirrors this logic.src/persistence/aof_manifest.rs (1)
2901-2999: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy liftSplit this module before adding more persistence logic.
src/persistence/aof_manifest.rsis already far past the 1500-line cap, and this change adds more rewrite/replay/test surface on top of that. Please break the per-shard rewrite/replay pieces into submodules before this grows further.As per coding guidelines, "No single
.rsfile should exceed 1500 lines. Split into submodules if approaching this limit."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof_manifest.rs` around lines 2901 - 2999, The aof_manifest.rs file has grown beyond the 1500-line guideline — move the per-shard rewrite/replay logic and related tests into one or more submodules (e.g., shard_rewrite.rs and shard_replay.rs). Concretely: extract functions and helpers that operate per-shard (advance_shard, prune_shard_files, shard_base_path_seq, shard_incr_path_seq, plus any helpers used only for shard-level file ops and the test advance_shard_defers_delete_until_after_commit) into a new submodule(s), keep the manifest-level state (AofManifest struct, seq, write_manifest, initialize_multi, load) in the top-level file, and re-export needed symbols with pub use so callers still reference AofManifest::{advance_shard,prune_shard_files,...} as before; update imports/paths in tests to use the new module layout and ensure visibility (pub(crate)/pub) matches previous access.
🟠 Major comments (16)
src/server/conn/handler_monoio/mod.rs-2049-2057 (1)
2049-2057:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftReply-loss now skips AOF for writes that may already have executed.
Once the
PipelineBatchhas been accepted, the target shard can apply the write and then lose the reply. This error branch returns early and dropsmeta, so theaof_bytesnever reach Lines 2066-2095. That leaves the target shard’s in-memory state ahead of its AOF until a later rewrite, so a crash in that window can lose a write that may already be visible to clients. The durability boundary needs to move into the target shard’sPipelineBatchexecution/ack path instead of depending on the reply round-trip.Also applies to: 2066-2095
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/conn/handler_monoio/mod.rs` around lines 2049 - 2057, The error branch for shard_responses (when Err(err_msg)) returns early and skips appending aof_bytes for the corresponding PipelineBatch, risking reply-loss durability gaps; modify the handler so that whenever a PipelineBatch was accepted by a target shard (identify via PipelineBatch or meta entries and aof_bytes), you still persist the batch to AOF before continuing or ensure the target shard writes to AOF synchronously as part of its accept/ack path. Concretely: in the shard_responses Err branch (and the same logic covering the range that currently writes aof_bytes), either call the existing AOF append routine with the same aof_bytes tied to the meta entries or change the target shard execution/ack code to perform the AOF write at accept time (so PipelineBatch acceptance guarantees persistence), then remove the early-return that skips AOF.src/persistence/page_cache/mod.rs-93-107 (1)
93-107:⚠️ Potential issue | 🟠 Major | ⚡ Quick winLazy startup allocation still does not reclaim memory after eviction.
These buffers start empty now, but
evict_cold_frames()only clears frame metadata; it never drops the backingVec<u8>for evicted frames. After the working set shrinks, touched pages stay resident forever, so the memory-pressure cascade still cannot actually give PageCache bytes back before escalating to KV eviction.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/page_cache/mod.rs` around lines 93 - 107, evict_cold_frames currently only clears frame metadata but leaves the backing buffers (buffers_4k and buffers_64k) allocated, so freed pages never release heap capacity; update evict_cold_frames to drop or replace the Vec<u8> backing an evicted frame (for the 4k and 64k pools) — e.g. swap the slot's RwLock<Vec<u8>> contents with an empty Vec::new() or call Vec::clear() plus shrink_to_fit() while holding the lock — so that the heap allocation is released; locate uses in fetch_page, FrameDescriptor and any code that indexes into buffers_4k / buffers_64k to ensure the swap/replace is safe under the same locking scheme.src/persistence/page_cache/mod.rs-68-71 (1)
68-71:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMinimum frame floors can still blow past the whole-instance budget.
per_shard_pagecache_budget()divides first, but these unconditionalmax(64)/max(8)floors add a fixed 768 KiB minimum per shard. Oncewhole_budget / num_shardsdrops below that, aggregate PageCache capacity exceeds the configured whole-instance budget again (for example, 4 MiB across 8 shards becomes 6 MiB of frame capacity). That breaks the new shard-budget contract on small-memory or high-shard-count deployments.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/page_cache/mod.rs` around lines 68 - 71, The function pagecache_frame_counts currently forces fixed minimums (num_4k.max(64), num_64k.max(8)) which can make per-shard capacity exceed the per_shard_pagecache_budget; change pagecache_frame_counts to never increase frame counts above what budget_bytes permits: compute num_4k and num_64k from the 3:1 split as before, then remove the unconditional .max floor and instead clamp/adjust the counts so (num_4k*PAGE_4K + num_64k*PAGE_64K) <= budget_bytes (reduce 4K frames first or proportionally if needed) and ensure zero/low budgets yield small counts rather than the fixed 64/8 floor; update any callers (per_shard_pagecache_budget) expectations accordingly.src/server/conn/handler_sharded/mod.rs-1283-1288 (1)
1283-1288:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't spill to the process CWD on a missing offload dir.
If
ctx.spill_senderis set butctx.disk_offload_dirisNone, this still evicts the key and queues the spill under"./data". That hides broken wiring and can send cold files somewhere the shard's read-through/recovery code will never scan. Fail fast here instead of defaulting to".".Proposed fix
- let dir = ctx - .disk_offload_dir - .as_deref() - .unwrap_or(std::path::Path::new(".")); + let Some(dir) = ctx.disk_offload_dir.as_deref() else { + drop(rt); + return Err(Frame::Error(Bytes::from_static( + b"ERR disk offload misconfigured", + ))); + };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/conn/handler_sharded/mod.rs` around lines 1283 - 1288, The code currently falls back to "." when ctx.spill_sender is Some but ctx.disk_offload_dir is None; update the evict logic so that when ctx.spill_sender.is_some() and ctx.disk_offload_dir.is_none() you fail fast instead of defaulting to CWD: in the evict_result block (where ctx.spill_sender, ctx.spill_file_id and the dir variable are used) add an explicit check for ctx.disk_offload_dir.is_none() and return or propagate an error (or log and abort the eviction) rather than using unwrap_or("."), ensuring spills are only queued when disk_offload_dir is present. Ensure you reference ctx.spill_sender, ctx.disk_offload_dir, and ctx.spill_file_id in the change so the code path short-circuits appropriately.src/storage/tiered/kv_spill.rs-336-356 (1)
336-356:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSync the data directory after publishing the batch file.
file.sync_all()only persists the temporary file contents. Without anfsyncondata_dirafterrename, a crash can loseheap-*.mpfeven though later code commits that file into the manifest, leaving recovery with a manifest entry that points at a missing spill file.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/storage/tiered/kv_spill.rs` around lines 336 - 356, The write_kv_spill_batch function currently syncs only the temp file (file.sync_all()) but not the containing directory, so post-rename the directory entry for final_path may not be persisted; after the std::fs::rename(&tmp_path, &final_path)? call, open the data_dir (e.g., with OpenOptions::new().read(true).open(&data_dir) or File::open(&data_dir)) and call sync_all() on that directory handle to ensure the directory metadata (the new heap-*.mpf entry) is flushed to disk.src/shard/persistence_tick.rs-405-438 (1)
405-438:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail closed when the manifest write fails.
If
manifest.commit()errors here, the code still inserts every key intocold_indexand starts serving that spill file immediately. After the next crash/restart, recovery never sees the file because it was never durably registered, so those evicted keys disappear. Skip indexing, or requeue the completion, until the manifest commit succeeds.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/shard/persistence_tick.rs` around lines 405 - 438, If manifest.commit() fails after manifest.add_file (the block handling shard_manifest), do not proceed to insert entries into the cold_index; instead treat the spill completion as not durable and either requeue the completion or return an error so it can be retried. Concretely: after calling manifest.commit() on the manifest from shard_manifest, on Err(e) stop further processing of that completion (do not run the ColdLocation insertion loop), and re-enqueue or surface the completion so it can be retried later; ensure this change short-circuits both the is_initialized()/with_shard_db path and the shard_databases.write_db path that call ci.insert(...) so no keys are indexed unless commit succeeded.src/storage/tiered/cold_read.rs-48-59 (1)
48-59:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard the Unix-only
read_exact_atpath.
src/storage/tiered/cold_read.rsimportsstd::os::unix::fs::FileExt as _;unconditionally (only#[cfg(test)]is nearby), so non-Unix targets won’t compile. Keep the page-offset fast path under#[cfg(unix)]/#[cfg(target_family = "unix")]and add a#[cfg(not(unix))]fallback (e.g.,seek+read_exactintoleaf_buf, or the previous whole-file read) so this module builds everywhere.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/storage/tiered/cold_read.rs` around lines 48 - 59, The code unconditionally imports std::os::unix::fs::FileExt and calls file.read_exact_at(...) which breaks non-Unix builds; wrap the import and the fast pread path in #[cfg(unix)] (or #[cfg(target_family = "unix")]) and provide a #[cfg(not(unix))] fallback that uses file.seek(SeekFrom::Start(page_offset)) followed by file.read_exact(&mut leaf_buf) (or the previous whole-file read) to fill leaf_buf for the specific page; apply these guards around the FileExt import and the block that computes page_offset and calls read_exact_at so symbols like file_path, file, page_offset, leaf_buf, PAGE_4K, and location.page_idx remain usable on all platforms.tests/crash_recovery_disk_offload_no_aof.rs-331-337 (1)
331-337:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd missing
// SAFETY:for thelibc::killunsafeblock insigkill
Intests/crash_recovery_disk_offload_no_aof.rs(unixsigkill, lines ~331-338),unsafe { libc::kill(pid, libc::SIGKILL); }is missing a preceding// SAFETY:comment explaining the invariant that makes the call sound (or refactor to avoidunsafe). Note:scripts/audit-unsafe.shgrepssrc/**/*.rs, so this specific test may not be caught by that script, but it still violates the repo unsafe policy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/crash_recovery_disk_offload_no_aof.rs` around lines 331 - 337, The unsafe call to libc::kill inside the unix-only function sigkill lacks the required safety comment; add a preceding "// SAFETY:" comment that documents the invariants making the call sound (e.g., that child.id() yields a valid pid for the child process, the pid value is non-zero and points to a process we intentionally intend to SIGKILL, and that using libc::kill with SIGKILL is safe here), or alternatively refactor sigkill to use a safe wrapper (e.g., nix::sys::signal::kill with nix::unistd::Pid) to avoid unsafe entirely; reference the sigkill function and the libc::kill usage when making the change.src/main.rs-790-850 (1)
790-850:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRefuse startup here instead of warning-and-continuing.
If this data dir has already been migrated to single-shard multi-part AOF,
appendonly.aofmay have been renamed toappendonly.aof.legacyby the monoio path above. In that case this tokio branch starts without replaying the authoritative manifest pair and silently serves stale/empty state. This should be a hard error until tokio can load single-shard multi-part AOF, not a warning.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main.rs` around lines 790 - 850, The tokio branch currently logs a warning when a multi-part AOF manifest exists but monoio-only single-shard replay is unavailable, which can let the process start with stale/empty state; change the #[cfg(not(feature = "runtime-monoio"))] block so it refuses startup instead of continuing: detect the presence of the multi-part manifest/appendonlydir (same condition the monoio branch would have used), and return an error (propagate with anyhow::Context or similar) or exit with a non-zero error rather than calling tracing::warn; update references inside main's startup logic around num_shards, base_dir, and the legacy appendonly.aof handling so that when a manifest is present and runtime-monoio is disabled the program fails fast (mirroring the monoio path behavior that runs replay_multi_part/retire legacy files) rather than continuing.src/config.rs-32-34 (1)
32-34:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid deriving
DefaultforServerConfig(derived defaults don’t match clap CLI defaults).
ServerConfigderivesDefault, but many fields define clapdefault_value/default_value_t(e.g.,bind="127.0.0.1",port=6379,appendonly="yes",protected_mode="yes",memory_arenas_cap=8). The derivedDefaultwill instead produce Rust defaults ("",0,false, etc.), creating an invalid config ifServerConfig::default()is ever used.Repo tests use
ServerConfig::parse_from([])for defaults, but the public derive remains a footgun. Remove theDefaultderive or implementDefaultby delegating toServerConfig::parse_from([])(so it matches the CLI defaults exactly).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/config.rs` around lines 32 - 34, ServerConfig currently derives Default which yields Rust primitive defaults that differ from clap's CLI defaults; remove the #[derive(Default)] or replace it with a custom impl Default that delegates to clap's parsing so defaults match CLI behavior (e.g., implement Default for ServerConfig by calling ServerConfig::parse_from(&[]) and returning that value). Update the ServerConfig declaration (the struct with #[derive(Parser, Debug, Clone, ...)] and any tests expecting ServerConfig::default()) to use the new behavior so ServerConfig::default() mirrors clap defaults exactly.tests/crash_matrix_per_shard_bgrewriteaof.rs-27-27 (1)
27-27:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGate this test to the runtime that actually supports the scenario.
The file comment says tokio rejects per-shard
BGREWRITEAOF, but thiscfg(any(...))still compiles the tests forruntime-tokio.bgrewriteaof()then asserts the command is not refused, so an explicit tokio run fails by construction. Limit this file to monoio or add a tokio-specific expectation path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` at line 27, The test is compiled for both runtimes but only valid for monoio; update the test gating or add a tokio branch: either change the module cfg from cfg(any(feature = "runtime-monoio", feature = "runtime-tokio")) to only cfg(feature = "runtime-monoio") so the file is only built for monoio, or keep both runtimes and modify the bgrewriteaof() test to branch on the runtime (runtime-tokio should assert the command is refused while runtime-monoio asserts success). Refer to the test function bgrewriteaof() and the module-level cfg to implement the appropriate gate or conditional expectation.tests/crash_matrix_per_shard_bgrewriteaof.rs-138-145 (1)
138-145:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix missing
// SAFETY:forunsafe { libc::kill(...) }insigkill
tests/crash_matrix_per_shard_bgrewriteaof.rs(sigkill, lines ~138-145) has anunsafe { libc::kill(pid, libc::SIGKILL) }block without the mandatory// SAFETY:comment explaining the invariant (e.g., thatpidcomes from the liveChildhandle for the spawned process).The existing
#[cfg(unix)]vs#[cfg(not(unix))]structure matches other test helpers in this repo that also uselibc::kill, so the immediate required fix is the missing safety rationale.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` around lines 138 - 145, Add a `// SAFETY:` rationale above the unsafe block in the `sigkill` function explaining why the call to `libc::kill(pid, libc::SIGKILL)` is safe: state that `pid` is derived from a live `std::process::Child` handle returned by `child.id()` and thus refers to a valid process id for the spawned child, and that killing with `SIGKILL` is intentional for the test; keep this comment adjacent to the `unsafe { libc::kill(...) }` block in `sigkill` so it documents the required invariant.src/server/listener.rs-315-328 (1)
315-328:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSkip binding the central TCP listener when
per_shard_acceptis enabled.This still creates a REUSEPORT listener even when the main loop never calls
accept()on it. On Linux multishard runs, the kernel can hash connections to that idle socket, so those clients get black-holed again. Bind the central TCP listener only in the!per_shard_acceptpath, or make the per-shard mode use no central TCP socket at all.Suggested fix
- #[cfg(unix)] - let listener = match crate::shard::conn_accept::create_reuseport_socket(&addr) { - Ok(std_listener) => TcpListener::from_std(std_listener)?, - Err(e) => { - tracing::warn!( - "central SO_REUSEPORT bind failed ({}); falling back to plain bind on {}", - e, - addr - ); - TcpListener::bind(&addr).await? - } - }; - #[cfg(not(unix))] - let listener = TcpListener::bind(&addr).await?; + let listener = if per_shard_accept { + None + } else { + #[cfg(unix)] + { + Some(match crate::shard::conn_accept::create_reuseport_socket(&addr) { + Ok(std_listener) => TcpListener::from_std(std_listener)?, + Err(e) => { + tracing::warn!( + "central SO_REUSEPORT bind failed ({}); falling back to plain bind on {}", + e, + addr + ); + TcpListener::bind(&addr).await? + } + }) + } + #[cfg(not(unix))] + { + Some(TcpListener::bind(&addr).await?) + } + }; ... - if per_shard_accept { + if per_shard_accept { std::future::pending::<std::io::Result<(tokio::net::TcpStream, std::net::SocketAddr)>>().await } else { - listener.accept().await + listener.as_ref().expect("listener exists when per_shard_accept is false").accept().await }Also applies to: 403-408
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/listener.rs` around lines 315 - 328, The central TCP listener is still being bound even when per_shard_accept is enabled, causing an idle REUSEPORT socket to receive hashed connections; update the code around create_reuseport_socket/TcpListener::bind so the listener is only created when per_shard_accept is false (i.e. move the entire match that calls create_reuseport_socket and TcpListener::bind into the !per_shard_accept branch or conditionally skip binding when per_shard_accept is true), and apply the same change to the other analogous block around lines 403-408 so no central socket is bound in per-shard accept mode.src/persistence/aof_manifest.rs-1246-1251 (1)
1246-1251:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPersist the rewrite cutoff LSN here, not the manifest sequence.
max_lsnis now defined as the next-free replication offset, but this storesnew_seq(2,3, …). After the first rewrite, recovery can seedmaster_repl_offsetbelow already-persisted data and reissue duplicate LSNs. Thread the coordinated shard cutoff/end offset intoadvance_shardand persist that value instead.Suggested direction
pub fn advance_shard( &mut self, shard_id: u16, new_seq: u64, + max_lsn: u64, rdb_bytes: &[u8], ) -> Result<PathBuf, crate::error::MoonError> { @@ - self.shards[shard_idx].max_lsn = self.shards[shard_idx].max_lsn.max(new_seq); + self.shards[shard_idx].max_lsn = max_lsn;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof_manifest.rs` around lines 1246 - 1251, The code currently sets self.shards[shard_idx].max_lsn = self.shards[shard_idx].max_lsn.max(new_seq) in advance_shard, but max_lsn represents the next-free replication offset and must store the coordinated shard rewrite cutoff/end offset (the persisted rewrite cutoff LSN) rather than the manifest sequence number new_seq; modify advance_shard to accept the computed shard_cutoff (or end_offset) parameter from the caller and persist that value into self.shards[shard_idx].max_lsn (use max to avoid regressions), ensuring the function uses shard_cutoff instead of new_seq so recovery cannot reuse stale LSNs and master_repl_offset remains monotonic.src/persistence/manifest.rs-1284-1437 (1)
1284-1437: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy liftThis file is over the 1500-line limit.
The overflow-page implementation and new crash-path tests push
src/persistence/manifest.rspast the repo cap. Splitting serialization/recovery helpers into submodules will make the dual-root and overflow invariants easier to audit.As per coding guidelines, "No single
.rsfile should exceed 1500 lines. Split into submodules if approaching this limit."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/manifest.rs` around lines 1284 - 1437, The manifest.rs file exceeds the 1500-line limit; refactor by extracting serialization/recovery and overflow-related code into focused submodules so the top-level ShardManifest implementation stays small and tests keep the same behavior. Identify the helpers used by commit(), open(), and overflow handling (e.g., any functions named like persist_overflow, recover_overflow, recover_root, serialize_root, parse_overflow_pages) and move them into new submodules (serialization/recovery and overflow), adjust their visibility (pub(crate) or pub) so ShardManifest::commit, ShardManifest::open, gc_tombstones, add_file, remove_file and the tests still compile unchanged, and update mod declarations and use/imports in manifest.rs to delegate to the new modules while keeping the public API intact. Ensure unit tests (test_overflow_*, test_overflow_commit_crash_atomicity, test_overflow_compaction_bounds_growth, test_overflow_tombstone_and_gc) continue to run by preserving behavior and error paths after the split.src/persistence/manifest.rs-438-450 (1)
438-450:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle torn overflow tails before computing
start_page.
open()can recover from a partially written overflow run by falling back to the older root, but the nextcommit()appends from raw EOF and onlydebug_assert!s 4KB alignment. In release, a non-page-aligned EOF makesstart_pagefloor to the previous page while the new overflow bytes are written at the unaligned offset, so the committed root points at the wrong location and the next reopen drops the new slot. Truncate or reject anyeof % PAGE_4K != 0tail before appending.Suggested fix
- let eof = self.file.seek(SeekFrom::End(0))?; - // The manifest is always a whole number of 4 KB pages. - debug_assert_eq!(eof % PAGE_4K as u64, 0); + let mut eof = self.file.seek(SeekFrom::End(0))?; + let misalignment = eof % PAGE_4K as u64; + if misalignment != 0 { + eof -= misalignment; + self.file.set_len(eof)?; + self.file.sync_data()?; + } let start_page = (eof / PAGE_4K as u64) as u32;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/manifest.rs` around lines 438 - 450, The overflow append path computes start_page from EOF without handling a torn (non-4KB-aligned) tail, which in release causes writes at the wrong offset; before computing start_page in the block that writes overflow (the code around overflow_start_page and use of PAGE_4K), detect if eof % PAGE_4K != 0 and truncate the file down to the previous PAGE_4K boundary (or return an error) so subsequent writes are page-aligned; implement this by checking eof = self.file.seek(SeekFrom::End(0))?, computing eof_floor = (eof / PAGE_4K as u64) * PAGE_4K as u64 and calling self.file.set_len(eof_floor) (or propagate an error) before computing start_page and writing the overflow buffer.
🟡 Minor comments (5)
src/shard/dispatch.rs-690-693 (1)
690-693:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDrop the wall-clock assertion from this test.
start.elapsed() < 5msmeasures scheduler noise as much as helper behavior, so it can flap on a loaded CI runner even when the fast path never sleeps.calls == 1already proves the no-retry path; if you want timing coverage, use paused Tokio time instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/shard/dispatch.rs` around lines 690 - 693, Remove the flaky wall-clock assertion that checks start.elapsed() < Duration::from_millis(5) in the test (the assert! using start and the 5ms bound), because it measures scheduler noise; instead rely on the existing calls == 1 check to prove the no-retry path. If you need to assert timing behavior for the fast path, convert the test to use paused Tokio time (tokio::time::pause/advance) and assert against simulated time rather than real elapsed time. Ensure you remove the assert! line and keep the calls variable assertion intact.tests/multishard_serve_smoke.rs-106-110 (1)
106-110:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
safe_shards()still oversubscribes 1-core runners.The lower clamp of
2means a host withavailable_parallelism() == 1still runs the “multishard” cases with 2 shards, which is exactly the oversubscription confound the comment says this helper avoids. Skip the multishard variants when fewer than 2 cores are available instead of forcing 2 here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/multishard_serve_smoke.rs` around lines 106 - 110, The helper safe_shards() forces at least 2 shards, causing 1-core runners to be oversubscribed; change the lower clamp from 2 to 1 so it returns the actual available_parallelism() (bounded above by 4) instead of forcing two shards — update the safe_shards() implementation (which uses std::thread::available_parallelism()) to use .clamp(1, 4) and ensure any test harness that expects multishard behavior skips multishard variants unless safe_shards() >= 2.tests/multishard_serve_smoke.rs-37-44 (1)
37-44:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAvoid the probe-port race in the smoke harness.
free_port()releases the ephemeral listener before the child process binds, so a parallel test/process can steal that port and make this smoke test fail with the same “never served” symptoms it is trying to catch. Please retry port allocation/spawn on bind failure instead of treating the first chosen port as reserved.Also applies to: 113-125
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/multishard_serve_smoke.rs` around lines 37 - 44, free_port() currently picks and releases an ephemeral port before spawning the child, which allows races; change the harness to loop: pick a candidate port (using the existing free_port() helper or by trying to bind and immediately drop), spawn the child process with that port, then probe/connect to the child (or parse its startup output) with a short timeout; if the probe fails with a bind-refused or "address already in use" condition, kill the child and retry with a new port up to a small retry limit. Apply this retry-on-bind-failure pattern to the other probe/spawn logic referenced (the code around lines 113-125) so the test only fails after exhausting retries rather than on the first stolen port.README.md-236-242 (1)
236-242:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClarify the guardrail caveats here.
This overstates the behavior a bit: omitting
--maxmemorydoes not always forceallkeys-lru, and it can skip auto-capping entirely when no memory limit is detectable. The doc should matchresolve_memory_guardrail()and say the policy flip happens only fromnoeviction, otherwise the existing eviction policy is preserved.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 236 - 242, Update the README blurb about --maxmemory to match resolve_memory_guardrail(): clarify that omitting --maxmemory does not always force allkeys-lru — the code only flips the eviction policy to allkeys-lru when it would otherwise be noeviction and a memory guardrail is applied; if no memory limit is detectable the auto-cap can be skipped entirely and an existing eviction policy is preserved. Reference the behavior of resolve_memory_guardrail(), the flags --maxmemory and --shards, and the policies allkeys-lru and noeviction so readers understand the exact conditional policy flip and the per-shard cap calculation.tests/crash_matrix_per_shard_bgrewriteaof.rs-281-281 (1)
281-281:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDon't derive the port with
+1; that socket was never reserved.
unique_port()only proves the returned port was free at probe time.saturating_add(1)picks a different port that may already be occupied, so this test can fail intermittently for reasons unrelated toBGREWRITEAOF. Callunique_port()again here instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/crash_matrix_per_shard_bgrewriteaof.rs` at line 281, The test currently derives a second port by doing let port = unique_port().saturating_add(1); which is unsafe because the +1 socket was never reserved; replace that expression by calling unique_port() again to obtain an independently probed free port (i.e., call unique_port() twice rather than adding 1) so the variable assigned to port is guaranteed to be probed free when used in the BGREWRITEAOF test.
🧹 Nitpick comments (1)
tests/autovacuum_daemon.rs (1)
14-28: 💤 Low valuePrefer
parking_lot::Mutexfor consistency with project guidelines.The coding guidelines specify using
parking_lotlocks instead ofstd::synclocks across all Rust files. While this is test infrastructure (not a hot path), usingparking_lot::Mutexmaintains consistency and is simpler—parking_lot locks never poison, so you can replace the poison-recovery.unwrap_or_else(|e| e.into_inner())with a plain.lock(). The non-poisoning behavior still prevents wedging sibling tests if one panics.As per coding guidelines: "Use
parking_lot::RwLock/parking_lot::Mutex— neverstd::synclocks."♻️ Refactor to use parking_lot::Mutex
use std::sync::atomic::Ordering; -use std::sync::{Mutex, MutexGuard, OnceLock}; +use std::sync::OnceLock; +use parking_lot::{Mutex, MutexGuard}; -/// Serializes the two tests that read/assert the process-global -/// `RECL_AUTOVACUUM_LAST_RUN_TS`. The default parallel test runner otherwise lets -/// `test_autovacuum_tick_updates_last_run_ts` advance that global while -/// `test_disabled_autovacuum_is_noop` is mid-assertion → spurious cross-test -/// failure (observed flaking under load). Poison-tolerant: a panicking holder -/// must not wedge the sibling. +/// Serializes the two tests that read/assert the process-global +/// `RECL_AUTOVACUUM_LAST_RUN_TS`. parking_lot::Mutex never poisons, so a +/// panicking holder will not wedge sibling tests. fn autovacuum_ts_lock() -> MutexGuard<'static, ()> { static GUARD: OnceLock<Mutex<()>> = OnceLock::new(); - GUARD - .get_or_init(|| Mutex::new(())) - .lock() - .unwrap_or_else(|e| e.into_inner()) + GUARD.get_or_init(|| Mutex::new(())).lock() }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/autovacuum_daemon.rs` around lines 14 - 28, The test helper autovacuum_ts_lock currently uses std::sync::Mutex and poison-recovery unwrap_or_else; replace std::sync::Mutex/MutexGuard with parking_lot::Mutex and parking_lot::MutexGuard (keep std::sync::OnceLock or replace with equivalent if desired), remove the poison-handling call and simply call .lock() on the parking_lot mutex, and adjust imports accordingly; refer to the autovacuum_ts_lock function and the GUARD static that protects RECL_AUTOVACUUM_LAST_RUN_TS when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 28f3eec1-3613-4d85-bb70-498af67cfcdc
📒 Files selected for processing (61)
CHANGELOG.mdREADME.mdsrc/command/persistence.rssrc/config.rssrc/main.rssrc/persistence/aof.rssrc/persistence/aof_manifest.rssrc/persistence/manifest.rssrc/persistence/migrate_aof.rssrc/persistence/page_cache/mod.rssrc/server/conn/handler_monoio/mod.rssrc/server/conn/handler_sharded/mod.rssrc/server/conn/handler_single.rssrc/server/embedded.rssrc/server/listener.rssrc/shard/conn_accept.rssrc/shard/dispatch.rssrc/shard/event_loop.rssrc/shard/persistence_tick.rssrc/shard/spsc_handler.rssrc/shard/timers.rssrc/storage/eviction.rssrc/storage/tiered/cold_index.rssrc/storage/tiered/cold_read.rssrc/storage/tiered/kv_spill.rssrc/storage/tiered/spill_thread.rstests/adversarial_v0110_fix01_set_delete_rollback.rstests/adversarial_v0110_fix02_err_path_intent.rstests/adversarial_v0110_fix03_simplestring_graph.rstests/adversarial_v0110_fix04_shortest_path_call_parity.rstests/adversarial_v0110_fix06_shortest_path_min_hops.rstests/adversarial_v0110_fix07_multihop_edge_var_reject.rstests/autovacuum_daemon.rstests/crash_matrix_per_shard_aof.rstests/crash_matrix_per_shard_bgrewriteaof.rstests/crash_recovery_disk_offload_no_aof.rstests/ft_search_as_of_boundary.rstests/ft_search_as_of_filter.rstests/ft_search_concurrent_readers.rstests/ft_search_multi_shard_as_of.rstests/ft_search_temporal_parity.rstests/graph_bench_compare.rstests/graph_bench_e2e.rstests/graph_integration.rstests/graph_stress_deep.rstests/integration.rstests/kill_snapshot.rstests/lunaris_cypher_shortest_path.rstests/lunaris_cypher_temporal.rstests/lunaris_hybrid_ft_search.rstests/mq_integration.rstests/multishard_serve_smoke.rstests/pipeline_auto_index.rstests/replication_test.rstests/txn_completeness_edge_cases.rstests/txn_cypher_write_rollback.rstests/txn_ft_search_snapshot.rstests/txn_graph_wiring.rstests/txn_kv_wiring.rstests/vacuum_commands.rstests/workspace_integration.rs
| if self.remaining.fetch_sub(1, Ordering::AcqRel) == 1 { | ||
| // Abort if any shard failed to fold: committing new_seq while a | ||
| // shard lacks its new base would break recovery. Keep old_seq. | ||
| if self.failed.load(Ordering::Acquire) { | ||
| let m = self.manifest.lock(); | ||
| // Best-effort: prune the orphaned new-seq files written by the | ||
| // shards that DID fold, so they don't linger. | ||
| for sid in 0..self.n_shards { | ||
| m.prune_shard_files(sid as u16, self.new_seq); | ||
| } | ||
| drop(m); | ||
| error!( | ||
| "F6 per-shard rewrite ABORTED: a shard failed to fold; seq stays {}. \ | ||
| Old generation remains authoritative (crash-safe). A RESTART is \ | ||
| recommended so successful shards' writers stop appending to the \ | ||
| discarded new generation.", | ||
| self.old_seq | ||
| ); | ||
| crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); | ||
| return; | ||
| } | ||
| let mut m = self.manifest.lock(); | ||
| m.seq = self.new_seq; | ||
| if let Err(e) = m.write_manifest() { | ||
| error!( | ||
| "F6 per-shard rewrite: final manifest commit (seq {}) failed: {}. \ | ||
| Old generation remains authoritative; rewrite did not take effect.", | ||
| self.new_seq, e | ||
| ); | ||
| // Do NOT prune — old generation is still the committed state. | ||
| drop(m); | ||
| crate::command::persistence::AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); | ||
| return; |
There was a problem hiding this comment.
Abort paths still leave successful shards appending to uncommitted files.
By the time this branch runs, the successful writers have already reopened onto new_seq in do_rewrite_per_shard(). Returning here with manifest.seq still on old_seq means subsequent writes on those shards go to files recovery will ignore after a restart. Logging “restart recommended” is not enough here; this path needs an explicit rollback or a fail-fast shutdown before more writes are accepted.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/persistence/aof.rs` around lines 283 - 315, The abort branch after
checking self.failed currently returns with shards' writers already reopened on
self.new_seq (done in do_rewrite_per_shard()), causing successful shards to keep
appending to uncommitted files; fix by performing an explicit rollback or
fail-fast stop before returning: acquire the manifest lock (m) as done, call the
rollback routine that reopens each shard's writer back to self.old_seq (or
invoke the writer-shutdown/reopen function used by do_rewrite_per_shard for each
sid and ensure m.prune_shard_files is still called for any orphaned new_seq
files), set AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst), and if a
safe rollback cannot be completed atomically, trigger a fail-fast termination so
no further writes to new_seq occur (ensure this uses the same writer-control
APIs used elsewhere rather than leaving writers active).
| for s in &self.senders { | ||
| // Blocking send for guaranteed delivery — see the doc comment. | ||
| if s.send(AofMessage::RewritePerShard { | ||
| shard_dbs: shard_dbs.clone(), | ||
| coord: coord.clone(), | ||
| }) | ||
| .is_err() | ||
| { | ||
| error!( | ||
| "F6 per-shard rewrite: a writer channel is disconnected; \ | ||
| rewrite aborted (no manifest commit, old generation remains \ | ||
| authoritative). Inspect AOF writer threads." | ||
| ); | ||
| return Err(AofPoolSendError::SendFailed); |
There was a problem hiding this comment.
Handle partial fan-out failure before returning.
If one send() fails after earlier writers already received RewritePerShard, those writers can still fold and call shard_done(), but the failed sender and the unsent tail never decrement remaining. That leaves AOF_REWRITE_IN_PROGRESS stuck and can strand successful shards on new_seq files that will never be committed. Mark the coord failed and account for every unsent shard before returning so the dispatched writers can drive the abort path to completion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/persistence/aof.rs` around lines 794 - 807, When a send to a writer fails
in the loop over self.senders while dispatching AofMessage::RewritePerShard
(with shard_dbs and coord), do not return immediately; instead mark the coord as
failed (e.g. call coord.mark_failed() or coord.fail()) and account for every
unsent shard by invoking the same completion path the writers use (e.g. call
coord.shard_done() once per unsent shard or otherwise decrement the remaining
counter for each shard that never received the message) before returning
Err(AofPoolSendError::SendFailed) so AOF_REWRITE_IN_PROGRESS is cleared and
successful shards can finish and commit new_seq as intended.
| AofMessage::AppendSync { | ||
| lsn, | ||
| bytes: data, | ||
| ack, | ||
| } => { | ||
| write_framed(file, lsn, &data).map_err(|e| AofError::Io { | ||
| path: PathBuf::from("<aof per-shard incr drain>"), | ||
| source: e, | ||
| })?; | ||
| outcome.drained += 1; | ||
| // Durability for these is covered by the post-drain fsync at | ||
| // the rewrite boundary (mirrors drain_pending_appends). | ||
| let _ = ack.send(AofAck::Synced); |
There was a problem hiding this comment.
Do not ack AppendSync before the rewrite-boundary fsync succeeds.
Line 2799 sends AofAck::Synced before do_rewrite_per_shard() reaches its phase-1/phase-3 file.sync_data(). Under appendfsync=always, a client can therefore receive +OK during BGREWRITEAOF even though a crash in that window still loses the entry. Keep these ack senders pending and only fulfill them after the caller’s post-drain sync completes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/persistence/aof.rs` around lines 2787 - 2799, The current code sends
AofAck::Synced immediately inside the AofMessage::AppendSync arm (after calling
write_framed), which can ack clients before the rewrite-boundary fsync in
do_rewrite_per_shard() completes; instead, buffer the ack sender for later and
only send AofAck::Synced after the caller’s post-drain sync finishes.
Concretely: in the AppendSync branch (where write_framed is called and ack is
available) stop calling ack.send(AofAck::Synced) and push the ack into a pending
list/vec (e.g., outcome.pending_syncs or a new field) so do_rewrite_per_shard()
can iterate and fulfill those senders only after its phase-1/phase-3
file.sync_data() succeeds; ensure any error paths still notify or drop the
sender appropriately.
| // ── Partition into inline candidates and oversized entries ──────────────── | ||
| // We use indices to avoid re-allocating keys. `inline_indices` are the | ||
| // positions in `buffer` of entries that fit the inline threshold. | ||
| let mut inline_indices: Vec<usize> = Vec::with_capacity(buffer.len()); | ||
| let mut oversized_indices: Vec<usize> = Vec::new(); | ||
|
|
||
| for (i, req) in buffer.iter().enumerate() { | ||
| if req.value_bytes.len() <= INLINE_MAX_VALUE_BYTES { | ||
| inline_indices.push(i); | ||
| } else { | ||
| oversized_indices.push(i); | ||
| } | ||
| } | ||
|
|
||
| // ── Write ONE batch file for all inline entries ─────────────────────────── | ||
| if !inline_indices.is_empty() { | ||
| // Use the file_id of the first inline entry for the batch file. | ||
| let file_id = buffer[inline_indices[0]].file_id; | ||
| let shard_dir = buffer[inline_indices[0]].shard_dir.clone(); | ||
|
|
||
| let spill_entries: Vec<SpillEntry> = inline_indices | ||
| .iter() | ||
| .map(|&i| SpillEntry { | ||
| key: buffer[i].key.clone(), | ||
| value_bytes: buffer[i].value_bytes.clone(), | ||
| value_type: buffer[i].value_type, | ||
| flags: buffer[i].flags, | ||
| ttl_ms: buffer[i].ttl_ms, | ||
| }) | ||
| .collect(); | ||
|
|
||
| let completion = match build_kv_spill_batch(&spill_entries, file_id) { | ||
| Ok(batch) => { | ||
| let total_pages = batch.leaves.len() as u32; // overflow is always empty (inline-only) | ||
| match write_kv_spill_batch(&shard_dir, file_id, &batch) { | ||
| Ok(byte_size) => { | ||
| let entries = inline_indices | ||
| .iter() | ||
| .zip(batch.locations.iter()) | ||
| .map(|(&buf_idx, &(page_idx, slot_idx))| SpillCompletionEntry { | ||
| key: buffer[buf_idx].key.clone(), | ||
| db_index: buffer[buf_idx].db_index, | ||
| page_idx, | ||
| slot_idx, | ||
| }) | ||
| .collect(); | ||
| SpillCompletion { | ||
| file_entry: make_file_entry(file_id, total_pages, byte_size), | ||
| entries, | ||
| success: true, | ||
| } | ||
| } | ||
| Err(e) => { | ||
| warn!( | ||
| file_id, | ||
| error = %e, | ||
| count = inline_indices.len(), | ||
| "spill_thread: inline batch write failed" | ||
| ); | ||
| SpillCompletion { | ||
| file_entry: make_file_entry(file_id, 0, 0), | ||
| entries: Vec::new(), | ||
| success: false, | ||
| } | ||
| } | ||
| } | ||
| } | ||
| Err(e) => { | ||
| warn!( | ||
| file_id, | ||
| error = %e, | ||
| count = inline_indices.len(), | ||
| "spill_thread: inline batch build failed" | ||
| ); | ||
| SpillCompletion { | ||
| file_entry: make_file_entry(file_id, 0, 0), | ||
| entries: Vec::new(), | ||
| success: false, | ||
| } |
There was a problem hiding this comment.
Don't fail the entire inline flush when one entry misses the inline fit heuristic.
This pre-screen only checks value_bytes.len(), so a large key or an incompressible ~3.5KB value can still make build_kv_spill_batch() return InvalidData on a fresh leaf. That error path currently turns one bad candidate into a failed completion for the entire inline buffer, even though async eviction has already removed all of those keys from RAM. Please retry the offending entry via build_kv_spill_pages() (or split/rebuild the batch) instead of dropping the whole flush.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/storage/tiered/spill_thread.rs` around lines 155 - 233, The current logic
treats any error from build_kv_spill_batch(...) as a fatal failure for the whole
inline_indices batch; instead, when build_kv_spill_batch fails (e.g.,
InvalidData due to a single oversized or incompressible entry), retry the
entries individually: for each index in inline_indices call
build_kv_spill_pages() (or otherwise build a per-entry batch/page) for that
single SpillEntry and then write it with the same writer (e.g.,
write_kv_spill_batch or the per-page writer), collecting successful
SpillCompletionEntry items; on per-entry failure log and skip that entry but do
not mark the entire SpillCompletion as failed if at least one entry succeeded,
and only return a failed SpillCompletion when none of the entries could be
written. Update the error branch around build_kv_spill_batch(...) to implement
this per-entry retry loop and to populate
SpillCompletion.file_entry/entries/success accordingly.
| match completion_tx.try_send(completion) { | ||
| Ok(()) => {} | ||
| Err(flume::TrySendError::Full(_)) => { | ||
| SPILL_COMPLETION_DROPPED.fetch_add(1, Ordering::Relaxed); | ||
| warn!( | ||
| "spill_thread: completion channel full, dropping completion (total dropped: {})", | ||
| SPILL_COMPLETION_DROPPED.load(Ordering::Relaxed) | ||
| ); | ||
| } | ||
| Err(flume::TrySendError::Disconnected(_)) => { | ||
| // Event loop dropped its receiver -- shutting down; ignore. | ||
| } | ||
| } |
There was a problem hiding this comment.
Completion drops are not safe here.
A dropped SpillCompletion never reaches apply_spill_completions(), so the file is not added to the manifest and its keys are not reinserted into cold_index. By this point the keys have already been evicted from RAM, so try_send(...Full) turns backpressure into immediate misses and permanent loss after restart.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/storage/tiered/spill_thread.rs` around lines 407 - 419, The code
currently drops SpillCompletion when completion_tx.try_send(...) returns Full,
causing lost completions and manifest/cold_index corruption; change the send
logic in spill_thread to guarantee delivery of the SpillCompletion (do not
increment SPILL_COMPLETION_DROPPED or drop the value). Replace the try_send
branch with a blocking or async send loop using completion_tx.send(completion)
(or await completion_tx.send_async/completion_tx.send with retry on Full) and
only treat Disconnected as shutdown; ensure the SpillCompletion is retried until
sent so apply_spill_completions() always receives it and keys can be reinserted
into cold_index.
Self-review of PR #136 (BUG #1, MEDIUM, in-scope to f5e17ac). ShardManifest:: compact() rewrites the manifest to a temp file, fsyncs, and renames it over self.path (the durability point), then reopens self.file against the new inode. If that reopen failed, the error propagated out of compact() and was swallowed by commit() (warn-only, "commit already durable"). But self.file then still referred to the PRE-rename, now-orphaned inode: every SUBSEQUENT commit would seek+write through that detached handle, silently discarding all post-compact manifest updates while recovery (which reads self.path) saw only the compaction snapshot. Cold-tier file entries committed after the failed reopen were lost on restart. Trigger: reopen fails (fd exhaustion / permission / mount transient) while overflow is in use — rare, but silent and persistent when hit. Fix: on reopen failure, set a `needs_reopen` flag instead of leaving a stale handle. commit() checks the flag FIRST and reattaches to self.path before any write (resetting active_slot to 0, matching the compacted layout). If that reattach also fails, commit() returns the error — the data genuinely cannot be persisted, so fail loudly rather than write into a dead fd. TDD: compact_reopen_failure_does_not_silently_lose_later_commits drives a compact() with a cfg(test) reopen-failure injection (points self.file at a throwaway inode, the real "orphaned fd"), then commits another entry and asserts it survives recovery. RED before the guard (entry 999 missing after reopen); GREEN after. Full persistence::manifest module green on both runtimes. author: Tin Dang <tin.dang@trustifytechnology.com>
Self-review of PR #136 (BUG #2, MEDIUM). The tokio per-shard AOF writer (per_shard_aof_writer_task) wrote each record as [u64 lsn][u32 len][payload] by issuing the header write_all then the payload write_all, logging+`continue` on either failure — with NO write_error latch. The single-file writer (aof.rs ~:1467) and the monoio per-shard writer (~:2125) both carry that latch; the tokio per-shard path did not. A torn write (header lands, payload fails) left an orphaned 12-byte header; the NEXT Append wrote a fresh header right after it, so the framed reader on replay misread the orphan's `len` and consumed the following record's bytes as payload — corrupting every record after the tear. Pre-existing since the per-shard foundation (#129, 8c49ec2), but this branch's 8853ade makes the tokio per-shard writer live for the first time, so the latent bug is now reachable. Fix: mirror the monoio latch. A `write_error` flag is set on any header/payload/ flush/sync failure; once set, Append drops silently (fire-and-forget) and AppendSync replies WriteFailed (so the caller errors instead of hanging to the F2 fsync timeout or ack'ing a write into a corrupt stream) — never emitting another record. The writer stays latched for its lifetime; recovery replays the clean prefix and a rewrite starts a fresh file. Zero behavior change on the happy path (flag never set). TDD: tokio_per_shard_writer_latches_after_torn_write drives the real writer task against a real PerShard manifest, injects a torn write on the 2nd Append via a cfg(test) atomic (TEST_FAIL_WRITE_AT — not an env var; edition-2024 set_var is unsafe), and asserts (a) a following AppendSync returns WriteFailed and (b) the on-disk incr replays as exactly the pre-tear record with no corruption. RED before the latch (AppendSync acked Synced; stream corrupt); GREEN after. Full persistence::aof module green on both runtimes; fmt + clippy clean. author: Tin Dang <tin.dang@trustifytechnology.com>
Self-review (20 commits) — 2 MEDIUM durability bugs found + fixedRan a 3-way adversarial review of the whole branch. The headline multishard
Each fix has a RED→GREEN regression test (RED verified by bypassing the check). Lower-severity follow-ups (NOT in this PR — tracked)
|
…overflow DoS The precedence-climbing recursive-descent expression grammar pushes ~9 stack frames per source nesting level (parse_expression -> or -> and -> not -> comparison -> addition -> multiplication -> unary -> primary -> recurse), but check_depth() only counts one source level per recursion. With the previous limit of 64, a pathologically nested query descended ~64 levels (~576 debug frames) BEFORE the guard could fire at level 65, overflowing the stack. Cypher is parsed on async worker threads (tokio/monoio default to 2 MiB stacks, not the 8 MiB main thread), so this overflow aborted the whole process with SIGABRT instead of returning NestingDepthExceeded — a real remote DoS, and a red CI "Check" job (test_nesting_depth_exceeded overflowed its stack). Introduce DEFAULT_MAX_NESTING_DEPTH = 32, threaded through parse_cypher, the test harness, and the public re-export. 32 levels keeps the worst- case descent to ~288 debug frames (well under 2 MiB, ~50% margin vs the overflowing 64) while still allowing far deeper nesting than any real query needs. Deep nesting now returns NestingDepthExceeded gracefully, never aborts. Verified in OrbStack VM (cgroup-capped): test_nesting_depth_exceeded and test_nesting_within_limit both pass, full graph:: module 368/368 green, fmt + clippy clean under --features graph and --no-default-features --features runtime-tokio,jemalloc,graph. author: Tin Dang <tin.dang@trustifytechnology.com>
…pher depth bound Document the user-facing behavior changes shipped in PR #136: - CLAUDE.md "Environment Variables": add `MOON_URING=1` (opt into the tokio->io_uring bridge, now default-off because it floods errors under load) and clarify that `MOON_NO_URING=1` force-disables io_uring on both the monoio runtime and the tokio bridge. - CHANGELOG.md [0.2.0-alpha]: new "Fixed" subsection for PR #136 covering the three multishard RAM-zombie root causes (PageCache eager pre-alloc, tokio listener SO_REUSEPORT bind-race, io_uring-under-tokio default-off), the two in-PR durability TDD fixes (manifest compact-reopen latch, tokio per-shard AOF write_error latch), the Cypher nesting-depth bound, and the three low-severity follow-ups (#137/#138/#139). author: Tin Dang <tin.dang@trustifytechnology.com>
…ration spawn_tokio_connection, spawn_monoio_connection and spawn_migrated_monoio_ connection all thread the disk-offload spill context (spill_sender / spill_file_id / disk_offload_dir) into the ConnectionContext, but spawn_migrated_tokio_connection built it with None / Cell(0) / None. After an affinity migration the tokio connection therefore fell back to the non-spilling eviction path even when disk-offload was enabled, so cold-tier durability silently depended on whether the connection had migrated. CodeRabbit flagged this on PR #136 (review 4420584384); validated against the merged code (issue #141). Fix: thread spill_sender / spill_file_id / disk_offload_dir through spawn_migrated_tokio_connection and pass them to ConnectionContext::new, mirroring the three sibling spawn paths. Updated both call sites in event_loop.rs. Validation: this is mechanical parity with three already-correct sibling functions in the same file. A genuine red/green unit test is disproportionate here — ConnectionContext::new takes 26 args and is consumed inside a spawned task, so asserting the migrated context's spill fields would require stubbing the entire context or simulating a live cross-shard connection migration. Verified instead by: clean build + clippy under both default (monoio) and --no-default-features --features runtime-tokio,jemalloc, and structural parity (all four spawn paths now pass the spill context). The behavioural guarantee is covered by the existing disk-offload read-through integration suite. author: Tin Dang <tin.dang@trustifytechnology.com>
…ard accept run_embedded() diverged from the production main.rs startup path on two functional points (CodeRabbit PR #136 review 4420584384, issue #142): 1. Recovery only ran when `persistence_dir` was set, so `--appendonly no` + disk-offload (persistence_dir = None, disk_offload_base = Some) skipped `Shard::restore_from_persistence` entirely — cold-tier keys were unrecoverable after an embedded restart. Now gated on `should_run_recovery(persistence_dir, disk_offload_enabled)` = persistence OR disk-offload, recovering from `config.dir` when persistence_dir is None (mirrors main.rs:702-704). 2. Embedded forced `per_shard_accept = cfg!(target_os = "linux")`, re-enabling the tokio/Linux per-shard accept bind-order race that PR #136 disabled in main.rs. Now `false` (central accept), matching main.rs. Validated in OrbStack VM (cgroup-capped), red->green TDD under --no-default-features --features runtime-tokio,jemalloc (embedded is runtime-tokio-only): recovery_runs_for_disk_offload_without_persistence (RED with the disk-offload term dropped -> GREEN). fmt + clippy clean under both default (monoio) and tokio,jemalloc. Note: a full embedded+disk-offload+restart integration test was not added — there is no run_embedded test harness to extend and standing it up for a 2-line parity fix is disproportionate; the behavioural recovery path is the same one covered by tests/crash_recovery_disk_offload_no_aof.rs on main.rs. The extracted should_run_recovery predicate unit-tests the exact gate that regressed. author: Tin Dang <tin.dang@trustifytechnology.com>
…tway try_send_rewrite_per_shard fans the RewritePerShard message out to every shard writer with a blocking send. If a writer channel is disconnected the send fails and the function returned Err immediately — but the writers that ALREADY received the message decrement the PerShardRewriteCoord countdown when they fold, while the failed writer and the unsent tail never do. The countdown therefore never reaches zero, no shard runs the terminal branch, and AOF_REWRITE_IN_PROGRESS (set by bgrewriteaof_start_sharded before dispatch) is wedged true forever — every future BGREWRITEAOF is silently rejected for the lifetime of the process. Same wedged-flag family CodeRabbit flagged on PR #136 review 4420584384 (related to #138). Fix: on send failure, call coord.mark_failed() then coord.shard_done() once for each shard from the failing index through the tail (the failed shard plus every unsent one). The already-dispatched writers still decrement on fold, so the countdown reaches zero exactly once; mark_failed() is published before the tail decrements, so whichever decrement is terminal takes the abort path — keeps old_seq authoritative (no manifest commit) and clears AOF_REWRITE_IN_PROGRESS. The error log now names the offending writer index. Validated in OrbStack VM (cgroup-capped), red/green TDD: rewrite_fan_out_partial_failure_clears_in_progress_flag disconnects shard-0's writer, asserts the fan-out returns Err AND the in-progress flag is cleared (RED before the accounting fix — flag stayed true; GREEN after). fmt + clippy clean under both default (monoio) and tokio,jemalloc. author: Tin Dang <tin.dang@trustifytechnology.com>
…n on abort A per-shard BGREWRITEAOF reopens each shard's append file onto the NEW generation in phase 6 (do_rewrite_per_shard) BEFORE the coordinator decides whether the rewrite commits. If the rewrite then aborts — another shard failed to fold, or the final manifest commit errored — the manifest keeps old_seq and the abort path prunes the new-seq files, but the already-folded writers keep appending into the now-discarded new-seq incr. Recovery resolves base/incr by the committed seq (old_seq), so every post-abort write is silently lost. The old code acknowledged this with a "RESTART recommended" log; a server that is not restarted loses data. CodeRabbit flagged this on PR #136 review 4420584384. Fix — barrier-before-resume. PerShardRewriteCoord now publishes the committed generation exactly once from the terminal shard_done (new_seq on success, old_seq on abort/commit-failure) via a parking_lot Mutex<Option<u64>> + Condvar. After phase 7 each folded writer blocks in await_outcome and, if the committed seq != new_seq, reopens its append file onto the committed seq's incr (shard_incr_path_seq) — so it resumes appending into the authoritative generation with no restart and no loss. On the happy path committed == new_seq and the barrier is a no-op beyond unblocking. The barrier turns "every shard calls shard_done exactly once" into a LIVENESS requirement: a writer that decrements then blocks hangs forever if another shard never decrements. The one path that skips shard_done is a panic mid-fold (save_snapshot_to_bytes OOM-unwinding — issue #138), which under the default panic=unwind would now hang every healthy shard's AOF writer thread. Closed with a ShardDoneGuard: the fold creates it on entry and calls complete() on success (clean shard_done); on any ?-error or panic-unwind its Drop fires mark_failed + shard_done, so the countdown always closes, the terminal writer always publishes, and every waiter wakes and rolls back. The guard now owns the single decrement for ALL exits, so both writer-task error arms (monoio + tokio) no longer decrement after do_rewrite_per_shard (was a double-decrement risk). This incidentally closes the #138 panic-wedge. Each per-shard writer runs on its own dedicated OS thread (main.rs spawns aof-writer-{sid} → block_on_local), so blocking one at the barrier never starves the thread that must publish — no deadlock on either runtime. Validated in OrbStack VM (cgroup-capped), red/green TDD: - rewrite_abort_reopens_writer_onto_committed_old_generation: drives one shard's real fold with a second shard pre-failed; asserts the new-gen incr is pruned AND post-abort appends through *file land in the committed old-gen incr (RED before phase 8 — writes went to the pruned inode; GREEN after). - rewrite_abort_wakes_waiter_cross_thread: real condvar wait/notify across threads observes old_seq. - rewrite_fold_panic_releases_barrier_for_other_writers: a ShardDoneGuard dropped during catch_unwind releases a blocked waiter with old_seq (hangs forever without the guard). No regression: per-shard AOF SIGKILL recovery (crash_matrix_per_shard_aof, 3) and per-shard BGREWRITEAOF crash recovery (crash_matrix_per_shard_bgrewriteaof, 2) stay green. fmt + clippy clean under both default (monoio) and tokio,jemalloc. author: Tin Dang <tin.dang@trustifytechnology.com>
During a BGREWRITEAOF drain, both drain_pending_appends (legacy/TopLevel, default --shards 1 path) and drain_pending_appends_framed (per-shard) wrote the appended bytes and immediately acked AofAck::Synced — BEFORE the post-drain boundary sync_data(). Under appendfsync=always a client received +OK (durable) for an entry that a crash in the ack→fsync window would lose. CodeRabbit flagged this on PR #136 review 4420584384 (issue #140); the always-fsync durability contract requires the bytes to be on disk before the client is told Synced. Fix: park the AppendSync ack senders in DrainOutcome.pending_acks instead of acking inside the drain. New sync_and_fulfill_drain() performs the boundary fsync and only then resolves the parked acks — Synced on success, or FsyncFailed plus a propagated IO error on failure (never a false Synced). All six drain sites across do_rewrite_per_shard, do_rewrite_single and do_rewrite_sharded route through it. A write error inside the drain still drops the parked acks with the outcome (RecvError → hard failure), unchanged. Validated in OrbStack VM (cgroup-capped), red/green TDD: - drain_framed_parks_appendsync_ack_until_boundary_fsync: asserts the drain PARKS the ack (pending_acks.len()==1, not sent) and that it resolves Synced only after sync_data() (RED before: no pending_acks field / acked in drain). - drain_single_fulfills_fsync_failure_as_fsync_failed: the DEFAULT --shards 1 non-framed path must resolve FsyncFailed (never Synced) when the boundary fsync fails. No regression: full aof lib suite (66) green; bgrewriteaof TTL emission tests green; per-shard AOF SIGKILL + BGREWRITEAOF crash-recovery suites green. fmt + clippy clean under both default (monoio) and tokio,jemalloc. Note: tests/aof_fsync_err_subscribe_ordering single-shard variant (an #[ignore]d normal-append-path test, multi-shard twin passes) fails identically on HEAD without this change — pre-existing and unrelated to the rewrite-drain path touched here. author: Tin Dang <tin.dang@trustifytechnology.com>
…00-line cap src/persistence/aof_manifest.rs had grown to 3058 lines — past the 1500-line cap in CLAUDE.md. CodeRabbit flagged this on PR #136 review 4420584384 (issue #143). Converted the file into a directory module and relocated cohesive chunks, preserving every public path via re-export. Layout (all three files now under the cap): - aof_manifest/mod.rs (1395): AofManifest state + manifest-level I/O (paths, initialize/load, parse_v1/v2, cleanup_orphans, write_manifest, advance, migrate, verify, global_max_lsn) + manifest-core unit tests. - aof_manifest/shard_replay.rs (1159): the multi-part / per-shard replay free functions (replay_multi_part, replay_incr_resp, replay_incr_framed, replay_per_shard, replay_ordered_merge) + OrderedEntry + the replay unit tests. - aof_manifest/shard_rewrite.rs (577): the per-shard rewrite inherent methods (initialize_multi, try_initialize_multi, advance_shard, prune_shard_files) as `impl AofManifest` + the shard-rewrite unit tests. Mechanics: - `use super::*` in each submodule pulls the parent's types, std/tracing imports, consts, and private helpers down (a child module sees the parent's private items), so no import duplication beyond the test modules. - Relocated public replay API is re-exported from mod.rs (`pub use shard_replay::{OrderedEntry, replay_multi_part, replay_ordered_merge, replay_per_shard}`) so external callers (`crate::persistence::aof_manifest::replay_*`) keep resolving unchanged. - Per CLAUDE.md the cap and the "tests stay in mod.rs" convention can't both hold for a 3058-line file (tests alone are ~1100 lines); tests were moved next to the code they cover, each submodule getting its own temp_dir() copy with a distinct directory prefix (replay-/rewrite-) so the independent static counters never collide. No behaviour change. Validated in OrbStack VM: 30 aof_manifest unit tests green; per-shard AOF SIGKILL recovery (3) and per-shard BGREWRITEAOF crash recovery (2) green (exercise the relocated replay_per_shard / advance_shard / initialize_multi via recovery). fmt + clippy clean under both default (monoio) and tokio,jemalloc. author: Tin Dang <tin.dang@trustifytechnology.com>
…ion entry CI Lint gate requires a CHANGELOG.md entry (or the skip-changelog label) per PR. Document the PR #144 scope under [0.2.0-alpha] Unreleased: the 8 CodeRabbit PR #136 durability follow-ups, the two PR #144-review Majors (async-spill eviction regression test, cfg(unix) migration gate), the aof_manifest.rs and aof.rs decompositions, and the VECTOR_INDEXES RwLock test-isolation fix. author: Tin Dang <tin.dang@trustifytechnology.com>
fix(persistence): close 8 CodeRabbit PR #136 durability findings + aof_manifest split
Summary
Makes Moon stable in multishard mode on both runtimes (monoio + tokio) with disk-offload, and eliminates the long-standing "zombie eating RAM in multishard" symptom. Ports the PR #129 per-shard AOF work onto
main, lands the disk-offload durability fixes, and closes the multishard serving + bounded-RAM gaps with benchmark evidence on both runtimes.The "zombie eating RAM in multishard" report — three independent root causes
SO_REUSEPORT) while each shard binds its ownSO_REUSEPORTlistener — incompatible sockets on one port, resolved by a bind-order race. Either wayPINGwas never answered (port up, shards parked, nothing dispatched — the "zombie" signature). Fix: the central listener also bindsSO_REUSEPORT, so it coexists with the shard sockets regardless of order. monoio was always healthy.Unknown io_uring event type: 0and drops connections under load. Fix: default-OFF for tokio (MOON_URING=1to opt in;MOON_NO_URINGstill force-disables and stays the CI default). io_uring remains in monoio, the production runtime.--disk-offloaddefaults toenable, so each shard built aPageCachesized to 25% of the whole maxmemory and eagerly committednum_frames × PAGEzeroed bytes at construction. With N shards that isN × 25% × maxmemorypre-committed before serving a command; with--maxmemoryunset the guardrail sets it to 80% of host RAM → a 4-shard default server reserved ~80% of host RAM instantly (measured 3791 MB idle RSS, zero data). Fix: (a) lazy buffers — frames start empty, grow to a full page on firstfetch_pagemiss; (b) divide the budget across shards; (c) startup WARN when an explicit--pagecache-sizeexceeds 50% of maxmemory. Startup RSS 3791 MB → 29 MB.Benchmark — multishard + disk-offload, both runtimes
OrbStack Linux aarch64, 4 shards,
--maxmemory 64MiB --maxmemory-policy allkeys-lru --disk-offload enable, 300k SET + 300k GET over a 200k keyspace (forces spill + cold read-through), p=16, all server runs cgroup-capped to protect the host:RSS stays flat through 600k ops on both runtimes — bounded RAM, working spill/read-through, no zombie.
Also in this branch (PR #129 port + disk-offload durability)
Validation
tests/multishard_serve_smoke.rspasses 3/3 on monoio and tokio.cargo fmt --checkclean;cargo clippy -- -D warningsclean on default andruntime-tokio,jemallocfeature sets.Test plan
cargo test(monoio) full suite greencargo test --no-default-features --features runtime-tokio,jemallocfull suite greenSummary by CodeRabbit
New Features
Improvements
Documentation