docs(design): propose per-adapter logical backup format#706
docs(design): propose per-adapter logical backup format#706
Conversation
Per-adapter directory hierarchy (dynamodb/<table>/items/<pk>.json,
s3/<bucket>/<original/path>, redis/db_<n>/<type>/<key>.{bin,json,jsonl},
sqs/<queue>/messages.jsonl) so a backup is recoverable without elastickv —
cat / jq / find / aws s3 sync / redis-cli are sufficient. Strips internal
magic prefixes and wide-column key encodings; pins read_ts via the lease-
read path and the active timestamp tracker so the dump is consistent and
the compactor cannot retire its versions mid-dump.
📝 WalkthroughWalkthroughTwo comprehensive design documents introduce logical backup and snapshot decoder functionality. The first specifies Phase 1 design for extracting point-in-time consistent backups from live clusters using admin RPCs and raft FSM commands. The second standardizes conversion of FSM snapshots into inspectable vendor-neutral logical backup formats with per-adapter schemas. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
|
@claude review |
|
Claude finished @bootjp's task in 3m 3s —— View job Review:
|
Critical: - Replace nonexistent kv.ShardedCoordinator.ScanRange with the actual ShardStore.ScanAt (kv/shard_store.go:106) plus a thin BackupScanner iterator wrapper for paged scans. - Specify ActiveTimestampTracker.PinWithDeadline plus a per-second sweeper goroutine. The existing Pin has no TTL, so a crashed producer would block compaction forever without this. Important: - Add hll/ to the Redis layout tree. - Replace .metadata.json sidecar with .elastickv-meta.json reserved suffix; document collision handling and rename-collisions flag. - Spell out cross-shard wait: BeginBackup polls each group applied_index until it covers read_ts (5s default deadline) before any scan starts. The 50ms skew alone was a liveness assertion, not an enforced wait. - Specify CHECKSUMS as sha256sum(1)-compatible (verifiable with sha256sum -c CHECKSUMS from the dump root). - Add ListAdaptersAndScopes RPC alongside BeginBackup/EndBackup with ttl_ms / pin_token. Minor: - Redis TTL: default skip-expired on restore; preserve-ttl opt-in. - aws s3 sync example needs exclude pattern for sidecars. - DynamoDB example clarified: composite-key vs hash-only layouts. - SQS dedup-window reset moved into trade-offs as an explicit callout. - Redis stream merge mode: refuse non-empty target; auto-id strategy falls back to XADD * with an _id_remap.jsonl mapping log. Tests added: TestPinWithDeadlineExpiry, TestBeginBackupWaitsForLaggingShard, TestBackupScannerPaging, TestS3SidecarSuffixCollision, TestRedisTTLExpiredKeySkippedByDefault, TestRedisStreamMergeRejectsNonEmpty.
|
Claude-bot review (#706 (comment)) reflected in 5719128. Summary of responses: Critical
Important
Minor
Nits
New tests: @claude review |
|
Claude finished @bootjp's task in 4m 11s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | ScanRange doesn't exist |
Replaced with ShardStore.ScanAt + BackupScanner wrapper ✓ |
| 2 | ActiveTimestampTracker no TTL |
PinWithDeadline + sweeper specified ✓ |
| 3 | hll/ missing from layout tree |
Added ✓ |
| 4 | S3 sidecar collision | .elastickv-meta.json suffix, reserved at dump time ✓ |
| 5 | Cross-shard wait underspecified | applied_index poll in BeginBackup with 5 s deadline ✓ |
| 6 | CHECKSUMS format unspecified |
sha256sum(1)-compatible; sha256sum -c CHECKSUMS ✓ |
| 7 | ListAdaptersAndScopes undocumented |
RPC sketch added ✓ |
| 8 | Redis TTL on expired keys | skip-expired default, --preserve-ttl opt-in ✓ |
| 9 | aws s3 sync uploads sidecars |
--exclude '*.elastickv-meta.json' ✓ |
| 10 | DynamoDB composite vs hash-only | Both layouts shown side by side ✓ |
| 11 | SQS FIFO dedup warning buried | Moved to trade-offs as explicit callout ✓ |
| 12 | Redis stream merge collision | merge refuses non-empty; --stream-merge-strategy=auto-id fallback ✓ |
Overall: the revision is a solid step forward. Items 2 and 3 above (the renewal protocol gap and the ListAdaptersAndScopes binding) are the blockers — both are observable mismatches between the described behavior and the proposed proto that would surface as bugs on day one of Phase 1 implementation. Item 1 (the ScanAt doc comment) needs a one-liner fix in the design to prevent confusion during code review. Items 4–7 are genuinely minor.
There was a problem hiding this comment.
Code Review
This pull request proposes a logical backup system for elastickv, transitioning from opaque FSM snapshots to a per-adapter decomposed format using standard types like JSON and JSONL. The design ensures vendor-independent recovery and point-in-time consistency. Feedback focused on scalability and correctness, specifically regarding the performance impact of individual file storage for DynamoDB items, potential POSIX path collisions for S3 keys, and the scalability of monolithic metadata files. Reviewers also suggested implementing a TTL refresh mechanism for large backups and bounding concurrent backup registrations to prevent resource exhaustion.
| `MANIFEST.json` is **the only file a restore tool must read first**. | ||
| Everything else is decoded from its on-disk path and contents. | ||
|
|
||
| ## Read-Side Consistency |
There was a problem hiding this comment.
The default backup_ts_ttl of 30 minutes might be too short for large-scale backups, creating a significant operational risk where the compactor retires MVCC versions still in use, resulting in a corrupted backup. The design must include a mechanism to refresh the TTL and should also detail mitigation strategies for operational risks like rolling upgrades, such as a temporary 'bridge' mode.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
| │ └── <table-name>/ | ||
| │ ├── _schema.json | ||
| │ └── items/ | ||
| │ └── <pk-segment>/[<sk-segment>.]json |
There was a problem hiding this comment.
Storing each DynamoDB item in an individual file will cause severe performance degradation and filesystem overhead for tables with millions of items. This represents a significant operational risk. Consider adopting a bundled format like JSONL (e.g., items/data-00001.jsonl). Additionally, per repository guidelines, when a design identifies such risks, it must detail mitigation strategies like implementing a temporary 'bridge' or 'proxy' mode for rolling upgrades.
| │ └── <pk-segment>/[<sk-segment>.]json | |
| │ └── data-<part-id>.jsonl |
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
| recorded in `KEYMAP`). | ||
|
|
||
| The directory **is** the index. There is no other side-table that must be | ||
| parsed before the user can find a record. |
There was a problem hiding this comment.
Preserving S3 object key / separators as filesystem path separators introduces potential collisions that S3 allows but POSIX filesystems do not. For example, S3 allows both an object named path/to/obj and an object named path/to. On a standard filesystem, path/to cannot be both a directory and a regular file. The design should specify how such collisions are handled (e.g., by encoding the leaf node or using a sidecar directory for nested objects).
| - **All other key segments** (DynamoDB partition/sort key bytes, Redis keys, | ||
| SQS queue names and message IDs) are encoded with the following scheme, | ||
| which is unambiguous and reversible: | ||
| - Bytes in `[A-Za-z0-9._-]` pass through. |
There was a problem hiding this comment.
A single KEYMAP file per adapter scope will not scale well for datasets with millions of keys, becoming a bottleneck for memory and I/O. Furthermore, as this design supports system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to avoid service interruption during the transition.
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
|
|
||
| ```json | ||
| { | ||
| "format_version": 1, |
There was a problem hiding this comment.
Using a single strings_ttl.json file to store TTLs for all Redis strings in a database is not scalable. For a large Redis instance, this file will grow excessively, leading to high memory consumption. It would be more efficient to use a newline-delimited JSON (JSONL) format or to shard the TTL data.
| "preserve_sqs_visibility": false, | ||
| "include_sqs_side_records": false | ||
| }, | ||
| "checksum_algorithm": "sha256", |
There was a problem hiding this comment.
To prevent unbounded memory growth and potential Denial of Service (DoS) through excessive BeginBackup calls, the design should specify a fixed bound on the number of concurrent active backup registrations allowed in the active_timestamp_tracker. This aligns with the general rule of applying bounds to collections that grow from external requests.
References
- To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.
HIGH: - TTL renewal made explicit. New RenewBackup RPC; producer renews every ttl_ms/3 and aborts the dump on renewal failure (continuing past TTL would silently corrupt the artifact). backup_max_ttl_ms caps a single window; multi-hour dumps work via repeated renewals. MEDIUM: - DynamoDB scalability. Default stays per-item (the user's stated requirement: PK名.json で1レコード1ファイル). Added opt-in --dynamodb-bundle-mode jsonl for tables where inode count is the binding constraint; structured warning emitted when an unbundled scope exceeds 1M items. - S3 file-vs-directory collisions. Producer detects keys like "path/to" and "path/to/obj" coexisting and renames the shorter to "path/to.elastickv-leaf-data" with KEYMAP.jsonl + manifest record. - KEYMAP scaled by switching to KEYMAP.jsonl (line-streamable). - Redis strings_ttl.json -> strings_ttl.jsonl with one TTL record per line so TTL count is unbounded. - max_active_backup_pins (default 4) caps concurrent BeginBackup registrations; tracker returns ErrTooManyActiveBackups, RPC surfaces ResourceExhausted. Prevents a misbehaving caller from holding the compactor open across the entire MVCC retention horizon. Manifest fields added: s3_collision_strategy, dynamodb_layout, max_active_backup_pins. CLI flags added: --ttl-ms, --scan-page-size, --dynamodb-bundle-mode/--dynamodb-bundle-size, --rename-collisions. Tests added: TestS3PathFileVsDirectoryCollision, TestBeginBackupTooManyActiveBackups, TestRenewBackupExtendsDeadline.
|
Gemini review (#706 (review)) addressed in HIGH (line 515) — TTL refresh for large backups
MEDIUM (line 129) — DynamoDB per-item file scalability
MEDIUM (line 163) — S3 file-vs-directory POSIX collision
MEDIUM (line 181) — KEYMAP scalability
MEDIUM (line 341) — Redis TTL file scalability
MEDIUM (line 503) — DoS bound on concurrent backup pins
New CLI flags: New tests: @claude review |
|
Claude finished @bootjp's task in 3m 18s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| HIGH | TTL refresh for long backups | RenewBackup RPC + backup_max_ttl_ms ceiling + pin_renewals_total metric ✓ |
| MEDIUM | DynamoDB per-item file scalability | --dynamodb-bundle-mode jsonl opt-in; 1M-item warning ✓ |
| MEDIUM | S3 file-vs-directory POSIX collision | .elastickv-leaf-data rename + KEYMAP.jsonl + MANIFEST.s3_collision_strategy ✓ |
| MEDIUM | KEYMAP scalability |
KEYMAP.jsonl (line-streamable) ✓ |
| MEDIUM | strings_ttl.json scalability |
strings_ttl.jsonl ✓ |
| MEDIUM | DoS bound on concurrent backup pins | max_active_backup_pins = 4; ErrTooManyActiveBackups; ResourceExhausted ✓ |
Overall: the Gemini pass was handled cleanly. The five carry-overs from round 2 are all small doc fixes (no new design required); items 6 and 7 above are new minor nits from the Gemini changes. No blockers were introduced. Once items 2 and 3 (the NewBackupScanner placement and the ListAdaptersAndScopes binding) are addressed in the doc, the design is ready for implementation.
Round-2 carry-overs: - Move NewBackupScanner from *ShardedCoordinator to *ShardStore. ShardedCoordinator (kv/sharded_coordinator.go:124) has no *ShardStore field; ShardStore is the natural home for ScanAt-adjacent code. - Add a Phase-1 note that ScanAt's doc comment must be updated to reflect that BeginBackup's applied_index fence makes cross-shard ScanAt globally consistent. - Add pin_token to ListAdaptersAndScopesRequest so the scope scan is bounded to the pinned read_ts; concurrently-created scopes are not surfaced. - Add hll/ to the Redis db_0 example tree (top-level layout already had it; the per-adapter example lagged). - Add --stream-merge-strategy / --preserve-ttl / --preserve-visibility to the elastickv-restore apply CLI block. New nits from the Gemini changes: - Pick one form for DynamoDB bundle-mode flags. Two-flag form (--dynamodb-bundle-mode jsonl + --dynamodb-bundle-size 64MiB) throughout; drop the inline jsonl:size=64MiB form. - RenewBackupRequest.ttl_ms documents the same 60s-24h range constraint as BeginBackupRequest, bounded above by backup_max_ttl_ms; out-of-range returns InvalidArgument. Tests added: TestRenewBackupTTLRangeValidation, TestListAdaptersAndScopesAtPinTS.
|
Round-3 review (#706 (comment)) addressed in Carry-overs from round 2 (5 items)
New from Gemini changes (2 items) New tests: @claude review |
|
Claude finished @bootjp's task in 3m 46s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | ScanAt comment still says "NOT consistent" |
Phase-1 note added at lines 643–654 directing the comment update ✓ |
| 2 | NewBackupScanner on wrong type |
Moved to *ShardStore with explicit rationale (lines 629–641) ✓ |
| 3 | ListAdaptersAndScopesRequest has no pin_token |
bytes pin_token = 1 added (line 707); scope scan bounded to read_ts ✓ |
| 4 | hll/ missing from db_0 example tree |
Added (lines 441–443) ✓ |
| 5 | --stream-merge-strategy missing from CLI block |
Added to restore CLI (line 878) ✓ |
| 6 | --dynamodb-bundle-mode syntax inconsistency |
Two-flag form throughout (lines 315, 847–848) ✓ |
| 7 | RenewBackupRequest.ttl_ms range undocumented |
60s–24h constraint + InvalidArgument on out-of-range (lines 693–695) ✓ |
Overall: the design is in very good shape. Items 1 and 2 above (stream and HLL TTL gaps) are the blockers — they are confirmed by the live adapter/redis.go code and would result in silent data loss for any key with an expiry. Item 3 (the XADD NOMKSTREAM semantic error) needs correction before Phase 2 implementation begins or the restore tool will behave incorrectly for skip-existing on streams. Item 4 is a one-liner doc fix.
Important (verified against adapter/redis.go): - Stream TTLs were silently dropped. Streams can carry TTL via EXPIRE/PEXPIRE/PEXPIREAT, kept in !redis|ttl| by buildTTLElems (adapter/redis.go:2471). Added expire_at_ms to the _meta line so restore re-applies the original epoch. - HLL TTLs were silently dropped. HLL is reported as redisTypeString but stores TTL in the legacy scan index (adapter/redis.go:2319-2320 comment confirms it). Added hll_ttl.jsonl sidecar at the db_<n> root, same shape as strings_ttl.jsonl. Sidecar is omitted when no HLL key has a TTL. - Fixed XADD NOMKSTREAM semantic error in the skip-existing description. NOMKSTREAM only prevents stream creation; it does not skip entries by ID. Replaced with the correct mechanism: attempt XADD and treat the "ID equal or smaller" ERR as "already present, skip." Minor: - Added --begin-backup-deadline 5s to the elastickv-backup dump CLI block (the prose mentioned it but the flag list lagged). Tests added: TestRedisStreamTTLRoundTrip, TestRedisHLLTTLRoundTrip, TestRedisStreamSkipExistingHandlesERR.
|
Round-4 review (#706 (comment)) addressed in Important (data-loss gaps)
Minor
New tests: @claude review |
|
Claude finished @bootjp's task in 4m 11s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | Stream TTLs silently dropped | expire_at_ms in _meta line (line 491); prose explains !redis|ttl| origin ✓ |
| 2 | HLL TTLs silently dropped | hll_ttl.jsonl sidecar at db_<n> root (lines 507–513); top-level tree updated ✓ |
| 3 | XADD NOMKSTREAM semantic error |
Replaced with attempt-XADD-and-treat-error approach; NOMKSTREAM clarified (lines 992–1001) ✓ |
| 4 | --begin-backup-deadline missing from CLI block |
Added at line 862 ✓ |
Overall: this design is nearly ready. Item 1 (the node-local pin gap) is the only blocker — it is a correctness issue for any multi-node deployment and the fix requires adding a fan-out mechanism to the BeginBackup / RenewBackup / EndBackup RPC contract. Items 2–4 are one-liners.
Important (correctness gap, verified against main.go): - BeginBackup pin was node-local. Each node owns its own ActiveTimestampTracker (main.go:294) and only its compactor consults it (main.go:335). A pin on the receiving node would not protect group leaders on other nodes; their compactors would be free to retire MVCC versions at read_ts mid-dump. Fix: pins propagate through each Raft group's log as BackupPin / BackupExtend / BackupRelease FSM commands. Every replica applies them to its local tracker, so the pin set is replicated and durable across leader changes (a new leader inherits the pin from the same log entry it applied). BeginBackup fans the propose out per group; if any group fails to commit within --begin-backup-deadline, BeginBackup proposes BackupRelease on groups that did commit and returns Unavailable so retry is clean. RenewBackup / EndBackup fan out the same way; releases are idempotent so partial state is tolerated. Minor: - Specified _id_remap.jsonl path: redis/db_<n>/streams/<key>._id_remap.jsonl (next to source) so tools can find it on disk. - TestRedisStreamSkipExistingHandlesERR description: "the dump" -> "the restore" (it's a restore-side test). - Added KEYMAP.jsonl to every per-adapter scope in the layout tree; omitted when empty. Tests added: TestBeginBackupPinFanOutAllNodes, TestBeginBackupPinSurvivesLeaderChange, TestBeginBackupGroupUnreachable.
|
Round-5 review (#706 (comment)) addressed in Important (correctness gap)
Minor
New tests: @claude review |
|
Claude finished @bootjp's task in 5m 13s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | Phase 1 missing FSM scope | FSM plumbing described in prose (lines 762–864) but Phase 1 file list still incomplete — covered as item #1 above |
| 1 | ScanAt doc comment |
Phase 1 note to update comment (lines 664–675) ✓ |
| 2 | NewBackupScanner on wrong type |
Moved to *ShardStore with rationale (lines 639–663) ✓ |
| 3 | ListAdaptersAndScopesRequest missing pin_token |
bytes pin_token = 1 present (line 728) ✓ |
| 4 | hll/ missing from db_0 example tree |
Added (lines 447–448) ✓ |
| 5 | --stream-merge-strategy missing from CLI block |
Listed (line 948) ✓ |
| Node-local pin gap (Important) | BackupPin via Raft FSM commands |
Prose complete; FSM file list gap flagged as item #1 above |
_id_remap.jsonl path (Minor) |
redis/db_<n>/streams/<key>._id_remap.jsonl (line 1063) ✓ |
|
| Test description typo (Minor) | "the restore" (line 1177) ✓ | |
KEYMAP.jsonl in layout trees (Minor) |
Added per-scope in top-level tree (lines 128–152) ✓ |
Overall: the proposal is in excellent shape. Item #2 (the Restore-wipes-pins gap) is the one correctness issue that would bite in production: a 2-hour backup on a cluster with a 1-hour retention horizon is exactly where the hidden compaction window appears. Item #1 (FSM file list) is a one-paragraph addition to Phase 1 scope that will save the Phase 1 implementor a debugging session. Items #3 and #4 are implementation guidance nits. After these two important items are addressed the design is ready to ship.
Important (verified against kv/fsm.go):
- Phase 1 scope was missing the FSM plumbing for BackupPin /
BackupExtend / BackupRelease entries. Added explicit items:
- new byte tags 0x03/0x04/0x05 in kv/fsm.go alongside existing
raftEncodeHLCLease=0x02 (kv/fsm.go:116);
- new applyBackupPin/applyBackupExtend/applyBackupRelease
handlers, dispatched from kvFSM.Apply (kv/fsm.go:60) in the
same shape as applyHLCLease;
- new *ActiveTimestampTracker field on kvFSM (kv/fsm.go:26),
wired via NewKvFSMWithHLCAndTracker so the FSM and the
coordinator share one tracker (analogous to the *HLC sharing
invariant in CLAUDE.md);
- new kv/backup_codec.go with the 33/25/17-byte fixed binary
layouts for the three entry types.
- kvFSM.Restore (kv/fsm.go:264-287) wipes in-memory pins. A replica
that installs a Raft snapshot during the backup window loses its
BackupPin (snapshot only carries the Pebble store, not the
tracker). After the leader compacts the log past the BackupPin
entry, that entry is gone; a follower catching up via snapshot
never replays it.
Added explicit safe-bound invariant:
backup_duration + worst_case_compaction_lag < mvcc_retention_horizon
BeginBackup enforces a soft form via --snapshot-headroom-entries
(default 10000). Producer's per-scope expected_keys heuristic
catches mid-dump snapshot-induced shortfalls. Heavier mitigations
(snapshotting tracker state, sibling-pin queries) are explicitly
Phase 3, not Phase 1.
Minor:
- --begin-backup-deadline binding constraint clarified: it is the
applied_index poll for lagging shards, not the BackupPin commit
(which is sub-second on a healthy cluster).
- RenewBackup retries per-group BackupExtend up to 3 times with
500ms backoff before aborting; transient leader elections (<1s
under etcd/raft defaults) no longer abort otherwise-healthy dumps.
Tests added: TestBackupPinFSMCodecRoundTrip,
TestRestoreWipesLocalPins, TestBeginBackupRefusesNearSnapshotThreshold,
TestRenewBackupRetriesLeaderElection.
|
Round-6 review (#706 (comment)) addressed in Important
Minor
New tests: @claude review |
|
Claude finished @bootjp's task in 7m 13s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | FSM plumbing missing from Phase 1 scope | kv/fsm.go byte tags, apply handlers, *ActiveTimestampTracker field, kv/backup_codec.go — all added to Phase 1 scope (lines 1152–1181) ✓ |
| 2 | kvFSM.Restore wipes pins |
Safe-bound invariant documented; --snapshot-headroom-entries guard added; expected_keys heuristic referenced (though undefined — item #2 above) ✓ |
| 3 | --begin-backup-deadline binding constraint |
Clarified: applied_index poll for lagging shards is the binding wait (lines 864–869) ✓ |
| 4 | RenewBackup retry on leader election |
3 retries with 500 ms backoff before abort (lines 906–917) ✓ |
Overall: the design is in excellent shape for implementation. Items 1 and 2 above are the blockers — the snapshot-headroom check will produce operational confusion if shipped inverted, and the expected_keys heuristic is the only safety net for the snapshot-invalidation corner case. Items 3–5 are one- or two-sentence additions. After those are addressed, the proposal is ready to merge and Phase 1 implementation can begin.
Important (verified against internal/raftengine/engine.go and
internal/raftengine/etcd/engine.go):
- --snapshot-headroom-entries check was inverted. Replaced with the
correct formula:
entries_since_last_snapshot = AppliedIndex - LastSnapshotIndex
remaining_headroom = SnapshotEvery - entries_since_last_snapshot
refuse if: remaining_headroom < --snapshot-headroom-entries
(raftengine.Status has LastSnapshotIndex not FirstIndex, and
defaultSnapshotEvery = 10000 is the per-engine snapshot trigger.)
A freshly-snapshotted cluster has the largest remaining headroom
and is now correctly allowed; the check refuses backups that would
trigger the snapshot-installation corner case mid-dump. Default
--snapshot-headroom-entries lowered from 10000 to 1000 (one-tenth
of SnapshotEvery) since the formula now expresses what the prose
intended.
- expected_keys heuristic was undefined. Replaced with a concrete
expected-keys baseline mechanism:
* BeginBackup runs a count-only ScanAt at read_ts before the pin
installs, returns per-scope (key_count, applied_index) in
BeginBackupResponse.
* Producer compares actual count against baseline; fails with
ErrCompactionDuringDump if below 99% +/- sqrt(baseline).
* Adds keys_only bool parameter to MVCCStore.ScanAt /
ShardStore.ScanAt so the baseline pass skips value
materialization (Pebble iterator with payload trimmed at the
block layer).
Minor:
- _long_keys.jsonl renamed to KEYMAP.jsonl on line 192 to match the
rest of the spec.
- BackupPin fan-out is now explicitly concurrent (one goroutine per
group, blocked on all-or-deadline). 100-shard cluster does not pay
100x serial Raft round-trips.
Tests added: TestExpectedKeysBaselineToleratesTTLExpiry,
TestExpectedKeysBaselineCountOnlyScan. Existing tests
TestRestoreWipesLocalPins and TestBeginBackupRefusesNearSnapshotThreshold
updated to match the corrected formulas.
|
Round-7 review (#706 (comment)) addressed in Important
Minor
New tests: @claude review |
|
Claude finished @bootjp's task in 4m 30s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | --snapshot-headroom-entries check was inverted |
Formula corrected: remaining_headroom = SnapshotEvery - (AppliedIndex - LastSnapshotIndex); freshly-snapshotted cluster correctly allowed ✓ |
| 2 | expected_keys heuristic undefined |
Concrete count-only scan baseline: 99% × baseline ± sqrt(baseline) tolerance; keys_only=true path specified ✓ |
| 3 | _long_keys.jsonl vs KEYMAP.jsonl inconsistency |
Line 192 now says KEYMAP.jsonl throughout ✓ |
| 4 | firstIndex not in raftengine.Status |
Replaced with LastSnapshotIndex; formula uses firstIndex = LastSnapshotIndex + 1 (verified against internal/raftengine/engine.go:60) ✓ |
| 5 | BackupPin fan-out parallelism unspecified |
"One goroutine per group" explicit at lines 923–924 ✓ |
Overall: the design is very close to ready. Item 1 (missing proto definitions) and item 2 (count-ordering bug) are the blockers — item 2 is a correctness issue that would make ErrCompactionDuringDump silently ineffective on lagging shards, which is the only detection mechanism for the snapshot-induced pin loss corner case. Both are small additions to existing sections. Items 3 and 4 are one-liner fixes. After these four items, the proposal is ready to merge.
Important:
- BeginBackupResponse proto block was missing definitions. Added
ShardApplied {group_id, applied_index} and ScopeKeyCount
{scope_id, key_count, applied_index_at_count}; added
"repeated ScopeKeyCount expected_keys = 5" to BeginBackupResponse.
- Expected-keys baseline ordering bug. Documented as running "during
step 1" but step 1 is "pick read_ts" and step 2 is shard catch-up.
A lagging shard would return 0 (or partial) for ScanAt at read_ts
before applying through it, producing an undercount and silently
disabling ErrCompactionDuringDump on exactly the shards most at
risk of snapshot-induced pin loss. Moved to "after step 2 (every
shard has applied through read_ts) and before step 3 (pin)".
Minor:
- --snapshot-headroom-entries CLI default 10000 -> 1000 (matches the
one-tenth-of-SnapshotEvery default documented in the prose).
- ScanAt keys_only addition is a source-incompatible Go API change
with ~10 callers in shard_store.go alone. Replaced with a new
ScanKeysAt overload on MVCCStore / ShardStore (returns [][]byte
not []*KVPair); existing ScanAt callers untouched.
Tests adjusted: TestExpectedKeysBaselineCountOnlyScan now verifies
both the new path and the absence of a regression on existing
ScanAt callers; new TestExpectedKeysBaselineRunsAfterShardCatchup
guards against the ordering bug.
|
Round-8 review (#706 (comment)) addressed in Important
Minor
Tests: @claude review |
|
Claude finished @bootjp's task in 4m 43s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | Missing ShardApplied and ScopeKeyCount proto definitions |
Both defined at lines 714–726; expected_keys = 5 on BeginBackupResponse ✓ |
| 2 | Baseline count runs at step 1 before shards have caught up | Explicitly "After step 2 (every shard has applied through read_ts) and before step 3" at lines 857–868 ✓ |
| 3 | --snapshot-headroom-entries CLI default was 10000, not 1000 |
CLI block now shows 1000 at line 1056 ✓ |
| 4 | ScanAt keys_only would break ~10 callers |
Replaced with ScanKeysAt overload returning [][]byte; existing callers untouched (lines 1266–1276) ✓ |
Overall: the design is nearly complete. Item 1 (the SnapshotEvery interface gap) requires a one-line addition to the engine interface and one-line correction to the env-var name, but those belong in Phase 1 scope and should be captured before the implementation starts. Item 2 (rolling upgrade safety) is the last true blocker — without it, the first BeginBackup call on a partially upgraded cluster will fail in an undocumented way. Item 3 is a one-line doc fix.
Important (verified against internal/raftengine/{etcd/,}engine.go and
kv/fsm.go):
- SnapshotEvery is not on the raftengine.Engine interface today.
Phase 1 adds StatusReader.SnapshotEvery() uint64 implemented on
the etcd backend by returning the snapshotEveryFromEnv() value
cached at Open. Also corrected the env-var name: there is no
--raftSnapshotEvery flag; the override is
ELASTICKV_RAFT_SNAPSHOT_COUNT (internal/raftengine/etcd/engine.go:93).
- Rolling-upgrade safety. An old kvFSM receiving a 0x03-tagged
BackupPin entry falls through decodeRaftRequests ->
decodeLegacyRaftRequest (kv/fsm.go:140-145) -> proto.Unmarshal ->
Apply error -> apply-loop stall. A single-release introduction
would corrupt apply on un-upgraded replicas the moment the first
BeginBackup runs.
Phase 1 ships in two releases:
Release N: kvFSM.Apply treats unknown tags in [0x03, 0x0F] as
no-ops with an unknown_fsm_tag warning (forward-compat handler,
no other backup machinery active).
Release N+1: BackupPin/Extend/Release handlers ship; BeginBackup
gates on every group member reporting version >= N via
Admin.GetRaftGroups, returning FailedPrecondition otherwise.
Operators may treat N and N+1 as the same calendar release if they
enforce a fleet-wide upgrade window first; the gate still
short-circuits.
Minor:
- ScopeKeyCount.scope_id format documented for all four adapters,
including the redis_databases uint32 -> "redis/db_<n>" mapping.
Tests added: TestForwardCompatUnknownFSMTag,
TestBeginBackupGatesOnMinVersion, TestSnapshotEveryReadsFromEngine.
|
Round-9 review (#706 (comment)) addressed in Important
Minor
New tests: @claude review |
|
Claude finished @bootjp's task in 3m 18s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | SnapshotEvery not on raftengine.Engine interface; wrong env var name |
StatusReader.SnapshotEvery() uint64 proposed; env var correctly identified as ELASTICKV_RAFT_SNAPSHOT_COUNT (internal/raftengine/etcd/engine.go:93) ✓ |
| 2 | Rolling-upgrade safety — unknown tag stalls apply loop | Two-release rollout plan (Release N: no-op forward-compat; Release N+1: full handlers + version gate) ✓ |
| 3 | ScopeKeyCount.scope_id undocumented for Redis/SQS |
All four adapter mappings documented at lines 722–732; Redis uint32 → "redis/db_<n>" mapping explicit ✓ |
Overall: one confirmed correctness issue remains — the min_backup_version gate depends on a version field that does not exist in RaftGroupState. That's a one-field proto addition and a one-line scope-list update, but without it the gating logic that makes the rolling-upgrade plan safe is unimplementable. Items 2 and 3 are one-liners. After those three fixes, the proposal is ready to merge.
Important (verified against proto/admin.proto): - The min_backup_version gate claimed Admin.GetRaftGroups already carries per-node version. It does not. RaftGroupState (proto/admin.proto:38-49) only has raft_group_id, leader_node_id, leader_term, commit_index, applied_index, last_contact_unix_ms. Phase 1 therefore adds string node_version = 7 to RaftGroupState, populated by each admin handler from the build-time version constant. Removed the false "already carries" claim. Minor: - MaxEntryBytes is not a constant in the codebase. Replaced with the real limit: MaxSizePerMsg (defaultMaxSizePerMsg = 1<<20, internal/raftengine/etcd/engine.go:55). - Forward-compat unknown-tag handler position specified: in kvFSM.Apply immediately after the raftEncodeHLCLease guard and before decodeRaftRequests, so all leading-byte dispatch stays in one place rather than splitting between Apply and the default case of decodeRaftRequests. Tags >= 0x10 still fall through and error, bounding the forward-compat door.
|
Round-10 review (#706 (comment)) addressed in Important
Minor
@claude review |
|
Claude finished @bootjp's task in 5m 25s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | RaftGroupState has no node_version field |
Correctly described as Phase 1 addition; "already carries" claim removed (lines 1302–1322) ✓ |
| 2 | MaxEntryBytes doesn't exist |
Replaced with MaxSizePerMsg (internal/raftengine/etcd/engine.go:55, confirmed = 1 << 20) (line 1351) ✓ |
| 3 | Forward-compat handler position unspecified | "Immediately after raftEncodeHLCLease guard, before decodeRaftRequests" (lines 1287–1290); matches actual kv/fsm.go:63–69 structure ✓ |
Overall: two confirmed code-vs-doc mismatches remain. Item 1 (the version gate) is the blocker — as written, BeginBackup would always pass the version check because GetRaftGroups only reports the local node's version. Item 2 (the wiring claim) needs a one-line correction plus a one-line Phase 1 scope addition. Item 3 is a two-sentence clarification. After those are addressed, the proposal is ready to merge.
Important (verified against adapter/admin_grpc.go): - Rolling-upgrade gate redesigned. GetRaftGroups iterates the local AdminServer's registered groups and reports each row from the local engine's Status() (adapter/admin_grpc.go:382-416), so adding node_version to RaftGroupState would only ever report the local admin server's version. Followers on other nodes would not surface, and the gate would falsely pass on a partially-upgraded cluster. Phase 1 instead adds a dedicated GetNodeVersion RPC fanned out by the BeginBackup handler to every peer in AdminServer.members. Each peer responds with its compiled-in version constant; the handler refuses BeginBackup if any peer is below N or unreachable within --begin-backup-deadline. RaftGroupState gains leader_node_version = 7 for UI visibility only (not used for the gate decision). - AdminServer has no *kv.ShardStore today (confirmed at adapter/admin_grpc.go:62-82). The "existing AdminServer -> ShardStore wiring used by GetRaftGroups" claim was wrong; GetRaftGroups only reaches AdminGroup.Status() and Configuration(). Phase 1 adds the field explicitly, wired in main.go. Minor: - AdminGroup interface (adapter/admin_grpc.go:41-44) does not embed StatusReader, so adding SnapshotEvery() to StatusReader does not reach BeginBackup through s.groups[id]. Phase 1 explicitly extends AdminGroup with SnapshotEvery() uint64; mocks in adapter/admin_grpc_test.go gain one method that returns defaultSnapshotEvery for parity with production. TestBeginBackupGatesOnMinVersion updated to match the new GetNodeVersion fan-out path; canonical 3-node fixture documented.
|
Round-11 review (#706 (comment)) addressed in Important
Minor
@claude review |
|
Claude finished @bootjp's task in 4m 8s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | Rolling-upgrade gate used GetRaftGroups which only reports local node's version |
GetNodeVersion RPC added; BeginBackup fans out to AdminServer.members (and self); RaftGroupState.leader_node_version = 7 for UI only (not gate) ✓ |
| 2 | AdminServer has no *kv.ShardStore field — "existing wiring" claim was wrong |
Phase 1 explicitly adds the field and wires it in main.go; false "existing" language removed ✓ |
| 3 | AdminGroup interface doesn't embed StatusReader — SnapshotEvery() unreachable through s.groups[id] |
AdminGroup extended with SnapshotEvery() uint64; mocks return defaultSnapshotEvery = 10000 ✓ |
Overall: the design is in excellent shape. Item 1 above (static seed vs. live members) is the only structural correctness gap remaining — the version gate is explicitly designed to prevent backup on partially-upgraded clusters, but it silently misses nodes added after startup. Items 2 and 3 are one-sentence clarifications that will save Phase 1 implementors from having to invent a latency contract. After those three items are addressed, the proposal is ready to merge and Phase 1 implementation can begin.
Important (verified against adapter/admin_grpc.go): - BeginBackup gate fan-out target was AdminServer.members (the static bootstrap seed at line 64,89 — never updated after startup). Nodes added later via AddVoter would not appear, silently missing exactly the nodes most likely to be running an older version during a rolling upgrade. Fix: BeginBackup reuses the existing snapshotMembers helper (adapter/admin_grpc.go:158), the same path GetClusterOverview uses, which unions the seed with members discovered from each group's Configuration(). The fan-out target is the live cluster, not the startup snapshot. Minor: - GetRaftGroups today is a synchronous local-state read with no network calls. Adding leader_node_version populated via GetNodeVersion(leader) would turn it into a fan-out handler with unbounded latency tail. Specified async dial per leader with a 500ms per-leader timeout, "" on failure, and a 10s response cache so rapid polls do not fan out repeatedly. - GetNodeVersion failure mode now distinguishes "old version" from "unreachable" in the FailedPrecondition error string so operators can diagnose without out-of-band investigation. Also noted that --begin-backup-deadline is per-peer (fan-out is concurrent), so the worst-case wait is one deadline period regardless of cluster size. Tests added: TestBeginBackupGatesIncludesDynamicallyAddedNodes, TestGetRaftGroupsLeaderVersionAsync. TestBeginBackupGatesOnMinVersion updated to assert the new error shapes.
|
Round-12 review (#706 (comment)) addressed in Important
Minor
New tests: @claude review |
|
Claude finished @bootjp's task in 3m 41s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | Static AdminServer.members seed misses dynamically-added nodes |
BeginBackup now uses snapshotMembers (adapter/admin_grpc.go:158) — unions seed with live Configuration() — confirmed correct ✓ |
| 2 | GetRaftGroups async-dial latency contract unspecified |
500 ms timeout per leader, empty string on failure, 10 s response cache ✓ |
| 3 | GetNodeVersion conflates old-version and unreachable error |
Error message distinguishes "node X reports version vX" from "node X did not respond within Ns" ✓ |
Overall: the design is in excellent shape. Item 1 above (the connection pool gap) is the last structural issue before Phase 1 implementation can begin — without specifying it, two independent implementors will make incompatible choices about credentials and dial options. Item 2 is a one-line addition to the Phase 1 scope list. After those two fixes the proposal is ready to merge.
Important (verified against adapter/admin_grpc.go and
kv/grpc_conn_cache.go):
- AdminServer has no outbound gRPC connection mechanism today.
GetNodeVersion fan-out (in BeginBackup) and the async
leader_node_version probes (in GetRaftGroups) both need to dial
peer admin servers. Without specifying the dialer, two
implementors would make incompatible choices about credentials
and connection lifecycle.
Phase 1 explicitly adds a *kv.GRPCConnCache field to AdminServer
using internal.GRPCDialOptions(), kept separate from
ShardStore.connCache because admin and data-plane addresses
occupy different address spaces (admin cache is keyed on
NodeIdentity.GRPCAddress; data-plane cache is keyed on Raft-
group-leader addresses, which migrate on leader change).
Minor:
- 10s leader_node_version cache is now an explicit AdminServer
struct field (versionCache map[string]struct{...}) listed in
Phase 1 scope rather than implicit prose.
Tests added: TestAdminGRPCConnCacheReuse (verifies cache reuse,
peer-shutdown eviction, independence from ShardStore.connCache).
|
Round-13 review (#706 (comment)) addressed in Important
Minor
New test: @claude review |
|
Claude finished @bootjp's task in 5m 34s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | AdminServer has no outbound gRPC mechanism for GetNodeVersion |
*kv.GRPCConnCache added to Phase 1 scope (line 1274); rationale for separate-from-ShardStore cache documented ✓ |
| 2 | 10s leader_node_version cache was implicit prose, not an explicit struct field |
versionCache now listed in Phase 1 scope with the field type and TTL contract (lines 1286–1292) ✓ |
Overall: the design is in excellent shape. Item 1 above (auth token propagation) is the last blocker — every BeginBackup attempt on a cluster with an admin token configured would fail with a misleading error. The fix is a one-sentence addition to the fan-out description and a one-liner code pattern. Item 2 is a concurrency-spec nit that prevents a potential data race. After those two are addressed, the proposal is ready to merge and Phase 1 implementation can begin.
Important (verified against adapter/admin_grpc.go:455-505): - GetNodeVersion is itself an Admin method, gated by AdminTokenAuth on every auth-enabled cluster. The fan-out from BeginBackup through *kv.GRPCConnCache makes outbound calls with no metadata attached, so every dial would return Unauthenticated and the gate would misreport the failure as "node did not respond" on every cluster running --adminToken. Phase 1 explicitly propagates the inbound bearer token to outbound GetNodeVersion calls via metadata.NewOutgoingContext( context.Background(), inMD). The same pattern applies to the async GetNodeVersion(leader) probes in GetRaftGroups. Forwarding the inbound caller's token (not a server-stored value) preserves the operator's authorization decision and avoids privilege escalation by the admin server. Minor: - versionCache committed to sync.Map (with key/value types spelled out) so the implementor cannot accidentally produce a data race by reusing groupsMu for cache writes. The previous "(or sync.Map equivalent)" left enough room for a plain map + groupsMu approach that would either contend or race. Tests added: TestBeginBackupPropagatesAdminAuthToken, TestVersionCacheRaceUnderLoad (the latter runs under -race).
|
Round-14 review (#706 (comment)) addressed in Important
Minor
New tests: @claude review |
|
Claude finished @bootjp's task in 4m 24s —— View job Review:
|
| # | Item | Verdict |
|---|---|---|
| 1 | GetNodeVersion fan-out would fail with Unauthenticated on auth-enabled clusters |
metadata.NewOutgoingContext(context.Background(), inMD) pattern documented (lines 1404–1414); forwarding the inbound token, not a server-stored value ✓ |
| 2 | versionCache synchronization left ambiguous — room for data race |
Committed to sync.Map with key/value types spelled out (lines 1286–1297) ✓ |
Overall: the design is one small addition away from being ready. Item 1 (the missing context.WithTimeout) is the last correctness gap — as written, the BeginBackup handler would hang indefinitely on any unreachable peer rather than respecting the --begin-backup-deadline budget. Item 2 is a one-line scope-list addition. After both are addressed the proposal is ready to merge.
snapshot decoder) and Phase 1 (live PIT extraction)
The original proposal accumulated significant complexity chasing
cluster-wide point-in-time consistency: BackupPin Raft FSM commands,
GetNodeVersion fan-out, version gate, expected-keys baseline,
admin-side connection cache. That machinery is justified only when
multi-shard cross-group PIT is required.
For many use cases (single-shard clusters, one-time export off
elastickv, recovering from snapshot files when the cluster itself
is gone), a purely offline .fsm-to-directory-tree converter
delivers the original "vendor-independent recovery" goal with no
admin RPCs, no FSM tag, no Raft involvement.
Split:
Phase 0 (new doc, 2026_04_29_proposed_snapshot_logical_decoder.md):
- Offline cmd/elastickv-snapshot-decode and -encode tools
- Owns the per-adapter directory format definition
(DynamoDB / S3 / Redis / SQS layouts, filename encoding,
MANIFEST.json schema, KEYMAP.jsonl, the four-class
internal-state policy).
- Restore via stop-replace-restart of the .fsm + .snap files.
- Trade-offs section explicitly maps when Phase 0 is sufficient
vs. when Phase 1 is required.
Phase 1 (existing doc, retitled and prefaced):
- Title updated to "Live Cluster, PIT-Consistent Extraction
(Phase 1)".
- New Phasing section near the top cross-references Phase 0 as
the format owner.
- Body retains the live-cluster machinery: read_ts pinning,
cross-shard catch-up, expected-keys baseline, version-gated
rolling-upgrade safety, FSM-level pin propagation.
- MANIFEST.phase field discriminates dumps from the two phases
("phase0-snapshot-decode" vs. "phase1-live-pinned").
Both phases produce the same on-disk format so restorers and
external tools do not care which produced a given dump.
Proposal split into Phase 0 + Phase 1After review feedback exposed how much complexity the live-cluster PIT machinery had accumulated (replicated Phase 0 —
Phase 1 —
Both phases produce the same on-disk format, so external recovery tools ( Commit: |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
docs/design/2026_04_29_proposed_snapshot_logical_decoder.md (1)
13-13: Add language identifiers to fenced code blocks.These blocks currently trigger MD040 warnings. Please tag them (
text,json,go,protobuf, etc.) so docs lint passes cleanly.Also applies to: 108-108, 209-209, 298-298, 315-315, 420-420, 438-438, 472-472, 481-481
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/2026_04_29_proposed_snapshot_logical_decoder.md` at line 13, Several fenced code blocks in the document are missing language identifiers; update every triple-backtick fenced code block (```) to include the appropriate language tag (e.g., text, json, go, protobuf) so the linter MD040 stops warning; search for all occurrences of ``` and replace them with language-tagged openings like ```json or ```go matching the snippet content (apply this to the other unnamed fences in the file as well).docs/design/2026_04_29_proposed_logical_backup.md (1)
42-42: Standardize fenced code blocks with explicit languages.These blocks are currently untyped and trigger MD040. Adding explicit languages will improve lint reliability and readability.
Also applies to: 148-148, 265-265, 354-354, 386-386, 458-458, 487-487, 518-518, 523-523, 542-542, 553-553, 578-578, 868-868, 1099-1099, 1122-1122, 1136-1136, 1490-1490
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/2026_04_29_proposed_logical_backup.md` at line 42, The document contains multiple untyped fenced code blocks using just ``` which trigger MD040; update each untyped fenced block to include an explicit language identifier (e.g., ```sql, ```bash, ```yaml, ```json, or ```text as appropriate to the block content) so linters can detect the language and improve readability, apply this change for every occurrence flagged in the comment (all untyped ``` blocks throughout the document), and re-run markdownlint/MD040 to verify the warnings are resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/2026_04_29_proposed_logical_backup.md`:
- Around line 1080-1084: The markdown table rows contain unescaped pipe
characters inside inline examples (e.g., the tokens shown as "!s3route|",
"!redis|ttl|", "!txn|" and paths like internal/raftengine/etcd/persistence.go,
kv/txn_keys.go) which breaks table rendering; update those cells to either
escape the internal '|' characters or present the examples in a code span/block
that avoids raw '|' inside the table cell so the three-column layout renders
correctly, ensuring entries such as the S3 route, Redis TTL example, and txn
intent markers are updated consistently across the affected rows.
- Around line 217-237: The draft overstates filename reversibility: update the
wording to state that when the filename is replaced by the
`<sha256-hex-prefix-32>__<truncated-original>` fallback (and for
collision-renamed keys), the KEYMAP.jsonl at the adapter/scope root is required
to reconstruct the exact original bytes — i.e., KEYMAP is a correctness
dependency for those cases; remove or reword the sentence that currently claims
KEYMAP is purely a convenience and not required for correctness, while
preserving that consumers who never rely on truncated/hashed filenames can
ignore KEYMAP.
In `@docs/design/2026_04_29_proposed_snapshot_logical_decoder.md`:
- Around line 167-182: The doc currently claims the original bytes are stored in
KEYMAP.jsonl for oversized segments encoded as
"<sha256-prefix-32>__<truncated-original>" but then says the translation is
losslessly recoverable from the encoded filename alone and that KEYMAP.jsonl is
optional; update the text to explicitly state that when the SHA-fallback
encoding ("<sha256-prefix-32>__<truncated-original>") is used, KEYMAP.jsonl is
required to recover the full original bytes, and contrast this with purely
lossless encodings where KEYMAP is only a convenience; adjust the sentences
referencing KEYMAP.jsonl and the recovery claim to make this dependency clear.
---
Nitpick comments:
In `@docs/design/2026_04_29_proposed_logical_backup.md`:
- Line 42: The document contains multiple untyped fenced code blocks using just
``` which trigger MD040; update each untyped fenced block to include an explicit
language identifier (e.g., ```sql, ```bash, ```yaml, ```json, or ```text as
appropriate to the block content) so linters can detect the language and improve
readability, apply this change for every occurrence flagged in the comment (all
untyped ``` blocks throughout the document), and re-run markdownlint/MD040 to
verify the warnings are resolved.
In `@docs/design/2026_04_29_proposed_snapshot_logical_decoder.md`:
- Line 13: Several fenced code blocks in the document are missing language
identifiers; update every triple-backtick fenced code block (```) to include the
appropriate language tag (e.g., text, json, go, protobuf) so the linter MD040
stops warning; search for all occurrences of ``` and replace them with
language-tagged openings like ```json or ```go matching the snippet content
(apply this to the other unnamed fences in the file as well).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 77133461-b7dd-4815-b35e-5fa2256e9699
📒 Files selected for processing (2)
docs/design/2026_04_29_proposed_logical_backup.mddocs/design/2026_04_29_proposed_snapshot_logical_decoder.md
| - If the resulting segment exceeds 240 bytes, the segment is replaced with | ||
| `<sha256-hex-prefix-32>__<truncated-original>` and the full original | ||
| bytes are recorded in `<adapter>/<scope>/KEYMAP.jsonl` (the same | ||
| KEYMAP file used for filename round-trip — long-key fallback and | ||
| encoded-name reverse-lookup share one append-only stream). | ||
| - **DynamoDB binary partition / sort keys** (the `B` attribute type) are | ||
| rendered as `b64.<base64url>` so a binary key never collides with a | ||
| string key whose hex encoding happens to look like base64. | ||
| - **Stream messages** within an SQS queue are not represented by file | ||
| per message (would be millions of small files); they live in a single | ||
| `messages.jsonl`. Same rationale for Redis stream entries. | ||
|
|
||
| `KEYMAP.jsonl` at each adapter scope root translates the encoded | ||
| filename back to the exact original bytes, in case the user needs to | ||
| feed the data into a system that requires the verbatim key. JSONL | ||
| (one mapping per line, never loaded fully) is used so the file scales | ||
| to millions of keys without becoming a memory bottleneck on either the | ||
| producer or a downstream consumer. The translation is also losslessly | ||
| recoverable from the encoded filename alone — `KEYMAP.jsonl` is a | ||
| convenience, not a correctness dependency. A consumer that does not | ||
| need it can ignore it entirely. |
There was a problem hiding this comment.
Filename reversibility claim overstates correctness guarantees.
Line 217-Line 220 introduces SHA+truncation fallback with KEYMAP, but Line 235-Line 237 says KEYMAP is not a correctness dependency. For long-key fallback and collision renames, KEYMAP is required to reconstruct original bytes exactly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/2026_04_29_proposed_logical_backup.md` around lines 217 - 237,
The draft overstates filename reversibility: update the wording to state that
when the filename is replaced by the
`<sha256-hex-prefix-32>__<truncated-original>` fallback (and for
collision-renamed keys), the KEYMAP.jsonl at the adapter/scope root is required
to reconstruct the exact original bytes — i.e., KEYMAP is a correctness
dependency for those cases; remove or reword the sentence that currently claims
KEYMAP is purely a convenience and not required for correctness, while
preserving that consumers who never rely on truncated/hashed filenames can
ignore KEYMAP.
| | Class | Examples | Backup behavior | | ||
| |---|---|---| | ||
| | **Re-derivable from user data + config** | DynamoDB GSI rows; S3 route catalog (`!s3route|`); Redis TTL scan index (`!redis|ttl|`); SQS visibility / age / dedup / group / tombstone indexes; route catalog under default group | Excluded by default. Restore re-builds them as user data is replayed. | | ||
| | **Per-cluster operational state** | HLC physical ceiling; Raft term/index/conf state; FSM marker files (`raft-engine`, `EKVR`/`EKVM`/`EKVW` magic files in `internal/raftengine/etcd/persistence.go`); write conflict counter | Never dumped. They belong to the cluster, not the data. A restore initializes them fresh. | | ||
| | **In-flight transactional state** | `!txn|` intent / lock / resolver records (`kv/txn_keys.go`, `kv/txn_codec.go`); pending S3 multipart uploads | Excluded by default. Optionally dumped under `_internals/` for forensics, but never re-applied by the restore path — replaying intents from a stale read_ts can resurrect aborted transactions. | |
There was a problem hiding this comment.
Internal-state markdown table is malformed due to unescaped pipes.
Line 1082-Line 1084 includes | inside inline code spans (e.g., !s3route|, !txn|), which breaks column parsing (MD056). Escape the pipes or wrap examples without raw | so all three columns render correctly.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 1082-1082: Table column count
Expected: 3; Actual: 6; Too many cells, extra data will be missing
(MD056, table-column-count)
[warning] 1084-1084: Spaces inside code span elements
(MD038, no-space-in-code)
[warning] 1084-1084: Spaces inside code span elements
(MD038, no-space-in-code)
[warning] 1084-1084: Table column count
Expected: 3; Actual: 4; Too many cells, extra data will be missing
(MD056, table-column-count)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/2026_04_29_proposed_logical_backup.md` around lines 1080 - 1084,
The markdown table rows contain unescaped pipe characters inside inline examples
(e.g., the tokens shown as "!s3route|", "!redis|ttl|", "!txn|" and paths like
internal/raftengine/etcd/persistence.go, kv/txn_keys.go) which breaks table
rendering; update those cells to either escape the internal '|' characters or
present the examples in a code span/block that avoids raw '|' inside the table
cell so the three-column layout renders correctly, ensuring entries such as the
S3 route, Redis TTL example, and txn intent markers are updated consistently
across the affected rows.
| - If the resulting segment exceeds 240 bytes, it becomes | ||
| `<sha256-prefix-32>__<truncated-original>` and the full original | ||
| bytes are recorded in `<adapter>/<scope>/KEYMAP.jsonl` — same file | ||
| used for filename round-trip and S3 path-collision rename | ||
| bookkeeping. | ||
| - **Stream messages** within an SQS queue are not represented by file | ||
| per message (millions of small files); they live in a single | ||
| `messages.jsonl`. Same rationale for Redis stream entries. | ||
|
|
||
| `KEYMAP.jsonl` at each adapter scope root translates encoded | ||
| filenames back to original bytes when needed. JSONL (one mapping per | ||
| line) so the file scales to millions of keys without becoming a | ||
| memory bottleneck. The translation is also losslessly recoverable | ||
| from the encoded filename alone — `KEYMAP.jsonl` is a convenience, | ||
| not a correctness dependency. A consumer that does not need it can | ||
| ignore it entirely. |
There was a problem hiding this comment.
KEYMAP cannot be optional when SHA fallback is used.
Line 167-Line 170 says oversized segments are replaced with <sha256-prefix-32>__<truncated-original> and full bytes go to KEYMAP.jsonl; Line 180-Line 182 then says recovery is lossless from filename alone and KEYMAP is only convenience. Those two statements conflict. For SHA-fallback entries, KEYMAP is required for full reverse mapping.
Suggested doc fix
-`KEYMAP.jsonl` at each adapter scope root translates encoded
-filenames back to original bytes when needed. JSONL (one mapping per
-line) so the file scales to millions of keys without becoming a
-memory bottleneck. The translation is also losslessly recoverable
-from the encoded filename alone — `KEYMAP.jsonl` is a convenience,
-not a correctness dependency. A consumer that does not need it can
-ignore it entirely.
+`KEYMAP.jsonl` at each adapter scope root translates encoded
+filenames back to original bytes when needed. JSONL (one mapping per
+line) so the file scales to millions of keys without becoming a
+memory bottleneck. For plain `%HH`-encoded segments, reverse mapping
+is recoverable from the filename alone; for SHA-fallback or collision-
+renamed paths, `KEYMAP.jsonl` is required for lossless reversal.
+A consumer that does not need original-key recovery can ignore it.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - If the resulting segment exceeds 240 bytes, it becomes | |
| `<sha256-prefix-32>__<truncated-original>` and the full original | |
| bytes are recorded in `<adapter>/<scope>/KEYMAP.jsonl` — same file | |
| used for filename round-trip and S3 path-collision rename | |
| bookkeeping. | |
| - **Stream messages** within an SQS queue are not represented by file | |
| per message (millions of small files); they live in a single | |
| `messages.jsonl`. Same rationale for Redis stream entries. | |
| `KEYMAP.jsonl` at each adapter scope root translates encoded | |
| filenames back to original bytes when needed. JSONL (one mapping per | |
| line) so the file scales to millions of keys without becoming a | |
| memory bottleneck. The translation is also losslessly recoverable | |
| from the encoded filename alone — `KEYMAP.jsonl` is a convenience, | |
| not a correctness dependency. A consumer that does not need it can | |
| ignore it entirely. | |
| - If the resulting segment exceeds 240 bytes, it becomes | |
| `<sha256-prefix-32>__<truncated-original>` and the full original | |
| bytes are recorded in `<adapter>/<scope>/KEYMAP.jsonl` — same file | |
| used for filename round-trip and S3 path-collision rename | |
| bookkeeping. | |
| - **Stream messages** within an SQS queue are not represented by file | |
| per message (millions of small files); they live in a single | |
| `messages.jsonl`. Same rationale for Redis stream entries. | |
| `KEYMAP.jsonl` at each adapter scope root translates encoded | |
| filenames back to original bytes when needed. JSONL (one mapping per | |
| line) so the file scales to millions of keys without becoming a | |
| memory bottleneck. For plain `%HH`-encoded segments, reverse mapping | |
| is recoverable from the filename alone; for SHA-fallback or collision- | |
| renamed paths, `KEYMAP.jsonl` is required for lossless reversal. | |
| A consumer that does not need original-key recovery can ignore it. |
🧰 Tools
🪛 LanguageTool
[style] ~181-~181: Consider using “who” when you are referring to a person instead of an object.
Context: ...ot a correctness dependency. A consumer that does not need it can ignore it entirely...
(THAT_WHO)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/2026_04_29_proposed_snapshot_logical_decoder.md` around lines 167
- 182, The doc currently claims the original bytes are stored in KEYMAP.jsonl
for oversized segments encoded as "<sha256-prefix-32>__<truncated-original>" but
then says the translation is losslessly recoverable from the encoded filename
alone and that KEYMAP.jsonl is optional; update the text to explicitly state
that when the SHA-fallback encoding ("<sha256-prefix-32>__<truncated-original>")
is used, KEYMAP.jsonl is required to recover the full original bytes, and
contrast this with purely lossless encodings where KEYMAP is only a convenience;
adjust the sentences referencing KEYMAP.jsonl and the recovery claim to make
this dependency clear.
|
@claude review |
|
Claude finished @bootjp's task in 6m 55s —— View job Review:
|
| Prefix | Kind | Should be |
|---|---|---|
| `!sqs | queue | meta |
| `!sqs | queue | gen |
| `!sqs | queue | seq |
| `!sqs | queue | tombstone |
| `!sqs | msg | data |
| `!sqs | msg | dedup |
| `!sqs | msg | group |
| `!sqs | msg | byage |
"!sqs|queue|" matches all four queue prefixes, so generation, sequence, and tombstone records land in the schema encoder. "!sqs|msg|" matches all four message prefixes, so dedup, group, and byage index entries land in the message encoder and would produce malformed messages.jsonl output (likely failing to unmarshal the storedSQSMsgPrefix envelope, or silently emitting garbage JSON lines).
Fix: use the exact prefixes "!sqs|queue|meta|" and "!sqs|msg|data|" in the dispatch table, with explicit drop entries for the other sub-prefixes. Fix this →
4. "!ddb|meta|" dispatch prefix is too broad — DynamoDB generation counter records will corrupt schema output
The decoder pipeline (line 444):
"!ddb|meta|" → DynamoDB schema encoder
kv/shard_key.go:18–20:
DynamoTableMetaPrefix = "!ddb|meta|table|" // table schema
DynamoTableGenerationPrefix = "!ddb|meta|gen|" // per-table generation counter (internal)"!ddb|meta|" matches both. The generation counter record (!ddb|meta|gen|) is an internal elastickv tracking key — its value is a generation uint64, not a DescribeTable-shaped JSON. Feeding it into the DynamoDB schema encoder would either fail to unmarshal or silently produce a malformed _schema.json entry that corrupts the table schema.
Fix: change the dispatch entry to "!ddb|meta|table|" and add a drop entry for "!ddb|meta|gen|". Fix this →
Minor (CodeRabbit-flagged, both docs)
5. KEYMAP.jsonl optionality claim is incorrect when SHA fallback is used
Both documents contain the statement: "The translation is also losslessly recoverable from the encoded filename alone — KEYMAP.jsonl is a convenience, not a correctness dependency."
This is false for SHA-fallback filenames. When a key segment exceeds 240 bytes, it becomes <sha256-prefix-32>__<truncated-original> — the truncated original cannot be reconstructed from the filename alone. The KEYMAP.jsonl entry is the only place the full original bytes exist in the dump. Similarly for S3 collision-renamed keys (.elastickv-leaf-data suffix).
Fix in both docs: restate the claim accurately — KEYMAP is a convenience for the common case (pure %HH-encoded segments where reverse mapping is mechanical), but is a correctness dependency for SHA-fallback and collision-renamed entries. A consumer that never needs the verbatim original key can ignore it; one that does must read it. Fix this →
6. Phase 1 internal-state table has unescaped | in code spans — broken rendering
Phase 1 doc, line 1082:
| ... | ... (`!s3route|`); ... (`!redis|ttl|`); ... | ... |
The | inside code spans (!s3route|, !redis|ttl|) break table column parsing (markdownlint MD056: "Expected: 3; Actual: 6"). The Phase 0 doc already fixes this with \| escaping (line 407: `!s3route\|`, `!redis\|ttl\|`). Apply the same escaping to line 1082 in the Phase 1 doc. Fix this →
Round-15 items — status
| # | Item | Verdict |
|---|---|---|
| 1 | context.Background() no deadline + WaitForReady(true) blocks indefinitely |
Still open — line 1433 unchanged |
| 2 | GetRaftGroups discards context — auth propagation unimplementable |
Still open — not in Phase 1 scope list |
Overall: the Phase 0/Phase 1 split is clean and the format specification is solid. Items 3 and 4 above are correctness blockers for Phase 0 implementation — a decoder that dispatches on the wrong prefixes will silently corrupt its output before a single line of implementation code is written. Items 1 and 2 are the two carry-over items that make the GetNodeVersion fan-out hang on unreachable peers. After all four are addressed the design is ready to merge.
Summary
Adds
docs/design/2026_04_29_proposed_logical_backup.md, a proposal to expose elastickv backups as a per-adapter, decomposed directory tree instead of the opaque FSM snapshot stream. Backups become recoverable without elastickv —cat,jq,find,aws s3 sync,redis-cli --pipeare enough to extract any single record.Layout (top level)
dynamodb/<table>/_schema.json+items/<pk>/<sk>.json(one wire-format DynamoDB item per file).s3/<bucket>/<original/object/path>— the raw assembled object body at its natural path, with a<obj>.metadata.jsonsidecar (Content-Type, ETag, user metadata, …).aws s3 syncworks on the dump directory directly.redis/db_<n>/{strings,hashes,lists,sets,zsets,streams,hll}/<key>.{bin,json,jsonl}— strings stay raw bytes (binary-safe), collections are JSON, streams are JSONL in stream order.sqs/<queue>/_queue.json+messages.jsonl— one record per line, ordered by(send_ts, sequence_number, message_id).MANIFEST.jsonandCHECKSUMSat the root;MANIFEST.jsonis the only file a restore tool must read first.Key design choices
storedDynamoItemProtoPrefix0x00 'D' 'I' 0x01,storedSQSMsgPrefix0x00 'S' 'M' 0x01,EKVPBBL1) and the wide-column key encodings (!hs|fld|<keyLen>...,!s3|blob|<bucket><gen><obj>...,!ddb|item|...). The dump format is decoupled from internal layouts that have changed before and will change again.read_tsvia the lease-read path; held open viakv/active_timestamp_tracker.goso the compactor cannot retire MVCC versions the dump still depends on.%HH,b64.prefix for binary DynamoDB keys, SHA-256 fallback for >240-byte segments) with aKEYMAPsidecar that round-trips originals.Phasing
cmd/elastickv-backupproducer +BeginBackup/EndBackupadmin RPCs.cmd/elastickv-restoreconsumer driving public adapter endpoints (PutItem, PutObject, SET/HSET/…, SendMessage). Restore exercises the same code path real clients do, so format/version drift never silently bricks an old backup.Self-review
2026_04_14_implemented_etcd_snapshot_disk_offload.md) and registersread_tswith the active timestamp tracker so dumps cannot lose data to compaction.MANIFEST.json, so restore refuses partial trees.read_tsacross all adapters; GSIs/secondary indexes intentionally re-derived on restore so they cannot diverge from the base data they shadow.Test plan
make lint/make testare no-ops for this PR.Summary by CodeRabbit