Skip to content

docs(design): propose per-adapter logical backup format#706

Open
bootjp wants to merge 16 commits intomainfrom
docs/proposed-logical-backup
Open

docs(design): propose per-adapter logical backup format#706
bootjp wants to merge 16 commits intomainfrom
docs/proposed-logical-backup

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 29, 2026

Summary

Adds docs/design/2026_04_29_proposed_logical_backup.md, a proposal to expose elastickv backups as a per-adapter, decomposed directory tree instead of the opaque FSM snapshot stream. Backups become recoverable without elastickvcat, jq, find, aws s3 sync, redis-cli --pipe are enough to extract any single record.

Layout (top level)

  • dynamodb/<table>/_schema.json + items/<pk>/<sk>.json (one wire-format DynamoDB item per file).
  • s3/<bucket>/<original/object/path> — the raw assembled object body at its natural path, with a <obj>.metadata.json sidecar (Content-Type, ETag, user metadata, …). aws s3 sync works on the dump directory directly.
  • redis/db_<n>/{strings,hashes,lists,sets,zsets,streams,hll}/<key>.{bin,json,jsonl} — strings stay raw bytes (binary-safe), collections are JSON, streams are JSONL in stream order.
  • sqs/<queue>/_queue.json + messages.jsonl — one record per line, ordered by (send_ts, sequence_number, message_id).
  • MANIFEST.json and CHECKSUMS at the root; MANIFEST.json is the only file a restore tool must read first.

Key design choices

  • Strips internal magic prefixes (storedDynamoItemProtoPrefix 0x00 'D' 'I' 0x01, storedSQSMsgPrefix 0x00 'S' 'M' 0x01, EKVPBBL1) and the wide-column key encodings (!hs|fld|<keyLen>..., !s3|blob|<bucket><gen><obj>..., !ddb|item|...). The dump format is decoupled from internal layouts that have changed before and will change again.
  • Cluster-wide consistent read_ts via the lease-read path; held open via kv/active_timestamp_tracker.go so the compactor cannot retire MVCC versions the dump still depends on.
  • Internal-state policy is explicit: re-derivable indexes (DynamoDB GSI, SQS visibility/age/dedup/group, Redis TTL index, S3 route catalog) are excluded by default and rebuilt by replaying user data through public adapter endpoints on restore. HLC ceiling, Raft term/index, txn intents, in-flight multipart uploads are never dumped.
  • Filename encoding is documented (RFC3986 unreserved + %HH, b64. prefix for binary DynamoDB keys, SHA-256 fallback for >240-byte segments) with a KEYMAP sidecar that round-trips originals.

Phasing

  1. Phase 1 — cmd/elastickv-backup producer + BeginBackup / EndBackup admin RPCs.
  2. Phase 2 — cmd/elastickv-restore consumer driving public adapter endpoints (PutItem, PutObject, SET/HSET/…, SendMessage). Restore exercises the same code path real clients do, so format/version drift never silently bricks an old backup.
  3. Phase 3 — incremental / scoped backups (out of scope here, sketched at the bottom).

Self-review

  • Data loss — Doc only; no code changes. The proposal explicitly leaves the FSM snapshot path untouched (see 2026_04_14_implemented_etcd_snapshot_disk_offload.md) and registers read_ts with the active timestamp tracker so dumps cannot lose data to compaction.
  • Concurrency / distributed failures — Cross-shard consistency uses the same lease-read primitive every adapter already uses; producer crash leaves no MANIFEST.json, so restore refuses partial trees.
  • Performance — Dump is 1.5×–3× larger and slower than the physical FSM snapshot; trade-off is documented. No hot-path impact.
  • Data consistency — Single read_ts across all adapters; GSIs/secondary indexes intentionally re-derived on restore so they cannot diverge from the base data they shadow.
  • Test coverage — P0/P1/P2 test plan included in the doc (filename round-trip, multipart S3 reassembly, FIFO order, cross-adapter consistency, external-tool replay against MinIO / local DynamoDB / real Redis).

Test plan

  • No code changed — make lint / make test are no-ops for this PR.
  • Reviewers: confirm the layout is decomposable enough to recover one record without elastickv, and that the internal-state exclusion list is complete (anything that would resurrect aborted state on restore must be excluded).

Summary by CodeRabbit

  • Documentation
    • Added comprehensive design specification for a logical backup system featuring admin controls, cluster-wide consistency, and automatic expiration management
    • Added specification for a backup file decoder tool enabling users to inspect and restore from logical backups

Per-adapter directory hierarchy (dynamodb/<table>/items/<pk>.json,
s3/<bucket>/<original/path>, redis/db_<n>/<type>/<key>.{bin,json,jsonl},
sqs/<queue>/messages.jsonl) so a backup is recoverable without elastickv —
cat / jq / find / aws s3 sync / redis-cli are sufficient. Strips internal
magic prefixes and wide-column key encodings; pins read_ts via the lease-
read path and the active timestamp tracker so the dump is consistent and
the compactor cannot retire its versions mid-dump.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

Two comprehensive design documents introduce logical backup and snapshot decoder functionality. The first specifies Phase 1 design for extracting point-in-time consistent backups from live clusters using admin RPCs and raft FSM commands. The second standardizes conversion of FSM snapshots into inspectable vendor-neutral logical backup formats with per-adapter schemas.

Changes

Cohort / File(s) Summary
Logical Backup Design
docs/design/2026_04_29_proposed_logical_backup.md
Specifies Phase 1 design for live cluster logical backups including new admin RPCs (BeginBackup, RenewBackup, EndBackup, ListAdaptersAndScopes, GetNodeVersion), raft FSM commands for cluster-wide read-ts pinning, deadline-based timestamp tracking, MVCC key counting, paging scanner wrapper to avoid materialization, rolling-upgrade safety via forward-compatible FSM tags, version gating with leader caching, and comprehensive multi-phase test coverage plan.
Snapshot Decoder Specification
docs/design/2026_04_29_proposed_snapshot_logical_decoder.md
Defines offline converter specification for decoding elastickv FSM snapshots into vendor-independent logical backup directory trees with MANIFEST.json and CHECKSUMS, RFC3986-based filename encoding, per-adapter schemas (DynamoDB items, S3 objects with metadata sidecars, Redis data types, SQS message queues), decoding/exclusion rules, encoder/decoder CLI pipelines, operator restore workflows, and acceptance test categories covering round-trip consistency and corruption prevention.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Through snapshots hop, logical paths gleam,
Admin RPC whispers, a backup dream,
Pinned timestamps guard the data so dear,
From live clusters safe, year after year!
Decoders translate to formats so bright,
The warren's archives shine pure and white.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'docs(design): propose per-adapter logical backup format' directly and clearly summarizes the main addition: two comprehensive design documents proposing a per-adapter, directory-tree logical backup format for elastickv. It accurately reflects the primary change in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/proposed-logical-backup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 3m 3s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md

  • Gather context
  • Read design document
  • Analyze proposal thoroughly
  • Post review feedback

This is a well-structured proposal. The motivations are clear, the per-adapter layouts are practical, and the decision to restore through public adapter endpoints is the right call for long-term format stability. A few issues to resolve before implementation begins.


Critical

1. ScanRange / kv.ShardedCoordinator.ScanRange does not exist

The producer flow (BeginBackup → ListAdaptersAndScopes → ScanRange(at_ts) → ...) depends on a scan primitive that isn't in ShardedCoordinator. Looking at kv/sharded_coordinator.go, the public API is Dispatch (single-key or multi-key mutation groups) and related transaction methods — there is no ScanAt, ScanRange, or iteration primitive today. This is the most load-bearing new API in Phase 1 and needs to be designed explicitly: key-range iterator on a shard group at a fixed read_ts, surfacing raw KV pairs so per-adapter encoders can decode them. The document should sketch the interface and mention whether it threads through ShardStore / kv.FSM or goes to store.MVCCStore directly. Fix this →

2. ActiveTimestampTracker has no TTL — the 30-minute auto-expiry needs to be built

The proposal says: "A tool-side crash leaves the tracker entry behind for at most backup_ts_ttl (default 30 minutes, configurable via the same admin RPC)". The actual ActiveTimestampTracker (kv/active_timestamp_tracker.go) only has Pin(ts) → token and token.Release(). There is no deadline, no timeout, and no sweep loop. If a producer crashes without calling EndBackup, the Pin registration is held forever and the compactor is blocked indefinitely. This must be addressed in Phase 1 — either add a deadline parameter to Pin, add a background sweeper, or specify another mechanism. The design doc should describe the chosen approach. Fix this →


Important

3. hll/ directory is missing from the top-level Redis layout diagram

The schema section (line 139–146 in the doc) shows strings/, hashes/, lists/, sets/, zsets/, streams/ — but hll/ is documented in the body text (hll/<key>.bin) and the HLL prefix !redis|hll| exists in adapter/redis_compat_types.go:83. The layout tree is incomplete. Fix this →

4. S3 .metadata.json sidecar collides with object keys ending in .metadata.json

If a user stores an S3 object at 2026/04/29/img.jpg.metadata.json, the sidecar for 2026/04/29/img.jpg would collide with it on the filesystem. The design needs a collision-safe sidecar convention. Common approaches: use a separate _meta/ subdirectory mirroring the object tree, or use a double-extension like .elastickv-meta.json and document that that suffix is reserved (analogous to how the internal prefix policy works for keys). Fix this →

5. Cross-shard read_ts admissibility: "below each group's last-applied commit-ts" is a liveness assertion, not a wait

The proposal says BeginBackup enforces the chosen read_ts is below max(group_commit_ts) + 50ms_skew. But if shard B has not yet applied log entries up to that point, a ScanRange(at_ts=read_ts) on shard B would either return a stale view or fail visibility checks — the 50ms skew buffer does not guarantee shard B has actually applied through read_ts. An implementation note is needed: does BeginBackup wait (with a deadline) for each shard to report applied_index ≥ f(read_ts), or does ScanRange block at the store layer? The consistency guarantee in the document is correct in intent but the mechanism to enforce it is underspecified. Fix this →

6. CHECKSUMS file format is not specified

The doc says CHECKSUMS contains "SHA-256 of every regular file under the root" but doesn't say the format. Specifying sha256sum(1) compatible output (<64-char hex> <relative-path>) would let users verify integrity with sha256sum -c CHECKSUMS without any elastickv tooling — which is in the spirit of the proposal.

7. ListAdaptersAndScopes is undocumented

The producer flow references ListAdaptersAndScopes but it doesn't appear in Phase 1 scope, admin.proto, or anywhere in the codebase. Is it a new RPC? A client-side scan of the route catalog? This should be defined. Fix this →


Minor

8. Redis TTL handling on restore: absolute epoch-ms timestamps that have already expired

TTL values are stored as absolute Unix-millis. If the backup is 2 hours old and a key had a 1-hour TTL, restoring it with the original expire_at_ms would cause the key to expire immediately. The restore tool should either skip already-expired keys (default) or let the user pass --preserve-ttl/--skip-expired. Document the default behavior.

9. aws s3 sync works only with --exclude "*.metadata.json"

The doc says "aws s3 sync works on the dump directory directly". It would also upload all .metadata.json sidecars as objects. The one-liner should read aws s3 sync --exclude "*.metadata.json" s3/<bucket>/ s3://target-bucket/.

10. DynamoDB example: b64.AAECAw../single.json is ambiguous

The example table has a composite key (customer_id hash, order_ts range). The b64.AAECAw../single.json entry looks like a binary partition key with a sort key value of "single" — but it's not labelled as such. If it's for a hash-only table, the layout would be different (file directly under items/, not in a subdirectory). Clarify the example.

11. SQS FIFO deduplication window reset deserves a stronger warning

"The dedup window will reset on restore" is buried in the "in-flight side records" paragraph. For FIFO queues with ContentBasedDeduplication=true, replaying the same messages through SendMessage on restore will work (fresh window), but a caller that relied on dedup IDs from the original queue will get duplicates if they replay messages themselves concurrently. This deserves a callout in the trade-offs section, not just a passing mention.

12. Redis stream restore ordering in --mode merge

XADD with an explicit ID fails if an entry with a higher ID already exists in the stream. The restore doc should clarify that --mode merge for streams either (a) skips IDs already present, (b) requires the target stream to be empty, or (c) tails with auto-generated IDs and loses the original ID semantics.


Nit

  • backup_ts_ttl appears in the text but not in the BeginBackup RPC — add it to the proposed RPC signature in Phase 1.
  • The MANIFEST.json spec omits backup_ts_ttl / active_timestamp_tracker_ttl_ms even though it's a dump-time parameter that a restore tool might want to surface.

Overall: strong proposal. The cross-adapter consistency model, the exclusion-list rationale, and the restore-through-public-API decision are all well-reasoned. The main work before implementation is filling in the ScanRange API gap (#1), specifying the TTL mechanism for stuck registrations (#2), and choosing a safe sidecar naming convention for S3 (#4).

Critical:
- Replace nonexistent kv.ShardedCoordinator.ScanRange with the actual
  ShardStore.ScanAt (kv/shard_store.go:106) plus a thin BackupScanner
  iterator wrapper for paged scans.
- Specify ActiveTimestampTracker.PinWithDeadline plus a per-second
  sweeper goroutine. The existing Pin has no TTL, so a crashed
  producer would block compaction forever without this.

Important:
- Add hll/ to the Redis layout tree.
- Replace .metadata.json sidecar with .elastickv-meta.json reserved
  suffix; document collision handling and rename-collisions flag.
- Spell out cross-shard wait: BeginBackup polls each group applied_index
  until it covers read_ts (5s default deadline) before any scan starts.
  The 50ms skew alone was a liveness assertion, not an enforced wait.
- Specify CHECKSUMS as sha256sum(1)-compatible (verifiable with
  sha256sum -c CHECKSUMS from the dump root).
- Add ListAdaptersAndScopes RPC alongside BeginBackup/EndBackup with
  ttl_ms / pin_token.

Minor:
- Redis TTL: default skip-expired on restore; preserve-ttl opt-in.
- aws s3 sync example needs exclude pattern for sidecars.
- DynamoDB example clarified: composite-key vs hash-only layouts.
- SQS dedup-window reset moved into trade-offs as an explicit callout.
- Redis stream merge mode: refuse non-empty target; auto-id strategy
  falls back to XADD * with an _id_remap.jsonl mapping log.

Tests added: TestPinWithDeadlineExpiry,
TestBeginBackupWaitsForLaggingShard, TestBackupScannerPaging,
TestS3SidecarSuffixCollision, TestRedisTTLExpiredKeySkippedByDefault,
TestRedisStreamMergeRejectsNonEmpty.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Claude-bot review (#706 (comment)) reflected in 5719128. Summary of responses:

Critical

  • Configure Renovate #1 ScanRange doesn't exist — replaced with ShardStore.ScanAt (which does, kv/shard_store.go:106) plus a new BackupScanner iterator wrapper for paged scans.
  • Update dependency go to v1.21.4 #2 ActiveTimestampTracker no TTL — added PinWithDeadline + per-second sweeper goroutine spec.

Important

Minor

Nits

  • backup_ts_ttl / s3_meta_suffix / checksum_format added to MANIFEST.json; ttl_ms is on the BeginBackup RPC sketch.

New tests: TestPinWithDeadlineExpiry, TestBeginBackupWaitsForLaggingShard, TestBackupScannerPaging, TestS3SidecarSuffixCollision, TestRedisTTLExpiredKeySkippedByDefault, TestRedisStreamMergeRejectsNonEmpty.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 11s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 2)

  • Gather context
  • Read updated design document
  • Verify previous review items were addressed
  • Check referenced source files
  • Post review feedback

All 12 items from the first review are addressed — the layout is cleaner, the RPC sketches are credible, and the trade-offs section is now the right level of detail. Three structural gaps surfaced when I checked the proposed interfaces against the actual code; two minor doc inconsistencies follow.


Important

1. ScanAt's own doc comment contradicts the design's consistency claim

kv/shard_store.go:101–105 contains this warning:

// ScanAt scans keys across shards at the given timestamp. Note: when the range
// spans multiple shards, each shard may have a different Raft apply position.
// This means the returned view is NOT a globally consistent snapshot — it is
// a best-effort point-in-time scan. Callers requiring cross-shard consistency
// should use a transaction or implement a cross-shard snapshot fence.

The design doc says BeginBackup implements exactly that fence — polling applied_index across every group until all have applied through read_ts. If that polling is correct, the ScanAt comment's warning no longer applies to the backup path. But as written, someone implementing Phase 1 will read this comment and either doubt the design or add an unnecessary extra mechanism.

The design doc should note the discrepancy and direct the implementation to update the ScanAt comment (or add a separate ScanAt variant with a precondition that the caller has already fenced the applied index). Without that, the code-doc split will persist and the next reader will face the same confusion.
Fix this →

2. NewBackupScanner is on ShardedCoordinator but ShardedCoordinator has no ShardStore

The proposed signature is:

func (c *ShardedCoordinator) NewBackupScanner(start, end []byte, ts uint64, pageSize int) BackupScanner

ShardStore (which holds ScanAt) and ShardedCoordinator are sibling structs, not a hierarchy. Both hold groups map[uint64]*ShardGroup but ShardedCoordinator does not embed or reference *ShardStore. The coordinator's single store store.MVCCStore field is the default group's bare MVCC store, not the multi-shard ShardStore.

An implementation today would have to either (a) add a *ShardStore field to ShardedCoordinator (non-trivial wire-up across all call sites in main.go), or (b) put NewBackupScanner on *ShardStore directly and have the backup CLI call it from there (cleaner, since ShardStore is already where multi-shard scan routing lives).

The design should commit to one of these options rather than leaving the coupling implicit. Fix this →

3. BeginBackup renewal has no protocol support

The flow description (step 5) says the producer calls BeginBackup again every ttl_ms / 3 to extend the deadline. But the proposed BeginBackupRequest message only has:

message BeginBackupRequest {
  uint64 ttl_ms = 1;
}

There is no existing_token or target_read_ts field. A fresh BeginBackup call would pin a new read_ts at the current HLC time, not extend the existing one — the producer would silently drift to a newer snapshot and lose consistency. Either add a RenewBackup(pin_token, ttl_ms) → ttl_ms_effective RPC, or add an optional bytes existing_token = 2 field to BeginBackupRequest and document the semantic (if set, extend the existing pin; if unset, open a fresh one). Fix this →

4. ListAdaptersAndScopes has no read_ts or pin_token to bind its scan to the backup snapshot

The producer flow is BeginBackup → ListAdaptersAndScopes → BackupScanner. But ListAdaptersAndScopesRequest is empty — the RPC has no way to know which read_ts to scan metadata at. If it scans at the current HLC time (post-BeginBackup), it might see tables/queues that were created after read_ts, leading to a scope list that doesn't match the data scan. Pass either the pin_token from BeginBackup (server looks up associated read_ts) or a read_ts field. Fix this →


Minor

5. strings_ttl.json could be unbounded in size

strings_ttl.json at the db_N root is a single JSON object mapping all TTL'd string-key encoded filenames to epoch-ms. For a Redis database with millions of TTL'd strings this is a single JSON file that must be fully parsed to restore any one key's TTL — it also cannot be streamed in the same way .bin files can. Worth a sentence in the trade-offs or the Redis section acknowledging the bound (strings_ttl.json may be large; restore tool streams it once at startup per db) or an alternative (e.g., split by prefix or embed TTL in a tiny sidecar .ttl file alongside each .bin).

6. hll/ directory is missing from the Redis section's concrete example tree

The top-level layout tree (line 144) and the prose correctly include hll/<key>.bin. But the Redis section's db_0 example (lines 363–376) does not show an hll/ entry. Add one hll/pfcount%3Auniques.bin line for consistency. Fix this →

7. --stream-merge-strategy=auto-id is not in the elastickv-restore apply flag list

The trade-offs section (line 819) defines --stream-merge-strategy=auto-id but the elastickv-restore apply CLI block (lines 712–718) doesn't list it. Add it alongside --mode to avoid implementor confusion. Fix this →


Previous review items — all resolved ✓

# Item Verdict
1 ScanRange doesn't exist Replaced with ShardStore.ScanAt + BackupScanner wrapper ✓
2 ActiveTimestampTracker no TTL PinWithDeadline + sweeper specified ✓
3 hll/ missing from layout tree Added ✓
4 S3 sidecar collision .elastickv-meta.json suffix, reserved at dump time ✓
5 Cross-shard wait underspecified applied_index poll in BeginBackup with 5 s deadline ✓
6 CHECKSUMS format unspecified sha256sum(1)-compatible; sha256sum -c CHECKSUMS
7 ListAdaptersAndScopes undocumented RPC sketch added ✓
8 Redis TTL on expired keys skip-expired default, --preserve-ttl opt-in ✓
9 aws s3 sync uploads sidecars --exclude '*.elastickv-meta.json'
10 DynamoDB composite vs hash-only Both layouts shown side by side ✓
11 SQS FIFO dedup warning buried Moved to trade-offs as explicit callout ✓
12 Redis stream merge collision merge refuses non-empty; --stream-merge-strategy=auto-id fallback ✓

Overall: the revision is a solid step forward. Items 2 and 3 above (the renewal protocol gap and the ListAdaptersAndScopes binding) are the blockers — both are observable mismatches between the described behavior and the proposed proto that would surface as bugs on day one of Phase 1 implementation. Item 1 (the ScanAt doc comment) needs a one-liner fix in the design to prevent confusion during code review. Items 4–7 are genuinely minor.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a logical backup system for elastickv, transitioning from opaque FSM snapshots to a per-adapter decomposed format using standard types like JSON and JSONL. The design ensures vendor-independent recovery and point-in-time consistency. Feedback focused on scalability and correctness, specifically regarding the performance impact of individual file storage for DynamoDB items, potential POSIX path collisions for S3 keys, and the scalability of monolithic metadata files. Reviewers also suggested implementing a TTL refresh mechanism for large backups and bounding concurrent backup registrations to prevent resource exhaustion.

`MANIFEST.json` is **the only file a restore tool must read first**.
Everything else is decoded from its on-disk path and contents.

## Read-Side Consistency
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default backup_ts_ttl of 30 minutes might be too short for large-scale backups, creating a significant operational risk where the compactor retires MVCC versions still in use, resulting in a corrupted backup. The design must include a mechanism to refresh the TTL and should also detail mitigation strategies for operational risks like rolling upgrades, such as a temporary 'bridge' mode.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

│ └── <table-name>/
│ ├── _schema.json
│ └── items/
│ └── <pk-segment>/[<sk-segment>.]json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Storing each DynamoDB item in an individual file will cause severe performance degradation and filesystem overhead for tables with millions of items. This represents a significant operational risk. Consider adopting a bundled format like JSONL (e.g., items/data-00001.jsonl). Additionally, per repository guidelines, when a design identifies such risks, it must detail mitigation strategies like implementing a temporary 'bridge' or 'proxy' mode for rolling upgrades.

Suggested change
│ └── <pk-segment>/[<sk-segment>.]json
│ └── data-<part-id>.jsonl
References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

Comment on lines +160 to +163
recorded in `KEYMAP`).

The directory **is** the index. There is no other side-table that must be
parsed before the user can find a record.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Preserving S3 object key / separators as filesystem path separators introduces potential collisions that S3 allows but POSIX filesystems do not. For example, S3 allows both an object named path/to/obj and an object named path/to. On a standard filesystem, path/to cannot be both a directory and a regular file. The design should specify how such collisions are handled (e.g., by encoding the leaf node or using a sidecar directory for nested objects).

- **All other key segments** (DynamoDB partition/sort key bytes, Redis keys,
SQS queue names and message IDs) are encoded with the following scheme,
which is unambiguous and reversible:
- Bytes in `[A-Za-z0-9._-]` pass through.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A single KEYMAP file per adapter scope will not scale well for datasets with millions of keys, becoming a bottleneck for memory and I/O. Furthermore, as this design supports system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to avoid service interruption during the transition.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.


```json
{
"format_version": 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a single strings_ttl.json file to store TTLs for all Redis strings in a database is not scalable. For a large Redis instance, this file will grow excessively, leading to high memory consumption. It would be more efficient to use a newline-delimited JSON (JSONL) format or to shard the TTL data.

"preserve_sqs_visibility": false,
"include_sqs_side_records": false
},
"checksum_algorithm": "sha256",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent unbounded memory growth and potential Denial of Service (DoS) through excessive BeginBackup calls, the design should specify a fixed bound on the number of concurrent active backup registrations allowed in the active_timestamp_tracker. This aligns with the general rule of applying bounds to collections that grow from external requests.

References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.

HIGH:
- TTL renewal made explicit. New RenewBackup RPC; producer renews
  every ttl_ms/3 and aborts the dump on renewal failure (continuing
  past TTL would silently corrupt the artifact). backup_max_ttl_ms
  caps a single window; multi-hour dumps work via repeated renewals.

MEDIUM:
- DynamoDB scalability. Default stays per-item (the user's stated
  requirement: PK名.json で1レコード1ファイル). Added opt-in
  --dynamodb-bundle-mode jsonl for tables where inode count is the
  binding constraint; structured warning emitted when an unbundled
  scope exceeds 1M items.
- S3 file-vs-directory collisions. Producer detects keys like
  "path/to" and "path/to/obj" coexisting and renames the shorter to
  "path/to.elastickv-leaf-data" with KEYMAP.jsonl + manifest record.
- KEYMAP scaled by switching to KEYMAP.jsonl (line-streamable).
- Redis strings_ttl.json -> strings_ttl.jsonl with one TTL record per
  line so TTL count is unbounded.
- max_active_backup_pins (default 4) caps concurrent BeginBackup
  registrations; tracker returns ErrTooManyActiveBackups, RPC surfaces
  ResourceExhausted. Prevents a misbehaving caller from holding the
  compactor open across the entire MVCC retention horizon.

Manifest fields added: s3_collision_strategy, dynamodb_layout,
max_active_backup_pins. CLI flags added: --ttl-ms, --scan-page-size,
--dynamodb-bundle-mode/--dynamodb-bundle-size, --rename-collisions.

Tests added: TestS3PathFileVsDirectoryCollision,
TestBeginBackupTooManyActiveBackups, TestRenewBackupExtendsDeadline.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Gemini review (#706 (review)) addressed in bbd1bb8b. Summary:

HIGH (line 515) — TTL refresh for large backups

  • Added RenewBackup RPC; producer calls it every ttl_ms/3 and aborts the dump on renewal failure (continuing past TTL would silently corrupt the artifact, since the compactor would already have retired versions the in-flight scan depends on).
  • backup_max_ttl_ms (default 1h) caps a single window; multi-hour dumps work via repeated renewals.
  • pin_renewals_total metric for operator visibility.

MEDIUM (line 129) — DynamoDB per-item file scalability

  • Default stays per-item (items/<pk>/<sk>.json). This is the user-stated requirement that motivates the entire proposal — "DynamoDBならテーブルディレクトリ配下にPKのファイル名.json でレコードがある" — and what makes one-line restore scripts trivial.
  • Added opt-in --dynamodb-bundle-mode jsonl[:size=64MiB] for tables where inode count is the binding constraint (millions of items). Structured warning emitted when an unbundled scope exceeds 1M items so operators know to switch.
  • MANIFEST.json records dynamodb_layout: "per-item"|"jsonl" so restore dispatches the right reader.

MEDIUM (line 163) — S3 file-vs-directory POSIX collision

  • Producer detects keys like path/to and path/to/obj coexisting in the same bucket; the shorter key is renamed to path/to.elastickv-leaf-data with the rename in KEYMAP.jsonl and s3_collision_strategy: "leaf-data-suffix" in MANIFEST.json. Sidecar follows the renamed base.

MEDIUM (line 181) — KEYMAP scalability

  • KEYMAPKEYMAP.jsonl (line-streamable, never loaded fully into memory).

MEDIUM (line 341) — Redis TTL file scalability

  • strings_ttl.jsonstrings_ttl.jsonl, one record per line.

MEDIUM (line 503) — DoS bound on concurrent backup pins

  • max_active_backup_pins (default 4) caps concurrent BeginBackup registrations; tracker returns ErrTooManyActiveBackups, RPC surfaces ResourceExhausted. Prevents a misbehaving caller from holding the compactor open across the MVCC retention horizon.

New CLI flags: --ttl-ms, --scan-page-size, --dynamodb-bundle-mode, --dynamodb-bundle-size, --rename-collisions.

New tests: TestS3PathFileVsDirectoryCollision, TestBeginBackupTooManyActiveBackups, TestRenewBackupExtendsDeadline.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 3m 18s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 3)

  • Gather context
  • Read updated design document
  • Verify Gemini review items addressed
  • Check for new issues introduced by the changes
  • Post review feedback

All six Gemini review items are addressed — the RenewBackup RPC closes the TTL-refresh gap, max_active_backup_pins is a clean DoS bound, and KEYMAP.jsonl / strings_ttl.jsonl are the right scalability fix. Two issues introduced by the Gemini changes need attention; five items from round 2 are still open.


Still open from round 2

1. ScanAt doc comment still says "NOT a globally consistent snapshot"

kv/shard_store.go:101–105 still carries the warning that ScanAt is not globally consistent and that callers must implement a cross-shard snapshot fence themselves. The design's BeginBackup applied-index poll is that fence, but the doc never directs Phase 1 implementors to update the ScanAt comment. The mismatch will surface in code review as a contradiction between the comment and the backup contract. Add a note in the "Scan primitive" section saying Phase 1 must update the ScanAt comment to: "consistent when the caller has fenced applied_index ≥ f(ts) across all groups before calling, as BeginBackup does." Fix this →

2. NewBackupScanner is on ShardedCoordinator but ShardedCoordinator has no ShardStore field

The proposed signature (line 626) is:

func (c *ShardedCoordinator) NewBackupScanner(start, end []byte, ts uint64, pageSize int) BackupScanner

Confirmed by reading kv/sharded_coordinator.go:124–145: ShardedCoordinator holds engine, router, groups, clock, store store.MVCCStore — it has no *ShardStore reference. ScanAt (kv/shard_store.go:106) lives on *ShardStore, a sibling type. Placing NewBackupScanner on ShardedCoordinator requires either adding a *ShardStore field to ShardedCoordinator (non-trivial wiring in main.go) or moving the method to *ShardStore (the natural home). The design needs to commit to one path. Fix this →

3. ListAdaptersAndScopesRequest still has no pin_token or read_ts field

The proto block (line 640) shows rpc ListAdaptersAndScopes(ListAdaptersAndScopesRequest) returns (ListAdaptersAndScopesResponse) and the response message is defined, but no ListAdaptersAndScopesRequest message body appears. Line 676 says the scan runs at read_ts, but without pin_token or read_ts on the request, the server has no way to identify which pin (and therefore which read_ts) to use. A new metadata object created between BeginBackup and ListAdaptersAndScopes would otherwise appear in the scope list even though BackupScanner at the pinned read_ts will never surface it. Fix this →

4. hll/ directory still missing from the db_0 example tree

The top-level layout (line 144) and HLL prose (line 493) correctly include hll/<key>.bin. The per-adapter Redis section example (lines 424–440) still shows only strings/, strings_ttl.jsonl, hashes/, lists/, sets/, zsets/, streams/ with no hll/ entry. Fix this →

5. --stream-merge-strategy not listed in the elastickv-restore apply CLI block

Lines 830–837 define the restore CLI flags but --stream-merge-strategy=auto-id (documented in the trade-offs section at line 938) is absent. An implementor reading the CLI spec would omit it. Fix this →


New issues from the Gemini changes

6. --dynamodb-bundle-mode flag syntax inconsistency

The proposal presents two incompatible forms:

  • Line 315: --dynamodb-bundle-mode jsonl[:size=64MiB] (size as an inline colon-delimited sub-option of a single flag)
  • Lines 808–810: --dynamodb-bundle-mode per-item|jsonl + separate --dynamodb-bundle-size 64MiB flag

Pick one and be consistent. The two-flag form (--dynamodb-bundle-mode jsonl + --dynamodb-bundle-size 64MiB) is more composable and easier to document; the inline form is harder to parse in shell. Fix this →

7. RenewBackupRequest.ttl_ms has no range constraint documented

BeginBackupRequest.ttl_ms carries the constraint "Range: 60s–24h. Default: 30m" and backup_max_ttl_ms bounds it at server level. RenewBackupRequest (line 663) has the same ttl_ms field but no documented range. If a buggy client passes 0 or MaxUint64 in a renewal, the server needs a defined behavior. Specify that RenewBackup applies the same range constraint as BeginBackup (same 60s floor, same backup_max_ttl_ms ceiling). Fix this →


Gemini review items — all resolved ✓

# Item Verdict
HIGH TTL refresh for long backups RenewBackup RPC + backup_max_ttl_ms ceiling + pin_renewals_total metric ✓
MEDIUM DynamoDB per-item file scalability --dynamodb-bundle-mode jsonl opt-in; 1M-item warning ✓
MEDIUM S3 file-vs-directory POSIX collision .elastickv-leaf-data rename + KEYMAP.jsonl + MANIFEST.s3_collision_strategy
MEDIUM KEYMAP scalability KEYMAP.jsonl (line-streamable) ✓
MEDIUM strings_ttl.json scalability strings_ttl.jsonl
MEDIUM DoS bound on concurrent backup pins max_active_backup_pins = 4; ErrTooManyActiveBackups; ResourceExhausted

Overall: the Gemini pass was handled cleanly. The five carry-overs from round 2 are all small doc fixes (no new design required); items 6 and 7 above are new minor nits from the Gemini changes. No blockers were introduced. Once items 2 and 3 (the NewBackupScanner placement and the ListAdaptersAndScopes binding) are addressed in the doc, the design is ready for implementation.

Round-2 carry-overs:
- Move NewBackupScanner from *ShardedCoordinator to *ShardStore.
  ShardedCoordinator (kv/sharded_coordinator.go:124) has no *ShardStore
  field; ShardStore is the natural home for ScanAt-adjacent code.
- Add a Phase-1 note that ScanAt's doc comment must be updated to
  reflect that BeginBackup's applied_index fence makes cross-shard
  ScanAt globally consistent.
- Add pin_token to ListAdaptersAndScopesRequest so the scope scan is
  bounded to the pinned read_ts; concurrently-created scopes are not
  surfaced.
- Add hll/ to the Redis db_0 example tree (top-level layout already
  had it; the per-adapter example lagged).
- Add --stream-merge-strategy / --preserve-ttl / --preserve-visibility
  to the elastickv-restore apply CLI block.

New nits from the Gemini changes:
- Pick one form for DynamoDB bundle-mode flags. Two-flag form
  (--dynamodb-bundle-mode jsonl + --dynamodb-bundle-size 64MiB)
  throughout; drop the inline jsonl:size=64MiB form.
- RenewBackupRequest.ttl_ms documents the same 60s-24h range
  constraint as BeginBackupRequest, bounded above by
  backup_max_ttl_ms; out-of-range returns InvalidArgument.

Tests added: TestRenewBackupTTLRangeValidation,
TestListAdaptersAndScopesAtPinTS.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-3 review (#706 (comment)) addressed in e3b29360. Summary:

Carry-overs from round 2 (5 items)

  1. ScanAt comment — Phase-1 note added: the comment must be updated to reflect that BeginBackup's applied_index fence makes cross-shard ScanAt globally consistent.
  2. NewBackupScanner placement — moved from *ShardedCoordinator to *ShardStore. Verified: ShardedCoordinator (kv/sharded_coordinator.go:124) has no *ShardStore field, so the natural home is *ShardStore next to ScanAt.
  3. ListAdaptersAndScopesRequestpin_token field added; the scope scan is now bounded to the pinned read_ts so concurrently-created tables/buckets are not surfaced.
  4. hll/ — added to the Redis db_0 per-adapter example tree (top-level layout already had it).
  5. --stream-merge-strategy — added to the elastickv-restore apply CLI block alongside --preserve-ttl and --preserve-visibility.

New from Gemini changes (2 items)
6. --dynamodb-bundle-mode syntax — two-flag form (--dynamodb-bundle-mode jsonl + --dynamodb-bundle-size 64MiB) is now used throughout; the inline jsonl:size=64MiB form is gone.
7. RenewBackupRequest.ttl_ms — documents the same 60s–24h constraint as BeginBackupRequest, bounded above by backup_max_ttl_ms; out-of-range returns InvalidArgument.

New tests: TestRenewBackupTTLRangeValidation, TestListAdaptersAndScopesAtPinTS.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 3m 46s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 4)

  • Gather context
  • Read updated design document
  • Verify round-3 items addressed
  • Check for new issues
  • Post review feedback

All 7 items from round 3 are addressed cleanly. Three new gaps surfaced when I checked the Redis section against the live adapter code (adapter/redis.go).


Important

1. Stream TTLs are silently dropped

Redis streams can have TTLs set by EXPIRE / PEXPIRE / PEXPIREAT. The code stores those in the !redis|ttl| scan index via buildTTLElems (adapter/redis.go:2471), the same mechanism used for hashes, lists, sets, and zsets.

Hashes / lists / sets / zsets correctly embed expire_at_ms in their JSON files. But the stream format (lines 482–492) is:

{"id": "1714400000000-0", "fields": {"event": "login", "user": "alice"}}
{"_meta": true, "length": 2, "last_ms": 1714400000001, "last_seq": 0}

No expire_at_ms anywhere. If a user runs PEXPIREAT mystream 1735689600000, that TTL is stored in the !redis|ttl| scan index but the dump format has nowhere to put it, so restore silently creates a stream with no expiry.

Fix: add expire_at_ms to the _meta line (null when no TTL, epoch-ms when set):

{"_meta": true, "length": 2, "last_ms": 1714400000001, "last_seq": 0, "expire_at_ms": null}

Fix this →

2. HLL TTLs are silently dropped

HLL keys (!redis|hll|<key>) are reported as redisTypeString by the adapter and their TTL is kept in the legacy !redis|ttl| scan index — confirmed by the comment at adapter/redis.go:2319–2320:

// HLL is reported as redisTypeString but stores its payload under
// !redis|hll|<key>; keep TTL in the legacy scan index via buildTTLElems.

The dump format writes hll/<key>.bin as raw bytes with no TTL mechanism. If a user sets a TTL on an HLL key (PEXPIREAT myhll 1735689600000), the dump captures the sketch bytes but discards the expiry.

Fix: add a hll_ttl.jsonl sidecar at the db_N root (same structure as strings_ttl.jsonl):

{"key":"pfcount%3Auniques","expire_at_ms":1735689600000}

Or, alternatively, document the known limitation explicitly and add TestHLLTTLDroppedWarning to the P1 test plan.
Fix this →

3. XADD NOMKSTREAM does not implement "skip-existing" semantics for stream entries

Line 974 describes --mode skip-existing for streams as:

XADD NOMKSTREAM only entries whose ID is not already present

NOMKSTREAM means "do not create the stream if it does not exist" — it has nothing to do with skipping entries by ID. If an entry with a matching or higher ID already exists, XADD returns ERR regardless of NOMKSTREAM. There is no Redis command that skips an explicit ID if it already exists; a correct implementation requires XRANGE <key> <id> <id> COUNT 1 before each XADD, or accepting the ERR and treating it as "already exists." The current description is semantically incorrect and would mislead a Phase 2 implementor.

Fix: replace the description with the actual semantics — e.g., "For each entry: try XADD <id>; treat ERR The ID specified in XADD is equal or smaller as 'already present' and skip." Or, acknowledge that true skip-existing for streams requires a XRANGE probe per entry and document the O(N) cost.
Fix this →


Minor

4. --begin-backup-deadline is missing from the producer CLI block

Line 778 says the applied_index poll deadline is "surfaced as --begin-backup-deadline on the CLI," but the elastickv-backup dump flag list (lines 833–849) does not include it. Add it alongside --ttl-ms and --scan-page-size.
Fix this →


Round-3 items — all resolved ✓

# Item Verdict
1 ScanAt comment still says "NOT consistent" Phase-1 note added at lines 643–654 directing the comment update ✓
2 NewBackupScanner on wrong type Moved to *ShardStore with explicit rationale (lines 629–641) ✓
3 ListAdaptersAndScopesRequest has no pin_token bytes pin_token = 1 added (line 707); scope scan bounded to read_ts
4 hll/ missing from db_0 example tree Added (lines 441–443) ✓
5 --stream-merge-strategy missing from CLI block Added to restore CLI (line 878) ✓
6 --dynamodb-bundle-mode syntax inconsistency Two-flag form throughout (lines 315, 847–848) ✓
7 RenewBackupRequest.ttl_ms range undocumented 60s–24h constraint + InvalidArgument on out-of-range (lines 693–695) ✓

Overall: the design is in very good shape. Items 1 and 2 above (stream and HLL TTL gaps) are the blockers — they are confirmed by the live adapter/redis.go code and would result in silent data loss for any key with an expiry. Item 3 (the XADD NOMKSTREAM semantic error) needs correction before Phase 2 implementation begins or the restore tool will behave incorrectly for skip-existing on streams. Item 4 is a one-liner doc fix.

Important (verified against adapter/redis.go):
- Stream TTLs were silently dropped. Streams can carry TTL via
  EXPIRE/PEXPIRE/PEXPIREAT, kept in !redis|ttl| by buildTTLElems
  (adapter/redis.go:2471). Added expire_at_ms to the _meta line so
  restore re-applies the original epoch.
- HLL TTLs were silently dropped. HLL is reported as redisTypeString
  but stores TTL in the legacy scan index (adapter/redis.go:2319-2320
  comment confirms it). Added hll_ttl.jsonl sidecar at the db_<n>
  root, same shape as strings_ttl.jsonl. Sidecar is omitted when no
  HLL key has a TTL.
- Fixed XADD NOMKSTREAM semantic error in the skip-existing
  description. NOMKSTREAM only prevents stream creation; it does
  not skip entries by ID. Replaced with the correct mechanism:
  attempt XADD and treat the "ID equal or smaller" ERR as
  "already present, skip."

Minor:
- Added --begin-backup-deadline 5s to the elastickv-backup dump CLI
  block (the prose mentioned it but the flag list lagged).

Tests added: TestRedisStreamTTLRoundTrip, TestRedisHLLTTLRoundTrip,
TestRedisStreamSkipExistingHandlesERR.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-4 review (#706 (comment)) addressed in 797d5711. Both data-loss gaps verified against adapter/redis.go.

Important (data-loss gaps)

  1. Stream TTLs silently dropped — verified at adapter/redis.go:2471 (buildTTLElems). Streams can have TTL via PEXPIREAT, kept in !redis|ttl|. Added expire_at_ms to the _meta line:
    {"_meta": true, "length": 2, "last_ms": 1714400000001, "last_seq": 0, "expire_at_ms": null}
    
  2. HLL TTLs silently dropped — verified at adapter/redis.go:2319-2320 comment. Added hll_ttl.jsonl sidecar at the db_<n> root with the same shape as strings_ttl.jsonl. Omitted when no HLL key has a TTL.
  3. XADD NOMKSTREAM semantic error — replaced with the correct mechanism: attempt XADD <id> and treat ERR The ID specified in XADD is equal or smaller as "already present, skip." NOMKSTREAM does not skip-by-ID — it only suppresses stream creation. Documented O(N) cost.

Minor

  1. --begin-backup-deadline 5s added to the elastickv-backup dump CLI block (the prose mentioned it but the flag list lagged).

New tests: TestRedisStreamTTLRoundTrip, TestRedisHLLTTLRoundTrip, TestRedisStreamSkipExistingHandlesERR.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 11s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 5)

  • Gather context
  • Read updated design document
  • Verify round-4 items addressed
  • Check for new issues
  • Post review feedback

All four round-4 items are addressed cleanly. One new structural gap surfaced from inspecting main.go and kv/active_timestamp_tracker.go; one minor gap from the Gemini changes is still open.


Important

1. BeginBackup pin is node-local; multi-node deployments are not protected

kv/active_timestamp_tracker.go is a plain in-memory map per process. In main.go:294–335, a single readTracker := kv.NewActiveTimestampTracker() is created per node and wired exclusively to that node's compactor (WithFSMCompactorActiveTimestampTracker(readTracker)). There is one tracker per OS process; different nodes have completely independent trackers.

When BeginBackup calls PinWithDeadline(read_ts, …), it registers the pin in the tracker of the node that received the admin RPC. That tracker gates only that node's compactor. In a multi-shard deployment, the BackupScanner reads data from the leaders of each group — which are likely on different nodes. Those nodes' compactors are entirely unaware of the pin and free to compact past read_ts. The proposal's cross-shard consistency guarantee breaks silently in this scenario.

The fix is a fan-out: BeginBackup must propagate a pin to the leader of every shard group it will scan, not only to the local node. One approach is a new internal-only AdminServer.PinBackupTs(pin_token, read_ts, ttl_ms) RPC that the BeginBackup handler calls on each group leader before returning to the producer. The design should specify how this fan-out works and what happens when a leader for one group is unreachable (fail BeginBackup entirely, or treat it as a partial backup?).
Fix this →


Minor

2. _id_remap.jsonl path is unspecified

Line 1009–1010 references a _id_remap.jsonl log produced by --stream-merge-strategy=auto-id, but doesn't say where the file lands. Is it per-stream alongside streams/<key>.jsonl, at the db_<n>/ root, or in a separate restore-side artifacts directory? An implementor of Phase 2 needs to know the canonical path so the log is discoverable without elastickv.
Fix this →

3. TestRedisStreamSkipExistingHandlesERR test description says "aborting the dump"

Line 1120: "tolerates ERR The ID specified in XADD is equal or smaller without aborting the dump" — this is a restore-side test; it should say "without aborting the restore."
Fix this →

4. KEYMAP.jsonl absent from the layout trees

The encoding section (lines 196–204) says KEYMAP.jsonl lives at each adapter scope root, and the S3 path-collision section (line 219) references it — but it doesn't appear in the top-level layout tree (lines 121–150) or in any of the per-adapter format trees. Adding one representative entry per adapter (dynamodb/<table>/KEYMAP.jsonl, s3/<bucket>/KEYMAP.jsonl, etc.) would make the trees self-consistent and help implementors understand where to write the file.
Fix this →


Round-4 items — all resolved ✓

# Item Verdict
1 Stream TTLs silently dropped expire_at_ms in _meta line (line 491); prose explains !redis|ttl| origin ✓
2 HLL TTLs silently dropped hll_ttl.jsonl sidecar at db_<n> root (lines 507–513); top-level tree updated ✓
3 XADD NOMKSTREAM semantic error Replaced with attempt-XADD-and-treat-error approach; NOMKSTREAM clarified (lines 992–1001) ✓
4 --begin-backup-deadline missing from CLI block Added at line 862 ✓

Overall: this design is nearly ready. Item 1 (the node-local pin gap) is the only blocker — it is a correctness issue for any multi-node deployment and the fix requires adding a fan-out mechanism to the BeginBackup / RenewBackup / EndBackup RPC contract. Items 2–4 are one-liners.

Important (correctness gap, verified against main.go):
- BeginBackup pin was node-local. Each node owns its own
  ActiveTimestampTracker (main.go:294) and only its compactor
  consults it (main.go:335). A pin on the receiving node would not
  protect group leaders on other nodes; their compactors would be
  free to retire MVCC versions at read_ts mid-dump.

  Fix: pins propagate through each Raft group's log as
  BackupPin / BackupExtend / BackupRelease FSM commands. Every replica
  applies them to its local tracker, so the pin set is replicated and
  durable across leader changes (a new leader inherits the pin from
  the same log entry it applied). BeginBackup fans the propose out
  per group; if any group fails to commit within
  --begin-backup-deadline, BeginBackup proposes BackupRelease on
  groups that did commit and returns Unavailable so retry is clean.
  RenewBackup / EndBackup fan out the same way; releases are
  idempotent so partial state is tolerated.

Minor:
- Specified _id_remap.jsonl path: redis/db_<n>/streams/<key>._id_remap.jsonl
  (next to source) so tools can find it on disk.
- TestRedisStreamSkipExistingHandlesERR description: "the dump"
  -> "the restore" (it's a restore-side test).
- Added KEYMAP.jsonl to every per-adapter scope in the layout tree;
  omitted when empty.

Tests added: TestBeginBackupPinFanOutAllNodes,
TestBeginBackupPinSurvivesLeaderChange,
TestBeginBackupGroupUnreachable.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-5 review (#706 (comment)) addressed in 2c00d560.

Important (correctness gap)

  1. BeginBackup pin was node-local — verified at main.go:294,335. Each node owns its own ActiveTimestampTracker and only its compactor consults it; a pin on the receiving node does not protect group leaders on other nodes from compacting past read_ts.

    Fix: pins propagate through each Raft group's log as BackupPin{pin_id, read_ts, deadline} / BackupExtend / BackupRelease FSM commands. Every replica applies them to its local tracker, so the pin set is replicated and durable across leader changes (a new leader inherits the pin from the same log entry it applied — no separate handoff). BeginBackup fans the propose out per group and waits for commit; if any group fails to commit within --begin-backup-deadline, BackupRelease is proposed on groups that did commit and Unavailable is returned so retry is clean. RenewBackup / EndBackup fan out the same way; BackupRelease is idempotent so partial state is tolerated.

Minor

  1. _id_remap.jsonl path — specified as redis/db_<n>/streams/<key>._id_remap.jsonl (next to the source) so tools can find it on disk.
  2. Test descriptionTestRedisStreamSkipExistingHandlesERR: "the dump" → "the restore".
  3. KEYMAP.jsonl — added to every per-adapter scope in the layout tree (omitted when empty).

New tests: TestBeginBackupPinFanOutAllNodes, TestBeginBackupPinSurvivesLeaderChange, TestBeginBackupGroupUnreachable.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 5m 13s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 6)

  • Gather context
  • Read updated design document
  • Verify round-5 items addressed
  • Check Raft FSM pin mechanism against source files
  • Check for new issues
  • Post review feedback

All four round-5 items are addressed. One important structural gap surfaced from reading kv/fsm.go directly; one important correctness issue from fsm.Restore; two minor doc issues follow.


Important

1. Phase 1 scope is missing the FSM plumbing for BackupPin / BackupExtend / BackupRelease

The Raft propagation described in the "Cluster-wide propagation" section requires three things that none of the Phase 1 scope files (lines 1085–1109) mention:

  • New entry type dispatch in kv/fsm.go:Apply (kv/fsm.go:60). Today Apply dispatches on exactly two byte tags:

    if data[0] == raftEncodeHLCLease { return f.applyHLCLease(data[1:]) }  // 0x02
    // else: fall through to decodeRaftRequests

    Adding BackupPin / BackupExtend / BackupRelease requires new byte tags (raftEncodeBackupPin = 0x03, etc.) and corresponding applyBackupPin / applyBackupExtend / applyBackupRelease handlers — analogous to how raftEncodeHLCLease / applyHLCLease / hlcLeasePayloadLen were introduced.

  • New *ActiveTimestampTracker field on kvFSM (kv/fsm.go:26). Today the struct holds store, log, and hlc. applyBackupPin needs to call tracker.PinWithDeadline(read_ts, deadline) — the same pattern as hlc *HLC wired via NewKvFSMWithHLC. The design should note that Phase 1 extends the FSM constructor (e.g., NewKvFSMWithHLCAndTracker(store, hlc, tracker)) so the tracker and the FSM share one instance, exactly as the HLC sharing invariant is documented in CLAUDE.md.

  • Serialization format for the new entry types. The design describes the logical structure (BackupPin{pin_id, read_ts, deadline}) but says nothing about the wire encoding — proto message, hand-coded binary (like the 8-byte HLC entry), or something else. Without specifying this, two independent implementors will produce incompatible entry layouts.

Add kv/fsm.go, a new kv/backup_codec.go (or equivalent), and the FSM constructor extension to the Phase 1 scope list. Without calling these out explicitly, the Raft pin propagation can be implemented at the admin-RPC layer and the TTL-extension layer while the FSM dispatch is silently left as a stub — and the propagation would appear to work in single-node tests while failing in multi-node deployments.
Fix this →

2. kvFSM.Restore wipes in-memory pins — Raft snapshot installs during backup are unprotected

kv/fsm.go:264–287 (Restore) installs a Raft snapshot by replacing the Pebble store. It does not touch the in-memory ActiveTimestampTracker. Since the tracker is not part of the snapshot, any replica that installs a Raft snapshot during the backup window loses its BackupPin registration. That replica's compactor (Oldest() → not bounded by the lost pin) is immediately free to compact past read_ts.

This matters whenever a replica falls behind far enough to need a snapshot — which can happen during a long backup if a follower crashes and restarts, or if a network partition heals after the leader has compacted the log past the BackupPin entry index. The design currently says "a newly-elected leader applies the same entry from the log and inherits the pin" (line 773), but that is only true if the log still contains the BackupPin entry. After the leader compacts the log (triggered by a snapshot it sent to a follower), the BackupPin entry is gone from the log, and a future replica catching up will never replay it.

The safe bound is: backup_duration + compaction_lag < mvcc_retention_horizon. The design should state this invariant explicitly and note the failure mode: if a snapshot-installing replica's compactor becomes unprotected and the backup duration has already exceeded the retention window, ScanAt(at_ts=read_ts) on that replica may return versions that have been compacted away — which surfaces as ErrKeyNotFound or silently missing entries rather than a readable error.

A practical mitigation (lighter than snapshotting the tracker state): on Raft snapshot installation (Restore), the FSM could re-read active BackupPin entries from the cluster by issuing a lightweight metadata-only request. Alternatively, the design could document that snapshot installations invalidate a backup and BeginBackup should refuse to proceed if any group is far enough behind to need a snapshot in the near future. Either way, this corner case needs an explicit mention.
Fix this →


Minor

3. --begin-backup-deadline is used for two distinct waits with different natural timeouts

The flag is referenced for both:

  • Step 2 (lines 815–819): poll until applied_index ≥ commit_index_at_pin — can legitimately take seconds for a lagging shard
  • Step 3 (lines 838–839): wait for BackupPin Raft commit — typically one round-trip (milliseconds in a healthy cluster)

A 5 s deadline is reasonable for the applied_index poll (a lagging shard that won't catch up in 5 s probably won't catch up soon) but is excessively generous for the BackupPin commit (if Raft can't commit in 1–2 s something is structurally wrong). Sharing one flag for both waits is not incorrect, but the doc should call out which wait is the binding one in practice so operators who tune the flag know what they're adjusting. Fix this →

4. RenewBackup fan-out could abort on transient leader elections

Step 5 (lines 850–857) says the producer "logs a critical alert and aborts the dump" if a RenewBackup call fails. RenewBackup proposes BackupExtend through each group's ShardGroup.Propose, which requires the current leader. A leader election (common during any rolling restart or node failure) causes Propose to fail transiently — the new leader is elected in < 1 s under etcd/raft defaults. Aborting the dump on the first such failure is overly aggressive; the implementation should retry per-group at least 2–3 times with 500 ms backoff before declaring failure. The design should include this as implementation guidance so the retry is intentional rather than an afterthought discovered during first production use. Fix this →


Round-5 items — all resolved ✓

# Item Verdict
1 Phase 1 missing FSM scope FSM plumbing described in prose (lines 762–864) but Phase 1 file list still incomplete — covered as item #1 above
1 ScanAt doc comment Phase 1 note to update comment (lines 664–675) ✓
2 NewBackupScanner on wrong type Moved to *ShardStore with rationale (lines 639–663) ✓
3 ListAdaptersAndScopesRequest missing pin_token bytes pin_token = 1 present (line 728) ✓
4 hll/ missing from db_0 example tree Added (lines 447–448) ✓
5 --stream-merge-strategy missing from CLI block Listed (line 948) ✓
Node-local pin gap (Important) BackupPin via Raft FSM commands Prose complete; FSM file list gap flagged as item #1 above
_id_remap.jsonl path (Minor) redis/db_<n>/streams/<key>._id_remap.jsonl (line 1063) ✓
Test description typo (Minor) "the restore" (line 1177) ✓
KEYMAP.jsonl in layout trees (Minor) Added per-scope in top-level tree (lines 128–152) ✓

Overall: the proposal is in excellent shape. Item #2 (the Restore-wipes-pins gap) is the one correctness issue that would bite in production: a 2-hour backup on a cluster with a 1-hour retention horizon is exactly where the hidden compaction window appears. Item #1 (FSM file list) is a one-paragraph addition to Phase 1 scope that will save the Phase 1 implementor a debugging session. Items #3 and #4 are implementation guidance nits. After these two important items are addressed the design is ready to ship.

Important (verified against kv/fsm.go):
- Phase 1 scope was missing the FSM plumbing for BackupPin /
  BackupExtend / BackupRelease entries. Added explicit items:
  - new byte tags 0x03/0x04/0x05 in kv/fsm.go alongside existing
    raftEncodeHLCLease=0x02 (kv/fsm.go:116);
  - new applyBackupPin/applyBackupExtend/applyBackupRelease
    handlers, dispatched from kvFSM.Apply (kv/fsm.go:60) in the
    same shape as applyHLCLease;
  - new *ActiveTimestampTracker field on kvFSM (kv/fsm.go:26),
    wired via NewKvFSMWithHLCAndTracker so the FSM and the
    coordinator share one tracker (analogous to the *HLC sharing
    invariant in CLAUDE.md);
  - new kv/backup_codec.go with the 33/25/17-byte fixed binary
    layouts for the three entry types.

- kvFSM.Restore (kv/fsm.go:264-287) wipes in-memory pins. A replica
  that installs a Raft snapshot during the backup window loses its
  BackupPin (snapshot only carries the Pebble store, not the
  tracker). After the leader compacts the log past the BackupPin
  entry, that entry is gone; a follower catching up via snapshot
  never replays it.

  Added explicit safe-bound invariant:
    backup_duration + worst_case_compaction_lag < mvcc_retention_horizon
  BeginBackup enforces a soft form via --snapshot-headroom-entries
  (default 10000). Producer's per-scope expected_keys heuristic
  catches mid-dump snapshot-induced shortfalls. Heavier mitigations
  (snapshotting tracker state, sibling-pin queries) are explicitly
  Phase 3, not Phase 1.

Minor:
- --begin-backup-deadline binding constraint clarified: it is the
  applied_index poll for lagging shards, not the BackupPin commit
  (which is sub-second on a healthy cluster).
- RenewBackup retries per-group BackupExtend up to 3 times with
  500ms backoff before aborting; transient leader elections (<1s
  under etcd/raft defaults) no longer abort otherwise-healthy dumps.

Tests added: TestBackupPinFSMCodecRoundTrip,
TestRestoreWipesLocalPins, TestBeginBackupRefusesNearSnapshotThreshold,
TestRenewBackupRetriesLeaderElection.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-6 review (#706 (comment)) addressed in 020e0c4c. Both important items verified against kv/fsm.go.

Important

  1. FSM plumbing missing from Phase 1 scope — added the four concrete items the implementor needs:

    • New byte tags 0x03/0x04/0x05 in kv/fsm.go alongside raftEncodeHLCLease = 0x02.
    • New applyBackupPin / applyBackupExtend / applyBackupRelease handlers, dispatched from kvFSM.Apply in the same shape as applyHLCLease.
    • New *ActiveTimestampTracker field on kvFSM wired via NewKvFSMWithHLCAndTracker(store, hlc, tracker) — analogous to the *HLC sharing invariant in CLAUDE.md. Coordinator and FSM share one tracker.
    • New kv/backup_codec.go with fixed 33 / 25 / 17-byte binary layouts for the three entry types (matching the HLC-lease style, not proto, to keep the apply path light).
  2. kvFSM.Restore wipes in-memory pins — added an explicit safe-bound invariant:

    backup_duration + worst_case_compaction_lag < mvcc_retention_horizon

    BeginBackup enforces a soft form via --snapshot-headroom-entries (default 10000): if any group's appliedIndex - firstIndex is below this margin, BeginBackup returns FailedPrecondition rather than starting a dump that risks the snapshot-installation path. The producer's per-scope expected_keys heuristic catches mid-dump snapshot-induced shortfalls. Heavier mitigations (snapshotting tracker state, sibling-pin queries on Restore) are explicitly Phase 3 only — Phase 1's contract is "your dump completes within the retention horizon."

Minor

  1. --begin-backup-deadline binding-constraint clarified: the applied_index poll for lagging shards is the binding wait; BackupPin commit is sub-second on a healthy cluster. Operators tuning the flag are adjusting tolerance for shard lag.
  2. RenewBackup retries BackupExtend per-group up to 3 times with 500ms backoff before aborting. Transient leader elections (<1s under etcd/raft defaults) no longer kill otherwise-healthy dumps.

New tests: TestBackupPinFSMCodecRoundTrip, TestRestoreWipesLocalPins, TestBeginBackupRefusesNearSnapshotThreshold, TestRenewBackupRetriesLeaderElection.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 7m 13s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 7)

  • Gather context
  • Read updated design document
  • Verify round-6 items addressed
  • Check referenced source files
  • Post review feedback

All four round-6 items are addressed cleanly. Two structural issues surfaced from reading kv/fsm.go, internal/raftengine/engine.go, and the spec against the --snapshot-headroom-entries logic; two minor doc inconsistencies follow.


Important

1. --snapshot-headroom-entries check is inverted

Lines 802–810 specify the guard as:

refuses to start if any group's appliedIndex - firstIndex is below a configurable margin (--snapshot-headroom-entries, default 10000)

appliedIndex - firstIndex is the number of log entries accumulated since the last Raft snapshot (firstIndex ≈ snapshotIndex + 1). With etcd/raft's default SnapshotCount = 10000, this value ranges from ~0 (just snapshotted) to ~10000 (snapshot imminent). The check "refuse if < 10000" therefore:

  • Refuses when appliedIndex - firstIndex = 100 — the cluster just snapshotted; the next snapshot is ~9900 entries away. This is the safest condition to start a backup.
  • Allows when appliedIndex - firstIndex = 10001 — which can only happen transiently at the instant the snapshot is triggered (the counter resets to ~0 after the snapshot). This is the most dangerous condition.

With SnapshotCount = 10000 and snapshot_headroom_entries = 10000, the check almost always refuses (the counter is almost always below 10000). The same paragraph says the check catches clusters "running close to its snapshot threshold" — but the comparison direction says the opposite: it catches clusters that are far from the threshold.

The correct metric and direction:

allow if: SnapshotCount - (appliedIndex - firstIndex) >= snapshot_headroom_entries
refuse if: SnapshotCount - (appliedIndex - firstIndex) < snapshot_headroom_entries

This refuses when fewer than headroom entries remain before the next snapshot — which is what the prose describes. It requires exposing or configuring SnapshotCount in the backup CLI (it's a Raft engine configuration parameter, not currently in the raftengine.Status struct).

Alternatively, replace the check with a time-based estimate: refuse if estimated_backup_duration > (retention_horizon - elapsed_time_since_last_snapshot). This avoids needing SnapshotCount but requires an estimated dump time, which is also uncertain.

As written, the check would produce operationally confusing FailedPrecondition errors when the cluster is in the healthiest state for a backup, and silently allow backups right before a snapshot fires. Fix this →

2. expected_keys heuristic is referenced but never defined

Lines 817–820 and TestRestoreWipesLocalPins (line 1245) both rely on "the producer's per-scope expected_keys heuristic" to detect compaction-induced scan shortfall:

the producer surfaces the inconsistency as a ScanAt returning fewer keys than the applied_index baseline implied. The producer's per-scope expected_keys heuristic … catches it and fails the dump rather than emitting a corrupted artifact.

The document never specifies:

  1. How expected_keys is obtained. An applied_index value does not imply a key count — it is a log index, not a cardinality estimate. Obtaining a key count requires either a separate COUNT scan (same cost as the backup scan itself) or a shard-level cardinality estimate from the route catalog (which doesn't currently exist).
  2. What shortfall threshold triggers failure. Any key fewer than expected? 99%? A threshold that varies by scope size?
  3. How the heuristic distinguishes legitimate sparse ranges from compaction-induced gaps. An adapter scope with few keys looks identical to one that had keys compacted away.

Without this heuristic being defined, TestRestoreWipesLocalPins has no clear pass/fail criterion, and the safety claim ("fails the dump rather than emitting a corrupted artifact") is unverifiable. This is the only mechanism in Phase 1 that catches the snapshot-invalidation failure mode — if it is undefined, the corruption is silent.

At minimum, the design should specify the data source for expected_keys (even if approximate) and the comparison logic. Fix this →


Minor

3. _long_keys.jsonl in the encoding section conflicts with KEYMAP.jsonl everywhere else

Line 194: "the full original bytes are recorded in <adapter>/<scope>/_long_keys.jsonl."

Everywhere else the file is called KEYMAP.jsonl: the top-level layout tree (line 128–134), the KEYMAP description (lines 200–208), and TestLongKeySHA256Fallback (line 1266, which says "KEYMAP records the original"). An implementor of Phase 1 reading the encoding section in isolation would create a _long_keys.jsonl file; an implementor reading the layout section would create KEYMAP.jsonl. The two files would be out of sync.

Replace _long_keys.jsonl in line 194 with KEYMAP.jsonl. Fix this →

4. firstIndex is not a field in raftengine.Status

Line 803: BeginBackup reads "each group's Status.AppliedIndex and firstIndex."

raftengine.Status (internal/raftengine/engine.go:53–69) has AppliedIndex, CommitIndex, LastLogIndex, and LastSnapshotIndex — but no FirstIndex or firstIndex field. The closest available proxy is LastSnapshotIndex (≈ firstIndex - 1). Change the reference to use Status.LastSnapshotIndex and update the check formula: appliedIndex - (lastSnapshotIndex + 1). Fix this →

5. BackupPin fan-out parallelism not specified

Step 3 (lines 875–894) describes BeginBackup proposing a BackupPin entry "through every group involved in the dump" and waiting for each to commit. For a cluster with 100 shard groups, sequential proposal would mean 100 serial Raft round-trips before the backup starts — potentially 10–20 seconds. The design does not say whether the fan-out is sequential or parallel.

Add one sentence: "proposals are issued to all groups concurrently (one goroutine per group), and the BeginBackup handler blocks until all have committed or the deadline fires." This is analogous to how ListAdaptersAndScopes should parallelize its per-adapter prefix scans. Fix this →


Round-6 items — all resolved ✓

# Item Verdict
1 FSM plumbing missing from Phase 1 scope kv/fsm.go byte tags, apply handlers, *ActiveTimestampTracker field, kv/backup_codec.go — all added to Phase 1 scope (lines 1152–1181) ✓
2 kvFSM.Restore wipes pins Safe-bound invariant documented; --snapshot-headroom-entries guard added; expected_keys heuristic referenced (though undefined — item #2 above) ✓
3 --begin-backup-deadline binding constraint Clarified: applied_index poll for lagging shards is the binding wait (lines 864–869) ✓
4 RenewBackup retry on leader election 3 retries with 500 ms backoff before abort (lines 906–917) ✓

Overall: the design is in excellent shape for implementation. Items 1 and 2 above are the blockers — the snapshot-headroom check will produce operational confusion if shipped inverted, and the expected_keys heuristic is the only safety net for the snapshot-invalidation corner case. Items 3–5 are one- or two-sentence additions. After those are addressed, the proposal is ready to merge and Phase 1 implementation can begin.

Important (verified against internal/raftengine/engine.go and
internal/raftengine/etcd/engine.go):
- --snapshot-headroom-entries check was inverted. Replaced with the
  correct formula:
    entries_since_last_snapshot = AppliedIndex - LastSnapshotIndex
    remaining_headroom          = SnapshotEvery - entries_since_last_snapshot
    refuse if: remaining_headroom < --snapshot-headroom-entries
  (raftengine.Status has LastSnapshotIndex not FirstIndex, and
  defaultSnapshotEvery = 10000 is the per-engine snapshot trigger.)
  A freshly-snapshotted cluster has the largest remaining headroom
  and is now correctly allowed; the check refuses backups that would
  trigger the snapshot-installation corner case mid-dump. Default
  --snapshot-headroom-entries lowered from 10000 to 1000 (one-tenth
  of SnapshotEvery) since the formula now expresses what the prose
  intended.
- expected_keys heuristic was undefined. Replaced with a concrete
  expected-keys baseline mechanism:
    * BeginBackup runs a count-only ScanAt at read_ts before the pin
      installs, returns per-scope (key_count, applied_index) in
      BeginBackupResponse.
    * Producer compares actual count against baseline; fails with
      ErrCompactionDuringDump if below 99% +/- sqrt(baseline).
    * Adds keys_only bool parameter to MVCCStore.ScanAt /
      ShardStore.ScanAt so the baseline pass skips value
      materialization (Pebble iterator with payload trimmed at the
      block layer).

Minor:
- _long_keys.jsonl renamed to KEYMAP.jsonl on line 192 to match the
  rest of the spec.
- BackupPin fan-out is now explicitly concurrent (one goroutine per
  group, blocked on all-or-deadline). 100-shard cluster does not pay
  100x serial Raft round-trips.

Tests added: TestExpectedKeysBaselineToleratesTTLExpiry,
TestExpectedKeysBaselineCountOnlyScan. Existing tests
TestRestoreWipesLocalPins and TestBeginBackupRefusesNearSnapshotThreshold
updated to match the corrected formulas.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-7 review (#706 (comment)) addressed in b1669803. Both important items are bugs caught by reading internal/raftengine/engine.go directly.

Important

  1. --snapshot-headroom-entries check was inverted. Real formula:
    entries_since_last_snapshot = AppliedIndex - LastSnapshotIndex
    remaining_headroom          = SnapshotEvery - entries_since_last_snapshot
    refuse if: remaining_headroom < --snapshot-headroom-entries
    
    Status has LastSnapshotIndex (not FirstIndex), and defaultSnapshotEvery = 10000 (verified at internal/raftengine/etcd/engine.go:92). A freshly-snapshotted cluster has the largest remaining headroom and is now correctly allowed; the check refuses backups that would trigger the snapshot-installation corner case mid-dump. Default lowered from 10000 to 1000 (one-tenth of SnapshotEvery).
  2. expected_keys heuristic was undefined. Replaced with a concrete expected-keys baseline:
    • BeginBackup runs a count-only ScanAt at read_ts before the pin is installed, returns per-scope (key_count, applied_index) in BeginBackupResponse.expected_keys.
    • Producer compares actual count against baseline; fails with ErrCompactionDuringDump if below 99% × baseline ± sqrt(baseline) (binomial-noise tolerance for routine TTL expiry).
    • MVCCStore.ScanAt / ShardStore.ScanAt extended with a keys_only bool parameter so the baseline pass skips value materialization (Pebble iterator with payload trimmed at the block layer; ~10× cheaper).

Minor

  1. _long_keys.jsonlKEYMAP.jsonl on line 192 (consistent with the rest of the spec).
  2. firstIndexLastSnapshotIndex (item Configure Renovate #1 already covers it).
  3. BackupPin fan-out is now explicitly concurrent (one goroutine per group). A 100-shard cluster does not pay 100× serial Raft round-trips at the start of every backup.

New tests: TestExpectedKeysBaselineToleratesTTLExpiry, TestExpectedKeysBaselineCountOnlyScan. TestRestoreWipesLocalPins and TestBeginBackupRefusesNearSnapshotThreshold updated to match the corrected formulas.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 30s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 8)

  • Gather context
  • Read updated design document
  • Verify round-7 items addressed
  • Check referenced source files
  • Post review feedback

All five round-7 items are addressed cleanly — the snapshot-headroom check is now correct and matches internal/raftengine/engine.go:53–69 (confirmed LastSnapshotIndex exists; AppliedIndex is at line 58), defaultSnapshotEvery = 10_000 at internal/raftengine/etcd/engine.go:92 is verified, and the concurrent fan-out is explicit. Two structural gaps surfaced from reading the expected-keys baseline section against the flow description and the proto spec.


Important

1. BeginBackupResponse proto spec is missing expected_keys and ShardApplied message definitions

The proto block (lines 702–707) shows:

message BeginBackupResponse {
  uint64 read_ts            = 1;
  bytes  pin_token          = 2;
  uint64 ttl_ms_effective   = 3;
  repeated ShardApplied shards = 4;  // group_id, applied_index at pin time
}

Two problems:

  • ShardApplied is referenced but never defined in the proto section. A Phase 1 implementor has to guess its shape — presumably { string group_id = 1; uint64 applied_index = 2; }, but that is not stated.
  • Phase 1 scope (lines 1243–1246) says "Add BeginBackupResponse.expected_keys per scope (the count produced by the baseline pass at read_ts before the pin is installed)" but there is no such field in the response message. An implementor needs the field number, type, and the shape of the per-scope record (likely repeated ScopeKeyCount expected_keys = 5; with { string scope_id; uint64 key_count; uint64 applied_index_at_count; }).

Both missing definitions would force a Phase 1 implementor to invent a wire format on the fly, creating a risk that it never matches what the producer eventually commits. Fix this →

2. Count-only baseline scan runs at "step 1" but needs shards to have applied through read_ts (step 2)

The expected-keys baseline section (line 838) says:

"During step 1 of BeginBackup, before the pin is installed, the admin server runs a count-only scan of each adapter scope at read_ts…"

Step 1 is "Pick read_ts". Step 2 is "Wait for shards to catch up (applied_index ≥ commit_index_at_pin)". Step 3 is "Pin cluster-wide".

If the count-only scan runs at step 1 — before step 2 — a shard that has not yet applied through read_ts returns 0 (or a partial count) for ScanAt(at_ts=read_ts, keys_only=true). A lagging shard's baseline would be an undercount, which means:

  • The subsequent data scan (step 4, after pin is set) would see more keys than the baseline on the lagging shard.
  • ErrCompactionDuringDump (triggered when actual < 99% × baseline) can never fire for that shard's range.
  • The entire safety net against snapshot-induced pin loss becomes undetectable precisely on the shard most at risk.

The correct order is: step 2 (shard lag wait) → count-only scan → step 3 (pin). Change "During step 1" to "After step 2 and before step 3" and add a sentence confirming that by the time the count runs every shard has applied through read_ts. Fix this →


Minor

3. --snapshot-headroom-entries default is inconsistent between the text and the CLI block

Line 822 (text): "--snapshot-headroom-entries = 1000 (default; one-tenth of SnapshotEvery)"

Line 1029 (CLI block): [--snapshot-headroom-entries 10000]

The CLI example shows 10000 (same as SnapshotEvery), but the text says the default is 1000. An implementor or operator reading the CLI help block would configure 10× the intended default. Change the CLI example to 1000. Fix this →

4. ScanAt keys_only parameter addition is a breaking Go API change; call sites are unaccounted for

Phase 1 scope (lines 1238–1242) says: "extends MVCCStore.ScanAt / ShardStore.ScanAt with a keys_only bool parameter so the expected-keys baseline pass can iterate without materializing values."

Adding a new positional bool parameter to an existing Go function is a source-incompatible change — every existing caller must be updated. Confirmed by kv/shard_store.go:106: ScanAt(ctx, start, end, limit, ts) is called from at least 10 sites in shard_store.go alone (e.g. lines 308, 326, 855, 1344). Phase 1 scope does not list these as touch points.

The alternatives are: (a) keep the existing signature and add a new ScanAtKeysOnly(ctx, start, end, limit, ts) overload (no callers to update, simpler implementation), or (b) replace the parameter list with an options struct ScanOptions{KeysOnly bool} (also non-breaking via zero-value defaults, and open for future options). Add a note in Phase 1 scope choosing one of these and listing which call sites the chosen approach affects. Fix this →


Round-7 items — all resolved ✓

# Item Verdict
1 --snapshot-headroom-entries check was inverted Formula corrected: remaining_headroom = SnapshotEvery - (AppliedIndex - LastSnapshotIndex); freshly-snapshotted cluster correctly allowed ✓
2 expected_keys heuristic undefined Concrete count-only scan baseline: 99% × baseline ± sqrt(baseline) tolerance; keys_only=true path specified ✓
3 _long_keys.jsonl vs KEYMAP.jsonl inconsistency Line 192 now says KEYMAP.jsonl throughout ✓
4 firstIndex not in raftengine.Status Replaced with LastSnapshotIndex; formula uses firstIndex = LastSnapshotIndex + 1 (verified against internal/raftengine/engine.go:60) ✓
5 BackupPin fan-out parallelism unspecified "One goroutine per group" explicit at lines 923–924 ✓

Overall: the design is very close to ready. Item 1 (missing proto definitions) and item 2 (count-ordering bug) are the blockers — item 2 is a correctness issue that would make ErrCompactionDuringDump silently ineffective on lagging shards, which is the only detection mechanism for the snapshot-induced pin loss corner case. Both are small additions to existing sections. Items 3 and 4 are one-liner fixes. After these four items, the proposal is ready to merge.

Important:
- BeginBackupResponse proto block was missing definitions. Added
  ShardApplied {group_id, applied_index} and ScopeKeyCount
  {scope_id, key_count, applied_index_at_count}; added
  "repeated ScopeKeyCount expected_keys = 5" to BeginBackupResponse.

- Expected-keys baseline ordering bug. Documented as running "during
  step 1" but step 1 is "pick read_ts" and step 2 is shard catch-up.
  A lagging shard would return 0 (or partial) for ScanAt at read_ts
  before applying through it, producing an undercount and silently
  disabling ErrCompactionDuringDump on exactly the shards most at
  risk of snapshot-induced pin loss. Moved to "after step 2 (every
  shard has applied through read_ts) and before step 3 (pin)".

Minor:
- --snapshot-headroom-entries CLI default 10000 -> 1000 (matches the
  one-tenth-of-SnapshotEvery default documented in the prose).
- ScanAt keys_only addition is a source-incompatible Go API change
  with ~10 callers in shard_store.go alone. Replaced with a new
  ScanKeysAt overload on MVCCStore / ShardStore (returns [][]byte
  not []*KVPair); existing ScanAt callers untouched.

Tests adjusted: TestExpectedKeysBaselineCountOnlyScan now verifies
both the new path and the absence of a regression on existing
ScanAt callers; new TestExpectedKeysBaselineRunsAfterShardCatchup
guards against the ordering bug.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-8 review (#706 (comment)) addressed in 7ca1ae51.

Important

  1. Proto definitions filled in. Added ShardApplied {group_id, applied_index} and ScopeKeyCount {scope_id, key_count, applied_index_at_count}; added repeated ScopeKeyCount expected_keys = 5 to BeginBackupResponse.

  2. Expected-keys baseline ordering bug fixed. It said "during step 1" but step 1 is "pick read_ts" and step 2 is shard catch-up. A lagging shard would return a partial count for ScanAt at read_ts before applying through it, producing an undercount and silently disabling ErrCompactionDuringDump on exactly the shards most at risk of snapshot-induced pin loss. Moved to "after step 2 (every shard has applied through read_ts) and before step 3 (pin)".

Minor

  1. --snapshot-headroom-entries CLI default 100001000 (matches the one-tenth-of-SnapshotEvery default documented in the prose).

  2. ScanAt keys_only parameter would be a source-incompatible Go change touching ~10 callers in shard_store.go alone. Replaced with a new ScanKeysAt overload on MVCCStore / ShardStore (returns [][]byte rather than []*KVPair). Existing ScanAt callers untouched; the new path delegates to a private helper that shares route resolution and merge logic.

Tests: TestExpectedKeysBaselineCountOnlyScan now verifies both the new path and absence of regression on existing ScanAt callers; new TestExpectedKeysBaselineRunsAfterShardCatchup guards against the ordering bug.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 43s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 9)

  • Gather context
  • Read updated design document
  • Verify round-8 items addressed
  • Check referenced source files
  • Post review feedback

All four round-8 items are addressed cleanly — ShardApplied and ScopeKeyCount are defined, the baseline ordering bug is fixed, the CLI default is 1000, and ScanKeysAt replaces the breaking ScanAt extension. Two structural gaps surfaced from reading the raftengine interface and kv/fsm.go against the SnapshotEvery and FSM-tag claims in the design.


Important

1. SnapshotEvery is not accessible from the raftengine.Engine interface

Line 838–839 states:

"BeginBackup reads it from each group's engine config rather than hardcoding the default, so an operator who tuned --raftSnapshotEvery sees consistent behavior."

Two problems with this sentence:

  • No CLI flag --raftSnapshotEvery exists. The snapshot trigger is configured via the ELASTICKV_RAFT_SNAPSHOT_COUNT environment variable (internal/raftengine/etcd/engine.go:93), not a CLI flag. Any operator documentation or Phase 1 implementation that references --raftSnapshotEvery is looking for a flag that doesn't exist.

  • snapshotEvery is not accessible on the Engine interface. The Engine interface (internal/raftengine/engine.go:200) composes Proposer, LeaderView, StatusReader, and ConfigReader. StatusReader.Status() returns a Status struct with AppliedIndex, LastSnapshotIndex, CommitIndex, etc. — but no SnapshotEvery field. ConfigReader.Configuration() returns cluster membership, not engine config. snapshotEvery is a private field on the concrete etcd.Engine struct (internal/raftengine/etcd/engine.go:224), not exposed to any external caller.

A Phase 1 implementor reading the design would write code like group.Engine.SnapshotEvery() and find it doesn't compile. Fix: either (a) add SnapshotEvery() uint64 to the Engine interface and implement it on the etcd backend, or (b) acknowledge that BeginBackup always uses defaultSnapshotEvery = 10000 (and a separate admin knob sets the threshold). Clarify which. Also correct --raftSnapshotEvery to ELASTICKV_RAFT_SNAPSHOT_COUNT.
Fix this →

2. Old FSM nodes receiving BackupPin entries (byte tags 0x03–0x05) will fail Apply during rolling upgrades

The existing kvFSM.Apply (kv/fsm.go:60–65) dispatches only one special tag before falling into decodeRaftRequests:

if len(data) > 0 && data[0] == raftEncodeHLCLease {  // only 0x02
    return f.applyHLCLease(data[1:])
}
reqs, err := decodeRaftRequests(data)

decodeRaftRequests (kv/fsm.go:119–143) handles 0x00 and 0x01 explicitly; byte 0x03 hits default: return decodeLegacyRaftRequest(data) → tries proto-unmarshal on a hand-coded binary payload → returns an error. In etcd/raft, an error return from Apply causes the engine to treat that entry as failed — depending on the error-handling path this can stall the apply loop or diverge the FSM state on the old node.

During a rolling cluster upgrade (which is the normal deployment strategy), some nodes will have the old FSM and some the new. The first BeginBackup call after the new nodes are partially deployed would propose BackupPin through every group — and old nodes in those groups would fail to apply it.

The design must address this. Options:

  1. Require all nodes in the affected groups to be on the new version before BeginBackup can be called — and specify how this is enforced (e.g., BeginBackup checks all group members' versions via Admin.GetRaftGroups and returns FailedPrecondition if any member reports a version below min_backup_version).
  2. Make old nodes handle unknown backup tags gracefully — i.e., add a check in the old FSM's dispatch path that returns nil (no-op) for any unknown tag ≥ 0x03. This requires careful consideration of log ordering guarantees.

Without addressing this, the first BeginBackup call on a partially upgraded cluster will corrupt or stall the apply loop on old nodes.
Fix this →


Minor

3. ScopeKeyCount.scope_id format is only documented for two of the four adapters

Lines 720–725:

message ScopeKeyCount {
  // scope_id is "<adapter>/<scope-name>" e.g. "dynamodb/orders" or
  // "s3/photos". Producer matches it against the directory it is
  // about to write into.
  string scope_id = 1;

The examples cover DynamoDB and S3 but not Redis (whose scope_id would be "redis/db_0") or SQS ("sqs/orders-fifo.fifo"). ListAdaptersAndScopesResponse.redis_databases returns repeated uint32 not repeated string, so the producer needs a defined mapping from uint32 db_numscope_id string. Without this, producer and restore tools may generate inconsistent scope IDs across adapters when comparing expected_keys entries.
Fix this →


Round-8 items — all resolved ✓

# Item Verdict
1 Missing ShardApplied and ScopeKeyCount proto definitions Both defined at lines 714–726; expected_keys = 5 on BeginBackupResponse
2 Baseline count runs at step 1 before shards have caught up Explicitly "After step 2 (every shard has applied through read_ts) and before step 3" at lines 857–868 ✓
3 --snapshot-headroom-entries CLI default was 10000, not 1000 CLI block now shows 1000 at line 1056 ✓
4 ScanAt keys_only would break ~10 callers Replaced with ScanKeysAt overload returning [][]byte; existing callers untouched (lines 1266–1276) ✓

Overall: the design is nearly complete. Item 1 (the SnapshotEvery interface gap) requires a one-line addition to the engine interface and one-line correction to the env-var name, but those belong in Phase 1 scope and should be captured before the implementation starts. Item 2 (rolling upgrade safety) is the last true blocker — without it, the first BeginBackup call on a partially upgraded cluster will fail in an undocumented way. Item 3 is a one-line doc fix.

Important (verified against internal/raftengine/{etcd/,}engine.go and
kv/fsm.go):
- SnapshotEvery is not on the raftengine.Engine interface today.
  Phase 1 adds StatusReader.SnapshotEvery() uint64 implemented on
  the etcd backend by returning the snapshotEveryFromEnv() value
  cached at Open. Also corrected the env-var name: there is no
  --raftSnapshotEvery flag; the override is
  ELASTICKV_RAFT_SNAPSHOT_COUNT (internal/raftengine/etcd/engine.go:93).

- Rolling-upgrade safety. An old kvFSM receiving a 0x03-tagged
  BackupPin entry falls through decodeRaftRequests ->
  decodeLegacyRaftRequest (kv/fsm.go:140-145) -> proto.Unmarshal ->
  Apply error -> apply-loop stall. A single-release introduction
  would corrupt apply on un-upgraded replicas the moment the first
  BeginBackup runs.

  Phase 1 ships in two releases:
    Release N: kvFSM.Apply treats unknown tags in [0x03, 0x0F] as
      no-ops with an unknown_fsm_tag warning (forward-compat handler,
      no other backup machinery active).
    Release N+1: BackupPin/Extend/Release handlers ship; BeginBackup
      gates on every group member reporting version >= N via
      Admin.GetRaftGroups, returning FailedPrecondition otherwise.
  Operators may treat N and N+1 as the same calendar release if they
  enforce a fleet-wide upgrade window first; the gate still
  short-circuits.

Minor:
- ScopeKeyCount.scope_id format documented for all four adapters,
  including the redis_databases uint32 -> "redis/db_<n>" mapping.

Tests added: TestForwardCompatUnknownFSMTag,
TestBeginBackupGatesOnMinVersion, TestSnapshotEveryReadsFromEngine.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-9 review (#706 (comment)) addressed in 5bda8a8b. Both important items verified against internal/raftengine/{etcd/,}engine.go and kv/fsm.go.

Important

  1. SnapshotEvery not on raftengine.Engine interface, and the wrong env var name in the doc. Phase 1 adds StatusReader.SnapshotEvery() uint64 implemented on the etcd backend by returning the value cached at Open from snapshotEveryFromEnv(). Also corrected: there is no --raftSnapshotEvery flag; the override is ELASTICKV_RAFT_SNAPSHOT_COUNT (internal/raftengine/etcd/engine.go:93).

  2. Rolling-upgrade safety. Confirmed: an unknown leading tag falls through decodeRaftRequests (kv/fsm.go:140) → decodeLegacyRaftRequest (kv/fsm.go:145) → proto.UnmarshalApply error → apply-loop stall on un-upgraded replicas. A single-release introduction of BackupPin would corrupt apply on every un-upgraded replica the moment the first BeginBackup runs.

    Two-release rollout specified in Phase 1 scope:

    • Release N: kvFSM.Apply treats unknown tags in [0x03, 0x0F] as no-ops with an unknown_fsm_tag warning (forward-compat handler only — no other backup machinery active). Reserved range bounded so future tags (CDC, etc.) inherit the path without reopening the question.
    • Release N+1: BackupPin / Extend / Release handlers ship; BeginBackup gates on every group member reporting version ≥ N via Admin.GetRaftGroups, returning FailedPrecondition otherwise.

    Operators may collapse N and N+1 into one calendar release if they enforce a fleet-wide upgrade window before any backup runs; the gate check still short-circuits.

Minor

  1. ScopeKeyCount.scope_id format now documented for all four adapters, including the redis_databases uint32"redis/db_<n>" mapping for Redis.

New tests: TestForwardCompatUnknownFSMTag, TestBeginBackupGatesOnMinVersion, TestSnapshotEveryReadsFromEngine.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 3m 18s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 10)

  • Gather context
  • Read updated design document
  • Verify round-9 items addressed
  • Check referenced source files
  • Post review feedback

All three round-9 items are addressed cleanly — SnapshotEvery() uint64 is proposed for StatusReader, the env-var name is corrected to ELASTICKV_RAFT_SNAPSHOT_COUNT, the two-release rollout plan is solid, and ScopeKeyCount.scope_id covers all four adapters. One correctness issue surfaced from reading proto/admin.proto directly; two minor issues follow.


Important

1. GetRaftGroups carries no per-node version metadata — the min_backup_version gate cannot be implemented as described

Lines 1289–1293:

"BeginBackup itself gates on every group member reporting min_backup_version ≥ N — verified through the existing Admin.GetRaftGroups membership view, which already carries per-node version metadata."

RaftGroupState in proto/admin.proto:38–49 has exactly these fields: raft_group_id, leader_node_id, leader_term, commit_index, applied_index, last_contact_unix_ms. There is no version field. GetRaftGroupsResponse is repeated RaftGroupState groups = 1 — nothing else.

A Phase 1 implementor following the design would write group.Version in the BeginBackup handler and discover at compile time that the field does not exist. The same gap affects TestBeginBackupGatesOnMinVersion — the test has no version value to assert against.

Fix: add string node_version = 7; (semver of the binary on that member) to RaftGroupState in proto/admin.proto, add it to the Phase 1 scope list, and update line 1291 to remove the false "already carries" claim. Populating the field is straightforward — the admin handler already knows the local binary version, and follower versions are surfaced through the existing StatusReader.Status() flow.
Fix this →


Minor

2. MaxEntryBytes is not a constant in the codebase

Line 1322: "stay within the existing MaxEntryBytes budget."

There is no MaxEntryBytes constant anywhere in the codebase. The governing limit is MaxSizePerMsg, defaulting to 1 MiB (defaultMaxSizePerMsg = 1<<20, internal/raftengine/etcd/engine.go:55). The underlying claim is obviously valid (33 bytes ≪ 1 MiB) — but referencing a fictitious constant will confuse the Phase 1 implementor who searches for it. Change to "well within the MaxSizePerMsg limit (default 1 MiB, internal/raftengine/etcd/engine.go:55)".
Fix this →

3. Release N forward-compat check position in Apply is not specified

Lines 1279–1285 describe Release N adding a no-op handler for tags in [0x03, 0x0F], but don't say exactly where in kvFSM.Apply the check goes. The current Apply dispatch (kv/fsm.go:60–65) is:

if len(data) > 0 && data[0] == raftEncodeHLCLease {  // 0x02
    return f.applyHLCLease(data[1:])
}
// falls into decodeRaftRequests

The forward-compat block must sit between the raftEncodeHLCLease guard and decodeRaftRequests — adding it inside decodeRaftRequests's default: case would work identically, but splitting the two checks across two functions makes a future tag collision much harder to spot. Add one line: "The check is added in kvFSM.Apply immediately after the raftEncodeHLCLease guard and before decodeRaftRequests."
Fix this →


Round-9 items — all resolved ✓

# Item Verdict
1 SnapshotEvery not on raftengine.Engine interface; wrong env var name StatusReader.SnapshotEvery() uint64 proposed; env var correctly identified as ELASTICKV_RAFT_SNAPSHOT_COUNT (internal/raftengine/etcd/engine.go:93) ✓
2 Rolling-upgrade safety — unknown tag stalls apply loop Two-release rollout plan (Release N: no-op forward-compat; Release N+1: full handlers + version gate) ✓
3 ScopeKeyCount.scope_id undocumented for Redis/SQS All four adapter mappings documented at lines 722–732; Redis uint32"redis/db_<n>" mapping explicit ✓

Overall: one confirmed correctness issue remains — the min_backup_version gate depends on a version field that does not exist in RaftGroupState. That's a one-field proto addition and a one-line scope-list update, but without it the gating logic that makes the rolling-upgrade plan safe is unimplementable. Items 2 and 3 are one-liners. After those three fixes, the proposal is ready to merge.

Important (verified against proto/admin.proto):
- The min_backup_version gate claimed Admin.GetRaftGroups already
  carries per-node version. It does not. RaftGroupState
  (proto/admin.proto:38-49) only has raft_group_id, leader_node_id,
  leader_term, commit_index, applied_index, last_contact_unix_ms.
  Phase 1 therefore adds string node_version = 7 to RaftGroupState,
  populated by each admin handler from the build-time version
  constant. Removed the false "already carries" claim.

Minor:
- MaxEntryBytes is not a constant in the codebase. Replaced with the
  real limit: MaxSizePerMsg (defaultMaxSizePerMsg = 1<<20,
  internal/raftengine/etcd/engine.go:55).
- Forward-compat unknown-tag handler position specified: in
  kvFSM.Apply immediately after the raftEncodeHLCLease guard and
  before decodeRaftRequests, so all leading-byte dispatch stays in
  one place rather than splitting between Apply and the default case
  of decodeRaftRequests. Tags >= 0x10 still fall through and error,
  bounding the forward-compat door.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-10 review (#706 (comment)) addressed in 9bf18e1c. The important item was a real bug — verified against proto/admin.proto:38-49.

Important

  1. RaftGroupState does not carry node_version. My round-9 claim that Admin.GetRaftGroups "already carries per-node version metadata" was wrong — RaftGroupState has only raft_group_id, leader_node_id, leader_term, commit_index, applied_index, last_contact_unix_ms. Phase 1 therefore explicitly adds string node_version = 7; to RaftGroupState, populated by each admin handler from the build-time version constant. Without this, the rolling-upgrade gate is unimplementable. Removed the false "already carries" claim.

Minor

  1. MaxEntryBytes doesn't exist in the codebase. Replaced with the real limit: MaxSizePerMsg (defaultMaxSizePerMsg = 1<<20, internal/raftengine/etcd/engine.go:55).

  2. Forward-compat handler position specified. In kvFSM.Apply immediately after the raftEncodeHLCLease guard and before decodeRaftRequests, so all leading-byte dispatch lives in one place. Tags ≥ 0x10 still fall through and error, bounding the forward-compat door.

TestBeginBackupGatesOnMinVersion updated to assert against the new RaftGroupState.node_version field.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 5m 25s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 11)

  • Gather context
  • Read updated design document
  • Verify round-10 items addressed
  • Check referenced source files
  • Post review feedback

All three round-10 items are addressed correctly — RaftGroupState.node_version = 7 is specified as a Phase 1 addition with the false "already carries" claim removed, MaxSizePerMsg correctly points to internal/raftengine/etcd/engine.go:55 (confirmed: defaultMaxSizePerMsg = 1 << 20), and the forward-compat handler position is correctly described. Two structural issues surfaced from reading the adapter/admin_grpc.go implementation against the design's claims.


Important

1. GetRaftGroups returns one version per group (the local node's), not per cluster member — the rolling-upgrade gate cannot be implemented as described

GetRaftGroups (adapter/admin_grpc.go:382–416) iterates over s.groups — the Raft engines registered on the local AdminServer instance — and for each group returns one RaftGroupState populated from the local engine's Status():

out = append(out, &pb.RaftGroupState{
    RaftGroupId:       id,
    LeaderNodeId:      st.Leader.ID,
    LeaderTerm:        st.Term,
    CommitIndex:       st.CommitIndex,
    AppliedIndex:      st.AppliedIndex,
    LastContactUnixMs: lastContactUnixMs,
})

Adding node_version = 7 to RaftGroupState and populating it from the local binary's version constant would give a single version per group — the admin server node's version — regardless of what followers are running. A three-node cluster where node A (the admin server) is at v2, node B is at v1, and node C is at v2 would return three RaftGroupState entries (one per group), each with node_version = "v2" — the gate would pass and BackupPin would be proposed to groups still containing an old node B that doesn't know the new tags.

The design's claim that "follower versions are surfaced via the existing membership-state propagation path used by GetRaftGroups" is incorrect — GetRaftGroups has no follower version propagation today.

Options to fix the gate:

  • Fan-out GetVersion RPC to each peer: The backup CLI (or the admin server receiving BeginBackup) calls a simple GetVersion() → string RPC on every node in AdminServer.members. Each node responds with its own binary version. This is the lightest change — no Raft log involvement, no proto restructure. Works as long as the backup CLI can reach all admin ports.
  • Per-member rows in GetRaftGroups: Redesign RaftGroupState to include a repeated MemberState members = 8; field where each member carries node_id + node_version. Requires the leader to gossip or pull versions from peers.
  • Gate only on the local node: Document that the gate checks only that the node receiving BeginBackup is at version ≥ N, and rely on operator upgrade procedures for the rest. Weaker but honest about what GetRaftGroups can actually verify.

The design must pick one and remove the "existing membership-state propagation" claim.
Fix this →

2. AdminServer has no ShardStore — the "existing wiring" claim is factually wrong

Line 663–664:

"lets the admin server reach it directly via the existing AdminServerShardStore wiring used by GetRaftGroups"

Confirmed by reading adapter/admin_grpc.go:62–82: AdminServer holds self, members, groups map[uint64]AdminGroup, now, and sampler — no *kv.ShardStore field. GetRaftGroups calls AdminGroup.Status() and AdminGroup.Configuration(); it never touches ShardStore. There is no existing wiring.

For BeginBackup to call ShardStore.NewBackupScanner, Phase 1 must add a *kv.ShardStore field to AdminServer and wire it in main.go. The design should state this explicitly rather than describing it as pre-existing wiring — the language "existing" would mislead a Phase 1 implementor into searching for a connection that does not exist.
Fix this →


Minor

3. SnapshotEvery() in StatusReader does not reach BeginBackup through AdminGroup

The design adds SnapshotEvery() uint64 to raftengine.StatusReader (engine.go:158). But AdminGroup (adapter/admin_grpc.go:41–44) is a separate interface that independently defines Status() raftengine.Status — it does not embed StatusReader. Even after adding SnapshotEvery() to StatusReader, AdminGroup would still only expose Status(), not SnapshotEvery().

A BeginBackup handler on AdminServer accessing s.groups[id] (type AdminGroup) cannot call SnapshotEvery() without a type assertion or interface extension. Add a note to Phase 1 scope: either extend AdminGroup to include SnapshotEvery() uint64 (one more method), or document that the handler falls back to defaultSnapshotEvery when the type assertion s.groups[id].(interface{ SnapshotEvery() uint64 }) fails (appropriate for backwards-compat with tests that mock AdminGroup with only Status()).
Fix this →


Round-10 items — all resolved ✓

# Item Verdict
1 RaftGroupState has no node_version field Correctly described as Phase 1 addition; "already carries" claim removed (lines 1302–1322) ✓
2 MaxEntryBytes doesn't exist Replaced with MaxSizePerMsg (internal/raftengine/etcd/engine.go:55, confirmed = 1 << 20) (line 1351) ✓
3 Forward-compat handler position unspecified "Immediately after raftEncodeHLCLease guard, before decodeRaftRequests" (lines 1287–1290); matches actual kv/fsm.go:63–69 structure ✓

Overall: two confirmed code-vs-doc mismatches remain. Item 1 (the version gate) is the blocker — as written, BeginBackup would always pass the version check because GetRaftGroups only reports the local node's version. Item 2 (the wiring claim) needs a one-line correction plus a one-line Phase 1 scope addition. Item 3 is a two-sentence clarification. After those are addressed, the proposal is ready to merge.

Important (verified against adapter/admin_grpc.go):
- Rolling-upgrade gate redesigned. GetRaftGroups iterates the local
  AdminServer's registered groups and reports each row from the
  local engine's Status() (adapter/admin_grpc.go:382-416), so adding
  node_version to RaftGroupState would only ever report the local
  admin server's version. Followers on other nodes would not
  surface, and the gate would falsely pass on a partially-upgraded
  cluster.

  Phase 1 instead adds a dedicated GetNodeVersion RPC fanned out by
  the BeginBackup handler to every peer in AdminServer.members.
  Each peer responds with its compiled-in version constant; the
  handler refuses BeginBackup if any peer is below N or unreachable
  within --begin-backup-deadline. RaftGroupState gains
  leader_node_version = 7 for UI visibility only (not used for the
  gate decision).

- AdminServer has no *kv.ShardStore today (confirmed at
  adapter/admin_grpc.go:62-82). The "existing AdminServer ->
  ShardStore wiring used by GetRaftGroups" claim was wrong;
  GetRaftGroups only reaches AdminGroup.Status() and Configuration().
  Phase 1 adds the field explicitly, wired in main.go.

Minor:
- AdminGroup interface (adapter/admin_grpc.go:41-44) does not embed
  StatusReader, so adding SnapshotEvery() to StatusReader does not
  reach BeginBackup through s.groups[id]. Phase 1 explicitly extends
  AdminGroup with SnapshotEvery() uint64; mocks in
  adapter/admin_grpc_test.go gain one method that returns
  defaultSnapshotEvery for parity with production.

TestBeginBackupGatesOnMinVersion updated to match the new
GetNodeVersion fan-out path; canonical 3-node fixture documented.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-11 review (#706 (comment)) addressed in a0292c95. Both important items verified against adapter/admin_grpc.go.

Important

  1. Rolling-upgrade gate redesigned. GetRaftGroups (adapter/admin_grpc.go:382-416) iterates the local AdminServer's groups and reports each row from the local engine's Status(). Adding node_version to RaftGroupState would only ever surface the admin server's own version — followers on other nodes would not be reflected, and the gate would falsely pass on a partially-upgraded cluster.

    Phase 1 instead adds a dedicated GetNodeVersion RPC fanned out by the BeginBackup handler to every peer in AdminServer.members. Each peer returns its compiled-in version constant; the handler refuses BeginBackup if any peer is below N or unreachable within --begin-backup-deadline. RaftGroupState still gains leader_node_version = 7 for UI visibility, but it is not used for the gate decision.

  2. AdminServer has no *kv.ShardStore field (confirmed at adapter/admin_grpc.go:62-82). The "existing AdminServerShardStore wiring used by GetRaftGroups" claim was wrong — GetRaftGroups only reaches AdminGroup.Status() and Configuration(). Phase 1 adds the field explicitly and wires it in main.go.

Minor

  1. AdminGroup interface (adapter/admin_grpc.go:41-44) does not embed StatusReader, so the SnapshotEvery() addition on StatusReader does not reach BeginBackup through s.groups[id]. Phase 1 extends AdminGroup itself with SnapshotEvery() uint64; mocks in adapter/admin_grpc_test.go gain one extra method that returns defaultSnapshotEvery for parity with production.

TestBeginBackupGatesOnMinVersion updated to match the new GetNodeVersion fan-out path.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 8s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 12)

  • Gather context
  • Read updated design document
  • Verify round-11 items addressed
  • Check referenced source files
  • Post review feedback

All three round-11 items are addressed correctly — the GetNodeVersion fan-out replaces the broken GetRaftGroups approach, the false "existing wiring" claim is removed, and AdminGroup is extended with SnapshotEvery(). One structural gap surfaced from reading adapter/admin_grpc.go against the fan-out target; two minor issues follow.


Important

1. BeginBackup's GetNodeVersion fan-out targets AdminServer.members — dynamically-added nodes are never checked

Lines 1344–1346:

"The admin server handling BeginBackup issues GetNodeVersion to every peer in AdminServer.members (and to itself)"

AdminServer.members is a static bootstrap seed set at construction time via NewAdminServer(self NodeIdentity, members []NodeIdentity) (adapter/admin_grpc.go:89). It is never updated after startup.

GetClusterOverview works around this by calling snapshotMembers (adapter/admin_grpc.go:158), which unions the static seed with nodes dynamically discovered from each group's Configuration(). A node added via AddVoter after startup appears in Configuration() but not in AdminServer.members.

In a rolling-upgrade scenario:

  1. Cluster is running at vN-1. Static members seed = [n1, n2, n3].
  2. Operator adds node n4 (at old version) via AddVoter. n4 is in Configuration() but not in members.
  3. Operator upgrades n1, n2, n3 to vN+1.
  4. BeginBackup on n1 fans out GetNodeVersion to [n2, n3] (and self). Gate passes.
  5. BackupPin is proposed to every group — including the group where n4 is a replica. n4's FSM doesn't know BackupPin yet; the apply loop stalls.

The design's rolling-upgrade guarantee is silently incomplete for any cluster that has grown since startup.

Fix: change BeginBackup to fan out GetNodeVersion to the live member set, obtained the same way snapshotMembers does it — union of AdminServer.members and nodes discovered from each group's Configuration(). Add a note to Phase 1 scope that BeginBackup calls the equivalent of snapshotMembers before issuing GetNodeVersion fan-outs. Fix this →


Minor

2. leader_node_version = 7 in GetRaftGroups adds an undocumented async dial per group leader

Lines 1353–1357:

"RaftGroupState is also extended with a string leader_node_version = 7; field populated from GetNodeVersion(leader) so the admin UI can surface mixed-version groups directly."

GetRaftGroups today (adapter/admin_grpc.go:382–416) is a synchronous read of s.groups[id].Status() — O(groups), no network calls. Populating leader_node_version requires:

  1. Look up the leader's gRPC address from Configuration() (another per-group call)
  2. Dial the leader's admin service
  3. Call GetNodeVersion

This turns GetRaftGroups from a local-state snapshot into a fan-out handler, with a latency tail tied to leader reachability. The design says it's "informational" and "not used for the gate," but a slow or unreachable leader would silently delay or empty GetRaftGroups responses without any specified timeout.

Specify: either (a) the per-leader dial is async with a short timeout (e.g. 500ms), returning empty string on failure and never blocking the response, or (b) leader_node_version is populated lazily / on a separate background poller and the field in GetRaftGroups is best-effort-cached. Without this, a Phase 1 implementor has to choose a latency contract for GetRaftGroups with no guidance. Fix this →

3. GetNodeVersion failure mode conflates unreachable node with old-version node

Line 1348–1349:

"The handler refuses BeginBackup if any peer returns a version below N or fails to respond within --begin-backup-deadline."

A node that is partitioned or temporarily down (FailedPrecondition: "upgrade node X to vN") and a node running old software both produce FailedPrecondition. If a node is unreachable but is confirmed to be at the right version (e.g., it was just rebooted), the operator has no way to tell which condition applies without out-of-band investigation.

In practice, a node that can't respond to GetNodeVersion within 5 seconds during a backup attempt may or may not be an upgrade problem — it could be a momentary network hiccup or a legitimate failure. Blocking all backups until every node is reachable is conservative but may be operationally painful in a large cluster.

Add a sentence acknowledging this trade-off and mention that --begin-backup-deadline applies per-peer (so a 100-node cluster with one slow peer still only delays by deadline, not 100 × deadline), and that operators can use EndBackup on stranded pins if needed. Fix this →


Round-11 items — all resolved ✓

# Item Verdict
1 Rolling-upgrade gate used GetRaftGroups which only reports local node's version GetNodeVersion RPC added; BeginBackup fans out to AdminServer.members (and self); RaftGroupState.leader_node_version = 7 for UI only (not gate) ✓
2 AdminServer has no *kv.ShardStore field — "existing wiring" claim was wrong Phase 1 explicitly adds the field and wires it in main.go; false "existing" language removed ✓
3 AdminGroup interface doesn't embed StatusReaderSnapshotEvery() unreachable through s.groups[id] AdminGroup extended with SnapshotEvery() uint64; mocks return defaultSnapshotEvery = 10000

Overall: the design is in excellent shape. Item 1 above (static seed vs. live members) is the only structural correctness gap remaining — the version gate is explicitly designed to prevent backup on partially-upgraded clusters, but it silently misses nodes added after startup. Items 2 and 3 are one-sentence clarifications that will save Phase 1 implementors from having to invent a latency contract. After those three items are addressed, the proposal is ready to merge and Phase 1 implementation can begin.

Important (verified against adapter/admin_grpc.go):
- BeginBackup gate fan-out target was AdminServer.members (the
  static bootstrap seed at line 64,89 — never updated after startup).
  Nodes added later via AddVoter would not appear, silently missing
  exactly the nodes most likely to be running an older version
  during a rolling upgrade.

  Fix: BeginBackup reuses the existing snapshotMembers helper
  (adapter/admin_grpc.go:158), the same path GetClusterOverview
  uses, which unions the seed with members discovered from each
  group's Configuration(). The fan-out target is the live cluster,
  not the startup snapshot.

Minor:
- GetRaftGroups today is a synchronous local-state read with no
  network calls. Adding leader_node_version populated via
  GetNodeVersion(leader) would turn it into a fan-out handler with
  unbounded latency tail. Specified async dial per leader with a
  500ms per-leader timeout, "" on failure, and a 10s response
  cache so rapid polls do not fan out repeatedly.

- GetNodeVersion failure mode now distinguishes "old version" from
  "unreachable" in the FailedPrecondition error string so operators
  can diagnose without out-of-band investigation. Also noted that
  --begin-backup-deadline is per-peer (fan-out is concurrent), so
  the worst-case wait is one deadline period regardless of cluster
  size.

Tests added: TestBeginBackupGatesIncludesDynamicallyAddedNodes,
TestGetRaftGroupsLeaderVersionAsync.
TestBeginBackupGatesOnMinVersion updated to assert the new error
shapes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-12 review (#706 (comment)) addressed in 3efd68d2. Important item verified against adapter/admin_grpc.go.

Important

  1. AdminServer.members is a static bootstrap seed (adapter/admin_grpc.go:64,89 — set once in NewAdminServer, never updated). A node added later via AddVoter would not appear, silently missing exactly the nodes most likely to be at an older version during a rolling upgrade.

    Fix: BeginBackup reuses snapshotMembers (adapter/admin_grpc.go:158) — the same path GetClusterOverview uses — which unions the seed with members discovered from each group's Configuration(). Fan-out target is the live cluster.

Minor

  1. GetRaftGroups async-dial latency contract specified. Adding leader_node_version would otherwise turn a synchronous local read into a fan-out handler with unbounded latency. Per-leader dial is async with 500ms timeout (failures yield ""), and a 10s response cache prevents repeated fan-outs across rapid polls.

  2. GetNodeVersion failure-mode disambiguation. Error message distinguishes "old version" (FailedPrecondition: node X reports version v0.9.0; minimum is v1.0.0) from "unreachable" (FailedPrecondition: node X did not respond within 5s) so operators can diagnose without out-of-band investigation. --begin-backup-deadline is per-peer (fan-out is concurrent), so worst-case wait is one deadline period regardless of cluster size.

New tests: TestBeginBackupGatesIncludesDynamicallyAddedNodes, TestGetRaftGroupsLeaderVersionAsync. TestBeginBackupGatesOnMinVersion updated to assert the new error shapes and the snapshotMembers-based fan-out.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 3m 41s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 13)

  • Gather context
  • Read updated design document
  • Verify round-12 items addressed
  • Check referenced source files for new gaps
  • Post review feedback

All three round-12 items are addressed correctly — snapshotMembers is the right choice (confirmed: adapter/admin_grpc.go:158 unions the bootstrap seed with live Configuration() so dynamically-added nodes appear), the async-dial latency contract for leader_node_version is specified, and the error message distinguishes "old version" from "unreachable". One structural gap surfaced when reading adapter/admin_grpc.go against the GetNodeVersion fan-out description; one minor issue follows.


Important

1. AdminServer has no outbound gRPC connection mechanism for GetNodeVersion — Phase 1 scope gap

Line 1361–1365:

"Each GetNodeVersion dial uses --begin-backup-deadline as a per-peer budget (the fan-out is concurrent, one goroutine per peer) … Each peer responds with its compiled-in version constant."

snapshotMembers() returns []*pb.NodeIdentity{NodeId, GrpcAddress}, so the addresses are available. But actually dialing those addresses requires establishing gRPC connections. AdminServer today (adapter/admin_grpc.go:62–82) holds self, members, groups, now, sampler — no connection pool, no dialer, no *grpc.ClientConn factory. The only outbound connection infrastructure in the codebase lives in:

  • kv.GRPCConnCache (kv/grpc_conn_cache.go) — used by ShardStore / LeaderProxy / Coordinator for KV leader-forwarding. Phase 1 adds a *kv.ShardStore field to AdminServer; in principle the ShardStore.connCache could be reused, but that pool is keyed on data-plane addresses (Raft group leaders) and mixing admin-RPC dials into it conflates two different purposes.
  • cmd/elastickv-admin/main.go fanout — lives in the CLI tool, not in-process.

Without specifying the mechanism, a Phase 1 implementor either: (a) invents their own dialer with choices about credentials, dial options, and connection lifecycle that may not match the rest of the system, or (b) silently shares ShardStore.connCache in a way that violates the separation of concerns and may break if node addresses for admin and data plane ever diverge.

Fix: Add to Phase 1 scope — "Phase 1 also adds a *GRPCConnCache (or equivalent) to AdminServer for admin-to-admin peer dials, wired in main.go with internal.GRPCDialOptions(). The field is separate from the KV leader-forwarding cache in ShardStore because admin and data-plane addresses may differ."

This is the same category as the previously-caught "AdminServer has no ShardStore field — existing wiring doesn't exist" issue from round-11. Fix this →


Minor

2. leader_node_version 10s cache is a new AdminServer struct field not listed in Phase 1 scope

Line 1394–1397:

"The version is also cached locally on the admin server with a short TTL (default 10 s) so repeated GetRaftGroups calls within a polling window do not fan out repeatedly."

A 10s per-leader version cache requires new state in AdminServer: some form of map[string]struct{ version string; fetchedAt time.Time } protected by a mutex (or sync.Map). This is another struct change alongside the *kv.ShardStore and dialer additions. It is not mentioned in the Phase 1 scope list (lines 1256–1461). Add it so the implementor adds the cache in the same commit that adds the async dial, rather than discovering later that repeated GetRaftGroups polls are hammering every leader. Fix this →


Round-12 items — all resolved ✓

# Item Verdict
1 Static AdminServer.members seed misses dynamically-added nodes BeginBackup now uses snapshotMembers (adapter/admin_grpc.go:158) — unions seed with live Configuration() — confirmed correct ✓
2 GetRaftGroups async-dial latency contract unspecified 500 ms timeout per leader, empty string on failure, 10 s response cache ✓
3 GetNodeVersion conflates old-version and unreachable error Error message distinguishes "node X reports version vX" from "node X did not respond within Ns" ✓

Overall: the design is in excellent shape. Item 1 above (the connection pool gap) is the last structural issue before Phase 1 implementation can begin — without specifying it, two independent implementors will make incompatible choices about credentials and dial options. Item 2 is a one-line addition to the Phase 1 scope list. After those two fixes the proposal is ready to merge.

Important (verified against adapter/admin_grpc.go and
kv/grpc_conn_cache.go):
- AdminServer has no outbound gRPC connection mechanism today.
  GetNodeVersion fan-out (in BeginBackup) and the async
  leader_node_version probes (in GetRaftGroups) both need to dial
  peer admin servers. Without specifying the dialer, two
  implementors would make incompatible choices about credentials
  and connection lifecycle.

  Phase 1 explicitly adds a *kv.GRPCConnCache field to AdminServer
  using internal.GRPCDialOptions(), kept separate from
  ShardStore.connCache because admin and data-plane addresses
  occupy different address spaces (admin cache is keyed on
  NodeIdentity.GRPCAddress; data-plane cache is keyed on Raft-
  group-leader addresses, which migrate on leader change).

Minor:
- 10s leader_node_version cache is now an explicit AdminServer
  struct field (versionCache map[string]struct{...}) listed in
  Phase 1 scope rather than implicit prose.

Tests added: TestAdminGRPCConnCacheReuse (verifies cache reuse,
peer-shutdown eviction, independence from ShardStore.connCache).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-13 review (#706 (comment)) addressed in 6e62c7e7. Important item verified against adapter/admin_grpc.go and kv/grpc_conn_cache.go.

Important

  1. AdminServer has no outbound gRPC connection mechanism today. GetNodeVersion fan-out (in BeginBackup) and async leader probes (in GetRaftGroups) both need to dial peer admin servers. Without specifying the dialer, two implementors would make incompatible credential / lifecycle choices.

    Fix: Phase 1 adds a *kv.GRPCConnCache field to AdminServer, dialed with internal.GRPCDialOptions() — matching how ShardStore and the etcd-raft gRPC transport (internal/raftengine/etcd/grpc_transport.go:469) obtain peer connections today. Kept separate from ShardStore.connCache because admin and data-plane addresses occupy different address spaces (admin cache keyed on NodeIdentity.GRPCAddress; data-plane cache keyed on Raft-group-leader addresses that migrate on leader change).

Minor

  1. 10s leader_node_version cache now an explicit AdminServer field (versionCache map[string]struct{ version string; fetchedAt time.Time }) listed in Phase 1 scope rather than implicit prose.

New test: TestAdminGRPCConnCacheReuse.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 5m 34s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 14)

  • Gather context
  • Read updated design document
  • Verify round-13 items addressed
  • Check referenced source files (kv/grpc_conn_cache.go, adapter/admin_grpc.go)
  • Post review feedback

All two round-13 items are addressed correctly — *kv.GRPCConnCache is now an explicit Phase 1 field on AdminServer with the rationale for keeping it separate from ShardStore.connCache, and versionCache is listed explicitly as a new struct field rather than implicit prose. One correctness gap surfaced when tracing GetNodeVersion fan-out through the actual AdminTokenAuth implementation.


Important

1. GetNodeVersion fan-out will fail with Unauthenticated on auth-enabled clusters

adapter/admin_grpc.go:455-506 (AdminTokenAuth) installs a unary interceptor that gates every Admin service method:

var adminMethodPrefix = "/" + pb.Admin_ServiceDesc.ServiceName + "/"

unary := func(ctx, req, info, handler) {
    if !strings.HasPrefix(info.FullMethod, adminMethodPrefix) {
        return handler(ctx, req)  // pass through non-Admin methods
    }
    if err := check(ctx); err != nil {   // check "authorization: Bearer <token>"
        return nil, err
    }
    return handler(ctx, req)
}

GetNodeVersion is part of the Admin service, so it is auth-gated on every peer that runs with a non-empty admin token. When BeginBackup fans out GetNodeVersion to peers via the new *kv.GRPCConnCache, the outbound gRPC calls are made with no outgoing metadata — confirmed by reading kv/grpc_conn_cache.go:77-79: ConnFor dials with internalutil.GRPCDialOptions() + WaitForReady(true), no auth metadata. No metadata.NewOutgoingContext call exists anywhere in the admin or backup code paths (confirmed: only metadata.FromIncomingContext appears at adapter/admin_grpc.go:461).

The result: on any cluster with --adminToken configured, BeginBackup fans out GetNodeVersion calls that fail with codes.Unauthenticated on every peer. The BeginBackup handler interprets this as "node unreachable" (FailedPrecondition: node X did not respond within Ns) rather than "node running wrong version", so the error message is misleading and every backup attempt fails.

The fix is straightforward: the design must specify that the BeginBackup handler forwards the inbound metadata to outbound GetNodeVersion calls — the Go gRPC pattern is:

inMD, _ := metadata.FromIncomingContext(ctx)
outCtx := metadata.NewOutgoingContext(context.Background(), inMD)
// use outCtx for all GetNodeVersion fan-out calls

Or, alternatively, GetNodeVersion could be explicitly exempted from auth in the interceptor (a per-method allowlist), but that lowers the auth bar for a new endpoint. Document which approach is chosen.

Fix this →


Minor

2. versionCache synchronization is underspecified — (or sync.Map equivalent) is ambiguous

Phase 1 scope (line 1286) lists:

versionCache map[string]struct{ version string; fetchedAt time.Time }
(or sync.Map equivalent)

GetRaftGroups reads versionCache synchronously while holding s.groupsMu.RLock() (line 387). The async goroutines spawned to probe GetNodeVersion(leader) write to versionCache from a goroutine that may still be running from a prior GetRaftGroups call. These two paths access the map concurrently.

If the implementor uses a plain map with:

  • groupsMu.Lock() for writes — safe but causes the write goroutine to wait until GetRaftGroups releases its RLock(), and all concurrent GetRaftGroups calls wait until the goroutine finishes writing. Correct, but adds lock contention.
  • groupsMu.RLock() for writes — incorrect; RLock allows concurrent readers but does not permit concurrent writers.
  • No lock — data race. -race will flag it.
  • Separate versionCacheMu sync.RWMutex — cleanest; reads under versionCacheMu.RLock(), writes under versionCacheMu.Lock(), independent of groupsMu.

sync.Map (the "(or equivalent)") eliminates the question entirely. The design should commit to one of: sync.Map, or a separate versionCacheMu (and specify it in the Phase 1 scope field list). The current phrasing leaves enough room for a data race.

Fix this →


Round-13 items — all resolved ✓

# Item Verdict
1 AdminServer has no outbound gRPC mechanism for GetNodeVersion *kv.GRPCConnCache added to Phase 1 scope (line 1274); rationale for separate-from-ShardStore cache documented ✓
2 10s leader_node_version cache was implicit prose, not an explicit struct field versionCache now listed in Phase 1 scope with the field type and TTL contract (lines 1286–1292) ✓

Overall: the design is in excellent shape. Item 1 above (auth token propagation) is the last blocker — every BeginBackup attempt on a cluster with an admin token configured would fail with a misleading error. The fix is a one-sentence addition to the fan-out description and a one-liner code pattern. Item 2 is a concurrency-spec nit that prevents a potential data race. After those two are addressed, the proposal is ready to merge and Phase 1 implementation can begin.

Important (verified against adapter/admin_grpc.go:455-505):
- GetNodeVersion is itself an Admin method, gated by AdminTokenAuth
  on every auth-enabled cluster. The fan-out from BeginBackup
  through *kv.GRPCConnCache makes outbound calls with no metadata
  attached, so every dial would return Unauthenticated and the gate
  would misreport the failure as "node did not respond" on every
  cluster running --adminToken.

  Phase 1 explicitly propagates the inbound bearer token to
  outbound GetNodeVersion calls via metadata.NewOutgoingContext(
  context.Background(), inMD). The same pattern applies to the
  async GetNodeVersion(leader) probes in GetRaftGroups. Forwarding
  the inbound caller's token (not a server-stored value) preserves
  the operator's authorization decision and avoids privilege
  escalation by the admin server.

Minor:
- versionCache committed to sync.Map (with key/value types
  spelled out) so the implementor cannot accidentally produce a
  data race by reusing groupsMu for cache writes. The previous
  "(or sync.Map equivalent)" left enough room for a plain map +
  groupsMu approach that would either contend or race.

Tests added: TestBeginBackupPropagatesAdminAuthToken,
TestVersionCacheRaceUnderLoad (the latter runs under -race).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Round-14 review (#706 (comment)) addressed in 486675d4. Important item verified against adapter/admin_grpc.go:455-505.

Important

  1. GetNodeVersion is itself an Admin method and would be auth-gated on every cluster running --adminToken. The fan-out through *kv.GRPCConnCache makes outbound calls with no metadata, so every dial would return Unauthenticated and the gate would misreport "node did not respond" on every auth-enabled cluster.

    Fix: BeginBackup propagates the inbound bearer token to outbound GetNodeVersion calls via metadata.NewOutgoingContext(context.Background(), inMD). Same pattern for async GetNodeVersion(leader) probes in GetRaftGroups. Forwarding the inbound caller's token (not a server-stored value) preserves the operator's authorization decision — no privilege escalation by the admin server.

Minor

  1. versionCache committed to sync.Map with key/value types spelled out, removing the "(or sync.Map equivalent)" ambiguity that left room for a plain map + groupsMu approach that would either contend or race.

New tests: TestBeginBackupPropagatesAdminAuthToken, TestVersionCacheRaceUnderLoad (under -race).

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 24s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md (round 15)

  • Gather context
  • Read updated design document
  • Verify round-14 items addressed
  • Check referenced source files (kv/grpc_conn_cache.go, adapter/admin_grpc.go)
  • Post review feedback

Both round-14 items are addressed: the auth token propagation pattern is described correctly and versionCache sync.Map is explicitly specified with types. One important gap surfaced when reading kv/grpc_conn_cache.go against the code snippet added in round 14.


Important

1. metadata.NewOutgoingContext(context.Background(), inMD) has no deadline — WaitForReady(true) blocks indefinitely on unreachable peers

The code pattern introduced in round 14 (lines 1404–1407):

inMD, _ := metadata.FromIncomingContext(ctx)
outCtx := metadata.NewOutgoingContext(context.Background(), inMD)
// every per-peer GetNodeVersion dial uses outCtx

correctly propagates the auth metadata, but context.Background() carries no deadline. GRPCConnCache.ConnFor (kv/grpc_conn_cache.go:79) sets grpc.WithDefaultCallOptions(grpc.WaitForReady(true)) as a default option on every connection it creates. WaitForReady(true) means a call made on a context with no deadline will block indefinitely until the server transitions to READY — an unreachable peer never does. A BeginBackup fan-out goroutine for an unreachable peer would hang forever, rather than returning FailedPrecondition: node X did not respond within 5s as the design promises.

The fix requires a per-goroutine deadline context (not a shared outCtx created before the loop, which would give all goroutines the same cancellable window):

inMD, _ := metadata.FromIncomingContext(ctx)  // captured once, before spawning goroutines
// ── inside each goroutine ──
peerCtx, cancel := context.WithTimeout(context.Background(), perPeerDeadline)
defer cancel()
outCtx := metadata.NewOutgoingContext(peerCtx, inMD)
resp, err := stub.GetNodeVersion(outCtx, &pb.GetNodeVersionRequest{})

The same issue applies to GetRaftGroups's async leader probes: "500 ms timeout per leader" (line 1439) requires a per-goroutine context.WithTimeout(context.Background(), 500*time.Millisecond) before calling metadata.NewOutgoingContext.

The design should add the context.WithTimeout wrapper to the code pattern on lines 1404–1407, with a note that it applies per-goroutine. Without it, the described "per-peer budget" exists only in prose, not in the code an implementor will write.

Fix this →


Minor

2. GetRaftGroups discards its context — Phase 1 must stop doing that to capture inMD

GetRaftGroups today (adapter/admin_grpc.go:383) has the signature:

func (s *AdminServer) GetRaftGroups(
    _ context.Context,
    _ *pb.GetRaftGroupsRequest,
) (*pb.GetRaftGroupsResponse, error)

The context is explicitly discarded (_). The auth-propagation pattern requires:

inMD, _ := metadata.FromIncomingContext(ctx)  // captured during the handler

This can only work if Phase 1 changes the discarded parameter to ctx context.Context. The Phase 1 scope list (lines 1265–1509) does not mention this change. Add adapter/admin_grpc.go: change GetRaftGroups context parameter from _ to ctx to the scope list so the implementor doesn't try to forward metadata from a discarded value. Fix this →


Round-14 items — all resolved ✓

# Item Verdict
1 GetNodeVersion fan-out would fail with Unauthenticated on auth-enabled clusters metadata.NewOutgoingContext(context.Background(), inMD) pattern documented (lines 1404–1414); forwarding the inbound token, not a server-stored value ✓
2 versionCache synchronization left ambiguous — room for data race Committed to sync.Map with key/value types spelled out (lines 1286–1297) ✓

Overall: the design is one small addition away from being ready. Item 1 (the missing context.WithTimeout) is the last correctness gap — as written, the BeginBackup handler would hang indefinitely on any unreachable peer rather than respecting the --begin-backup-deadline budget. Item 2 is a one-line scope-list addition. After both are addressed the proposal is ready to merge.

snapshot decoder) and Phase 1 (live PIT extraction)

The original proposal accumulated significant complexity chasing
cluster-wide point-in-time consistency: BackupPin Raft FSM commands,
GetNodeVersion fan-out, version gate, expected-keys baseline,
admin-side connection cache. That machinery is justified only when
multi-shard cross-group PIT is required.

For many use cases (single-shard clusters, one-time export off
elastickv, recovering from snapshot files when the cluster itself
is gone), a purely offline .fsm-to-directory-tree converter
delivers the original "vendor-independent recovery" goal with no
admin RPCs, no FSM tag, no Raft involvement.

Split:

Phase 0 (new doc, 2026_04_29_proposed_snapshot_logical_decoder.md):
- Offline cmd/elastickv-snapshot-decode and -encode tools
- Owns the per-adapter directory format definition
  (DynamoDB / S3 / Redis / SQS layouts, filename encoding,
  MANIFEST.json schema, KEYMAP.jsonl, the four-class
  internal-state policy).
- Restore via stop-replace-restart of the .fsm + .snap files.
- Trade-offs section explicitly maps when Phase 0 is sufficient
  vs. when Phase 1 is required.

Phase 1 (existing doc, retitled and prefaced):
- Title updated to "Live Cluster, PIT-Consistent Extraction
  (Phase 1)".
- New Phasing section near the top cross-references Phase 0 as
  the format owner.
- Body retains the live-cluster machinery: read_ts pinning,
  cross-shard catch-up, expected-keys baseline, version-gated
  rolling-upgrade safety, FSM-level pin propagation.
- MANIFEST.phase field discriminates dumps from the two phases
  ("phase0-snapshot-decode" vs. "phase1-live-pinned").

Both phases produce the same on-disk format so restorers and
external tools do not care which produced a given dump.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

Proposal split into Phase 0 + Phase 1

After review feedback exposed how much complexity the live-cluster PIT machinery had accumulated (replicated BackupPin FSM commands, GetNodeVersion fan-out, version gate, expected-keys baseline, admin-side connection cache, auth-token propagation), the design has been split so the vendor-independent recovery goal — the actual user-facing requirement — can ship without dragging the entire live-PIT pipeline along.

Phase 0 — docs/design/2026_04_29_proposed_snapshot_logical_decoder.md (new)

  • Offline .fsm ↔ logical-format directory tree converter.
  • cmd/elastickv-snapshot-decode / cmd/elastickv-snapshot-encode.
  • Owns the format definition (per-adapter layout, filename encoding, MANIFEST.json, KEYMAP.jsonl, internal-state policy).
  • No admin RPCs, no FSM tag, no Raft involvement, no proto changes.
  • Restore via stop-replace-restart of the .fsm + .snap files.
  • Sufficient for single-shard clusters, one-time exports off elastickv, and any case where the latest available snapshot is a good-enough recovery point.

Phase 1 — docs/design/2026_04_29_proposed_logical_backup.md (this PR's existing doc, retitled)

  • Title updated to "Live Cluster, PIT-Consistent Extraction (Phase 1)".
  • New Phasing section near the top cross-references Phase 0 as the format owner.
  • Body retains the live-cluster machinery: read_ts pinning, cross-shard catch-up, expected-keys baseline, version-gated rolling-upgrade safety, FSM-level pin propagation.
  • MANIFEST.phase field discriminates dumps from the two phases ("phase0-snapshot-decode" vs. "phase1-live-pinned").
  • Required only when cross-shard PIT consistency or "snapshot now" cadence is needed.

Both phases produce the same on-disk format, so external recovery tools (aws s3 sync, aws dynamodb put-item, redis-cli --pipe, …) do not care which phase produced a given dump.

Commit: 4ee0f3fa

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
docs/design/2026_04_29_proposed_snapshot_logical_decoder.md (1)

13-13: Add language identifiers to fenced code blocks.

These blocks currently trigger MD040 warnings. Please tag them (text, json, go, protobuf, etc.) so docs lint passes cleanly.

Also applies to: 108-108, 209-209, 298-298, 315-315, 420-420, 438-438, 472-472, 481-481

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_29_proposed_snapshot_logical_decoder.md` at line 13,
Several fenced code blocks in the document are missing language identifiers;
update every triple-backtick fenced code block (```) to include the appropriate
language tag (e.g., text, json, go, protobuf) so the linter MD040 stops warning;
search for all occurrences of ``` and replace them with language-tagged openings
like ```json or ```go matching the snippet content (apply this to the other
unnamed fences in the file as well).
docs/design/2026_04_29_proposed_logical_backup.md (1)

42-42: Standardize fenced code blocks with explicit languages.

These blocks are currently untyped and trigger MD040. Adding explicit languages will improve lint reliability and readability.

Also applies to: 148-148, 265-265, 354-354, 386-386, 458-458, 487-487, 518-518, 523-523, 542-542, 553-553, 578-578, 868-868, 1099-1099, 1122-1122, 1136-1136, 1490-1490

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_29_proposed_logical_backup.md` at line 42, The document
contains multiple untyped fenced code blocks using just ``` which trigger MD040;
update each untyped fenced block to include an explicit language identifier
(e.g., ```sql, ```bash, ```yaml, ```json, or ```text as appropriate to the block
content) so linters can detect the language and improve readability, apply this
change for every occurrence flagged in the comment (all untyped ``` blocks
throughout the document), and re-run markdownlint/MD040 to verify the warnings
are resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/2026_04_29_proposed_logical_backup.md`:
- Around line 1080-1084: The markdown table rows contain unescaped pipe
characters inside inline examples (e.g., the tokens shown as "!s3route|",
"!redis|ttl|", "!txn|" and paths like internal/raftengine/etcd/persistence.go,
kv/txn_keys.go) which breaks table rendering; update those cells to either
escape the internal '|' characters or present the examples in a code span/block
that avoids raw '|' inside the table cell so the three-column layout renders
correctly, ensuring entries such as the S3 route, Redis TTL example, and txn
intent markers are updated consistently across the affected rows.
- Around line 217-237: The draft overstates filename reversibility: update the
wording to state that when the filename is replaced by the
`<sha256-hex-prefix-32>__<truncated-original>` fallback (and for
collision-renamed keys), the KEYMAP.jsonl at the adapter/scope root is required
to reconstruct the exact original bytes — i.e., KEYMAP is a correctness
dependency for those cases; remove or reword the sentence that currently claims
KEYMAP is purely a convenience and not required for correctness, while
preserving that consumers who never rely on truncated/hashed filenames can
ignore KEYMAP.

In `@docs/design/2026_04_29_proposed_snapshot_logical_decoder.md`:
- Around line 167-182: The doc currently claims the original bytes are stored in
KEYMAP.jsonl for oversized segments encoded as
"<sha256-prefix-32>__<truncated-original>" but then says the translation is
losslessly recoverable from the encoded filename alone and that KEYMAP.jsonl is
optional; update the text to explicitly state that when the SHA-fallback
encoding ("<sha256-prefix-32>__<truncated-original>") is used, KEYMAP.jsonl is
required to recover the full original bytes, and contrast this with purely
lossless encodings where KEYMAP is only a convenience; adjust the sentences
referencing KEYMAP.jsonl and the recovery claim to make this dependency clear.

---

Nitpick comments:
In `@docs/design/2026_04_29_proposed_logical_backup.md`:
- Line 42: The document contains multiple untyped fenced code blocks using just
``` which trigger MD040; update each untyped fenced block to include an explicit
language identifier (e.g., ```sql, ```bash, ```yaml, ```json, or ```text as
appropriate to the block content) so linters can detect the language and improve
readability, apply this change for every occurrence flagged in the comment (all
untyped ``` blocks throughout the document), and re-run markdownlint/MD040 to
verify the warnings are resolved.

In `@docs/design/2026_04_29_proposed_snapshot_logical_decoder.md`:
- Line 13: Several fenced code blocks in the document are missing language
identifiers; update every triple-backtick fenced code block (```) to include the
appropriate language tag (e.g., text, json, go, protobuf) so the linter MD040
stops warning; search for all occurrences of ``` and replace them with
language-tagged openings like ```json or ```go matching the snippet content
(apply this to the other unnamed fences in the file as well).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 77133461-b7dd-4815-b35e-5fa2256e9699

📥 Commits

Reviewing files that changed from the base of the PR and between 579d193 and 4ee0f3f.

📒 Files selected for processing (2)
  • docs/design/2026_04_29_proposed_logical_backup.md
  • docs/design/2026_04_29_proposed_snapshot_logical_decoder.md

Comment on lines +217 to +237
- If the resulting segment exceeds 240 bytes, the segment is replaced with
`<sha256-hex-prefix-32>__<truncated-original>` and the full original
bytes are recorded in `<adapter>/<scope>/KEYMAP.jsonl` (the same
KEYMAP file used for filename round-trip — long-key fallback and
encoded-name reverse-lookup share one append-only stream).
- **DynamoDB binary partition / sort keys** (the `B` attribute type) are
rendered as `b64.<base64url>` so a binary key never collides with a
string key whose hex encoding happens to look like base64.
- **Stream messages** within an SQS queue are not represented by file
per message (would be millions of small files); they live in a single
`messages.jsonl`. Same rationale for Redis stream entries.

`KEYMAP.jsonl` at each adapter scope root translates the encoded
filename back to the exact original bytes, in case the user needs to
feed the data into a system that requires the verbatim key. JSONL
(one mapping per line, never loaded fully) is used so the file scales
to millions of keys without becoming a memory bottleneck on either the
producer or a downstream consumer. The translation is also losslessly
recoverable from the encoded filename alone — `KEYMAP.jsonl` is a
convenience, not a correctness dependency. A consumer that does not
need it can ignore it entirely.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filename reversibility claim overstates correctness guarantees.

Line 217-Line 220 introduces SHA+truncation fallback with KEYMAP, but Line 235-Line 237 says KEYMAP is not a correctness dependency. For long-key fallback and collision renames, KEYMAP is required to reconstruct original bytes exactly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_29_proposed_logical_backup.md` around lines 217 - 237,
The draft overstates filename reversibility: update the wording to state that
when the filename is replaced by the
`<sha256-hex-prefix-32>__<truncated-original>` fallback (and for
collision-renamed keys), the KEYMAP.jsonl at the adapter/scope root is required
to reconstruct the exact original bytes — i.e., KEYMAP is a correctness
dependency for those cases; remove or reword the sentence that currently claims
KEYMAP is purely a convenience and not required for correctness, while
preserving that consumers who never rely on truncated/hashed filenames can
ignore KEYMAP.

Comment on lines +1080 to +1084
| Class | Examples | Backup behavior |
|---|---|---|
| **Re-derivable from user data + config** | DynamoDB GSI rows; S3 route catalog (`!s3route|`); Redis TTL scan index (`!redis|ttl|`); SQS visibility / age / dedup / group / tombstone indexes; route catalog under default group | Excluded by default. Restore re-builds them as user data is replayed. |
| **Per-cluster operational state** | HLC physical ceiling; Raft term/index/conf state; FSM marker files (`raft-engine`, `EKVR`/`EKVM`/`EKVW` magic files in `internal/raftengine/etcd/persistence.go`); write conflict counter | Never dumped. They belong to the cluster, not the data. A restore initializes them fresh. |
| **In-flight transactional state** | `!txn|` intent / lock / resolver records (`kv/txn_keys.go`, `kv/txn_codec.go`); pending S3 multipart uploads | Excluded by default. Optionally dumped under `_internals/` for forensics, but never re-applied by the restore path — replaying intents from a stale read_ts can resurrect aborted transactions. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Internal-state markdown table is malformed due to unescaped pipes.

Line 1082-Line 1084 includes | inside inline code spans (e.g., !s3route|, !txn|), which breaks column parsing (MD056). Escape the pipes or wrap examples without raw | so all three columns render correctly.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 1082-1082: Table column count
Expected: 3; Actual: 6; Too many cells, extra data will be missing

(MD056, table-column-count)


[warning] 1084-1084: Spaces inside code span elements

(MD038, no-space-in-code)


[warning] 1084-1084: Spaces inside code span elements

(MD038, no-space-in-code)


[warning] 1084-1084: Table column count
Expected: 3; Actual: 4; Too many cells, extra data will be missing

(MD056, table-column-count)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_29_proposed_logical_backup.md` around lines 1080 - 1084,
The markdown table rows contain unescaped pipe characters inside inline examples
(e.g., the tokens shown as "!s3route|", "!redis|ttl|", "!txn|" and paths like
internal/raftengine/etcd/persistence.go, kv/txn_keys.go) which breaks table
rendering; update those cells to either escape the internal '|' characters or
present the examples in a code span/block that avoids raw '|' inside the table
cell so the three-column layout renders correctly, ensuring entries such as the
S3 route, Redis TTL example, and txn intent markers are updated consistently
across the affected rows.

Comment on lines +167 to +182
- If the resulting segment exceeds 240 bytes, it becomes
`<sha256-prefix-32>__<truncated-original>` and the full original
bytes are recorded in `<adapter>/<scope>/KEYMAP.jsonl` — same file
used for filename round-trip and S3 path-collision rename
bookkeeping.
- **Stream messages** within an SQS queue are not represented by file
per message (millions of small files); they live in a single
`messages.jsonl`. Same rationale for Redis stream entries.

`KEYMAP.jsonl` at each adapter scope root translates encoded
filenames back to original bytes when needed. JSONL (one mapping per
line) so the file scales to millions of keys without becoming a
memory bottleneck. The translation is also losslessly recoverable
from the encoded filename alone — `KEYMAP.jsonl` is a convenience,
not a correctness dependency. A consumer that does not need it can
ignore it entirely.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

KEYMAP cannot be optional when SHA fallback is used.

Line 167-Line 170 says oversized segments are replaced with <sha256-prefix-32>__<truncated-original> and full bytes go to KEYMAP.jsonl; Line 180-Line 182 then says recovery is lossless from filename alone and KEYMAP is only convenience. Those two statements conflict. For SHA-fallback entries, KEYMAP is required for full reverse mapping.

Suggested doc fix
-`KEYMAP.jsonl` at each adapter scope root translates encoded
-filenames back to original bytes when needed. JSONL (one mapping per
-line) so the file scales to millions of keys without becoming a
-memory bottleneck. The translation is also losslessly recoverable
-from the encoded filename alone — `KEYMAP.jsonl` is a convenience,
-not a correctness dependency. A consumer that does not need it can
-ignore it entirely.
+`KEYMAP.jsonl` at each adapter scope root translates encoded
+filenames back to original bytes when needed. JSONL (one mapping per
+line) so the file scales to millions of keys without becoming a
+memory bottleneck. For plain `%HH`-encoded segments, reverse mapping
+is recoverable from the filename alone; for SHA-fallback or collision-
+renamed paths, `KEYMAP.jsonl` is required for lossless reversal.
+A consumer that does not need original-key recovery can ignore it.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- If the resulting segment exceeds 240 bytes, it becomes
`<sha256-prefix-32>__<truncated-original>` and the full original
bytes are recorded in `<adapter>/<scope>/KEYMAP.jsonl` — same file
used for filename round-trip and S3 path-collision rename
bookkeeping.
- **Stream messages** within an SQS queue are not represented by file
per message (millions of small files); they live in a single
`messages.jsonl`. Same rationale for Redis stream entries.
`KEYMAP.jsonl` at each adapter scope root translates encoded
filenames back to original bytes when needed. JSONL (one mapping per
line) so the file scales to millions of keys without becoming a
memory bottleneck. The translation is also losslessly recoverable
from the encoded filename alone`KEYMAP.jsonl` is a convenience,
not a correctness dependency. A consumer that does not need it can
ignore it entirely.
- If the resulting segment exceeds 240 bytes, it becomes
`<sha256-prefix-32>__<truncated-original>` and the full original
bytes are recorded in `<adapter>/<scope>/KEYMAP.jsonl` — same file
used for filename round-trip and S3 path-collision rename
bookkeeping.
- **Stream messages** within an SQS queue are not represented by file
per message (millions of small files); they live in a single
`messages.jsonl`. Same rationale for Redis stream entries.
`KEYMAP.jsonl` at each adapter scope root translates encoded
filenames back to original bytes when needed. JSONL (one mapping per
line) so the file scales to millions of keys without becoming a
memory bottleneck. For plain `%HH`-encoded segments, reverse mapping
is recoverable from the filename alone; for SHA-fallback or collision-
renamed paths, `KEYMAP.jsonl` is required for lossless reversal.
A consumer that does not need original-key recovery can ignore it.
🧰 Tools
🪛 LanguageTool

[style] ~181-~181: Consider using “who” when you are referring to a person instead of an object.
Context: ...ot a correctness dependency. A consumer that does not need it can ignore it entirely...

(THAT_WHO)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_29_proposed_snapshot_logical_decoder.md` around lines 167
- 182, The doc currently claims the original bytes are stored in KEYMAP.jsonl
for oversized segments encoded as "<sha256-prefix-32>__<truncated-original>" but
then says the translation is losslessly recoverable from the encoded filename
alone and that KEYMAP.jsonl is optional; update the text to explicitly state
that when the SHA-fallback encoding ("<sha256-prefix-32>__<truncated-original>")
is used, KEYMAP.jsonl is required to recover the full original bytes, and
contrast this with purely lossless encodings where KEYMAP is only a convenience;
adjust the sentences referencing KEYMAP.jsonl and the recovery claim to make
this dependency clear.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 6m 55s —— View job


Review: docs/design/2026_04_29_proposed_logical_backup.md + docs/design/2026_04_29_proposed_snapshot_logical_decoder.md (round 16)

  • Gather context
  • Read both design documents (post Phase 0/1 split)
  • Verify round-15 items addressed
  • Check referenced source files
  • Post review feedback

The Phase 0/Phase 1 split is well-structured. Two items from round 15 are still open; two correctness issues surfaced in the new Phase 0 decoder dispatch table; two CodeRabbit-flagged items need addressing.


Still open from round 15

1. context.Background() in the auth snippet still has no deadline — WaitForReady(true) will block indefinitely

The code snippet at lines 1431–1435 still reads:

inMD, _ := metadata.FromIncomingContext(ctx)
outCtx := metadata.NewOutgoingContext(context.Background(), inMD)
// every per-peer GetNodeVersion dial uses outCtx

kv/grpc_conn_cache.go:79 confirms WaitForReady(true) is set as a default call option on every connection obtained from ConnFor. Any GetNodeVersion goroutine hitting an unreachable peer with context.Background() will block indefinitely instead of honoring the per-peer deadline described in the prose.

The per-goroutine context.WithTimeout must be created inside each goroutine, not in a shared outCtx before the loop (a shared timeout cancels all goroutines simultaneously; each needs its own independent deadline):

inMD, _ := metadata.FromIncomingContext(ctx)  // captured once, before spawning goroutines
// ── inside each goroutine ──
peerCtx, cancel := context.WithTimeout(context.Background(), perPeerDeadline)
defer cancel()
outCtx := metadata.NewOutgoingContext(peerCtx, inMD)
resp, err := stub.GetNodeVersion(outCtx, &pb.GetNodeVersionRequest{})

The same applies to the 500 ms timeout in the GetRaftGroups async leader probes. Update the snippet on line 1433 to show the per-goroutine pattern, with a note that peerCtx must be created inside each goroutine. Fix this →

2. GetRaftGroups discards its context — auth propagation cannot be implemented as described

adapter/admin_grpc.go:382–385:

func (s *AdminServer) GetRaftGroups(
    _ context.Context,
    _ *pb.GetRaftGroupsRequest,
) (*pb.GetRaftGroupsResponse, error) {

Both the context and the request are discarded. The async leader-probe pattern described in the design (capturing inMD, _ := metadata.FromIncomingContext(ctx) to forward auth to GetNodeVersion(leader)) requires the context to not be discarded. Add to Phase 1 scope: "change GetRaftGroups context parameter from _ to ctx." Fix this →


Important (new — Phase 0 decoder dispatch table)

3. "!sqs|msg|" dispatch prefix is too broad — internal SQS records will corrupt the decoder output

The decoder pipeline (lines 453–454 in Phase 0 doc):

"!sqs|queue|" → SQS queue meta encoder
"!sqs|msg|"   → SQS message encoder

Checked against adapter/sqs_keys.go and adapter/sqs_messages.go:

Prefix Kind Should be
`!sqs queue meta
`!sqs queue gen
`!sqs queue seq
`!sqs queue tombstone
`!sqs msg data
`!sqs msg dedup
`!sqs msg group
`!sqs msg byage

"!sqs|queue|" matches all four queue prefixes, so generation, sequence, and tombstone records land in the schema encoder. "!sqs|msg|" matches all four message prefixes, so dedup, group, and byage index entries land in the message encoder and would produce malformed messages.jsonl output (likely failing to unmarshal the storedSQSMsgPrefix envelope, or silently emitting garbage JSON lines).

Fix: use the exact prefixes "!sqs|queue|meta|" and "!sqs|msg|data|" in the dispatch table, with explicit drop entries for the other sub-prefixes. Fix this →

4. "!ddb|meta|" dispatch prefix is too broad — DynamoDB generation counter records will corrupt schema output

The decoder pipeline (line 444):

"!ddb|meta|"  → DynamoDB schema encoder

kv/shard_key.go:18–20:

DynamoTableMetaPrefix     = "!ddb|meta|table|"  // table schema
DynamoTableGenerationPrefix = "!ddb|meta|gen|"  // per-table generation counter (internal)

"!ddb|meta|" matches both. The generation counter record (!ddb|meta|gen|) is an internal elastickv tracking key — its value is a generation uint64, not a DescribeTable-shaped JSON. Feeding it into the DynamoDB schema encoder would either fail to unmarshal or silently produce a malformed _schema.json entry that corrupts the table schema.

Fix: change the dispatch entry to "!ddb|meta|table|" and add a drop entry for "!ddb|meta|gen|". Fix this →


Minor (CodeRabbit-flagged, both docs)

5. KEYMAP.jsonl optionality claim is incorrect when SHA fallback is used

Both documents contain the statement: "The translation is also losslessly recoverable from the encoded filename alone — KEYMAP.jsonl is a convenience, not a correctness dependency."

This is false for SHA-fallback filenames. When a key segment exceeds 240 bytes, it becomes <sha256-prefix-32>__<truncated-original> — the truncated original cannot be reconstructed from the filename alone. The KEYMAP.jsonl entry is the only place the full original bytes exist in the dump. Similarly for S3 collision-renamed keys (.elastickv-leaf-data suffix).

Fix in both docs: restate the claim accurately — KEYMAP is a convenience for the common case (pure %HH-encoded segments where reverse mapping is mechanical), but is a correctness dependency for SHA-fallback and collision-renamed entries. A consumer that never needs the verbatim original key can ignore it; one that does must read it. Fix this →

6. Phase 1 internal-state table has unescaped | in code spans — broken rendering

Phase 1 doc, line 1082:

| ... | ... (`!s3route|`); ... (`!redis|ttl|`); ... | ... |

The | inside code spans (!s3route|, !redis|ttl|) break table column parsing (markdownlint MD056: "Expected: 3; Actual: 6"). The Phase 0 doc already fixes this with \| escaping (line 407: `!s3route\|`, `!redis\|ttl\|`). Apply the same escaping to line 1082 in the Phase 1 doc. Fix this →


Round-15 items — status

# Item Verdict
1 context.Background() no deadline + WaitForReady(true) blocks indefinitely Still open — line 1433 unchanged
2 GetRaftGroups discards context — auth propagation unimplementable Still open — not in Phase 1 scope list

Overall: the Phase 0/Phase 1 split is clean and the format specification is solid. Items 3 and 4 above are correctness blockers for Phase 0 implementation — a decoder that dispatches on the wrong prefixes will silently corrupt its output before a single line of implementation code is written. Items 1 and 2 are the two carry-over items that make the GetNodeVersion fan-out hang on unreachable peers. After all four are addressed the design is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant