Skip to content

docs(design): propose per-adapter logical backup format#705

Open
bootjp wants to merge 15 commits intomainfrom
docs/proposed-logical-backup
Open

docs(design): propose per-adapter logical backup format#705
bootjp wants to merge 15 commits intomainfrom
docs/proposed-logical-backup

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 29, 2026

Per-adapter directory hierarchy (dynamodb/

/items/.json, s3//<original/path>, redis/db_//.{bin,json,jsonl}, sqs//messages.jsonl) so a backup is recoverable without elastickv — cat / jq / find / aws s3 sync / redis-cli are sufficient. Strips internal magic prefixes and wide-column key encodings; pins read_ts via the lease- read path and the active timestamp tracker so the dump is consistent and the compactor cannot retire its versions mid-dump.

Per-adapter directory hierarchy (dynamodb/<table>/items/<pk>.json,
s3/<bucket>/<original/path>, redis/db_<n>/<type>/<key>.{bin,json,jsonl},
sqs/<queue>/messages.jsonl) so a backup is recoverable without elastickv —
cat / jq / find / aws s3 sync / redis-cli are sufficient. Strips internal
magic prefixes and wide-column key encodings; pins read_ts via the lease-
read path and the active timestamp tracker so the dump is consistent and
the compactor cannot retire its versions mid-dump.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 24 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 18073263-73e7-41fa-866b-4bf8c6cef5a4

📥 Commits

Reviewing files that changed from the base of the PR and between 579d193 and 486675d.

📒 Files selected for processing (1)
  • docs/design/2026_04_29_proposed_logical_backup.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/proposed-logical-backup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 15 minutes and 24 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 2m 47s —— View job


Code Review: docs/design/2026_04_29_proposed_logical_backup.md

  • Gather context and understand the request
  • Read the design document
  • Analyze design for correctness, completeness, and consistency
  • Check against existing codebase patterns and conventions
  • Post review feedback

Overall this is a well-motivated, thorough design. The core insight — that the FSM snapshot is a one-way artifact and the backup format should be the logical structure of the data, strippable to first-class OS tools — is exactly right. The restore going through public adapter endpoints rather than direct FSM apply is the correct long-term call for version resilience. A few issues need addressing before implementation begins:


Issues

1. ActiveTimestampTracker.Pin() has no TTL — the auto-expiry design assumption is missing infrastructure

The design states (§ Read-Side Consistency):

A tool-side crash leaves the tracker entry behind for at most backup_ts_ttl (default 30 minutes), after which the registration auto-expires so the compactor is not blocked indefinitely.

But kv/active_timestamp_tracker.go has no TTL or deadline mechanism — Pin(ts uint64) returns a token, and the token only releases when Release() is called explicitly. A crash leaves the entry permanently until the process restarts. Phase 1 must include a TTL extension to ActiveTimestampTracker (a goroutine-based or compactor-checked wall-clock deadline on the token). This is non-trivial and should be listed as an explicit Phase 1 deliverable, not implied by the backup_ts_ttl default.

Fix this →


2. BeginBackup / EndBackup RPCs are missing from proto/admin.proto, and the service boundary is wrong

proto/admin.proto defines Admin as a read-only admin service (GetClusterOverview, GetRaftGroups, etc.). BeginBackup / EndBackup are mutations (they register/release a compaction pin). Adding mutation RPCs to a read-only service contradicts the stated intent of the service. The design should either:

  • Define a new Backup service in a new backup.proto, or
  • Explicitly acknowledge and justify that Admin will grow mutation RPCs.

This also affects the auth story — read-only operations can use a lower-privilege token; BeginBackup holds a compaction pin and could be used as a DoS vector against the compactor without appropriate rate limits or role-based guards.


3. kv.ShardedCoordinator.ScanRange does not exist

The design cites kv.ShardedCoordinator.ScanRange as an existing primitive, but ShardedCoordinator has no ScanRange method. The MVCC layer has ScanAt (store/mvcc_store.go:331), but that operates on raw keys. For the backup to reconstruct, for example, a DynamoDB item from wide-column !hs|fld|<keyLen><key><field> entries, the scan must be adapter-aware — it needs to understand which wide-column keys belong to which user key. This is non-trivial, and Phase 1 needs to explicitly design the ScanRange abstraction, including how it drives the per-adapter decoders.


4. Redis HyperLogLog is missing from the canonical directory layout

The layout (§ Top-Level Layout) lists strings/, hashes/, lists/, sets/, zsets/, streams/, but hll/ is absent. It appears correctly later in the per-adapter section (line 395: hll/<key>.bin). Add hll/ to the top-level layout tree for completeness; implementors will use the tree as the reference.


5. single.json sentinel for no-sort-key DynamoDB items is undefined

The example shows:

b64.AAECAw../single.json     # binary partition key

single.json is used as the filename when a table has no sort key, but this sentinel is never defined or justified in the Filename Encoding section. What prevents a sort key value whose percent-encoded form is single? The collision risk needs to be closed: either reserve a different sentinel (e.g., _item.json) or document that single is a reserved name distinct from any encoded sort key (because encoded sort keys always contain at least one %-escape for the .json suffix to be unambiguous).


6. Redis LPUSH order example is misleading

Line 371:

lists/<key>.json: Order is left-to-right (LPUSH [3,2,1] produces ["3","2","1"]).

LPUSH key 3 2 1 pushes 3, then 2, then 1 to the head, so LRANGE key 0 -1 returns ["1","2","3"] (last-pushed is head). If the intent is that the dump preserves LRANGE order (head-to-tail), restate it as: "Items are in LRANGE head-to-tail order: LPUSH key 3 2 1["1","2","3"]."


7. The 50 ms skew buffer in cross-shard consistency is a new magic constant

§ Cross-shard consistency:

read_ts is admissible as long as it is below each group's last-applied commit-ts … which BeginBackup enforces by adding a small skew buffer (default 50 ms) matching the existing TSO buffer.

The existing TSO buffer is defined in the lease read implementation (2026_04_20_implemented_lease_read.md). If that document specifies a different value or a different mechanism, 50 ms could be wrong. Cite the exact constant / function from the implemented code rather than introducing a new literal, to avoid drift.


8. CHECKSUMS crash-safety ordering is underspecified

The manifest crash-safety guarantee is: "Producer writes MANIFEST.json last; restore refuses dumps without a manifest." But CHECKSUMS is a peer of MANIFEST.json at the root. Two questions:

  1. Is CHECKSUMS written before or after MANIFEST.json? (If before, a crash between the two leaves a CHECKSUMS with no manifest — is that detectable?)
  2. Does CHECKSUMS include a hash of itself? (It cannot.) Does MANIFEST.json include the hash of CHECKSUMS? That would tie the two together safely. Specify the write order explicitly.

9. Phase 3 dependency on keyviz_cluster_fanout design seems misplaced

§ Phase 3:

Concurrent multi-cluster fan-out … depends on the Distribution control plane being fan-out-aware (see 2026_04_27_proposed_keyviz_cluster_fanout.md).

2026_04_27_proposed_keyviz_cluster_fanout.md is the Key Visualizer fanout design. That is a read-path sampling pipeline, not the Distribution control plane's routing/shard-fan-out primitive. Either this reference is wrong, or it should cite a different proposed design doc for the distribution fan-out. Clarify.


10. docs/design/README.md is not updated

Per CLAUDE.md convention, docs/design/README.md indexes all design docs. This PR adds a new proposed doc but does not update the index. Add an entry for 2026_04_29_proposed_logical_backup.md.


Minor / Non-blocking

  • The --mode replace locking semantics ("live traffic is rejected during the replace window") in TestRestoreReplaceMode are unspecified at the design level. If there is no in-cluster fencing mechanism, the test description over-promises.
  • String TTL sidecar (strings_ttl.json) vs. inline expire_at_ms in hash/list/set/zset JSON is a deliberate asymmetry (preserves cat convenience for binary string payloads). Consider adding a one-line justification inline so implementors don't "fix" it.
  • The design document itself is well-structured and the internal-state classification table is excellent.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design for a logical backup format that decomposes snapshots into adapter-specific, human-readable structures to ensure vendor-independent recovery. The proposal outlines a directory-based layout for DynamoDB, S3, Redis, and SQS, utilizing a cluster-wide read timestamp for point-in-time consistency. Review feedback highlighted several improvement opportunities: mitigating the operational risk of the 30-minute backup TTL with a heartbeat mechanism, preventing S3 metadata filename collisions by using a dedicated directory, and improving scalability and parsing consistency for Redis TTLs and stream metadata by adopting JSONL and sidecar files.

lease reads already use; no new code path through MVCC is introduced.
3. `Admin.EndBackup` releases the tracker registration. A
tool-side crash leaves the tracker entry behind for at most
`backup_ts_ttl` (default 30 minutes, configurable via the same admin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The 30-minute default TTL for the read_ts registration in the active timestamp tracker represents a significant operational risk for large-scale backups. If the dump takes longer than 30 minutes, the compactor could retire MVCC versions while the backup is still scanning, causing the backup to fail. Per repository guidelines, design documents identifying such risks must detail mitigation strategies, such as implementing a heartbeat mechanism from the producer to the tracker to keep the registration alive.

Suggested change
`backup_ts_ttl` (default 30 minutes, configurable via the same admin
backup_ts_ttl (default 30 minutes, heartbeated by the producer or configurable via the same admin
References
  1. When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.

│ └── 04/
│ └── 29/
│ ├── img.jpg
│ ├── img.jpg.metadata.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a fixed .metadata.json suffix for sidecar files can lead to collisions with user-defined S3 object keys that happen to end with the same string. This could result in user data being overwritten or the restore tool misinterpreting metadata as user content. Using a dedicated directory for metadata is a safer approach.

Suggested change
│ ├── img.jpg.metadata.json
│ ├── _metadata/img.jpg.json

├── strings/
│ ├── session%3Aabc123.bin # SET session:abc123 …
│ └── counter%3Aglobal.bin # SET counter:global "42"
├── strings_ttl.json # { "session%3Aabc123": 1735689600000, ... }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Storing all string TTLs in a single JSON file per database scales poorly for large datasets. A single JSON object requires loading the entire index into memory to parse or update. Using a .jsonl format would allow for streaming and better scalability, adhering to the decomposed nature of the backup.

Suggested change
├── strings_ttl.json # { "session%3Aabc123": 1735689600000, ... }
├── strings_ttl.jsonl # {"format_version": 1, "key": "session%3Aabc123", "expire_at_ms": 1735689600000}

Comment on lines +388 to +393
followed (optionally) by a trailing meta line:
```
{"_meta": true, "length": 2, "last_ms": 1714400000001, "last_seq": 0}
```
The meta line lets a restore tool seed `XADD` IDs without re-deriving
them from the entries.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Including a metadata line with a different schema at the end of a .jsonl file makes the file heterogeneous and harder to process with standard line-oriented tools like jq. Moving this to a separate _stream.json file would be more consistent with the patterns used for other adapters.

Suggested change
followed (optionally) by a trailing meta line:
```
{"_meta": true, "length": 2, "last_ms": 1714400000001, "last_seq": 0}
```
The meta line lets a restore tool seed `XADD` IDs without re-deriving
them from the entries.
followed (optionally) by a sidecar _stream.json containing the stream's
length and last ID. This lets a restore tool seed XADD IDs without
re-deriving them from the entries.

Critical:
- Replace nonexistent kv.ShardedCoordinator.ScanRange with the actual
  ShardStore.ScanAt (kv/shard_store.go:106) plus a thin BackupScanner
  iterator wrapper for paged scans.
- Specify ActiveTimestampTracker.PinWithDeadline plus a per-second
  sweeper goroutine. The existing Pin has no TTL, so a crashed
  producer would block compaction forever without this.

Important:
- Add hll/ to the Redis layout tree.
- Replace .metadata.json sidecar with .elastickv-meta.json reserved
  suffix; document collision handling and rename-collisions flag.
- Spell out cross-shard wait: BeginBackup polls each group applied_index
  until it covers read_ts (5s default deadline) before any scan starts.
  The 50ms skew alone was a liveness assertion, not an enforced wait.
- Specify CHECKSUMS as sha256sum(1)-compatible (verifiable with
  sha256sum -c CHECKSUMS from the dump root).
- Add ListAdaptersAndScopes RPC alongside BeginBackup/EndBackup with
  ttl_ms / pin_token.

Minor:
- Redis TTL: default skip-expired on restore; preserve-ttl opt-in.
- aws s3 sync example needs exclude pattern for sidecars.
- DynamoDB example clarified: composite-key vs hash-only layouts.
- SQS dedup-window reset moved into trade-offs as an explicit callout.
- Redis stream merge mode: refuse non-empty target; auto-id strategy
  falls back to XADD * with an _id_remap.jsonl mapping log.

Tests added: TestPinWithDeadlineExpiry,
TestBeginBackupWaitsForLaggingShard, TestBackupScannerPaging,
TestS3SidecarSuffixCollision, TestRedisTTLExpiredKeySkippedByDefault,
TestRedisStreamMergeRejectsNonEmpty.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5719128131

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +637 to +639
5. **Renew on long dumps**: the producer calls `BeginBackup` again with
the same `read_ts` (carrying the existing token) every `ttl_ms / 3`
to extend the deadline. A multi-hour dump never relies on a single
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add a true pin-renewal RPC contract

Step 5 requires renewing an existing backup pin by calling BeginBackup with the same read_ts and token, but the proposed BeginBackupRequest only contains ttl_ms and cannot carry either value. In a long-running dump, the producer therefore has no API path to extend the original pin and will either create a new read_ts (breaking point-in-time guarantees) or let the original pin expire. The request/response schema needs explicit renewal semantics (e.g., token + read_ts on request, or a dedicated RenewBackup RPC).

Useful? React with 👍 / 👎.

Comment on lines +198 to +199
also losslessly recoverable from the encoded filename alone — `KEYMAP` is a
convenience, not a correctness dependency.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct filename reversibility claim for long keys

The document states key translation is recoverable from encoded filenames alone, but the long-segment rule replaces keys over 240 bytes with a hash prefix plus truncated original bytes. That transformation is not bijective from path text alone, so restores or external tools cannot reconstruct verbatim keys without side metadata (_long_keys.jsonl/KEYMAP). This should be specified as a required dependency for long keys, or the encoding scheme should be made fully reversible.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 3m 13s —— View job


Code Review: docs/design/2026_04_29_proposed_logical_backup.md

  • Gather context and understand the request
  • Read the design document
  • Check existing codebase for consistency
  • Analyze design for correctness, completeness, and risks
  • Post review feedback

The design is well-structured and the motivation is sound — decoupling the backup artifact from internal key encodings so recovery does not depend on the elastickv binary is the right call. The BackupScanner paging approach, the PinWithDeadline extension, and the per-adapter restore-via-public-API strategy are all solid. A few correctness issues need addressing before Phase 1 implementation starts:


P0 — Blocks implementation

1. BeginBackupRequest cannot express renewal

Step 5 of the BeginBackup → EndBackup flow (line 637–640) says:

the producer calls BeginBackup again with the same read_ts (carrying the existing token) every ttl_ms / 3 to extend the deadline

But BeginBackupRequest only has ttl_ms = 1. There are no fields for read_ts or pin_token, so the renewal described in step 5 is unimplementable — calling BeginBackup again creates a new pin at a new read_ts, not an extension of the existing one. This breaks the point-in-time guarantee for dumps longer than ttl_ms.

Fix: either add optional bytes renewal_token = 2 and optional uint64 renew_for_read_ts = 3 to BeginBackupRequest (server honors them as extend-in-place if present), or define a dedicated RenewBackup(pin_token, ttl_ms) RPC.

Fix this →


2. ListAdaptersAndScopesRequest message is never defined

The proto snippet (line 566–567) declares:

rpc ListAdaptersAndScopes(ListAdaptersAndScopesRequest)
    returns (ListAdaptersAndScopesResponse) {}

ListAdaptersAndScopesResponse is defined (line 586–592), but ListAdaptersAndScopesRequest is not. Even if the request is empty, an explicit message ListAdaptersAndScopesRequest {} is required for the proto to compile.

Fix this →


3. Redis list order example is incorrect

Line 395:

Order is left-to-right (LPUSH [3,2,1] produces ["3","2","1"]).

LPUSH key 3 2 1 pushes 3 to the head, then 2 to the head (in front of 3), then 1 to the head (in front of 2). LRANGE key 0 -1 returns ["1","2","3"], not ["3","2","1"]. The claim has the output inverted. Implementors who follow this example will produce a dump that restores lists in reverse order.

The correct statement: items are written in LRANGE head-to-tail order. LPUSH key 3 2 1 produces an LRANGE order of ["1","2","3"]. Or restate the example using RPUSH if the intent is push-order.

Fix this →


P1 — Design gaps / inconsistencies

4. BeginBackup/EndBackup on a read-only Admin service changes security posture without justification

proto/admin.proto currently has exclusively read RPCs (GetClusterOverview, GetRaftGroups, GetAdapterSummary, GetKeyVizMatrix, GetRouteDetail). BeginBackup and EndBackup are state-mutating: they register and release a compaction pin, and a rogue or unbounded BeginBackup call blocks the compactor indefinitely (up to ttl_ms).

Adding mutations to the Admin service makes the service's authorization requirements more complex — operators who give a token read-only Admin access today would inadvertently be granting backup-pin mutation rights. The design should either: (a) define a new Backup service in proto/backup.proto, or (b) explicitly justify why Admin is the right home and document the authorization boundary.


5. Phase 3 cross-reference to keyviz_cluster_fanout.md is wrong

Lines 891–893:

Concurrent multi-cluster fan-out … depends on the Distribution control plane being fan-out-aware (see 2026_04_27_proposed_keyviz_cluster_fanout.md).

2026_04_27_proposed_keyviz_cluster_fanout.md is the Key Visualizer admin-UI fan-out design (Phase 2-C of the heatmap SPA). It is not a Distribution control plane design. The reference is pointing at the wrong document. Either cite the correct routing/distribution fan-out proposal (if one exists), or remove the parenthetical and note that a distribution fan-out design is a prerequisite.

Fix this →


6. Cross-shard applied_index wait condition is underspecified

Step 2 of the BeginBackup flow says:

every group is required to report applied_index ≥ commit_index_at_pin for the default group's HLC ceiling proposal

The HLC ceiling is only proposed through the default Raft group. Non-default shard groups do not apply HLC ceiling entries — they apply user data commands. The condition applied_index ≥ commit_index_at_pin where commit_index_at_pin comes from the default group's HLC entry is mixing two independent Raft log indexes. A lagging shard's applied_index being ≥ the default group's commit_index says nothing about whether that shard has caught up to read_ts.

The correct wait condition for non-default shards is probably shard.LastAppliedTS ≥ read_ts (i.e., the highest timestamp the shard has applied is at or above the pinned read_ts), not an applied_index comparison across groups. Specify this precisely.


7. CHECKSUMS/MANIFEST crash-safety write order is unspecified

The "Risks and Mitigations" table says: "Producer writes MANIFEST.json last; restore refuses dumps without a manifest." But it says nothing about where CHECKSUMS falls in the write order relative to MANIFEST.json. Two scenarios are ambiguous:

  • If CHECKSUMS is written before MANIFEST.json: a crash between the two leaves a CHECKSUMS with no manifest. Restore detects and refuses — fine, but this should be stated.
  • If CHECKSUMS is written after MANIFEST.json: a crash between the two leaves a manifest with no checksums — restore can proceed but cannot verify integrity, which is a worse failure mode.

Explicitly state: CHECKSUMS is written before MANIFEST.json; the presence of MANIFEST.json is the atomicity marker.


Minor / Non-blocking

  • docs/design/README.md not updated. Per CLAUDE.md conventions, the design doc index must include the new file. Fix this →
  • strings_ttl.json as a single JSON object requires loading the entire TTL index into memory to parse. For large Redis databases this could be megabytes to gigabytes. JSONL (strings_ttl.jsonl) would allow streaming — consistent with how messages.jsonl and streams/<key>.jsonl are handled.
  • Stream _meta trailing line creates a heterogeneous JSONL file. A file named .jsonl where the last line has a different schema ({"_meta": true, "length": ..., "last_ms": ...}) breaks tools expecting jq .fields to work on every line. Moving it to a sidecar <key>_meta.json alongside streams/<key>.jsonl is more consistent with the sidecar pattern used for S3 objects and DynamoDB schemas.
  • --mode replace live-traffic rejection is unspecified at the design level. TestRestoreReplaceMode asserts "live traffic is rejected during the replace window" but there is no proposed in-cluster fencing mechanism for this guarantee. This either needs a design note explaining the mechanism or the test description should be weakened.

HIGH:
- TTL renewal made explicit. New RenewBackup RPC; producer renews
  every ttl_ms/3 and aborts the dump on renewal failure (continuing
  past TTL would silently corrupt the artifact). backup_max_ttl_ms
  caps a single window; multi-hour dumps work via repeated renewals.

MEDIUM:
- DynamoDB scalability. Default stays per-item (the user's stated
  requirement: PK名.json で1レコード1ファイル). Added opt-in
  --dynamodb-bundle-mode jsonl for tables where inode count is the
  binding constraint; structured warning emitted when an unbundled
  scope exceeds 1M items.
- S3 file-vs-directory collisions. Producer detects keys like
  "path/to" and "path/to/obj" coexisting and renames the shorter to
  "path/to.elastickv-leaf-data" with KEYMAP.jsonl + manifest record.
- KEYMAP scaled by switching to KEYMAP.jsonl (line-streamable).
- Redis strings_ttl.json -> strings_ttl.jsonl with one TTL record per
  line so TTL count is unbounded.
- max_active_backup_pins (default 4) caps concurrent BeginBackup
  registrations; tracker returns ErrTooManyActiveBackups, RPC surfaces
  ResourceExhausted. Prevents a misbehaving caller from holding the
  compactor open across the entire MVCC retention horizon.

Manifest fields added: s3_collision_strategy, dynamodb_layout,
max_active_backup_pins. CLI flags added: --ttl-ms, --scan-page-size,
--dynamodb-bundle-mode/--dynamodb-bundle-size, --rename-collisions.

Tests added: TestS3PathFileVsDirectoryCollision,
TestBeginBackupTooManyActiveBackups, TestRenewBackupExtendsDeadline.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbd1bb8b8b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +853 to +854
- DynamoDB: `aws dynamodb create-table --cli-input-json _schema.json` and
then `aws dynamodb put-item --item @items/<pk>/<sk>.json` per file.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use valid CreateTable request JSON for DynamoDB replay

The documented replay step aws dynamodb create-table --cli-input-json _schema.json will not work with the _schema.json shape defined earlier in this doc (format_version, table_name, primary_key, etc.). AWS CLI create-table expects request fields like TableName, KeySchema, and AttributeDefinitions, and --cli-input-json must follow the operation input skeleton format, so users following this runbook will fail before any item restore.

Useful? React with 👍 / 👎.

every sidecar as if it were a user object.
- Redis: a 100-line shell script over `find redis/db_0/strings -name '*.bin' -exec redis-cli -x SET …`,
with similar one-liners per type.
- SQS: `jq -c . messages.jsonl | xargs -n1 aws sqs send-message --message-body …`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include FIFO-required fields in SQS replay command

The suggested SQS replay one-liner only passes --message-body, but this design explicitly includes FIFO queues (orders-fifo.fifo) and AWS send-message fails for FIFO queues without MessageGroupId (and may also require MessageDeduplicationId when content-based dedup is off). As written, the advertised external restore path cannot replay FIFO dumps and also drops message attributes/metadata needed for faithful recovery.

Useful? React with 👍 / 👎.

Round-2 carry-overs:
- Move NewBackupScanner from *ShardedCoordinator to *ShardStore.
  ShardedCoordinator (kv/sharded_coordinator.go:124) has no *ShardStore
  field; ShardStore is the natural home for ScanAt-adjacent code.
- Add a Phase-1 note that ScanAt's doc comment must be updated to
  reflect that BeginBackup's applied_index fence makes cross-shard
  ScanAt globally consistent.
- Add pin_token to ListAdaptersAndScopesRequest so the scope scan is
  bounded to the pinned read_ts; concurrently-created scopes are not
  surfaced.
- Add hll/ to the Redis db_0 example tree (top-level layout already
  had it; the per-adapter example lagged).
- Add --stream-merge-strategy / --preserve-ttl / --preserve-visibility
  to the elastickv-restore apply CLI block.

New nits from the Gemini changes:
- Pick one form for DynamoDB bundle-mode flags. Two-flag form
  (--dynamodb-bundle-mode jsonl + --dynamodb-bundle-size 64MiB)
  throughout; drop the inline jsonl:size=64MiB form.
- RenewBackupRequest.ttl_ms documents the same 60s-24h range
  constraint as BeginBackupRequest, bounded above by
  backup_max_ttl_ms; out-of-range returns InvalidArgument.

Tests added: TestRenewBackupTTLRangeValidation,
TestListAdaptersAndScopesAtPinTS.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3b2936057

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +780 to +781
3. **Pin `read_ts`** with `PinWithDeadline(read_ts, now+ttl_ms)`; return
the resulting `pin_token` to the producer.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pin read_ts before waiting for shard catch-up

This flow pins read_ts only after the shard catch-up wait, which leaves a race where concurrent writes plus compaction can retire versions at read_ts before the scan ever starts. In a busy cluster (or with tight retention), BeginBackup can succeed but the subsequent ScanAt(read_ts) no longer has a guaranteed full snapshot. The pin should be established before the wait (and released on failure) so the fenced timestamp is protected for the entire handshake.

Useful? React with 👍 / 👎.

Comment on lines +174 to +177
- **S3 object keys preserve their `/` separators** as filesystem path
separators. `s3://photos/2026/04/29/img.jpg` becomes
`s3/photos/2026/04/29/img.jpg`. This is the whole point — `aws s3 sync`
on the dump directory works.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Encode unsafe S3 path segments before writing to disk

Mapping S3 keys directly to filesystem paths with raw / separators is unsafe for keys containing period-only path segments (for example ./ or ../) and other special segments, because many tools normalize them and can change where data is written/read. That breaks the design goal of deterministic, reversible backups and can cause object collisions or path-escape behavior during dump/restore unless segment normalization is explicitly encoded/escaped.

Useful? React with 👍 / 👎.

The format is designed to be extractable without elastickv:

- DynamoDB: `aws dynamodb create-table --cli-input-json _schema.json` and
then `aws dynamodb put-item --item @items/<pk>/<sk>.json` per file.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use valid aws put-item CLI arguments in replay guidance

The DynamoDB replay command omits required put-item inputs and uses an unsupported file reference form, so operators following this runbook cannot actually restore items. AWS CLI put-item requires at least --table-name and --item, and file input is provided via file://...; without that, the external restore path fails even when item files are correct.

Useful? React with 👍 / 👎.

Important (verified against adapter/redis.go):
- Stream TTLs were silently dropped. Streams can carry TTL via
  EXPIRE/PEXPIRE/PEXPIREAT, kept in !redis|ttl| by buildTTLElems
  (adapter/redis.go:2471). Added expire_at_ms to the _meta line so
  restore re-applies the original epoch.
- HLL TTLs were silently dropped. HLL is reported as redisTypeString
  but stores TTL in the legacy scan index (adapter/redis.go:2319-2320
  comment confirms it). Added hll_ttl.jsonl sidecar at the db_<n>
  root, same shape as strings_ttl.jsonl. Sidecar is omitted when no
  HLL key has a TTL.
- Fixed XADD NOMKSTREAM semantic error in the skip-existing
  description. NOMKSTREAM only prevents stream creation; it does
  not skip entries by ID. Replaced with the correct mechanism:
  attempt XADD and treat the "ID equal or smaller" ERR as
  "already present, skip."

Minor:
- Added --begin-backup-deadline 5s to the elastickv-backup dump CLI
  block (the prose mentioned it but the flag list lagged).

Tests added: TestRedisStreamTTLRoundTrip, TestRedisHLLTTLRoundTrip,
TestRedisStreamSkipExistingHandlesERR.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 797d5711e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +182 to +183
- Bytes in `[A-Za-z0-9._-]` pass through.
- Every other byte is rendered as `%HH` (uppercase hex), like
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Encode case to prevent filename collisions

The filename encoding keeps [A-Za-z0-9._-] unchanged, but this does not satisfy the stated requirement to handle case-colliding names on case-insensitive filesystems. Two distinct keys like UserA and usera will map to the same path on default Windows/macOS volumes, causing overwrite or dump failure and making the backup non-recoverable in those environments.

Useful? React with 👍 / 👎.

Comment on lines +876 to +877
→ CHECKSUMS → MANIFEST.json
→ EndBackup(pin_token)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include MANIFEST.json in checksum coverage

The flow writes CHECKSUMS before MANIFEST.json, so sha256sum -c CHECKSUMS cannot detect tampering of the manifest even though restore behavior depends on manifest fields (layout, collision strategy, options). This undermines the document’s integrity claim that tampering is caught by checksum verification.

Useful? React with 👍 / 👎.

Important (correctness gap, verified against main.go):
- BeginBackup pin was node-local. Each node owns its own
  ActiveTimestampTracker (main.go:294) and only its compactor
  consults it (main.go:335). A pin on the receiving node would not
  protect group leaders on other nodes; their compactors would be
  free to retire MVCC versions at read_ts mid-dump.

  Fix: pins propagate through each Raft group's log as
  BackupPin / BackupExtend / BackupRelease FSM commands. Every replica
  applies them to its local tracker, so the pin set is replicated and
  durable across leader changes (a new leader inherits the pin from
  the same log entry it applied). BeginBackup fans the propose out
  per group; if any group fails to commit within
  --begin-backup-deadline, BeginBackup proposes BackupRelease on
  groups that did commit and returns Unavailable so retry is clean.
  RenewBackup / EndBackup fan out the same way; releases are
  idempotent so partial state is tolerated.

Minor:
- Specified _id_remap.jsonl path: redis/db_<n>/streams/<key>._id_remap.jsonl
  (next to source) so tools can find it on disk.
- TestRedisStreamSkipExistingHandlesERR description: "the dump"
  -> "the restore" (it's a restore-side test).
- Added KEYMAP.jsonl to every per-adapter scope in the layout tree;
  omitted when empty.

Tests added: TestBeginBackupPinFanOutAllNodes,
TestBeginBackupPinSurvivesLeaderChange,
TestBeginBackupGroupUnreachable.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c00d5604d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +179 to +182
- **S3 object keys preserve their `/` separators** as filesystem path
separators. `s3://photos/2026/04/29/img.jpg` becomes
`s3/photos/2026/04/29/img.jpg`. This is the whole point — `aws s3 sync`
on the dump directory works.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle trailing-slash S3 keys in path mapping

This mapping rule makes the on-disk path non-representable for valid S3 objects whose key ends with / (for example, folder-marker objects): preserving separators and writing at the “natural path” means the target path is a directory, not a file, so the object body/sidecar cannot be emitted without renaming. In datasets that contain these keys, dumps will either fail or silently omit/collide objects, breaking restore completeness; the design needs an explicit escape/rename rule for trailing-slash keys.

Useful? React with 👍 / 👎.


- DynamoDB: `aws dynamodb create-table --cli-input-json _schema.json` and
then `aws dynamodb put-item --item @items/<pk>/<sk>.json` per file.
- S3: `aws s3 sync --exclude '*.elastickv-meta.json' s3/<bucket>/ s3://target-bucket/`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude control files from documented S3 replay sync

The documented replay command excludes only *.elastickv-meta.json, so it will also upload backup control artifacts that live under each bucket scope (for example _bucket.json and KEYMAP.jsonl) as regular S3 objects. That contaminates the restored bucket with non-user objects and makes the “external restore” path non-faithful unless additional excludes (or a filtered object list) are documented.

Useful? React with 👍 / 👎.

Important (verified against kv/fsm.go):
- Phase 1 scope was missing the FSM plumbing for BackupPin /
  BackupExtend / BackupRelease entries. Added explicit items:
  - new byte tags 0x03/0x04/0x05 in kv/fsm.go alongside existing
    raftEncodeHLCLease=0x02 (kv/fsm.go:116);
  - new applyBackupPin/applyBackupExtend/applyBackupRelease
    handlers, dispatched from kvFSM.Apply (kv/fsm.go:60) in the
    same shape as applyHLCLease;
  - new *ActiveTimestampTracker field on kvFSM (kv/fsm.go:26),
    wired via NewKvFSMWithHLCAndTracker so the FSM and the
    coordinator share one tracker (analogous to the *HLC sharing
    invariant in CLAUDE.md);
  - new kv/backup_codec.go with the 33/25/17-byte fixed binary
    layouts for the three entry types.

- kvFSM.Restore (kv/fsm.go:264-287) wipes in-memory pins. A replica
  that installs a Raft snapshot during the backup window loses its
  BackupPin (snapshot only carries the Pebble store, not the
  tracker). After the leader compacts the log past the BackupPin
  entry, that entry is gone; a follower catching up via snapshot
  never replays it.

  Added explicit safe-bound invariant:
    backup_duration + worst_case_compaction_lag < mvcc_retention_horizon
  BeginBackup enforces a soft form via --snapshot-headroom-entries
  (default 10000). Producer's per-scope expected_keys heuristic
  catches mid-dump snapshot-induced shortfalls. Heavier mitigations
  (snapshotting tracker state, sibling-pin queries) are explicitly
  Phase 3, not Phase 1.

Minor:
- --begin-backup-deadline binding constraint clarified: it is the
  applied_index poll for lagging shards, not the BackupPin commit
  (which is sub-second on a healthy cluster).
- RenewBackup retries per-group BackupExtend up to 3 times with
  500ms backoff before aborting; transient leader elections (<1s
  under etcd/raft defaults) no longer abort otherwise-healthy dumps.

Tests added: TestBackupPinFSMCodecRoundTrip,
TestRestoreWipesLocalPins, TestBeginBackupRefusesNearSnapshotThreshold,
TestRenewBackupRetriesLeaderElection.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 020e0c4c7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

every sidecar as if it were a user object.
- Redis: a 100-line shell script over `find redis/db_0/strings -name '*.bin' -exec redis-cli -x SET …`,
with similar one-liners per type.
- SQS: `jq -c . messages.jsonl | xargs -n1 aws sqs send-message --message-body …`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add required queue URL to SQS replay command

The documented replay one-liner cannot run as written because aws sqs send-message requires --queue-url in addition to --message-body; the AWS CLI synopsis/options mark --queue-url as required. Operators following this command will fail before any message replay begins, so the external-restore path is blocked for all queues.

Useful? React with 👍 / 👎.

every sidecar as if it were a user object.
- Redis: a 100-line shell script over `find redis/db_0/strings -name '*.bin' -exec redis-cli -x SET …`,
with similar one-liners per type.
- SQS: `jq -c . messages.jsonl | xargs -n1 aws sqs send-message --message-body …`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid xargs parsing for JSONL SQS message replay

Using jq -c . messages.jsonl | xargs -n1 ... is unsafe for common message bodies containing quotes or backslashes (for example JSON payloads), because GNU xargs treats quotes/backslashes as special delimiters by default and can error with unmatched quotes or alter arguments. That means replay can fail or send mutated payloads for valid dumped messages, so this command is not a reliable restore path.

Useful? React with 👍 / 👎.

Comment on lines +163 to +165
sidecar. Operators who hold such keys pass `--rename-collisions` to have
them emitted as `<obj>.elastickv-meta.json.user-data` (the rename is
recorded in `KEYMAP`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make sidecar-collision rename mapping injective

The proposed rename for object keys ending in .elastickv-meta.json (<obj>.elastickv-meta.json.user-data) is not collision-safe: a real user object can already exist at that exact key, so both objects map to the same dump path and one is overwritten/omitted. The collision strategy needs a guaranteed-unique escape rule (or iterative disambiguation) to keep restores lossless for these key pairs.

Useful? React with 👍 / 👎.

bootjp added 2 commits April 29, 2026 11:08
Important (verified against internal/raftengine/engine.go and
internal/raftengine/etcd/engine.go):
- --snapshot-headroom-entries check was inverted. Replaced with the
  correct formula:
    entries_since_last_snapshot = AppliedIndex - LastSnapshotIndex
    remaining_headroom          = SnapshotEvery - entries_since_last_snapshot
    refuse if: remaining_headroom < --snapshot-headroom-entries
  (raftengine.Status has LastSnapshotIndex not FirstIndex, and
  defaultSnapshotEvery = 10000 is the per-engine snapshot trigger.)
  A freshly-snapshotted cluster has the largest remaining headroom
  and is now correctly allowed; the check refuses backups that would
  trigger the snapshot-installation corner case mid-dump. Default
  --snapshot-headroom-entries lowered from 10000 to 1000 (one-tenth
  of SnapshotEvery) since the formula now expresses what the prose
  intended.
- expected_keys heuristic was undefined. Replaced with a concrete
  expected-keys baseline mechanism:
    * BeginBackup runs a count-only ScanAt at read_ts before the pin
      installs, returns per-scope (key_count, applied_index) in
      BeginBackupResponse.
    * Producer compares actual count against baseline; fails with
      ErrCompactionDuringDump if below 99% +/- sqrt(baseline).
    * Adds keys_only bool parameter to MVCCStore.ScanAt /
      ShardStore.ScanAt so the baseline pass skips value
      materialization (Pebble iterator with payload trimmed at the
      block layer).

Minor:
- _long_keys.jsonl renamed to KEYMAP.jsonl on line 192 to match the
  rest of the spec.
- BackupPin fan-out is now explicitly concurrent (one goroutine per
  group, blocked on all-or-deadline). 100-shard cluster does not pay
  100x serial Raft round-trips.

Tests added: TestExpectedKeysBaselineToleratesTTLExpiry,
TestExpectedKeysBaselineCountOnlyScan. Existing tests
TestRestoreWipesLocalPins and TestBeginBackupRefusesNearSnapshotThreshold
updated to match the corrected formulas.
Important:
- BeginBackupResponse proto block was missing definitions. Added
  ShardApplied {group_id, applied_index} and ScopeKeyCount
  {scope_id, key_count, applied_index_at_count}; added
  "repeated ScopeKeyCount expected_keys = 5" to BeginBackupResponse.

- Expected-keys baseline ordering bug. Documented as running "during
  step 1" but step 1 is "pick read_ts" and step 2 is shard catch-up.
  A lagging shard would return 0 (or partial) for ScanAt at read_ts
  before applying through it, producing an undercount and silently
  disabling ErrCompactionDuringDump on exactly the shards most at
  risk of snapshot-induced pin loss. Moved to "after step 2 (every
  shard has applied through read_ts) and before step 3 (pin)".

Minor:
- --snapshot-headroom-entries CLI default 10000 -> 1000 (matches the
  one-tenth-of-SnapshotEvery default documented in the prose).
- ScanAt keys_only addition is a source-incompatible Go API change
  with ~10 callers in shard_store.go alone. Replaced with a new
  ScanKeysAt overload on MVCCStore / ShardStore (returns [][]byte
  not []*KVPair); existing ScanAt callers untouched.

Tests adjusted: TestExpectedKeysBaselineCountOnlyScan now verifies
both the new path and the absence of a regression on existing
ScanAt callers; new TestExpectedKeysBaselineRunsAfterShardCatchup
guards against the ordering bug.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ca1ae5169

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +133 to +135
│ ├── _bucket.json
│ ├── KEYMAP.jsonl # per-scope (omitted when empty)
│ ├── <object-key-path> # original object bytes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent S3 object keys from colliding with control filenames

This layout puts user object payloads and backup control artifacts in the same bucket directory, so valid keys like _bucket.json or KEYMAP.jsonl map to the exact paths reserved for metadata files and will overwrite or block one another. In any bucket containing those keys, the dump cannot be lossless and restore behavior becomes ambiguous unless these control names are explicitly escaped/reserved for user keys.

Useful? React with 👍 / 👎.

Comment on lines +195 to +197
- **DynamoDB binary partition / sort keys** (the `B` attribute type) are
rendered as `b64.<base64url>` so a binary key never collides with a
string key whose hex encoding happens to look like base64.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disambiguate DynamoDB B and S filename encodings

Encoding binary key attributes as b64.<base64url> is not collision-safe with string keys, because string segments in the same character set pass through unchanged and can legitimately equal b64.<...>. A table containing both S and B keys that serialize to the same segment will collide on disk, causing overwrite/omission unless the encoding carries an unambiguous type discriminator for all key kinds.

Useful? React with 👍 / 👎.

Comment on lines +224 to +225
is renamed to `path/to.elastickv-leaf-data`, with the rename
recorded in `s3/<bucket>/KEYMAP.jsonl`. The sidecar follows the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make leaf-data collision rename mapping injective

The file-vs-directory fallback renames path/to to path/to.elastickv-leaf-data, but that target can already exist as a real user object key. When a bucket has path/to, path/to/obj, and path/to.elastickv-leaf-data, two distinct objects map to one dump path, so one is lost unless the rename strategy guarantees uniqueness (for example, iterative escaping or hash-based disambiguation).

Useful? React with 👍 / 👎.

Important (verified against internal/raftengine/{etcd/,}engine.go and
kv/fsm.go):
- SnapshotEvery is not on the raftengine.Engine interface today.
  Phase 1 adds StatusReader.SnapshotEvery() uint64 implemented on
  the etcd backend by returning the snapshotEveryFromEnv() value
  cached at Open. Also corrected the env-var name: there is no
  --raftSnapshotEvery flag; the override is
  ELASTICKV_RAFT_SNAPSHOT_COUNT (internal/raftengine/etcd/engine.go:93).

- Rolling-upgrade safety. An old kvFSM receiving a 0x03-tagged
  BackupPin entry falls through decodeRaftRequests ->
  decodeLegacyRaftRequest (kv/fsm.go:140-145) -> proto.Unmarshal ->
  Apply error -> apply-loop stall. A single-release introduction
  would corrupt apply on un-upgraded replicas the moment the first
  BeginBackup runs.

  Phase 1 ships in two releases:
    Release N: kvFSM.Apply treats unknown tags in [0x03, 0x0F] as
      no-ops with an unknown_fsm_tag warning (forward-compat handler,
      no other backup machinery active).
    Release N+1: BackupPin/Extend/Release handlers ship; BeginBackup
      gates on every group member reporting version >= N via
      Admin.GetRaftGroups, returning FailedPrecondition otherwise.
  Operators may treat N and N+1 as the same calendar release if they
  enforce a fleet-wide upgrade window first; the gate still
  short-circuits.

Minor:
- ScopeKeyCount.scope_id format documented for all four adapters,
  including the redis_databases uint32 -> "redis/db_<n>" mapping.

Tests added: TestForwardCompatUnknownFSMTag,
TestBeginBackupGatesOnMinVersion, TestSnapshotEveryReadsFromEngine.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5bda8a8b4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1288 to +1292
activated. `BeginBackup` itself **gates on every group member
reporting `min_backup_version ≥ N`** — verified through the
existing `Admin.GetRaftGroups` membership view, which already
carries per-node version metadata. If any member reports a
version older than N, `BeginBackup` returns
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate backups on N+1-capable members

The rollout gate currently allows BeginBackup when members are only at version N, but in this design version N explicitly treats backup FSM tags (0x030x0F) as no-ops. That means a mixed cluster with N members will accept BackupPin proposals yet silently skip pin application on those replicas, so their compactors can still retire read_ts versions during an active backup. This breaks the safety guarantee for long scans and leader movement; the gate needs to require members that actually implement applyBackupPin/Extend/Release (i.e., N+1 or an explicit capability flag).

Useful? React with 👍 / 👎.

Comment on lines +958 to +960
2. **Wait for shards to catch up**: every group is required to report
`applied_index ≥ commit_index_at_pin` for the default group's HLC
ceiling proposal that produced `read_ts`. `BeginBackup` polls each
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wait on per-group fence index, not default-group index

Step 2 requires every shard to satisfy applied_index ≥ commit_index_at_pin from the default group’s HLC-ceiling proposal, but Raft log indices are group-local and not comparable across groups. A low-traffic shard can remain healthy yet never reach another group’s index, causing false FailedPrecondition, while a high-index shard passing this check still does not prove it has crossed the required fence for read_ts. The wait condition must use a fence index derived per group (or another per-group read barrier).

Useful? React with 👍 / 👎.

Important (verified against proto/admin.proto):
- The min_backup_version gate claimed Admin.GetRaftGroups already
  carries per-node version. It does not. RaftGroupState
  (proto/admin.proto:38-49) only has raft_group_id, leader_node_id,
  leader_term, commit_index, applied_index, last_contact_unix_ms.
  Phase 1 therefore adds string node_version = 7 to RaftGroupState,
  populated by each admin handler from the build-time version
  constant. Removed the false "already carries" claim.

Minor:
- MaxEntryBytes is not a constant in the codebase. Replaced with the
  real limit: MaxSizePerMsg (defaultMaxSizePerMsg = 1<<20,
  internal/raftengine/etcd/engine.go:55).
- Forward-compat unknown-tag handler position specified: in
  kvFSM.Apply immediately after the raftEncodeHLCLease guard and
  before decodeRaftRequests, so all leading-byte dispatch stays in
  one place rather than splitting between Apply and the default case
  of decodeRaftRequests. Tags >= 0x10 still fall through and error,
  bounding the forward-compat door.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9bf18e1c87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +472 to +473
{"format_version": 1, "fields": {"name": "alice", "age": "30"}, "expire_at_ms": null}
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve Redis collection elements as binary-safe bytes

The Redis dump format for hashes/lists/sets/streams uses plain JSON strings/objects (for example fields in hashes/<key>.json), which is not a lossless representation for arbitrary Redis payload bytes. Redis collection members/fields/values are binary-safe, but JSON strings require valid Unicode text and JSON objects require textual property names, so keys containing non-UTF8 bytes (or other raw byte sequences) will be mutated or fail serialization during backup/restore. This breaks the document’s cross-version, byte-faithful restore goal for valid Redis datasets unless these elements are encoded with an explicit byte-safe scheme (e.g., base64-wrapped fields/values).

Useful? React with 👍 / 👎.

Comment on lines +486 to +487
{"member": "alice", "score": 100.0},
{"member": "bob", "score": 85.5}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Represent ZSET scores as strings to cover ±inf values

The proposed zsets/<key>.json stores score as a JSON number, but Redis sorted sets allow +inf and -inf scores (the repo already exercises this in adapter/redis_lua_collection_fastpath_test.go), and those values are not valid JSON numbers. Any dataset containing infinite scores cannot be faithfully dumped in this schema, causing either serialization failure or lossy coercion. The score field should be encoded as a string (Redis wire form) with explicit handling for +inf/-inf.

Useful? React with 👍 / 👎.

Important (verified against adapter/admin_grpc.go):
- Rolling-upgrade gate redesigned. GetRaftGroups iterates the local
  AdminServer's registered groups and reports each row from the
  local engine's Status() (adapter/admin_grpc.go:382-416), so adding
  node_version to RaftGroupState would only ever report the local
  admin server's version. Followers on other nodes would not
  surface, and the gate would falsely pass on a partially-upgraded
  cluster.

  Phase 1 instead adds a dedicated GetNodeVersion RPC fanned out by
  the BeginBackup handler to every peer in AdminServer.members.
  Each peer responds with its compiled-in version constant; the
  handler refuses BeginBackup if any peer is below N or unreachable
  within --begin-backup-deadline. RaftGroupState gains
  leader_node_version = 7 for UI visibility only (not used for the
  gate decision).

- AdminServer has no *kv.ShardStore today (confirmed at
  adapter/admin_grpc.go:62-82). The "existing AdminServer ->
  ShardStore wiring used by GetRaftGroups" claim was wrong;
  GetRaftGroups only reaches AdminGroup.Status() and Configuration().
  Phase 1 adds the field explicitly, wired in main.go.

Minor:
- AdminGroup interface (adapter/admin_grpc.go:41-44) does not embed
  StatusReader, so adding SnapshotEvery() to StatusReader does not
  reach BeginBackup through s.groups[id]. Phase 1 explicitly extends
  AdminGroup with SnapshotEvery() uint64; mocks in
  adapter/admin_grpc_test.go gain one method that returns
  defaultSnapshotEvery for parity with production.

TestBeginBackupGatesOnMinVersion updated to match the new
GetNodeVersion fan-out path; canonical 3-node fixture documented.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0292c955f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1383 to +1384
BackupPin : [tag:1][pin_id:16][read_ts:8][deadline_ms:8] = 33 bytes
BackupExtend : [tag:1][pin_id:16][deadline_ms:8] = 25 bytes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replicate pin TTL as duration, not absolute wall-clock deadline

Encoding deadline_ms into BackupPin/BackupExtend makes expiry depend on each replica’s local wall clock, so even modest clock skew can expire pins early on fast nodes while a backup is still running. In that skew scenario, a replica can drop the pin before the next renewal and compact away read_ts versions, breaking snapshot consistency despite a successful BeginBackup. To keep pin lifetime consistent cluster-wide, propagate a lease duration (or HLC-derived deadline evaluated consistently) instead of a coordinator wall-clock timestamp.

Useful? React with 👍 / 👎.

Important (verified against adapter/admin_grpc.go):
- BeginBackup gate fan-out target was AdminServer.members (the
  static bootstrap seed at line 64,89 — never updated after startup).
  Nodes added later via AddVoter would not appear, silently missing
  exactly the nodes most likely to be running an older version
  during a rolling upgrade.

  Fix: BeginBackup reuses the existing snapshotMembers helper
  (adapter/admin_grpc.go:158), the same path GetClusterOverview
  uses, which unions the seed with members discovered from each
  group's Configuration(). The fan-out target is the live cluster,
  not the startup snapshot.

Minor:
- GetRaftGroups today is a synchronous local-state read with no
  network calls. Adding leader_node_version populated via
  GetNodeVersion(leader) would turn it into a fan-out handler with
  unbounded latency tail. Specified async dial per leader with a
  500ms per-leader timeout, "" on failure, and a 10s response
  cache so rapid polls do not fan out repeatedly.

- GetNodeVersion failure mode now distinguishes "old version" from
  "unreachable" in the FailedPrecondition error string so operators
  can diagnose without out-of-band investigation. Also noted that
  --begin-backup-deadline is per-peer (fan-out is concurrent), so
  the worst-case wait is one deadline period regardless of cluster
  size.

Tests added: TestBeginBackupGatesIncludesDynamicallyAddedNodes,
TestGetRaftGroupsLeaderVersionAsync.
TestBeginBackupGatesOnMinVersion updated to assert the new error
shapes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3efd68d2c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +179 to +182
- **S3 object keys preserve their `/` separators** as filesystem path
separators. `s3://photos/2026/04/29/img.jpg` becomes
`s3/photos/2026/04/29/img.jpg`. This is the whole point — `aws s3 sync`
on the dump directory works.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle S3 keys that exceed filesystem segment limits

The layout maps S3 keys directly to filesystem paths by preserving / separators, but it does not define any fallback for oversized S3 path segments. S3 allows keys up to 1024 UTF-8 bytes, so a valid key containing a single long segment (or very deep path) can exceed common filesystem limits (for example NAME_MAX/PATH_MAX) and make dump output non-representable, causing backup failure or silent omission for valid objects.

Useful? React with 👍 / 👎.


| Risk | Mitigation |
|------|-----------|
| Dump duration exceeds MVCC retention horizon | `read_ts` registered with `active_timestamp_tracker`; compactor cannot retire its versions; producer exits with a typed error if registration is rejected |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct retention-horizon mitigation summary

This mitigation row states that pinning read_ts prevents compaction when a dump exceeds retention horizon, but the design earlier documents that snapshot installation can drop replica pin state (kvFSM.Restore path), which re-allows compaction and is exactly why the safe-bound invariant is required. Leaving this summary as-is is misleading because operators may infer long-running backups are protected when the document itself says they are only safe within bounded duration/headroom conditions.

Useful? React with 👍 / 👎.

Important (verified against adapter/admin_grpc.go and
kv/grpc_conn_cache.go):
- AdminServer has no outbound gRPC connection mechanism today.
  GetNodeVersion fan-out (in BeginBackup) and the async
  leader_node_version probes (in GetRaftGroups) both need to dial
  peer admin servers. Without specifying the dialer, two
  implementors would make incompatible choices about credentials
  and connection lifecycle.

  Phase 1 explicitly adds a *kv.GRPCConnCache field to AdminServer
  using internal.GRPCDialOptions(), kept separate from
  ShardStore.connCache because admin and data-plane addresses
  occupy different address spaces (admin cache is keyed on
  NodeIdentity.GRPCAddress; data-plane cache is keyed on Raft-
  group-leader addresses, which migrate on leader change).

Minor:
- 10s leader_node_version cache is now an explicit AdminServer
  struct field (versionCache map[string]struct{...}) listed in
  Phase 1 scope rather than implicit prose.

Tests added: TestAdminGRPCConnCacheReuse (verifies cache reuse,
peer-shutdown eviction, independence from ShardStore.connCache).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e62c7e740

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +186 to +187
- Bytes in `[A-Za-z0-9._-]` pass through.
- Every other byte is rendered as `%HH` (uppercase hex), like
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Escape dot-only segments in filename encoding

Allowing . and .. to pass through unchanged breaks path safety for layouts that use key segments as directories (for example DynamoDB items/<pk>/<sk>.json). A valid key segment of .. becomes items/../... and can escape the intended subtree after path normalization, while . collapses away and can alias other paths. This can overwrite sibling artifacts and makes backup/restore non-deterministic unless dot-only segments are explicitly escaped or rejected.

Useful? React with 👍 / 👎.

Important (verified against adapter/admin_grpc.go:455-505):
- GetNodeVersion is itself an Admin method, gated by AdminTokenAuth
  on every auth-enabled cluster. The fan-out from BeginBackup
  through *kv.GRPCConnCache makes outbound calls with no metadata
  attached, so every dial would return Unauthenticated and the gate
  would misreport the failure as "node did not respond" on every
  cluster running --adminToken.

  Phase 1 explicitly propagates the inbound bearer token to
  outbound GetNodeVersion calls via metadata.NewOutgoingContext(
  context.Background(), inMD). The same pattern applies to the
  async GetNodeVersion(leader) probes in GetRaftGroups. Forwarding
  the inbound caller's token (not a server-stored value) preserves
  the operator's authorization decision and avoids privilege
  escalation by the admin server.

Minor:
- versionCache committed to sync.Map (with key/value types
  spelled out) so the implementor cannot accidentally produce a
  data race by reusing groupsMu for cache writes. The previous
  "(or sync.Map equivalent)" left enough room for a plain map +
  groupsMu approach that would either contend or race.

Tests added: TestBeginBackupPropagatesAdminAuthToken,
TestVersionCacheRaceUnderLoad (the latter runs under -race).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 486675d414

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

```
- `streams/<key>.jsonl`: one entry per line, in stream order:
```
{"id": "1714400000000-0", "fields": {"event": "login", "user": "alice"}}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve Redis stream field order and duplicates

The stream record schema models fields as a JSON object, but Redis stream entries are ordered field-value lists and can legally contain repeated field names. This proposal therefore cannot round-trip valid XADD payloads that rely on ordering or duplicate fields (the repo’s stream model uses an ordered slice for fields), so restore would silently alter message contents. Use an ordered representation (e.g., flat array or array of {field,value} pairs) instead of an object.

Useful? React with 👍 / 👎.

Comment on lines +486 to +487
{"member": "alice", "score": 100.0},
{"member": "bob", "score": 85.5}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Encode Redis ZSET members with a byte-safe format

The ZSET dump shape stores member as a plain JSON string, which is not a byte-safe encoding for arbitrary Redis member bytes. For valid datasets containing non-UTF8 or control-byte members, serialization/replay will mutate or reject data, breaking the document’s lossless backup goal. Define an explicit byte-safe encoding for ZSET members (similar to other binary-safe paths) to keep restores faithful.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant