docs(design): propose per-adapter logical backup format by bootjp · Pull Request #705 · bootjp/elastickv

bootjp · 2026-04-29T00:59:03Z

Per-adapter directory hierarchy (dynamodb/

/items/.json, s3//<original/path>, redis/db_//.{bin,json,jsonl}, sqs//messages.jsonl) so a backup is recoverable without elastickv — cat / jq / find / aws s3 sync / redis-cli are sufficient. Strips internal magic prefixes and wide-column key encodings; pins read_ts via the lease- read path and the active timestamp tracker so the dump is consistent and the compactor cannot retire its versions mid-dump.

Per-adapter directory hierarchy (dynamodb/<table>/items/<pk>.json, s3/<bucket>/<original/path>, redis/db_<n>/<type>/<key>.{bin,json,jsonl}, sqs/<queue>/messages.jsonl) so a backup is recoverable without elastickv — cat / jq / find / aws s3 sync / redis-cli are sufficient. Strips internal magic prefixes and wide-column key encodings; pins read_ts via the lease- read path and the active timestamp tracker so the dump is consistent and the compactor cannot retire its versions mid-dump.

coderabbitai · 2026-04-29T00:59:10Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 24 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 18073263-73e7-41fa-866b-4bf8c6cef5a4

📥 Commits

Reviewing files that changed from the base of the PR and between 579d193 and 486675d.

📒 Files selected for processing (1)

docs/design/2026_04_29_proposed_logical_backup.md

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/proposed-logical-backup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 15 minutes and 24 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bootjp · 2026-04-29T00:59:25Z

@claude review

claude · 2026-04-29T00:59:50Z

Claude finished @bootjp's task in 2m 47s —— View job

Code Review: `docs/design/2026_04_29_proposed_logical_backup.md`

Gather context and understand the request
Read the design document
Analyze design for correctness, completeness, and consistency
Check against existing codebase patterns and conventions
Post review feedback

Overall this is a well-motivated, thorough design. The core insight — that the FSM snapshot is a one-way artifact and the backup format should be the logical structure of the data, strippable to first-class OS tools — is exactly right. The restore going through public adapter endpoints rather than direct FSM apply is the correct long-term call for version resilience. A few issues need addressing before implementation begins:

Issues

1. `ActiveTimestampTracker.Pin()` has no TTL — the auto-expiry design assumption is missing infrastructure

The design states (§ Read-Side Consistency):

A tool-side crash leaves the tracker entry behind for at most backup_ts_ttl (default 30 minutes), after which the registration auto-expires so the compactor is not blocked indefinitely.

But kv/active_timestamp_tracker.go has no TTL or deadline mechanism — Pin(ts uint64) returns a token, and the token only releases when Release() is called explicitly. A crash leaves the entry permanently until the process restarts. Phase 1 must include a TTL extension to ActiveTimestampTracker (a goroutine-based or compactor-checked wall-clock deadline on the token). This is non-trivial and should be listed as an explicit Phase 1 deliverable, not implied by the backup_ts_ttl default.

Fix this →

2. `BeginBackup` / `EndBackup` RPCs are missing from `proto/admin.proto`, and the service boundary is wrong

proto/admin.proto defines Admin as a read-only admin service (GetClusterOverview, GetRaftGroups, etc.). BeginBackup / EndBackup are mutations (they register/release a compaction pin). Adding mutation RPCs to a read-only service contradicts the stated intent of the service. The design should either:

Define a new Backup service in a new backup.proto, or
Explicitly acknowledge and justify that Admin will grow mutation RPCs.

This also affects the auth story — read-only operations can use a lower-privilege token; BeginBackup holds a compaction pin and could be used as a DoS vector against the compactor without appropriate rate limits or role-based guards.

3. `kv.ShardedCoordinator.ScanRange` does not exist

The design cites kv.ShardedCoordinator.ScanRange as an existing primitive, but ShardedCoordinator has no ScanRange method. The MVCC layer has ScanAt (store/mvcc_store.go:331), but that operates on raw keys. For the backup to reconstruct, for example, a DynamoDB item from wide-column !hs|fld|<keyLen><key><field> entries, the scan must be adapter-aware — it needs to understand which wide-column keys belong to which user key. This is non-trivial, and Phase 1 needs to explicitly design the ScanRange abstraction, including how it drives the per-adapter decoders.

4. Redis HyperLogLog is missing from the canonical directory layout

The layout (§ Top-Level Layout) lists strings/, hashes/, lists/, sets/, zsets/, streams/, but hll/ is absent. It appears correctly later in the per-adapter section (line 395: hll/<key>.bin). Add hll/ to the top-level layout tree for completeness; implementors will use the tree as the reference.

5. `single.json` sentinel for no-sort-key DynamoDB items is undefined

The example shows:

b64.AAECAw../single.json     # binary partition key

single.json is used as the filename when a table has no sort key, but this sentinel is never defined or justified in the Filename Encoding section. What prevents a sort key value whose percent-encoded form is single? The collision risk needs to be closed: either reserve a different sentinel (e.g., _item.json) or document that single is a reserved name distinct from any encoded sort key (because encoded sort keys always contain at least one %-escape for the .json suffix to be unambiguous).

6. Redis `LPUSH` order example is misleading

Line 371:

lists/<key>.json: Order is left-to-right (LPUSH [3,2,1] produces ["3","2","1"]).

LPUSH key 3 2 1 pushes 3, then 2, then 1 to the head, so LRANGE key 0 -1 returns ["1","2","3"] (last-pushed is head). If the intent is that the dump preserves LRANGE order (head-to-tail), restate it as: "Items are in LRANGE head-to-tail order: LPUSH key 3 2 1 → ["1","2","3"]."

7. The 50 ms skew buffer in cross-shard consistency is a new magic constant

§ Cross-shard consistency:

read_ts is admissible as long as it is below each group's last-applied commit-ts … which BeginBackup enforces by adding a small skew buffer (default 50 ms) matching the existing TSO buffer.

The existing TSO buffer is defined in the lease read implementation (2026_04_20_implemented_lease_read.md). If that document specifies a different value or a different mechanism, 50 ms could be wrong. Cite the exact constant / function from the implemented code rather than introducing a new literal, to avoid drift.

8. `CHECKSUMS` crash-safety ordering is underspecified

The manifest crash-safety guarantee is: "Producer writes MANIFEST.json last; restore refuses dumps without a manifest." But CHECKSUMS is a peer of MANIFEST.json at the root. Two questions:

Is CHECKSUMS written before or after MANIFEST.json? (If before, a crash between the two leaves a CHECKSUMS with no manifest — is that detectable?)
Does CHECKSUMS include a hash of itself? (It cannot.) Does MANIFEST.json include the hash of CHECKSUMS? That would tie the two together safely. Specify the write order explicitly.

9. Phase 3 dependency on `keyviz_cluster_fanout` design seems misplaced

§ Phase 3:

Concurrent multi-cluster fan-out … depends on the Distribution control plane being fan-out-aware (see 2026_04_27_proposed_keyviz_cluster_fanout.md).

2026_04_27_proposed_keyviz_cluster_fanout.md is the Key Visualizer fanout design. That is a read-path sampling pipeline, not the Distribution control plane's routing/shard-fan-out primitive. Either this reference is wrong, or it should cite a different proposed design doc for the distribution fan-out. Clarify.

10. `docs/design/README.md` is not updated

Per CLAUDE.md convention, docs/design/README.md indexes all design docs. This PR adds a new proposed doc but does not update the index. Add an entry for 2026_04_29_proposed_logical_backup.md.

Minor / Non-blocking

The --mode replace locking semantics ("live traffic is rejected during the replace window") in TestRestoreReplaceMode are unspecified at the design level. If there is no in-cluster fencing mechanism, the test description over-promises.
String TTL sidecar (strings_ttl.json) vs. inline expire_at_ms in hash/list/set/zset JSON is a deliberate asymmetry (preserves cat convenience for binary string payloads). Consider adding a one-line justification inline so implementors don't "fix" it.
The design document itself is well-structured and the internal-state classification table is excellent.

gemini-code-assist

Code Review

This pull request introduces a design for a logical backup format that decomposes snapshots into adapter-specific, human-readable structures to ensure vendor-independent recovery. The proposal outlines a directory-based layout for DynamoDB, S3, Redis, and SQS, utilizing a cluster-wide read timestamp for point-in-time consistency. Review feedback highlighted several improvement opportunities: mitigating the operational risk of the 30-minute backup TTL with a heartbeat mechanism, preventing S3 metadata filename collisions by using a dedicated directory, and improving scalability and parsing consistency for Redis TTLs and stream metadata by adopting JSONL and sidecar files.

gemini-code-assist · 2026-04-29T01:01:44Z

+   lease reads already use; no new code path through MVCC is introduced.
+3. `Admin.EndBackup` releases the tracker registration. A
+   tool-side crash leaves the tracker entry behind for at most
+   `backup_ts_ttl` (default 30 minutes, configurable via the same admin


The 30-minute default TTL for the read_ts registration in the active timestamp tracker represents a significant operational risk for large-scale backups. If the dump takes longer than 30 minutes, the compactor could retire MVCC versions while the backup is still scanning, causing the backup to fail. Per repository guidelines, design documents identifying such risks must detail mitigation strategies, such as implementing a heartbeat mechanism from the producer to the tracker to keep the registration alive.

Suggested change

`backup_ts_ttl` (default 30 minutes, configurable via the same admin

backup_ts_ttl (default 30 minutes, heartbeated by the producer or configurable via the same admin

References

When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.

gemini-code-assist · 2026-04-29T01:01:44Z

+    │   └── 04/
+    │       └── 29/
+    │           ├── img.jpg
+    │           ├── img.jpg.metadata.json


Using a fixed .metadata.json suffix for sidecar files can lead to collisions with user-defined S3 object keys that happen to end with the same string. This could result in user data being overwritten or the restore tool misinterpreting metadata as user content. Using a dedicated directory for metadata is a safer approach.

Suggested change

│ ├── img.jpg.metadata.json

│ ├── _metadata/img.jpg.json

gemini-code-assist · 2026-04-29T01:01:44Z

+    ├── strings/
+    │   ├── session%3Aabc123.bin             # SET session:abc123 …
+    │   └── counter%3Aglobal.bin             # SET counter:global "42"
+    ├── strings_ttl.json                     # { "session%3Aabc123": 1735689600000, ... }


Storing all string TTLs in a single JSON file per database scales poorly for large datasets. A single JSON object requires loading the entire index into memory to parse or update. Using a .jsonl format would allow for streaming and better scalability, adhering to the decomposed nature of the backup.

Suggested change

├── strings_ttl.json # { "session%3Aabc123": 1735689600000, ... }

├── strings_ttl.jsonl # {"format_version": 1, "key": "session%3Aabc123", "expire_at_ms": 1735689600000}

gemini-code-assist · 2026-04-29T01:01:44Z

+  followed (optionally) by a trailing meta line:
+  ```
+  {"_meta": true, "length": 2, "last_ms": 1714400000001, "last_seq": 0}
+  ```
+  The meta line lets a restore tool seed `XADD` IDs without re-deriving
+  them from the entries.


Including a metadata line with a different schema at the end of a .jsonl file makes the file heterogeneous and harder to process with standard line-oriented tools like jq. Moving this to a separate _stream.json file would be more consistent with the patterns used for other adapters.

Suggested change

followed (optionally) by a trailing meta line:

```

{"_meta": true, "length": 2, "last_ms": 1714400000001, "last_seq": 0}

```

The meta line lets a restore tool seed `XADD` IDs without re-deriving

them from the entries.

followed (optionally) by a sidecar _stream.json containing the stream's

length and last ID. This lets a restore tool seed XADD IDs without

re-deriving them from the entries.

Critical: - Replace nonexistent kv.ShardedCoordinator.ScanRange with the actual ShardStore.ScanAt (kv/shard_store.go:106) plus a thin BackupScanner iterator wrapper for paged scans. - Specify ActiveTimestampTracker.PinWithDeadline plus a per-second sweeper goroutine. The existing Pin has no TTL, so a crashed producer would block compaction forever without this. Important: - Add hll/ to the Redis layout tree. - Replace .metadata.json sidecar with .elastickv-meta.json reserved suffix; document collision handling and rename-collisions flag. - Spell out cross-shard wait: BeginBackup polls each group applied_index until it covers read_ts (5s default deadline) before any scan starts. The 50ms skew alone was a liveness assertion, not an enforced wait. - Specify CHECKSUMS as sha256sum(1)-compatible (verifiable with sha256sum -c CHECKSUMS from the dump root). - Add ListAdaptersAndScopes RPC alongside BeginBackup/EndBackup with ttl_ms / pin_token. Minor: - Redis TTL: default skip-expired on restore; preserve-ttl opt-in. - aws s3 sync example needs exclude pattern for sidecars. - DynamoDB example clarified: composite-key vs hash-only layouts. - SQS dedup-window reset moved into trade-offs as an explicit callout. - Redis stream merge mode: refuse non-empty target; auto-id strategy falls back to XADD * with an _id_remap.jsonl mapping log. Tests added: TestPinWithDeadlineExpiry, TestBeginBackupWaitsForLaggingShard, TestBackupScannerPaging, TestS3SidecarSuffixCollision, TestRedisTTLExpiredKeySkippedByDefault, TestRedisStreamMergeRejectsNonEmpty.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5719128131

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T01:10:44Z

+5. **Renew on long dumps**: the producer calls `BeginBackup` again with
+   the same `read_ts` (carrying the existing token) every `ttl_ms / 3`
+   to extend the deadline. A multi-hour dump never relies on a single


Add a true pin-renewal RPC contract

Step 5 requires renewing an existing backup pin by calling BeginBackup with the same read_ts and token, but the proposed BeginBackupRequest only contains ttl_ms and cannot carry either value. In a long-running dump, the producer therefore has no API path to extend the original pin and will either create a new read_ts (breaking point-in-time guarantees) or let the original pin expire. The request/response schema needs explicit renewal semantics (e.g., token + read_ts on request, or a dedicated RenewBackup RPC).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T01:10:44Z

+also losslessly recoverable from the encoded filename alone — `KEYMAP` is a
+convenience, not a correctness dependency.


Correct filename reversibility claim for long keys

The document states key translation is recoverable from encoded filenames alone, but the long-segment rule replaces keys over 240 bytes with a hash prefix plus truncated original bytes. That transformation is not bijective from path text alone, so restores or external tools cannot reconstruct verbatim keys without side metadata (_long_keys.jsonl/KEYMAP). This should be specified as a required dependency for long keys, or the encoding scheme should be made fully reversible.

Useful? React with 👍 / 👎.

bootjp · 2026-04-29T01:12:26Z

@claude review

claude · 2026-04-29T01:12:42Z

Claude finished @bootjp's task in 3m 13s —— View job

Code Review: `docs/design/2026_04_29_proposed_logical_backup.md`

Gather context and understand the request
Read the design document
Check existing codebase for consistency
Analyze design for correctness, completeness, and risks
Post review feedback

The design is well-structured and the motivation is sound — decoupling the backup artifact from internal key encodings so recovery does not depend on the elastickv binary is the right call. The BackupScanner paging approach, the PinWithDeadline extension, and the per-adapter restore-via-public-API strategy are all solid. A few correctness issues need addressing before Phase 1 implementation starts:

P0 — Blocks implementation

1. `BeginBackupRequest` cannot express renewal

Step 5 of the BeginBackup → EndBackup flow (line 637–640) says:

the producer calls BeginBackup again with the same read_ts (carrying the existing token) every ttl_ms / 3 to extend the deadline

But BeginBackupRequest only has ttl_ms = 1. There are no fields for read_ts or pin_token, so the renewal described in step 5 is unimplementable — calling BeginBackup again creates a new pin at a new read_ts, not an extension of the existing one. This breaks the point-in-time guarantee for dumps longer than ttl_ms.

Fix: either add optional bytes renewal_token = 2 and optional uint64 renew_for_read_ts = 3 to BeginBackupRequest (server honors them as extend-in-place if present), or define a dedicated RenewBackup(pin_token, ttl_ms) RPC.

Fix this →

2. `ListAdaptersAndScopesRequest` message is never defined

The proto snippet (line 566–567) declares:

rpc ListAdaptersAndScopes(ListAdaptersAndScopesRequest)
    returns (ListAdaptersAndScopesResponse) {}

ListAdaptersAndScopesResponse is defined (line 586–592), but ListAdaptersAndScopesRequest is not. Even if the request is empty, an explicit message ListAdaptersAndScopesRequest {} is required for the proto to compile.

Fix this →

3. Redis list order example is incorrect

Line 395:

Order is left-to-right (LPUSH [3,2,1] produces ["3","2","1"]).

LPUSH key 3 2 1 pushes 3 to the head, then 2 to the head (in front of 3), then 1 to the head (in front of 2). LRANGE key 0 -1 returns ["1","2","3"], not ["3","2","1"]. The claim has the output inverted. Implementors who follow this example will produce a dump that restores lists in reverse order.

The correct statement: items are written in LRANGE head-to-tail order. LPUSH key 3 2 1 produces an LRANGE order of ["1","2","3"]. Or restate the example using RPUSH if the intent is push-order.

Fix this →

P1 — Design gaps / inconsistencies

4. `BeginBackup`/`EndBackup` on a read-only `Admin` service changes security posture without justification

proto/admin.proto currently has exclusively read RPCs (GetClusterOverview, GetRaftGroups, GetAdapterSummary, GetKeyVizMatrix, GetRouteDetail). BeginBackup and EndBackup are state-mutating: they register and release a compaction pin, and a rogue or unbounded BeginBackup call blocks the compactor indefinitely (up to ttl_ms).

Adding mutations to the Admin service makes the service's authorization requirements more complex — operators who give a token read-only Admin access today would inadvertently be granting backup-pin mutation rights. The design should either: (a) define a new Backup service in proto/backup.proto, or (b) explicitly justify why Admin is the right home and document the authorization boundary.

5. Phase 3 cross-reference to `keyviz_cluster_fanout.md` is wrong

Lines 891–893:

Concurrent multi-cluster fan-out … depends on the Distribution control plane being fan-out-aware (see 2026_04_27_proposed_keyviz_cluster_fanout.md).

2026_04_27_proposed_keyviz_cluster_fanout.md is the Key Visualizer admin-UI fan-out design (Phase 2-C of the heatmap SPA). It is not a Distribution control plane design. The reference is pointing at the wrong document. Either cite the correct routing/distribution fan-out proposal (if one exists), or remove the parenthetical and note that a distribution fan-out design is a prerequisite.

Fix this →

6. Cross-shard `applied_index` wait condition is underspecified

Step 2 of the BeginBackup flow says:

every group is required to report applied_index ≥ commit_index_at_pin for the default group's HLC ceiling proposal

The HLC ceiling is only proposed through the default Raft group. Non-default shard groups do not apply HLC ceiling entries — they apply user data commands. The condition applied_index ≥ commit_index_at_pin where commit_index_at_pin comes from the default group's HLC entry is mixing two independent Raft log indexes. A lagging shard's applied_index being ≥ the default group's commit_index says nothing about whether that shard has caught up to read_ts.

The correct wait condition for non-default shards is probably shard.LastAppliedTS ≥ read_ts (i.e., the highest timestamp the shard has applied is at or above the pinned read_ts), not an applied_index comparison across groups. Specify this precisely.

7. CHECKSUMS/MANIFEST crash-safety write order is unspecified

The "Risks and Mitigations" table says: "Producer writes MANIFEST.json last; restore refuses dumps without a manifest." But it says nothing about where CHECKSUMS falls in the write order relative to MANIFEST.json. Two scenarios are ambiguous:

If CHECKSUMS is written before MANIFEST.json: a crash between the two leaves a CHECKSUMS with no manifest. Restore detects and refuses — fine, but this should be stated.
If CHECKSUMS is written after MANIFEST.json: a crash between the two leaves a manifest with no checksums — restore can proceed but cannot verify integrity, which is a worse failure mode.

Explicitly state: CHECKSUMS is written before MANIFEST.json; the presence of MANIFEST.json is the atomicity marker.

Minor / Non-blocking

docs/design/README.md not updated. Per CLAUDE.md conventions, the design doc index must include the new file. Fix this →
strings_ttl.json as a single JSON object requires loading the entire TTL index into memory to parse. For large Redis databases this could be megabytes to gigabytes. JSONL (strings_ttl.jsonl) would allow streaming — consistent with how messages.jsonl and streams/<key>.jsonl are handled.
Stream _meta trailing line creates a heterogeneous JSONL file. A file named .jsonl where the last line has a different schema ({"_meta": true, "length": ..., "last_ms": ...}) breaks tools expecting jq .fields to work on every line. Moving it to a sidecar <key>_meta.json alongside streams/<key>.jsonl is more consistent with the sidecar pattern used for S3 objects and DynamoDB schemas.
--mode replace live-traffic rejection is unspecified at the design level. TestRestoreReplaceMode asserts "live traffic is rejected during the replace window" but there is no proposed in-cluster fencing mechanism for this guarantee. This either needs a design note explaining the mechanism or the test description should be weakened.

HIGH: - TTL renewal made explicit. New RenewBackup RPC; producer renews every ttl_ms/3 and aborts the dump on renewal failure (continuing past TTL would silently corrupt the artifact). backup_max_ttl_ms caps a single window; multi-hour dumps work via repeated renewals. MEDIUM: - DynamoDB scalability. Default stays per-item (the user's stated requirement: PK名.json で1レコード1ファイル). Added opt-in --dynamodb-bundle-mode jsonl for tables where inode count is the binding constraint; structured warning emitted when an unbundled scope exceeds 1M items. - S3 file-vs-directory collisions. Producer detects keys like "path/to" and "path/to/obj" coexisting and renames the shorter to "path/to.elastickv-leaf-data" with KEYMAP.jsonl + manifest record. - KEYMAP scaled by switching to KEYMAP.jsonl (line-streamable). - Redis strings_ttl.json -> strings_ttl.jsonl with one TTL record per line so TTL count is unbounded. - max_active_backup_pins (default 4) caps concurrent BeginBackup registrations; tracker returns ErrTooManyActiveBackups, RPC surfaces ResourceExhausted. Prevents a misbehaving caller from holding the compactor open across the entire MVCC retention horizon. Manifest fields added: s3_collision_strategy, dynamodb_layout, max_active_backup_pins. CLI flags added: --ttl-ms, --scan-page-size, --dynamodb-bundle-mode/--dynamodb-bundle-size, --rename-collisions. Tests added: TestS3PathFileVsDirectoryCollision, TestBeginBackupTooManyActiveBackups, TestRenewBackupExtendsDeadline.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbd1bb8b8b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T01:21:18Z

+- DynamoDB: `aws dynamodb create-table --cli-input-json _schema.json` and
+  then `aws dynamodb put-item --item @items/<pk>/<sk>.json` per file.


Use valid CreateTable request JSON for DynamoDB replay

The documented replay step aws dynamodb create-table --cli-input-json _schema.json will not work with the _schema.json shape defined earlier in this doc (format_version, table_name, primary_key, etc.). AWS CLI create-table expects request fields like TableName, KeySchema, and AttributeDefinitions, and --cli-input-json must follow the operation input skeleton format, so users following this runbook will fail before any item restore.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T01:21:18Z

+  every sidecar as if it were a user object.
+- Redis: a 100-line shell script over `find redis/db_0/strings -name '*.bin' -exec redis-cli -x SET …`,
+  with similar one-liners per type.
+- SQS: `jq -c . messages.jsonl | xargs -n1 aws sqs send-message --message-body …`.


Include FIFO-required fields in SQS replay command

The suggested SQS replay one-liner only passes --message-body, but this design explicitly includes FIFO queues (orders-fifo.fifo) and AWS send-message fails for FIFO queues without MessageGroupId (and may also require MessageDeduplicationId when content-based dedup is off). As written, the advertised external restore path cannot replay FIFO dumps and also drops message attributes/metadata needed for faithful recovery.

Useful? React with 👍 / 👎.

Round-2 carry-overs: - Move NewBackupScanner from *ShardedCoordinator to *ShardStore. ShardedCoordinator (kv/sharded_coordinator.go:124) has no *ShardStore field; ShardStore is the natural home for ScanAt-adjacent code. - Add a Phase-1 note that ScanAt's doc comment must be updated to reflect that BeginBackup's applied_index fence makes cross-shard ScanAt globally consistent. - Add pin_token to ListAdaptersAndScopesRequest so the scope scan is bounded to the pinned read_ts; concurrently-created scopes are not surfaced. - Add hll/ to the Redis db_0 example tree (top-level layout already had it; the per-adapter example lagged). - Add --stream-merge-strategy / --preserve-ttl / --preserve-visibility to the elastickv-restore apply CLI block. New nits from the Gemini changes: - Pick one form for DynamoDB bundle-mode flags. Two-flag form (--dynamodb-bundle-mode jsonl + --dynamodb-bundle-size 64MiB) throughout; drop the inline jsonl:size=64MiB form. - RenewBackupRequest.ttl_ms documents the same 60s-24h range constraint as BeginBackupRequest, bounded above by backup_max_ttl_ms; out-of-range returns InvalidArgument. Tests added: TestRenewBackupTTLRangeValidation, TestListAdaptersAndScopesAtPinTS.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3b2936057

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T01:29:16Z

+3. **Pin `read_ts`** with `PinWithDeadline(read_ts, now+ttl_ms)`; return
+   the resulting `pin_token` to the producer.


Pin read_ts before waiting for shard catch-up

This flow pins read_ts only after the shard catch-up wait, which leaves a race where concurrent writes plus compaction can retire versions at read_ts before the scan ever starts. In a busy cluster (or with tight retention), BeginBackup can succeed but the subsequent ScanAt(read_ts) no longer has a guaranteed full snapshot. The pin should be established before the wait (and released on failure) so the fenced timestamp is protected for the entire handshake.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T01:29:16Z

+- **S3 object keys preserve their `/` separators** as filesystem path
+  separators. `s3://photos/2026/04/29/img.jpg` becomes
+  `s3/photos/2026/04/29/img.jpg`. This is the whole point — `aws s3 sync`
+  on the dump directory works.


Encode unsafe S3 path segments before writing to disk

Mapping S3 keys directly to filesystem paths with raw / separators is unsafe for keys containing period-only path segments (for example ./ or ../) and other special segments, because many tools normalize them and can change where data is written/read. That breaks the design goal of deterministic, reversible backups and can cause object collisions or path-escape behavior during dump/restore unless segment normalization is explicitly encoded/escaped.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T01:29:16Z

+The format is designed to be extractable without elastickv:
+
+- DynamoDB: `aws dynamodb create-table --cli-input-json _schema.json` and
+  then `aws dynamodb put-item --item @items/<pk>/<sk>.json` per file.


Use valid aws put-item CLI arguments in replay guidance

The DynamoDB replay command omits required put-item inputs and uses an unsupported file reference form, so operators following this runbook cannot actually restore items. AWS CLI put-item requires at least --table-name and --item, and file input is provided via file://...; without that, the external restore path fails even when item files are correct.

Useful? React with 👍 / 👎.

Important (verified against adapter/redis.go): - Stream TTLs were silently dropped. Streams can carry TTL via EXPIRE/PEXPIRE/PEXPIREAT, kept in !redis|ttl| by buildTTLElems (adapter/redis.go:2471). Added expire_at_ms to the _meta line so restore re-applies the original epoch. - HLL TTLs were silently dropped. HLL is reported as redisTypeString but stores TTL in the legacy scan index (adapter/redis.go:2319-2320 comment confirms it). Added hll_ttl.jsonl sidecar at the db_<n> root, same shape as strings_ttl.jsonl. Sidecar is omitted when no HLL key has a TTL. - Fixed XADD NOMKSTREAM semantic error in the skip-existing description. NOMKSTREAM only prevents stream creation; it does not skip entries by ID. Replaced with the correct mechanism: attempt XADD and treat the "ID equal or smaller" ERR as "already present, skip." Minor: - Added --begin-backup-deadline 5s to the elastickv-backup dump CLI block (the prose mentioned it but the flag list lagged). Tests added: TestRedisStreamTTLRoundTrip, TestRedisHLLTTLRoundTrip, TestRedisStreamSkipExistingHandlesERR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 797d5711e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T01:38:09Z

+  - Bytes in `[A-Za-z0-9._-]` pass through.
+  - Every other byte is rendered as `%HH` (uppercase hex), like


Encode case to prevent filename collisions

The filename encoding keeps [A-Za-z0-9._-] unchanged, but this does not satisfy the stated requirement to handle case-colliding names on case-insensitive filesystems. Two distinct keys like UserA and usera will map to the same path on default Windows/macOS volumes, causing overwrite or dump failure and making the backup non-recoverable in those environments.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T01:38:09Z

+                            → CHECKSUMS → MANIFEST.json
+                            → EndBackup(pin_token)


Include MANIFEST.json in checksum coverage

The flow writes CHECKSUMS before MANIFEST.json, so sha256sum -c CHECKSUMS cannot detect tampering of the manifest even though restore behavior depends on manifest fields (layout, collision strategy, options). This undermines the document’s integrity claim that tampering is caught by checksum verification.

Useful? React with 👍 / 👎.

Important (correctness gap, verified against main.go): - BeginBackup pin was node-local. Each node owns its own ActiveTimestampTracker (main.go:294) and only its compactor consults it (main.go:335). A pin on the receiving node would not protect group leaders on other nodes; their compactors would be free to retire MVCC versions at read_ts mid-dump. Fix: pins propagate through each Raft group's log as BackupPin / BackupExtend / BackupRelease FSM commands. Every replica applies them to its local tracker, so the pin set is replicated and durable across leader changes (a new leader inherits the pin from the same log entry it applied). BeginBackup fans the propose out per group; if any group fails to commit within --begin-backup-deadline, BeginBackup proposes BackupRelease on groups that did commit and returns Unavailable so retry is clean. RenewBackup / EndBackup fan out the same way; releases are idempotent so partial state is tolerated. Minor: - Specified _id_remap.jsonl path: redis/db_<n>/streams/<key>._id_remap.jsonl (next to source) so tools can find it on disk. - TestRedisStreamSkipExistingHandlesERR description: "the dump" -> "the restore" (it's a restore-side test). - Added KEYMAP.jsonl to every per-adapter scope in the layout tree; omitted when empty. Tests added: TestBeginBackupPinFanOutAllNodes, TestBeginBackupPinSurvivesLeaderChange, TestBeginBackupGroupUnreachable.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c00d5604d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T01:48:25Z

+- **S3 object keys preserve their `/` separators** as filesystem path
+  separators. `s3://photos/2026/04/29/img.jpg` becomes
+  `s3/photos/2026/04/29/img.jpg`. This is the whole point — `aws s3 sync`
+  on the dump directory works.


Handle trailing-slash S3 keys in path mapping

This mapping rule makes the on-disk path non-representable for valid S3 objects whose key ends with / (for example, folder-marker objects): preserving separators and writing at the “natural path” means the target path is a directory, not a file, so the object body/sidecar cannot be emitted without renaming. In datasets that contain these keys, dumps will either fail or silently omit/collide objects, breaking restore completeness; the design needs an explicit escape/rename rule for trailing-slash keys.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T01:48:25Z

+
+- DynamoDB: `aws dynamodb create-table --cli-input-json _schema.json` and
+  then `aws dynamodb put-item --item @items/<pk>/<sk>.json` per file.
+- S3: `aws s3 sync --exclude '*.elastickv-meta.json' s3/<bucket>/ s3://target-bucket/`.


Exclude control files from documented S3 replay sync

The documented replay command excludes only *.elastickv-meta.json, so it will also upload backup control artifacts that live under each bucket scope (for example _bucket.json and KEYMAP.jsonl) as regular S3 objects. That contaminates the restored bucket with non-user objects and makes the “external restore” path non-faithful unless additional excludes (or a filtered object list) are documented.

Useful? React with 👍 / 👎.

Important (verified against kv/fsm.go): - Phase 1 scope was missing the FSM plumbing for BackupPin / BackupExtend / BackupRelease entries. Added explicit items: - new byte tags 0x03/0x04/0x05 in kv/fsm.go alongside existing raftEncodeHLCLease=0x02 (kv/fsm.go:116); - new applyBackupPin/applyBackupExtend/applyBackupRelease handlers, dispatched from kvFSM.Apply (kv/fsm.go:60) in the same shape as applyHLCLease; - new *ActiveTimestampTracker field on kvFSM (kv/fsm.go:26), wired via NewKvFSMWithHLCAndTracker so the FSM and the coordinator share one tracker (analogous to the *HLC sharing invariant in CLAUDE.md); - new kv/backup_codec.go with the 33/25/17-byte fixed binary layouts for the three entry types. - kvFSM.Restore (kv/fsm.go:264-287) wipes in-memory pins. A replica that installs a Raft snapshot during the backup window loses its BackupPin (snapshot only carries the Pebble store, not the tracker). After the leader compacts the log past the BackupPin entry, that entry is gone; a follower catching up via snapshot never replays it. Added explicit safe-bound invariant: backup_duration + worst_case_compaction_lag < mvcc_retention_horizon BeginBackup enforces a soft form via --snapshot-headroom-entries (default 10000). Producer's per-scope expected_keys heuristic catches mid-dump snapshot-induced shortfalls. Heavier mitigations (snapshotting tracker state, sibling-pin queries) are explicitly Phase 3, not Phase 1. Minor: - --begin-backup-deadline binding constraint clarified: it is the applied_index poll for lagging shards, not the BackupPin commit (which is sub-second on a healthy cluster). - RenewBackup retries per-group BackupExtend up to 3 times with 500ms backoff before aborting; transient leader elections (<1s under etcd/raft defaults) no longer abort otherwise-healthy dumps. Tests added: TestBackupPinFSMCodecRoundTrip, TestRestoreWipesLocalPins, TestBeginBackupRefusesNearSnapshotThreshold, TestRenewBackupRetriesLeaderElection.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 020e0c4c7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T01:58:47Z

+  every sidecar as if it were a user object.
+- Redis: a 100-line shell script over `find redis/db_0/strings -name '*.bin' -exec redis-cli -x SET …`,
+  with similar one-liners per type.
+- SQS: `jq -c . messages.jsonl | xargs -n1 aws sqs send-message --message-body …`.


Add required queue URL to SQS replay command

The documented replay one-liner cannot run as written because aws sqs send-message requires --queue-url in addition to --message-body; the AWS CLI synopsis/options mark --queue-url as required. Operators following this command will fail before any message replay begins, so the external-restore path is blocked for all queues.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T01:58:48Z

+  every sidecar as if it were a user object.
+- Redis: a 100-line shell script over `find redis/db_0/strings -name '*.bin' -exec redis-cli -x SET …`,
+  with similar one-liners per type.
+- SQS: `jq -c . messages.jsonl | xargs -n1 aws sqs send-message --message-body …`.


Avoid xargs parsing for JSONL SQS message replay

Using jq -c . messages.jsonl | xargs -n1 ... is unsafe for common message bodies containing quotes or backslashes (for example JSON payloads), because GNU xargs treats quotes/backslashes as special delimiters by default and can error with unmatched quotes or alter arguments. That means replay can fail or send mutated payloads for valid dumped messages, so this command is not a reliable restore path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T01:58:48Z

+sidecar. Operators who hold such keys pass `--rename-collisions` to have
+them emitted as `<obj>.elastickv-meta.json.user-data` (the rename is
+recorded in `KEYMAP`).


Make sidecar-collision rename mapping injective

The proposed rename for object keys ending in .elastickv-meta.json (<obj>.elastickv-meta.json.user-data) is not collision-safe: a real user object can already exist at that exact key, so both objects map to the same dump path and one is overwritten/omitted. The collision strategy needs a guaranteed-unique escape rule (or iterative disambiguation) to keep restores lossless for these key pairs.

Useful? React with 👍 / 👎.

Important (verified against internal/raftengine/engine.go and internal/raftengine/etcd/engine.go): - --snapshot-headroom-entries check was inverted. Replaced with the correct formula: entries_since_last_snapshot = AppliedIndex - LastSnapshotIndex remaining_headroom = SnapshotEvery - entries_since_last_snapshot refuse if: remaining_headroom < --snapshot-headroom-entries (raftengine.Status has LastSnapshotIndex not FirstIndex, and defaultSnapshotEvery = 10000 is the per-engine snapshot trigger.) A freshly-snapshotted cluster has the largest remaining headroom and is now correctly allowed; the check refuses backups that would trigger the snapshot-installation corner case mid-dump. Default --snapshot-headroom-entries lowered from 10000 to 1000 (one-tenth of SnapshotEvery) since the formula now expresses what the prose intended. - expected_keys heuristic was undefined. Replaced with a concrete expected-keys baseline mechanism: * BeginBackup runs a count-only ScanAt at read_ts before the pin installs, returns per-scope (key_count, applied_index) in BeginBackupResponse. * Producer compares actual count against baseline; fails with ErrCompactionDuringDump if below 99% +/- sqrt(baseline). * Adds keys_only bool parameter to MVCCStore.ScanAt / ShardStore.ScanAt so the baseline pass skips value materialization (Pebble iterator with payload trimmed at the block layer). Minor: - _long_keys.jsonl renamed to KEYMAP.jsonl on line 192 to match the rest of the spec. - BackupPin fan-out is now explicitly concurrent (one goroutine per group, blocked on all-or-deadline). 100-shard cluster does not pay 100x serial Raft round-trips. Tests added: TestExpectedKeysBaselineToleratesTTLExpiry, TestExpectedKeysBaselineCountOnlyScan. Existing tests TestRestoreWipesLocalPins and TestBeginBackupRefusesNearSnapshotThreshold updated to match the corrected formulas.

Important: - BeginBackupResponse proto block was missing definitions. Added ShardApplied {group_id, applied_index} and ScopeKeyCount {scope_id, key_count, applied_index_at_count}; added "repeated ScopeKeyCount expected_keys = 5" to BeginBackupResponse. - Expected-keys baseline ordering bug. Documented as running "during step 1" but step 1 is "pick read_ts" and step 2 is shard catch-up. A lagging shard would return 0 (or partial) for ScanAt at read_ts before applying through it, producing an undercount and silently disabling ErrCompactionDuringDump on exactly the shards most at risk of snapshot-induced pin loss. Moved to "after step 2 (every shard has applied through read_ts) and before step 3 (pin)". Minor: - --snapshot-headroom-entries CLI default 10000 -> 1000 (matches the one-tenth-of-SnapshotEvery default documented in the prose). - ScanAt keys_only addition is a source-incompatible Go API change with ~10 callers in shard_store.go alone. Replaced with a new ScanKeysAt overload on MVCCStore / ShardStore (returns [][]byte not []*KVPair); existing ScanAt callers untouched. Tests adjusted: TestExpectedKeysBaselineCountOnlyScan now verifies both the new path and the absence of a regression on existing ScanAt callers; new TestExpectedKeysBaselineRunsAfterShardCatchup guards against the ordering bug.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ca1ae5169

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T02:21:42Z

+│       ├── _bucket.json
+│       ├── KEYMAP.jsonl                            # per-scope (omitted when empty)
+│       ├── <object-key-path>                       # original object bytes


Prevent S3 object keys from colliding with control filenames

This layout puts user object payloads and backup control artifacts in the same bucket directory, so valid keys like _bucket.json or KEYMAP.jsonl map to the exact paths reserved for metadata files and will overwrite or block one another. In any bucket containing those keys, the dump cannot be lossless and restore behavior becomes ambiguous unless these control names are explicitly escaped/reserved for user keys.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T02:21:42Z

+- **DynamoDB binary partition / sort keys** (the `B` attribute type) are
+  rendered as `b64.<base64url>` so a binary key never collides with a
+  string key whose hex encoding happens to look like base64.


Disambiguate DynamoDB B and S filename encodings

Encoding binary key attributes as b64.<base64url> is not collision-safe with string keys, because string segments in the same character set pass through unchanged and can legitimately equal b64.<...>. A table containing both S and B keys that serialize to the same segment will collide on disk, causing overwrite/omission unless the encoding carries an unambiguous type discriminator for all key kinds.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T02:21:42Z

+  is renamed to `path/to.elastickv-leaf-data`, with the rename
+  recorded in `s3/<bucket>/KEYMAP.jsonl`. The sidecar follows the


Make leaf-data collision rename mapping injective

The file-vs-directory fallback renames path/to to path/to.elastickv-leaf-data, but that target can already exist as a real user object key. When a bucket has path/to, path/to/obj, and path/to.elastickv-leaf-data, two distinct objects map to one dump path, so one is lost unless the rename strategy guarantees uniqueness (for example, iterative escaping or hash-based disambiguation).

Useful? React with 👍 / 👎.

Important (verified against internal/raftengine/{etcd/,}engine.go and kv/fsm.go): - SnapshotEvery is not on the raftengine.Engine interface today. Phase 1 adds StatusReader.SnapshotEvery() uint64 implemented on the etcd backend by returning the snapshotEveryFromEnv() value cached at Open. Also corrected the env-var name: there is no --raftSnapshotEvery flag; the override is ELASTICKV_RAFT_SNAPSHOT_COUNT (internal/raftengine/etcd/engine.go:93). - Rolling-upgrade safety. An old kvFSM receiving a 0x03-tagged BackupPin entry falls through decodeRaftRequests -> decodeLegacyRaftRequest (kv/fsm.go:140-145) -> proto.Unmarshal -> Apply error -> apply-loop stall. A single-release introduction would corrupt apply on un-upgraded replicas the moment the first BeginBackup runs. Phase 1 ships in two releases: Release N: kvFSM.Apply treats unknown tags in [0x03, 0x0F] as no-ops with an unknown_fsm_tag warning (forward-compat handler, no other backup machinery active). Release N+1: BackupPin/Extend/Release handlers ship; BeginBackup gates on every group member reporting version >= N via Admin.GetRaftGroups, returning FailedPrecondition otherwise. Operators may treat N and N+1 as the same calendar release if they enforce a fleet-wide upgrade window first; the gate still short-circuits. Minor: - ScopeKeyCount.scope_id format documented for all four adapters, including the redis_databases uint32 -> "redis/db_<n>" mapping. Tests added: TestForwardCompatUnknownFSMTag, TestBeginBackupGatesOnMinVersion, TestSnapshotEveryReadsFromEngine.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5bda8a8b4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T02:31:32Z

+       activated. `BeginBackup` itself **gates on every group member
+       reporting `min_backup_version ≥ N`** — verified through the
+       existing `Admin.GetRaftGroups` membership view, which already
+       carries per-node version metadata. If any member reports a
+       version older than N, `BeginBackup` returns


Gate backups on N+1-capable members

The rollout gate currently allows BeginBackup when members are only at version N, but in this design version N explicitly treats backup FSM tags (0x03–0x0F) as no-ops. That means a mixed cluster with N members will accept BackupPin proposals yet silently skip pin application on those replicas, so their compactors can still retire read_ts versions during an active backup. This breaks the safety guarantee for long scans and leader movement; the gate needs to require members that actually implement applyBackupPin/Extend/Release (i.e., N+1 or an explicit capability flag).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T02:31:33Z

+2. **Wait for shards to catch up**: every group is required to report
+   `applied_index ≥ commit_index_at_pin` for the default group's HLC
+   ceiling proposal that produced `read_ts`. `BeginBackup` polls each


Wait on per-group fence index, not default-group index

Step 2 requires every shard to satisfy applied_index ≥ commit_index_at_pin from the default group’s HLC-ceiling proposal, but Raft log indices are group-local and not comparable across groups. A low-traffic shard can remain healthy yet never reach another group’s index, causing false FailedPrecondition, while a high-index shard passing this check still does not prove it has crossed the required fence for read_ts. The wait condition must use a fence index derived per group (or another per-group read barrier).

Useful? React with 👍 / 👎.

Important (verified against proto/admin.proto): - The min_backup_version gate claimed Admin.GetRaftGroups already carries per-node version. It does not. RaftGroupState (proto/admin.proto:38-49) only has raft_group_id, leader_node_id, leader_term, commit_index, applied_index, last_contact_unix_ms. Phase 1 therefore adds string node_version = 7 to RaftGroupState, populated by each admin handler from the build-time version constant. Removed the false "already carries" claim. Minor: - MaxEntryBytes is not a constant in the codebase. Replaced with the real limit: MaxSizePerMsg (defaultMaxSizePerMsg = 1<<20, internal/raftengine/etcd/engine.go:55). - Forward-compat unknown-tag handler position specified: in kvFSM.Apply immediately after the raftEncodeHLCLease guard and before decodeRaftRequests, so all leading-byte dispatch stays in one place rather than splitting between Apply and the default case of decodeRaftRequests. Tags >= 0x10 still fall through and error, bounding the forward-compat door.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9bf18e1c87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T02:41:02Z

+  {"format_version": 1, "fields": {"name": "alice", "age": "30"}, "expire_at_ms": null}
+  ```


Preserve Redis collection elements as binary-safe bytes

The Redis dump format for hashes/lists/sets/streams uses plain JSON strings/objects (for example fields in hashes/<key>.json), which is not a lossless representation for arbitrary Redis payload bytes. Redis collection members/fields/values are binary-safe, but JSON strings require valid Unicode text and JSON objects require textual property names, so keys containing non-UTF8 bytes (or other raw byte sequences) will be mutated or fail serialization during backup/restore. This breaks the document’s cross-version, byte-faithful restore goal for valid Redis datasets unless these elements are encoded with an explicit byte-safe scheme (e.g., base64-wrapped fields/values).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T02:41:02Z

+    {"member": "alice", "score": 100.0},
+    {"member": "bob",   "score":  85.5}


Represent ZSET scores as strings to cover ±inf values

The proposed zsets/<key>.json stores score as a JSON number, but Redis sorted sets allow +inf and -inf scores (the repo already exercises this in adapter/redis_lua_collection_fastpath_test.go), and those values are not valid JSON numbers. Any dataset containing infinite scores cannot be faithfully dumped in this schema, causing either serialization failure or lossy coercion. The score field should be encoded as a string (Redis wire form) with explicit handling for +inf/-inf.

Useful? React with 👍 / 👎.

Important (verified against adapter/admin_grpc.go): - Rolling-upgrade gate redesigned. GetRaftGroups iterates the local AdminServer's registered groups and reports each row from the local engine's Status() (adapter/admin_grpc.go:382-416), so adding node_version to RaftGroupState would only ever report the local admin server's version. Followers on other nodes would not surface, and the gate would falsely pass on a partially-upgraded cluster. Phase 1 instead adds a dedicated GetNodeVersion RPC fanned out by the BeginBackup handler to every peer in AdminServer.members. Each peer responds with its compiled-in version constant; the handler refuses BeginBackup if any peer is below N or unreachable within --begin-backup-deadline. RaftGroupState gains leader_node_version = 7 for UI visibility only (not used for the gate decision). - AdminServer has no *kv.ShardStore today (confirmed at adapter/admin_grpc.go:62-82). The "existing AdminServer -> ShardStore wiring used by GetRaftGroups" claim was wrong; GetRaftGroups only reaches AdminGroup.Status() and Configuration(). Phase 1 adds the field explicitly, wired in main.go. Minor: - AdminGroup interface (adapter/admin_grpc.go:41-44) does not embed StatusReader, so adding SnapshotEvery() to StatusReader does not reach BeginBackup through s.groups[id]. Phase 1 explicitly extends AdminGroup with SnapshotEvery() uint64; mocks in adapter/admin_grpc_test.go gain one method that returns defaultSnapshotEvery for parity with production. TestBeginBackupGatesOnMinVersion updated to match the new GetNodeVersion fan-out path; canonical 3-node fixture documented.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0292c955f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T02:51:24Z

+    BackupPin     : [tag:1][pin_id:16][read_ts:8][deadline_ms:8]   = 33 bytes
+    BackupExtend  : [tag:1][pin_id:16][deadline_ms:8]              = 25 bytes


Replicate pin TTL as duration, not absolute wall-clock deadline

Encoding deadline_ms into BackupPin/BackupExtend makes expiry depend on each replica’s local wall clock, so even modest clock skew can expire pins early on fast nodes while a backup is still running. In that skew scenario, a replica can drop the pin before the next renewal and compact away read_ts versions, breaking snapshot consistency despite a successful BeginBackup. To keep pin lifetime consistent cluster-wide, propagate a lease duration (or HLC-derived deadline evaluated consistently) instead of a coordinator wall-clock timestamp.

Useful? React with 👍 / 👎.

Important (verified against adapter/admin_grpc.go): - BeginBackup gate fan-out target was AdminServer.members (the static bootstrap seed at line 64,89 — never updated after startup). Nodes added later via AddVoter would not appear, silently missing exactly the nodes most likely to be running an older version during a rolling upgrade. Fix: BeginBackup reuses the existing snapshotMembers helper (adapter/admin_grpc.go:158), the same path GetClusterOverview uses, which unions the seed with members discovered from each group's Configuration(). The fan-out target is the live cluster, not the startup snapshot. Minor: - GetRaftGroups today is a synchronous local-state read with no network calls. Adding leader_node_version populated via GetNodeVersion(leader) would turn it into a fan-out handler with unbounded latency tail. Specified async dial per leader with a 500ms per-leader timeout, "" on failure, and a 10s response cache so rapid polls do not fan out repeatedly. - GetNodeVersion failure mode now distinguishes "old version" from "unreachable" in the FailedPrecondition error string so operators can diagnose without out-of-band investigation. Also noted that --begin-backup-deadline is per-peer (fan-out is concurrent), so the worst-case wait is one deadline period regardless of cluster size. Tests added: TestBeginBackupGatesIncludesDynamicallyAddedNodes, TestGetRaftGroupsLeaderVersionAsync. TestBeginBackupGatesOnMinVersion updated to assert the new error shapes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3efd68d2c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T03:01:30Z

+- **S3 object keys preserve their `/` separators** as filesystem path
+  separators. `s3://photos/2026/04/29/img.jpg` becomes
+  `s3/photos/2026/04/29/img.jpg`. This is the whole point — `aws s3 sync`
+  on the dump directory works.


Handle S3 keys that exceed filesystem segment limits

The layout maps S3 keys directly to filesystem paths by preserving / separators, but it does not define any fallback for oversized S3 path segments. S3 allows keys up to 1024 UTF-8 bytes, so a valid key containing a single long segment (or very deep path) can exceed common filesystem limits (for example NAME_MAX/PATH_MAX) and make dump output non-representable, causing backup failure or silent omission for valid objects.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T03:01:30Z

+
+| Risk | Mitigation |
+|------|-----------|
+| Dump duration exceeds MVCC retention horizon | `read_ts` registered with `active_timestamp_tracker`; compactor cannot retire its versions; producer exits with a typed error if registration is rejected |


Correct retention-horizon mitigation summary

This mitigation row states that pinning read_ts prevents compaction when a dump exceeds retention horizon, but the design earlier documents that snapshot installation can drop replica pin state (kvFSM.Restore path), which re-allows compaction and is exactly why the safe-bound invariant is required. Leaving this summary as-is is misleading because operators may infer long-running backups are protected when the document itself says they are only safe within bounded duration/headroom conditions.

Useful? React with 👍 / 👎.

Important (verified against adapter/admin_grpc.go and kv/grpc_conn_cache.go): - AdminServer has no outbound gRPC connection mechanism today. GetNodeVersion fan-out (in BeginBackup) and the async leader_node_version probes (in GetRaftGroups) both need to dial peer admin servers. Without specifying the dialer, two implementors would make incompatible choices about credentials and connection lifecycle. Phase 1 explicitly adds a *kv.GRPCConnCache field to AdminServer using internal.GRPCDialOptions(), kept separate from ShardStore.connCache because admin and data-plane addresses occupy different address spaces (admin cache is keyed on NodeIdentity.GRPCAddress; data-plane cache is keyed on Raft- group-leader addresses, which migrate on leader change). Minor: - 10s leader_node_version cache is now an explicit AdminServer struct field (versionCache map[string]struct{...}) listed in Phase 1 scope rather than implicit prose. Tests added: TestAdminGRPCConnCacheReuse (verifies cache reuse, peer-shutdown eviction, independence from ShardStore.connCache).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e62c7e740

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T03:12:14Z

+  - Bytes in `[A-Za-z0-9._-]` pass through.
+  - Every other byte is rendered as `%HH` (uppercase hex), like


Escape dot-only segments in filename encoding

Allowing . and .. to pass through unchanged breaks path safety for layouts that use key segments as directories (for example DynamoDB items/<pk>/<sk>.json). A valid key segment of .. becomes items/../... and can escape the intended subtree after path normalization, while . collapses away and can alias other paths. This can overwrite sibling artifacts and makes backup/restore non-deterministic unless dot-only segments are explicitly escaped or rejected.

Useful? React with 👍 / 👎.

Important (verified against adapter/admin_grpc.go:455-505): - GetNodeVersion is itself an Admin method, gated by AdminTokenAuth on every auth-enabled cluster. The fan-out from BeginBackup through *kv.GRPCConnCache makes outbound calls with no metadata attached, so every dial would return Unauthenticated and the gate would misreport the failure as "node did not respond" on every cluster running --adminToken. Phase 1 explicitly propagates the inbound bearer token to outbound GetNodeVersion calls via metadata.NewOutgoingContext( context.Background(), inMD). The same pattern applies to the async GetNodeVersion(leader) probes in GetRaftGroups. Forwarding the inbound caller's token (not a server-stored value) preserves the operator's authorization decision and avoids privilege escalation by the admin server. Minor: - versionCache committed to sync.Map (with key/value types spelled out) so the implementor cannot accidentally produce a data race by reusing groupsMu for cache writes. The previous "(or sync.Map equivalent)" left enough room for a plain map + groupsMu approach that would either contend or race. Tests added: TestBeginBackupPropagatesAdminAuthToken, TestVersionCacheRaceUnderLoad (the latter runs under -race).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 486675d414

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T03:19:42Z

+  ```
+- `streams/<key>.jsonl`: one entry per line, in stream order:
+  ```
+  {"id": "1714400000000-0", "fields": {"event": "login",  "user": "alice"}}


Preserve Redis stream field order and duplicates

The stream record schema models fields as a JSON object, but Redis stream entries are ordered field-value lists and can legally contain repeated field names. This proposal therefore cannot round-trip valid XADD payloads that rely on ordering or duplicate fields (the repo’s stream model uses an ordered slice for fields), so restore would silently alter message contents. Use an ordered representation (e.g., flat array or array of {field,value} pairs) instead of an object.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T03:19:42Z

+    {"member": "alice", "score": 100.0},
+    {"member": "bob",   "score":  85.5}


Encode Redis ZSET members with a byte-safe format

The ZSET dump shape stores member as a plain JSON string, which is not a byte-safe encoding for arbitrary Redis member bytes. For valid datasets containing non-UTF8 or control-byte members, serialization/replay will mutate or reject data, breaking the document’s lossless backup goal. Define an explicit byte-safe encoding for ZSET members (similar to other binary-safe paths) to keep restores faithful.

Useful? React with 👍 / 👎.

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

bootjp added 2 commits April 29, 2026 11:08

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

	`backup_ts_ttl` (default 30 minutes, configurable via the same admin
	backup_ts_ttl (default 30 minutes, heartbeated by the producer or configurable via the same admin

	│ ├── img.jpg.metadata.json
	│ ├── _metadata/img.jpg.json

	├── strings_ttl.json # { "session%3Aabc123": 1735689600000, ... }
	├── strings_ttl.jsonl # {"format_version": 1, "key": "session%3Aabc123", "expire_at_ms": 1735689600000}

		also losslessly recoverable from the encoded filename alone — `KEYMAP` is a
		convenience, not a correctness dependency.

		- DynamoDB: `aws dynamodb create-table --cli-input-json _schema.json` and
		then `aws dynamodb put-item --item @items/<pk>/<sk>.json` per file.

		3. Pin `read_ts` with `PinWithDeadline(read_ts, now+ttl_ms)`; return
		the resulting `pin_token` to the producer.

		- Bytes in `[A-Za-z0-9._-]` pass through.
		- Every other byte is rendered as `%HH` (uppercase hex), like

		is renamed to `path/to.elastickv-leaf-data`, with the rename
		recorded in `s3/<bucket>/KEYMAP.jsonl`. The sidecar follows the

		{"format_version": 1, "fields": {"name": "alice", "age": "30"}, "expire_at_ms": null}
		```

		{"member": "alice", "score": 100.0},
		{"member": "bob", "score": 85.5}

		BackupPin : [tag:1][pin_id:16][read_ts:8][deadline_ms:8] = 33 bytes
		BackupExtend : [tag:1][pin_id:16][deadline_ms:8] = 25 bytes

Conversation

bootjp commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

bootjp commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: docs/design/2026_04_29_proposed_logical_backup.md

Issues

1. ActiveTimestampTracker.Pin() has no TTL — the auto-expiry design assumption is missing infrastructure

2. BeginBackup / EndBackup RPCs are missing from proto/admin.proto, and the service boundary is wrong

3. kv.ShardedCoordinator.ScanRange does not exist

4. Redis HyperLogLog is missing from the canonical directory layout

5. single.json sentinel for no-sort-key DynamoDB items is undefined

6. Redis LPUSH order example is misleading

7. The 50 ms skew buffer in cross-shard consistency is a new magic constant

8. CHECKSUMS crash-safety ordering is underspecified

9. Phase 3 dependency on keyviz_cluster_fanout design seems misplaced

10. docs/design/README.md is not updated

Minor / Non-blocking

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: docs/design/2026_04_29_proposed_logical_backup.md

P0 — Blocks implementation

1. BeginBackupRequest cannot express renewal

2. ListAdaptersAndScopesRequest message is never defined

3. Redis list order example is incorrect

P1 — Design gaps / inconsistencies

4. BeginBackup/EndBackup on a read-only Admin service changes security posture without justification

5. Phase 3 cross-reference to keyviz_cluster_fanout.md is wrong

6. Cross-shard applied_index wait condition is underspecified

7. CHECKSUMS/MANIFEST crash-safety write order is unspecified

Minor / Non-blocking

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

claude Bot commented Apr 29, 2026 •

edited

Loading

Code Review: `docs/design/2026_04_29_proposed_logical_backup.md`

1. `ActiveTimestampTracker.Pin()` has no TTL — the auto-expiry design assumption is missing infrastructure

2. `BeginBackup` / `EndBackup` RPCs are missing from `proto/admin.proto`, and the service boundary is wrong

3. `kv.ShardedCoordinator.ScanRange` does not exist

5. `single.json` sentinel for no-sort-key DynamoDB items is undefined

6. Redis `LPUSH` order example is misleading

8. `CHECKSUMS` crash-safety ordering is underspecified

9. Phase 3 dependency on `keyviz_cluster_fanout` design seems misplaced

10. `docs/design/README.md` is not updated

claude Bot commented Apr 29, 2026 •

edited

Loading

Code Review: `docs/design/2026_04_29_proposed_logical_backup.md`

1. `BeginBackupRequest` cannot express renewal

2. `ListAdaptersAndScopesRequest` message is never defined

4. `BeginBackup`/`EndBackup` on a read-only `Admin` service changes security posture without justification

5. Phase 3 cross-reference to `keyviz_cluster_fanout.md` is wrong

6. Cross-shard `applied_index` wait condition is underspecified