Conversation
Add a Proposed design doc covering value-level + Raft-proposal envelope encryption with KEK/DEK separation and KMS-backed key custody, so the cluster persisted state is undecryptable without externally-held key material.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughProposes elastickv data-at-rest encryption: per-value AES-256-GCM storage envelopes, separate Raft-level envelopes with distinct DEKs, external KEK wrapping, cluster-wide nonce/epoch model, crash-durable sidecar DEK persistence, three-phase rollout, MVCC encryption state bits, admin APIs, and refusal/resync semantics. Changes
Sequence Diagram(s)sequenceDiagram
participant Admin
participant Leader
participant KMS
participant Raft
participant Sidecar
participant WAL
participant Pebble
Admin->>Leader: request DEK rotate / generate
Leader->>KMS: request KEK wrap (wrap DEK)
KMS-->>Leader: wrapped DEK
Leader->>Raft: propose DEK-commit (wrapped DEK)
Raft-->>Leader: commit index
Leader->>Sidecar: persist wrapped DEK with commit index
Sidecar-->>Leader: ack persisted
Admin->>Leader: propose write (key + clear value)
Leader->>Leader: create raft-envelope (op + key + clear value)
Leader->>Raft: propose raft-envelope (encrypted with raft-DEK)
Raft-->>WAL: append encrypted raft-envelope
Raft-->>Pebble: apply -> create storage-envelope (encrypt value with storage-DEK)
Pebble-->>Pebble: persist storage-envelope (value ciphertext)
Pebble-->>Leader: ack applied
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 17 minutes and 55 seconds.Comment |
|
@claude review |
|
Claude finished @bootjp's task in 5m 48s —— View job Review:
|
| Item | Status |
|---|---|
Filename 2026_04_29_proposed_data_at_rest_encryption.md |
✅ matches YYYY_MM_DD_proposed_<slug> |
Header Status: Proposed |
✅ |
Header Author: bootjp |
✅ |
Header Date: 2026-04-29 |
✅ |
No _design suffix in slug |
✅ |
| Five CLAUDE.md self-review lenses in §10 | ✅ |
| Open question §11.5 (Lua batching) flagged as pre-implementation gate | ✅ |
Summary
The document's overall architecture (boundary choice, envelope format, two-tier KEK/DEK hierarchy, Raft-coordinated rotation, compress-before-encrypt) is sound and well-motivated. The three issues that need resolving before the implementation PR opens are: (1) the version-byte collision with lsm_store.go's existing tombstone byte encoding, (2) the coordinator-vs-storage-layer double-encryption contradiction, and (3) the single authoritative envelope diagram that includes the compress flag. The "same ciphertext" framing in §1/§4.2 should also be corrected to avoid misdirecting the implementor.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a9afe72f8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| envelope-version byte (`0x00` = cleartext, `0x01` = encrypted) lets | ||
| any node read either format. |
There was a problem hiding this comment.
Stop using leading byte to detect legacy cleartext values
The migration plan assumes 0x00 = cleartext and 0x01 = encrypted is sufficient for mixed-mode reads, but pre-encryption data already on disk is arbitrary raw bytes with no version prefix. Any legacy value whose first byte is 0x01 will be misclassified as an encrypted envelope and can fail with tag/format errors, making existing data unreadable during rollout or before rewrite completes. This breaks the stated online migration safety guarantees for workloads storing binary payloads.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request proposes a design for data-at-rest encryption in elastickv, utilizing AES-256-GCM at the storage and Raft layers with a two-tier key hierarchy (KEK/DEK). Feedback focuses on correcting the safe write limits for random 96-bit nonces to align with NIST standards, ensuring the encryption rewrite job accounts for MVCC history and zero-downtime requirements, and updating the envelope format documentation to include the compression flag in both the diagram and the AAD definition.
| up to ~2⁴⁸ writes per DEK; we rotate DEKs well before that bound | ||
| (see §5.2). |
There was a problem hiding this comment.
The claim that random 96-bit nonces are safe up to
| `elastickv-admin encryption rewrite` job can sweep the keyspace | ||
| over time to retire old DEKs; until that job has completed and the | ||
| retired DEK has zero references, the operator must keep the old | ||
| DEK loaded. | ||
|
|
There was a problem hiding this comment.
The design for the encryption rewrite job should clarify how it interacts with MVCC history. This presents a significant operational risk regarding DEK retirement and write amplification. As this constitutes a production-grade system migration, the plan must detail mitigation strategies, such as a temporary "bridge" or "proxy" mode, and strategies for live or zero-downtime cutovers (e.g., dual-write proxy) to avoid service interruption during the rewrite process.
References
- Significant operational risks and production-grade system migrations must detail mitigation strategies (like bridge/proxy modes) and zero-downtime cutover strategies.
| +--------+---------+----------+----------------+--------+ | ||
| | 0x01 | key_id | nonce | ciphertext | tag | | ||
| | 1 byte | 4 bytes | 12 bytes | N bytes | 16 B | | ||
| +--------+---------+----------+----------------+--------+ |
There was a problem hiding this comment.
The envelope format diagram is missing the 1-byte "compressed?" flag described in §6.4. Additionally, the AAD definition in §4.1 (line 155) should be updated to include this flag to prevent header-rewrite attacks that could toggle the compression bit and cause decryption or decompression failures. The per-value overhead estimate in §4.1 (line 163) should also be updated from 33 to 34 bytes for consistency with the disk overhead calculation in §8.2.
| +--------+---------+----------+----------------+--------+ | |
| | 0x01 | key_id | nonce | ciphertext | tag | | |
| | 1 byte | 4 bytes | 12 bytes | N bytes | 16 B | | |
| +--------+---------+----------+----------------+--------+ | |
| +--------+------------+---------+----------+----------------+--------+ | |
| | 0x01 | compressed | key_id | nonce | ciphertext | tag | | |
| | 1 byte | 1 byte | 4 bytes | 12 bytes | N bytes | 16 B | | |
| +--------+------------+---------+----------+----------------+--------+ |
|
@claude review |
|
Claude finished @bootjp's task in 3m 52s —— View job Design Review: Data-at-Rest Encryption
Conventions Check ✓Filename, header (Status/Author/Date), and lifecycle marker all conform to Critical Issues1. Migration version-byte detection is broken for pre-existing data
This assumption is fundamentally incorrect. Pre-encryption values stored in Pebble are raw bytes with no version prefix at all. When the read path branches on the first byte to detect format, any existing cleartext value whose first byte happens to be This is not a corner case: for binary workloads (images, serialized protobufs, arbitrary byte payloads) first-byte The Codex review correctly flagged this as P1. Some approaches worth considering:
2. AES-GCM random nonce limit stated incorrectly — 90-day rotation may be unsafe
2⁴⁸ is the birthday bound for a 96-bit space — the point at which nonce collision probability reaches ~50%. This is the danger threshold, not the safety limit. NIST SP 800-38D §8.3 recommends a maximum of 2³² random invocations per key to keep collision probability cryptographically negligible (< 2⁻³²). At the 10k writes/s baseline given in §8.1, 2³² invocations is reached in approximately 5 days — well inside the stated 90-day rotation target. The design as written could allow a 90-day-old DEK to be used for 90× more writes than NIST considers safe for random nonces. Two remedies, pick one or combine:
High-Priority Issues3. Double-encryption ambiguity in §6.3 / §4.2
§4.2 states:
§6.3 then states:
But §6.2 says the storage layer's The design is ambiguous about which of these two interpretations is intended:
Either is workable; the doc needs to commit to one path and trace it explicitly through §4.2 → §6.2 → §6.3 without contradiction. 4. Rewrite job must handle all MVCC versions, not just the latest
The design says the rewrite job walks the keyspace to rewrite cleartext → ciphertext and to retire old DEKs after rotation. But The doc should specify how the rewrite job iterates MVCC history (likely a full Pebble scan of the MVCC keyspace, not just a logical key scan) and how the checkpoint cursor ( Medium Issues5. Envelope format diagram is missing the compress flag
The §4.1 envelope diagram: does not include the 1-byte "compressed?" flag added in §6.4. This has two downstream inconsistencies:
The compress flag should either appear in the §4.1 format (preferred, since it's part of v1) or be clearly called out as a v2 extension (with §8.2 updated to match §4.1's 33-byte claim). 6.
|
P1 (codex): replace value-byte cleartext detection with a per-version encryption_state bit in MVCC metadata, so a legacy value whose first byte happens to equal the envelope version cannot be misclassified as ciphertext. High (gemini): retract the 2^48 random-nonce claim (birthday bound, not safe). Switch to per-DEK counter-based nonces and cap each (DEK, process-load) at 2^32 writes per NIST SP 800-38D §8.3. High (gemini): add §5.4 detailing how the rewrite job iterates over MVCC versions, rewrites in place at the same commit_ts, and what dual-condition criterion (cursor done + minRetainedTS advanced) gates DEK retirement. Medium (gemini): add the compressed flag byte to the envelope diagram, include it in the AAD to defeat header-rewrite attacks against the decompressor, and bump per-value overhead from 33 to 34 bytes.
|
@claude review |
|
Claude finished @bootjp's task in 5m 50s —— View job Review:
|
| Issue | Resolution |
|---|---|
| P1 (Codex): version-byte cleartext detection | Replaced with per-version encryption_state bit in MVCC metadata (§7.1) — correct fix |
| High: 2⁴⁸ random-nonce claim | Retracted; now uses counter-based nonces with 2³² write ceiling per NIST SP 800-38D §8.3 (§4.1, §5.2) |
| High: rewrite job must cover MVCC history | Added §5.4 with full iteration/retirement/crash-safety protocol |
| Medium: compress flag missing from envelope | Added to diagram, AAD, and overhead count (§4.1) — 34 bytes now consistent everywhere except §8.2 (see below) |
These were the right fixes. The remaining issues below are what the second commit did not address.
Critical (still blocking implementation)
1. Double-encryption is still designed in — §4.2 contradicts §6.2/6.3
The data-flow has three statements that, taken together, produce AES-256-GCM applied twice to every value:
- §4.2: "the coordinator encrypts before building the proposal" — coordinator applies §4.1 (storage DEK) to the value before Raft.
- §6.2: "Put paths wrap before
pebble.Batch.Set" — the storage layer always encrypts on write. - §6.3: "FSM unwraps on apply before dispatching to the existing storage handlers" — FSM calls the standard
lsm_store.PutAt, which invokes the §6.2 hook.
Follow the path: coordinator produces enc_storage(plaintext) → raft proposal carries enc_raft(enc_storage(plaintext)) → FSM decrypts raft layer → gets enc_storage(plaintext) → calls lsm_store.PutAt → storage hook encrypts again → Pebble stores enc_storage(enc_storage(plaintext)). That is double encryption.
§3(b) promises "no double-encryption", which is the correct design goal. The document needs to commit to one of:
- Option A (storage-layer-only, simpler): Coordinator does NOT pre-encrypt. It only wraps the full proposal with the raft DEK. On FSM apply, after unwrapping the raft DEK, the value is cleartext and the storage layer applies the storage DEK normally. §4.2 then needs to be rewritten: the coordinator wraps key + op + cleartext-value with the raft DEK, not pre-encrypted value.
- Option B (coordinator pre-encrypts): Coordinator applies §4.1 (storage DEK) before proposing. FSM must call a "raw Pebble write" path in
lsm_store.gothat bypasses the §6.2 encryption hook — explicitly specified as a newPutEncryptedAtor similar. §6.2 then needs a note that this bypass path exists and must be used from the FSM.
Neither option is wrong; Option A is significantly simpler. The document must pick one and remove the text that implies the other.
High (must resolve before implementation)
2. key_id type is undefined — string in sidecar, uint32 in envelope
The envelope carries a 4-byte key_id (uint32). The sidecar keys.json uses string identifiers ("k1a2", "k7c9"). The mapping between the two is nowhere defined. The decrypt path must do a constant-time lookup of key_id → []byte DEK, so the implementation must pick one of: hash of string, sequential counter persisted in the sidecar, first/last 4 bytes of a hex string, etc.
This is a wire-format decision. An implementation that chooses wrong and then rotates will permanently fail to decrypt old values. Specify the exact encoding and round-trip behavior.
3. encryption_state bit placement in the existing 9-byte header is underspecified
§7.1 says "We add a 2-bit encryption_state field to that header." The current lsm_store.go header is exactly 9 bytes: [tombstone(1)][expireAt(8)], where tombstone is decoded as data[0] != 0.
The design does not say which bytes/bits the field occupies. The two natural choices have different compat implications:
- Pack into byte 0 (spare bits): Set bit 1 of the tombstone byte to indicate encryption. Old
decodeValuereadstombstone := data[0] != 0— any non-zero encryption state in bits 1–7 is misread as a tombstone, silently losing the live value. This is the "right direction" for forward compat (old code errors rather than serving cleartext-from-ciphertext), but the error mode is silent wrong-answer, not a clear refused-to-start. - Extend header to 10 bytes: Add a dedicated encryption byte. Old
decodeValuemis-slicesexpireAt(reads 8 bytes starting at the new encryption byte, getting a wrong expiry) and misinterprets the value start. Equally confusing.
§9.1's startup refusal ("process refuses to start if sidecar present but encryption disabled") prevents the above in normal operation, but the design should still document the exact bit layout with a precise valueHeaderSize change (or confirmation it stays 9) and note why old code can't serve encrypted data even if somehow bypassing the startup check.
Medium (address before or alongside implementation)
4. Sidecar crash safety during DEK rotation is uncovered
§5.4.6 covers rewrite cursor crash safety ("cursor updated in same Pebble batch as rewritten value") — that's good. But the sidecar (keys.json) write during DEK rotation has no equivalent guarantee. The rotation Raft entry commits, the FSM applies it (updating in-memory keystore), then writes keys.json to disk. A crash between the apply and the fsync leaves the sidecar stale after restart.
On restart, §9.1 checks "wrapped DEKs unwrap under configured KEK" — a stale sidecar (missing the new DEK) would pass this check and silently boot with the old DEK as active, while Pebble already holds values encrypted under the new DEK. Those values would fail to decrypt.
Specify the recovery path: the FSM should also persist a "pending rotation" marker in Pebble under the same batch as any in-memory DEK state change, so on restart it can detect the sidecar-vs-log divergence and re-derive the sidecar from the Pebble marker before opening.
5. --encryption-rotate-on-startup follower behavior undefined
§6.5 documents the flag as "Force a DEK rotation at boot (used by ops scripts; not for normal restarts)" but doesn't say what happens when the booting node starts as a follower (the common case in a rolling restart). A follower cannot propose to Raft; if it tries, the proposal is rejected or queued indefinitely.
Add one sentence: "Rotation is attempted only after the node successfully wins a leadership election; on followers this flag is a no-op until the node becomes leader, at which point it proposes the rotation automatically."
Minor
6. §3(b) "single ciphertext object" claim is still inaccurate
§3(b) Pro: "Single ciphertext object — no double-encryption — across every persistent surface (Raft WAL, Pebble SST, FSM snapshot, snapshot streaming)."
This is not accurate even if issue #1 is resolved as Option A. The Raft WAL carries raft_DEK(key + op + value) while Pebble SSTs carry storage_DEK(value) — two different ciphertexts of different plaintexts. The true property is: no persistent surface holds cleartext user values. "Single ciphertext object" misdirects an implementor into believing the same bytes flow through all surfaces.
7. §8.2 overhead breakdown inconsistency
§8.2: "33 bytes (envelope) + up to 1 byte (compress flag) = 34 bytes per stored value"
But the flag byte is already inside the envelope (the §4.1 diagram shows it as the second field, and §4.1 says "Per-value overhead is 34 bytes (1 version + 1 flag + 4 key_id + 12 nonce + 16 tag)"). The breakdown in §8.2 should read "34-byte envelope", not "33 + 1".
8. AES-NI assumption in §8.1 is implicit
"AES-256-GCM with AES-NI runs at 3–5 GB/s per core" is correct for x86_64 with hardware acceleration. Go's crypto/aes falls back to a pure-software implementation on non-AES-NI targets (older ARM, RISC-V) where throughput is ~200–400 MB/s — a 10–15× reduction that could push encryption CPU well above the stated 10% regression budget in §8.3. Add: "The performance numbers in this section assume AES-NI. Operators targeting non-AES-NI hardware must re-run the §8.3 benchmarks against the ≤10% regression budget."
9. Verified OK: reserved-key namespace
!encryption| does not collide with existing reserved prefixes (!redis|, !sqs|, !dist| — confirmed in store/write_conflict_counter.go, kv/shard_key.go, distribution/catalog.go).
Conventions check
| Item | Status |
|---|---|
Filename 2026_04_29_proposed_data_at_rest_encryption.md |
✅ |
| Header Status/Author/Date | ✅ |
| Five CLAUDE.md self-review lenses in §10 | ✅ |
| §10 lens 1 updated to reference MVCC-bit approach | ✅ |
| Open question §11.5 (Lua batching) flagged | ✅ |
| §11.6 (TTL expiry visibility) cross-referenced to §2.2 | ✅ |
Summary
The second commit closed the four most-cited issues across all prior reviews. Two blockers remain before the implementation PR opens:
- §4.2 + §6.2/6.3 double-encryption — the design still implies the storage DEK is applied twice per write. Pick Option A (coordinator wraps proposal with raft DEK only; storage layer applies storage DEK on FSM apply) and rewrite §4.2 accordingly. This is also the change that makes §3(b)'s "no double-encryption" claim accurate.
key_idtype mismatch — the uint32 in the envelope must be precisely mapped to the string identifier inkeys.json. Define it (recommended: sequential uint32 counter persisted in the sidecar) and update the JSON example.
The encryption_state bit placement (#3) should also be pinned before implementation begins since it determines whether valueHeaderSize changes and how backward compatibility is structured.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 79d446f338
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Restart is rolling; mixed-mode clusters are supported because | ||
| each MVCC version carries its own `encryption_state` bit and | ||
| reads dispatch on that bit, not on the value bytes. |
There was a problem hiding this comment.
Gate Raft payload encryption on full-cluster rollout
The rollout step says mixed-mode restarts are safe because reads dispatch on MVCC encryption_state, but §4.2 also encrypts the entire Raft Data[] payload; a follower that has not yet been restarted with KEK/DEK material cannot decode committed entries and can fail apply during the rolling window. This makes the documented "restart one node at a time" migration unsafe unless proposal-envelope encryption is deferred until all replicas are encryption-capable (or a compatibility mode is defined).
Useful? React with 👍 / 👎.
| The counter's high 32 bits are a per-process random prefix | ||
| (re-drawn whenever the DEK is loaded, so two processes that share | ||
| a DEK never collide), and the low 64 bits are an `atomic.Uint64` | ||
| incremented per write. This makes nonce reuse impossible within |
There was a problem hiding this comment.
Replace random 32-bit nonce prefix with collision-free source
This claims two processes sharing a DEK "never collide," but a 32-bit random prefix can collide across nodes/restarts; if it does, both processes generate the same counter sequence and reuse (key, nonce) pairs. For AES-GCM, even a single nonce reuse under the same key is a catastrophic integrity/confidentiality failure, so this construction needs a deterministic uniqueness mechanism (or a much stronger persisted nonce allocator) instead of probabilistic 32-bit separation.
Useful? React with 👍 / 👎.
| - `sidecar.go` — atomic read/write of `keys.json` (write to | ||
| `keys.json.tmp` + `os.Rename`). |
There was a problem hiding this comment.
Make key sidecar updates crash-durable, not just atomic
The sidecar write plan uses keys.json.tmp + os.Rename, which is atomic for visibility but not sufficient for crash durability by itself; a power loss after Raft commits a rotation but before the new file/dir metadata is durably synced can roll back keys.json and strand ciphertext written under the new DEK. Because losing wrapped DEK metadata is unrecoverable for encrypted data, the design should require fsync of file and parent directory (or equivalent durable commit semantics) before acknowledging rotation success.
Useful? React with 👍 / 👎.
Critical: Resolve §4.2 vs §6.2 double-encryption ambiguity by committing to storage-layer-only value encryption; the coordinator only wraps the proposal Data with the raft DEK and the FSM hands cleartext to the storage layer, which is the sole encryption site. Critical: Tie the per-version encryption_state bit to lsm_store.go encodeValue header. The leading byte now encodes tombstone (bit 0) plus encryption-state (bits 1-2). Old builds reading an encrypted value see an unrecognised tombstone byte and refuse to serve, never returning ciphertext as cleartext. valueHeaderSize stays 9. High: Replace the misleading "same ciphertext flows through Raft -> WAL -> Pebble" framing in §1 and §3 with the correct invariant -- each persistent surface holds ciphertext but Raft and Pebble use different DEKs, so the bytes differ. High: New §5.5 sidecar / Raft-log reconciliation -- raft_applied_index in keys.json, ErrSidecarBehindRaftLog refusal at startup, automatic leader rewrap + manual follower resync-sidecar repair. Medium: Define key_id as a 32-bit CSPRNG-drawn integer with the sidecar JSON keying on the decimal string form; on-disk envelope and in-memory keystore stay binary uint32. Medium: Specify --encryption-rotate-on-startup behaviour on followers -- request is queued in memory and auto-fires only if this node becomes leader within the same uptime window; safe to set across a rolling restart. Medium: §5.4 step 7 covers rewrite-cursor encryption bootstrapping; step 8 documents the !encryption| reserved-key namespace collision check; §2.2 adds explicit call-outs for commit_ts visibility and TTL/tombstone byte being cleartext by design. Medium: §8.1 splits the AES-NI fast path from the software-fallback path with the 10-15x penalty so non-AES-NI deployments cannot quietly miss the §8.3 throughput gate. Minor: Add encryption retire-dek, resync-sidecar, and backup verify admin commands; new metrics elastickv_encryption_writes_per_dek and elastickv_encryption_sidecar_raft_index; flag size-leak risk on the overhead histogram. §10 test list updated for the new regression and reconciliation cases.
|
@claude review |
|
Claude finished @bootjp's task in 5m 56s —— View job Review:
|
…erministic nonce, sidecar fsync) P1 (codex line 728): §4.2 raft envelope rollout was unsafe in a rolling restart -- a non-upgraded follower without KEK cannot decrypt entries proposed by an upgraded leader, breaking deterministic apply. Split rollout into Phase 1 (storage envelope only, raft Data stays cleartext) and Phase 2 (enable-raft-envelope admin command, gated on every voting member reporting encryption_capable). Phase 2 is one-way because the WAL would interleave cleartext and raft-envelope entries. P1 (codex line 179): 32-bit random nonce prefix had a non-zero birthday-collision probability across process loads (~1e-7 over multi-year DEK lifetime) and a single AES-GCM nonce reuse is catastrophic. Replace with deterministic node_id (16b) || local_epoch (16b) || write_count (64b). local_epoch is persisted in the sidecar and fsync-bumped before any encryption, removing collision probability entirely. P2 (codex line 561): os.Rename was atomic for visibility but not crash-durable. Specify the fsync-file -> rename -> fsync-parent-dir protocol; refuse to start on filesystems that cannot guarantee dir.Sync semantics (NFS, some FUSE). Self-review §10 lens 1 + lens 2 updated to call out sidecar non-durability and nonce reuse as data-loss-class hard preconditions, and the membership-snapshot check as the load-bearing piece for Phase 2 safety. §9.1 startup refusal list extended with ErrUnsupportedFilesystem and missing node_id. §6.6 admin commands gain enable-raft-envelope.
|
@claude review |
|
Claude finished @bootjp's task in 7m 23s —— View job Review:
|
| Issue | Resolution |
|---|---|
| Double-encryption contradiction (§4.2 vs §6.2/6.3) | §4.2 now explicitly states: "The coordinator does NOT pre-encrypt values before building a proposal, and the FSM does NOT call a 'raw' Pebble write." ✅ |
key_id uint32 vs. string mismatch |
§5.1 specifies CSPRNG-drawn uint32, decimal string in JSON, binary uint32 in envelope + strconv.ParseUint decode. ✅ |
encryption_state bit placement unspecified |
§6.2: bits 1–2 of byte 0, valueHeaderSize stays 9. ✅ |
| Compress flag missing from envelope / AAD | §4.1 diagram includes flag, AAD binds envelope_version ‖ flag ‖ key_id, overhead is 34 B everywhere. ✅ |
| Random nonce birthday bound (2⁴⁸ was wrong) | Replaced with deterministic node_id ‖ local_epoch ‖ write_count; zero collision probability under stated preconditions. ✅ |
| Sidecar fsync non-durable | §5.1: write-tmp → fsync-file → rename → fsync-parent → ack; ErrUnsupportedFilesystem on NFS/FUSE. ✅ |
| 32-bit random nonce-prefix birthday risk | Retracted; same deterministic nonce fix above. ✅ |
| Raft rolling-restart safety (pre-Phase-2 followers) | Two-phase rollout in §7.1: Phase 1 keeps Raft proposals cleartext; Phase 2 gated on encryption_capable membership check. ✅ |
Critical (must fix before implementation PR opens)
1. Phase 2 discriminator byte collides with existing FSM tag bytes
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1
The design says replicas "dispatch on a 1-byte format tag at the start of the Data payload — 0x00 = cleartext, 0x01 = raft envelope." But kv/fsm.go already uses that exact byte for the existing operation encoding:
// kv/fsm.go lines 110–116
const (
raftEncodeSingle byte = 0x00 // single pb.Request
raftEncodeBatch byte = 0x01 // pb.RaftCommand with multiple requests
raftEncodeHLCLease byte = 0x02 // HLC ceiling-only entry
)Under the proposed scheme, every raftEncodeBatch proposal (0x01) — used by Lua commit batching, XADD, HSET, and any multi-op transaction — is misidentified as a raft envelope under Phase 2. The FSM attempts AES-GCM decryption on a protobuf-encoded pb.RaftCommand, gets a tag mismatch, and returns an apply error. The cluster stops making progress as soon as the first batch write arrives after enable-raft-envelope is applied. Every raftEncodeHLCLease (0x02) entry would similarly be misrouted.
The fix is straightforward: reserve a new unused tag byte, e.g. raftEncodeEnveloped byte = 0x03, for the encrypted wrapper. The Phase 2 FSM path becomes:
Apply(data):
if data[0] == raftEncodeEnveloped: // new — §4.2 envelope
data = raftDEK.Unwrap(data[1:]) // strip envelope, recover inner bytes
// existing dispatch (unchanged):
if data[0] == raftEncodeHLCLease: → applyHLCLease
else: decodeRaftRequests (0x00 / 0x01 / future tags)
This also correctly handles the Phase 2 ordering requirement: the raft-envelope unwrap happens before the existing HLC-lease and request-decode checks, so HLC ceiling proposals (§4.5) are also transparently decrypted.
Update §7.1 to replace "0x00 = cleartext, 0x01 = raft envelope" with the raftEncodeEnveloped = 0x03 scheme and cross-reference kv/fsm.go so the implementor knows exactly which file gains the new constant.
High (must resolve before implementation)
2. local_epoch overflow has no hard block — silent nonce collision risk
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.1, §9.1
The design correctly warns at local_epoch > 0xff00 (256 restarts before wrap) but does not specify what happens when local_epoch reaches 0xFFFF and would overflow to 0x0000. If the operator ignores all 256 warnings and the epoch wraps:
- The next process load uses
node_id ‖ 0 ‖ 0for its first write - The DEK's very first ever write (at epoch 0, write_count 0 after creation) used the same nonce
- AES-GCM nonce collision → key recovery + XOR of two plaintexts
The deterministic nonce's safety guarantee ("zero collision probability under its preconditions") is only valid when wrapping is prevented. Add ErrLocalEpochExhausted to §9.1's startup refusal list:
"The
local_epochfield for any active DEK in the sidecar has reached0xFFFF. The process refuses to start until the operator rotates the DEK, which resets the epoch to 0 under a fresh key."
3. 16-bit node_id truncation from DeriveNodeID (uint64) is unspecified
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.1
The design describes node_id as "the 16-bit Raft member ID assigned at cluster bootstrap." In the actual codebase, internal/raftengine/etcd/peers.go:DeriveNodeID returns a uint64 FNV-64a hash of the --raftId string (e.g., "n1", "n2"). The nonce field allocates only 2 bytes for node_id, so the implementation must truncate the 64-bit hash.
The design must specify:
- The truncation rule (e.g., "lower 16 bits of
DeriveNodeID(--raftId)"). - A startup collision check: on every boot, the encryption package reads the current Raft membership snapshot and asserts that no two voting members share the same lower-16-bit value of their derived
node_id. This check is cheap (at most dozens of comparisons) and prevents a silent nonce-space collision in a cluster with unlucky ID strings.
For elastickv's current ID scheme ("n1", "n2", "n3" → well-distributed FNV hashes), collisions in the lower 16 bits are extremely unlikely, but the requirement must be stated and enforced rather than assumed.
Medium (address before or alongside implementation)
4. raft_applied_index write timing is contradictory between §5.1 and §5.5
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §5.1 step 5, §5.5
§5.1 step 5 says: "update raft_applied_index in memory; that value is then persisted on the next sidecar write."
§5.5 says: "The sidecar carries raft_applied_index, updated in the same os.WriteFile + os.Rename that persists the new wrapped DEKs."
These are contradictory. If step 5's reading is correct (index deferred to next write), then after a rotation the on-disk sidecar has the new DEK but an old raft_applied_index. On restart, §5.5's check sees the gap between the raftengine's persisted applied index and the sidecar's index, covers a rotation entry, and fires ErrSidecarBehindRaftLog — even though the sidecar is fully consistent. This would be a spurious startup failure loop for any correctly-functioning node that crashed between steps 4 and 5.
The correct behavior (matching §5.5) is that the sidecar written in steps 1–4 includes the current rotation's raft_applied_index. Step 5's "update in memory" should read: "update the in-memory raft_applied_index to the now-confirmed-durable value; the sidecar already reflects it." Reword step 5 to eliminate the ambiguity.
5. encryption_capable heartbeat field is underspecified
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 2
The enable-raft-envelope gate requires "every voting member of every Raft group has reported encryption_capable = true in its periodic heartbeat metadata." This introduces a new field to the heartbeat protocol but the design doesn't specify:
- Which gRPC message or Raft protocol extension carries the field.
- How staleness is bounded — a heartbeat from 30 minutes ago is not a safe signal.
- Whether
enable-raft-envelopeforces a fresh membership poll (blocking) or reads cached heartbeat state.
Without this, an operator could issue enable-raft-envelope immediately after a rolling restart before all heartbeats have refreshed, and the check would pass on stale encryption_capable = false → true transitions. Add a paragraph to §7.1 specifying the heartbeat path (e.g., piggyback on the existing peer metadata extension in internal/raftengine/etcd/peer_metadata.go) and a maximum staleness bound (e.g., "the command polls until a fresh heartbeat is received from every member within one election timeout, or returns an error if any member is unreachable").
Minor
6. Phase 2 FSM dispatch ordering for HLC lease entries should be explicit
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.5, §6.3
§4.5 correctly notes that HLC ceiling proposals get the §4.2 raft envelope "for free." kv/fsm.go::Apply currently checks for HLC lease entries (data[0] == raftEncodeHLCLease) before decodeRaftRequests. With the fix from issue #1 above (raftEncodeEnveloped = 0x03), the correct Phase 2 dispatch order is:
1. if data[0] == 0x03 → unwrap raft envelope → data = inner bytes
2. if data[0] == 0x02 → applyHLCLease (HLC ceiling entry)
3. else → decodeRaftRequests (0x00 single / 0x01 batch)
§6.3 says "unwrap the raft envelope on apply, recover the cleartext operation, and dispatch to the existing storage handlers" — this correctly implies the unwrap happens first. But since the HLC lease path is a special case that bypasses the storage layer, it should be called out explicitly so the implementor doesn't accidentally place the HLC check before the envelope check.
Conventions check
| Item | Status |
|---|---|
Filename 2026_04_29_proposed_data_at_rest_encryption.md |
✅ |
| Header Status/Author/Date | ✅ |
| Five CLAUDE.md self-review lenses in §10 | ✅ |
| §10 lens 1 updated for nonce/sidecar preconditions | ✅ |
| §10 lens 2 updated for Phase-2 membership check | ✅ |
§10 test coverage includes local_epoch > 0xff00 warning |
✅ |
| §11.5 Lua batching open question flagged | ✅ |
| Reserved-key `!encryption | ` namespace confirmed no collision |
Positive summary
Three rounds of review have closed every cryptographic and architectural gap identified in rounds 1–3. The deterministic nonce design (node_id ‖ local_epoch ‖ write_count) is the strongest possible fix for the original birthday-bound problem. The two-phase rollout correctly handles the asymmetry between per-version storage envelopes (safe in mixed-mode) and per-proposal raft envelopes (requires full-cluster capability). The sidecar fsync protocol now meets crash-durability requirements, and the double-encryption contradiction is cleanly resolved.
The one blocking issue before the implementation PR opens is the discriminator byte collision in §7.1 Phase 2 — discovered by cross-referencing the design against kv/fsm.go:110-116. The fix is a one-line constant addition (raftEncodeEnveloped = 0x03) and a two-paragraph update to §7.1. The local_epoch hard-stop and node_id truncation spec (#2 and #3) are also required before implementation to keep the nonce uniqueness guarantee unconditional.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1b8613f08a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| so an old build reading a value written by an encryption-enabled | ||
| build sees an unrecognised tombstone byte and refuses to serve it | ||
| rather than returning ciphertext as cleartext. `valueHeaderSize` |
There was a problem hiding this comment.
Avoid treating new header bits as tombstones on old binaries
This section assumes an older binary will "refuse to serve" values written with the new tombstone+enc_state byte, but in the current tree store/lsm_store.go decodes tombstones as data[0] != 0 (any non-zero value). In a rolling Phase-1 deployment, old nodes would therefore interpret encrypted live values as tombstones and drop them from reads, which breaks the mixed-version safety claim in §7.1. The migration needs a backward-compatible encoding strategy (or a hard gate that prevents encrypted writes until all readers understand the new bits).
Useful? React with 👍 / 👎.
| 3. Recovery is **automatic on the leader** (it re-proposes a | ||
| "rewrap current active DEK" entry that brings every node's | ||
| sidecar back in sync) and **manual on a stuck follower** |
There was a problem hiding this comment.
Resync sidecars with all unretired DEKs, not only active
The recovery plan here re-proposes only a rewrap of the current active DEK, but §5.2 explicitly allows old DEKs to remain decrypt-required for retained MVCC history. If a node misses multiple rotations, restoring just the active key can still leave intermediate key_ids absent, causing decrypt failures for historical versions until rewrite/retire completes. The resync flow should rebuild the full set of unretired DEKs (or it can strand readable history on the recovered node).
Useful? React with 👍 / 👎.
| Replicas dispatch on a 1-byte format tag at the start of the | ||
| `Data` payload — `0x00` = cleartext (pre-flag entry, found | ||
| only in WAL/snapshot history), `0x01` = raft envelope. The |
There was a problem hiding this comment.
Preserve existing Raft tag space in Phase-2 envelope framing
This assigns 0x00 = cleartext and 0x01 = raft envelope for historical/new entries, but current kv/fsm.go already uses 0x00 for single-request, 0x01 for batch, and 0x02 for HLC-lease entries. Reusing 0x01 for encrypted payloads would make pre-cutover batch entries ambiguous during WAL replay/snapshot restore, so a node can misroute old cleartext data into envelope decode and fail apply. The envelope framing needs a non-conflicting outer tag scheme (or explicit version/index gating) that remains compatible with existing persisted entries.
Useful? React with 👍 / 👎.
…EK resync, index-gated framing) P1 (codex on 1b8613f line 645): old-binary tombstone test data[0] != 0 reads encrypted live values (encryption_state in bits 1-2 of byte 0) as tombstones and silently drops them. Convert §7.1 from 2-phase to 3-phase: Phase 0 = capability rollout (new binary writes cleartext, advertises encryption_capable in heartbeats); Phase 1 = enable-storage-envelope cluster-flag flip after membership-snapshot capability check; Phase 2 = enable-raft-envelope under the same gate. Per-node --encryption-enabled is now a capability assertion, not a behaviour trigger. P1 (codex on 1b8613f line 586): §5.5 recovery rewrapped only the active DEK, leaving intermediate unretired DEKs missing on a node that missed multiple rotations and silently breaking historical reads. Recovery now rewraps every unretired DEK (active + retiring) in one apply; resync-sidecar replays all rotation/rewrap-deks entries, not just the most recent. P1 (codex on 1b8613f line 847): Phase 2 first-byte framing tag (0x00 cleartext / 0x01 envelope) collided with kv/fsm.go existing tag space (0x00 single, 0x01 batch, 0x02 HLC-lease) and would misroute pre-cutover batch entries during WAL replay. Replace with raft-log-index gating: sidecar persists raft_envelope_cutover_index; entries below it use the legacy first-byte tags, entries at-or-above unwrap through the raft envelope. Snapshot headers carry the cutover index so a follower joining mid-Phase-2 reconstructs the same boundary. Sidecar example expanded with storage_envelope_active and raft_envelope_cutover_index. §9.1 startup refusal list adds ErrEnvelopeCutoverDivergence. §10 self-review lens 2 updated to call out the old-binary tombstone path and the full-DEK resync requirement as load-bearing. §6.6 admin gains enable-storage-envelope.
|
@claude review |
|
Claude finished @bootjp's task in 4m 7s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f34d9b73d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| relative to the persisted `raft_envelope_cutover_index` (the | ||
| apply index of the flag entry, recorded in the local sidecar | ||
| on apply). Entries below the cutover index are decoded via | ||
| the existing first-byte tag space (`0x00` single-request, | ||
| `0x01` batch, `0x02` HLC-lease, etc.); entries at-or-above | ||
| the cutover index are unwrapped through the raft envelope |
There was a problem hiding this comment.
Decode raft envelopes only after the cutover flag entry
Step 7 says the enable-raft-envelope flag entry itself is written in legacy framing, but this section defines raft_envelope_cutover_index as that flag entry’s index and then unwraps entries "at-or-above" it as envelopes. That makes WAL replay attempt to decrypt the flag entry as an envelope and fail. The boundary needs to exclude the flag entry (or record the first encrypted entry index instead), otherwise replay/apply can break exactly at cutover.
Useful? React with 👍 / 👎.
| sidecar's index AND the gap covers any rotation entries, the | ||
| node refuses to start with `ErrSidecarBehindRaftLog`, |
There was a problem hiding this comment.
Fail startup when sidecar misses raft cutover updates
The startup guard only blocks when sidecar lag spans rotation entries, but Raft entry decoding also depends on raft_envelope_cutover_index. If a node crashes after Raft advances past Phase-2 cutover but before that cutover state is persisted locally, this check can pass and the node may decode post-cutover entries with the wrong format. Include cutover/format-transition entries in the lag check (not only rotations) to avoid mis-dispatch on restart.
Useful? React with 👍 / 👎.
| as applied (and update `raft_applied_index` in memory; that | ||
| value is then persisted on the next sidecar write). |
There was a problem hiding this comment.
Persist raft_applied_index in the same sidecar rewrite
This step says raft_applied_index is updated only in memory and written on the next sidecar write, but §5.5 relies on that index being durable alongside the just-written DEK set. If implemented as written here, a crash after rotation can leave a fresh sidecar with a stale applied index and trigger unnecessary ErrSidecarBehindRaftLog refusals on restart. The design should require persisting raft_applied_index atomically with the DEK rewrite.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
docs/design/2026_04_29_proposed_data_at_rest_encryption.md (1)
1243-1248: Promote Lua batching interaction from “open question” to acceptance criteria.Given the rollout risk, this should be a concrete pre-merge check, not a deferred question. Add explicit tests/benchmarks for
lua_commit_batching+ raft envelope + inner per-value envelopes (mixed compressed/uncompressed payloads).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md` around lines 1243 - 1248, The docs note is asking to promote the open question into concrete pre-merge checks: add unit tests and benchmarks that exercise lua_commit_batching with Raft proposal envelopes containing per-value envelopes (both encrypted and unencrypted, both compressed and uncompressed) so the batch encoder cannot assume inner payloads are cleartext or compressible; implement tests named like TestLuaCommitBatchingEnvelopeHandling and BenchmarkLuaBatchEncoderMixedPayloads that build batches via lua_commit_batching, wrap inner Puts with per-value envelopes and an outer Raft envelope, and assert decoding/round-trip integrity, correct sizing (no double-compression), and that compressed flag handling is preserved for mixed payloads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md`:
- Around line 865-869: Add a post-cutover membership gate: specify an invariant
that once the encryption flags are active (after Phase 0 completion where every
voting member shows encryption_capable = true) any membership-change proposal
that would add or promote a voting member must be rejected if the candidate
lacks encryption_capable = true. Update the Phase 1/2 sections and the
membership-change proposal validation text (referencing "encryption_capable",
"Phase 0", "Phase 1/2", and "membership-change proposals") to require this check
in the membership-change code path and operator/API validation (i.e.,
join/replace/add-voter requests and Raft proposal validation) and note that
non-voting members may still be permitted but cannot be promoted to voters
unless they present the flag.
- Around line 914-917: The snapshot header currently written in kv/snapshot.go
(lines ~13-46) must be versioned to carry raft_envelope_cutover_index safely:
introduce a header format with a version byte (e.g., headerVersion) and a
deterministic layout (magic, version, length, then fields including
raft_envelope_cutover_index as a fixed-size u64) and update the write path
(WriteSnapshotHeader) to emit the new version when creating Phase‑2 snapshots;
update the read path (ReadSnapshotHeader) to parse the magic and version, handle
legacy v1 (existing 16‑byte magic+ceiling) by detecting the old length/magic and
treating it as implicit v1 behavior, and parse v2+ to extract
raft_envelope_cutover_index, failing fast on unknown versions while logging
helpful errors for forward/backward compatibility checks.
- Around line 183-186: The document currently relies on a 16-bit node_id as part
of the AES-GCM nonce which is too small and risks nonce reuse; update the spec
to require a collision-resistant, persisted per-node nonce prefix (e.g., a
cryptographically-random 96–128-bit nonce_prefix or a UUIDv4/UUIDv7) generated
at bootstrap and stored persistently (and bound to the node’s identity) instead
of the 16-bit node_id, describe that the nonce construction (nonce =
nonce_prefix || per-message counter) must guarantee monotonic, non-repeating
counters per DEK, add a requirement that implementations persist and verify the
nonce_prefix on startup (similar to membership snapshot checks) and provide
explicit non-reuse guarantees across node reboots, replacements, and migrations
so AES-GCM nonce reuse under any DEK cannot occur.
- Around line 415-417: The doc contains conflicting persistence semantics for
raft_applied_index: reconcile both places so raft_applied_index is persisted
atomically in the same fsync'ed sidecar write as the wrapped DEKs (i.e., the FSM
should acknowledge rotation only after that combined fsync completes). Update
the text around the FSM acknowledgment (references to raft_applied_index,
sidecar write, and wrapped DEKs) so both the earlier statement and the later
paragraph (the lines mentioning wrapped DEKs) consistently require a single
atomic fsync'd sidecar transaction that includes the key updates and
raft_applied_index before acknowledging the rotation entry.
---
Nitpick comments:
In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md`:
- Around line 1243-1248: The docs note is asking to promote the open question
into concrete pre-merge checks: add unit tests and benchmarks that exercise
lua_commit_batching with Raft proposal envelopes containing per-value envelopes
(both encrypted and unencrypted, both compressed and uncompressed) so the batch
encoder cannot assume inner payloads are cleartext or compressible; implement
tests named like TestLuaCommitBatchingEnvelopeHandling and
BenchmarkLuaBatchEncoderMixedPayloads that build batches via
lua_commit_batching, wrap inner Puts with per-value envelopes and an outer Raft
envelope, and assert decoding/round-trip integrity, correct sizing (no
double-compression), and that compressed flag handling is preserved for mixed
payloads.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aa7208b9-acf0-417f-9fe0-92b64e9ce300
📒 Files selected for processing (1)
docs/design/2026_04_29_proposed_data_at_rest_encryption.md
High (claude[bot] r4 #2): local_epoch wrap to 0 re-issues nonce node_id||0||0 of the same DEK first write -- catastrophic AES-GCM nonce reuse. Add ErrLocalEpochExhausted to §9.1 startup refusal at local_epoch == 0xFFFF; only DEK rotation can recover. High (claude[bot] r4 #3): node_id was described as 16-bit but DeriveNodeID() returns uint64 FNV-64a. Specify the truncation rule (lower 16 bits) and the startup ErrNodeIDCollision check that asserts no two voting members share the same uint16 truncated value. Medium (claude[bot] r4 #4): §5.1 step 5 said raft_applied_index is persisted on the next sidecar write, contradicting §5.5 which expects the same write. Reword: the index is part of the same JSON payload as the new wrapped DEKs, eliminating any deferred-write race that would spuriously fire ErrSidecarBehindRaftLog. Medium (claude[bot] r4 #5): encryption_capable heartbeat field path was unspecified. Pin it to the existing peer_metadata.go extension; cutover commands force a fresh poll within one election timeout rather than reading cached state. Critical reinforcement (claude[bot] r4 #1): explicitly cross-reference kv/fsm.go raftEncodeSingle/raftEncodeBatch/raftEncodeHLCLease constants in §7.1 Phase 2 and pseudocode the exact FSM dispatch order so the implementor cannot accidentally re-use the colliding tag bytes. Index-gating remains the chosen mechanism (already in round-4 commit). Medium (claude[bot] r3 #3): §6.1 sidecar.go entry now references the §5.1 crash-durable write protocol explicitly so an implementor reading §6.1 in isolation cannot omit the fsync. Medium (claude[bot] r3 #4): §6.2 corrected the misleading "old build refuses to serve" wording -- old decodeValue actually returns a silent tombstone, but the 3-phase rollout makes that path unreachable on a real cluster. Bit-layout comment requirement added. Minor (claude[bot] r3 #5): env KEK provider must os.Unsetenv("ELASTICKV_KEK_BASE64") immediately after read.
|
@claude review |
|
Claude finished @bootjp's task in 4m 57s —— View job Review:
|
| Issue | Resolution |
|---|---|
Critical: FSM tag byte collision (0x01 = raftEncodeBatch vs raft envelope discriminator) |
Eliminated entirely — Phase 2 dispatch is now index-gated (index >= raft_envelope_cutover_index), not tag-byte-gated. No new constant needed. ✅ |
High: local_epoch overflow has no hard stop |
§4.1 specifies process refusal at 0xFFFF via ErrLocalEpochExhausted; §9.1 includes it in the startup refusal list. ✅ |
High: 16-bit node_id truncation rule unspecified |
§4.1 specifies node_id := uint16(DeriveNodeID(raftID)) with a cross-membership collision check on startup. ✅ |
| Medium: sidecar fsync missing from §6.1 | §5.1 and §6.1 both specify the four-step write-tmp → fsync → rename → fsync-dir protocol. ✅ |
Medium: encryption_capable staleness / blocking |
§7.1 Phase 0 step 3 now specifies a fresh blocking poll per election timeout rather than reading cached state. ✅ |
| Medium: full-DEK resync on sidecar recovery | §5.5 recovery explicitly rebuilds every unretired DEK, not just the active one. ✅ |
CodeRabbit P2: raft_applied_index deferred-write ambiguity |
§5.1 step 1 and step 5 now consistently state the index is written atomically in the same JSON payload as the wrapped DEKs. ✅ |
Critical (must fix before implementation PR opens)
1. raft_envelope_cutover_index off-by-one: >= should be >
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 2 steps 7–8
The pseudocode reads:
Apply(index, data):
if index >= raft_envelope_cutover_index:
data = raftDEK.Unwrap(data)
raft_envelope_cutover_index is recorded as the Raft apply index of the enable-raft-envelope flag entry itself. Step 7 says explicitly:
This entry itself uses the legacy first-byte tag space so it remains decodable by any replica that has applied entries up to it.
With >=, when the FSM applies the flag entry (its index equals raft_envelope_cutover_index), it will attempt raftDEK.Unwrap(data) on a cleartext entry — the entry that was written in legacy framing. The unwrap call will fail with a tag mismatch, and the node will either error-out on apply (breaking the cluster) or silently skip the flag entry (never setting raft_envelope_cutover_index in the sidecar, which means the node will never recognize Phase 2 at all).
Step 8 says "From the apply index of that flag entry onward" — "onward" implies exclusive. The >= in the pseudocode is inconsistent with this prose. The correct condition is:
if index > raft_envelope_cutover_index:
data = raftDEK.Unwrap(data)
This processes the flag entry in legacy mode (which records raft_envelope_cutover_index in the sidecar), then wraps all subsequent entries. The snapshot ingest path (step 9) also needs to record the cutover index and apply the same > boundary on replay.
Medium (address before or alongside implementation)
2. Snapshot header format extension is unspecified
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 2, step 9
Step 9 says: "Snapshots taken during Phase 2 carry the cutover index in their metadata header so a fresh follower joining mid-Phase-2 reconstructs the same dispatch boundary on ingest."
The existing store/snapshot_pebble.go snapshot header is:
[pebbleSnapshotMagic (8 bytes)] [lastCommitTS uint64 (8 bytes)] [key-value entries...]
There is no version field in this header. Adding raft_envelope_cutover_index requires a format bump. Without specifying how the format evolves, the implementation will have to make uncoordinated decisions that could produce incompatible snapshots between builds.
The design should specify:
- The versioned format: e.g., change the magic from
EKVPBBL1toEKVPBBL2when Phase 2 is active, appendingraft_envelope_cutover_index uint64afterlastCommitTS. - Backward compatibility: a node reading an
EKVPBBL1snapshot treatsraft_envelope_cutover_indexas 0 (Phase 1 or earlier; no raft envelopes). - §9.1's
ErrEnvelopeCutoverDivergencecheck depends on the snapshot carrying this field — without the format spec, that check cannot be implemented.
3. peer_metadata.go reference names the wrong file
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 0, step 2
"Each upgraded node advertises
encryption_capable = truein the peer-metadata extension that today already piggy-backs on the etcd-raft heartbeat path (internal/raftengine/etcd/peer_metadata.go)."
internal/raftengine/etcd/peer_metadata.go is the peers-file persistence layer (it reads and writes etcd-raft-peers.bin, the on-disk membership snapshot). It is not a heartbeat extension and has no mechanism for advertising per-node capability flags.
The etcd raft library's heartbeat messages (raftpb.MsgHeartbeat / MsgHeartbeatResp) carry a Context []byte field that applications use to piggyback metadata — this is the correct extension point. Alternatively, a separate out-of-band RPC (e.g., on the existing Distribution service or a new EncryptionAdmin service) can poll capability before the cutover commands issue.
The design should replace the peer_metadata.go reference with a concrete mechanism:
- Option A (heartbeat context): Populate
MsgHeartbeat.Contextwith a small proto carryingencryption_capable bool; parse it inengine.go'sReadyLoopwhenraftpb.MsgHeartbeatRespis received. This requires adding a new field to the transport encoding but reuses the existing heartbeat cadence. - Option B (out-of-band RPC): The
enable-storage-envelopeandenable-raft-envelopecommands issue aCheckEncryptionCapabilityRPC to every voting peer before proposing. Simpler to implement; does not require changing the raft message format.
Either option is correct; the design must name one and remove the peer_metadata.go reference, which will send the implementor to the wrong file.
Minor
4. Post-cutover membership gate not addressed
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1
The Phase 0 → Phase 1 → Phase 2 gate correctly blocks cutover until every voting member is encryption_capable. But the document says nothing about membership changes after cutover. An operator adding a new voter (e.g., replacing a failed node) running an old binary without --encryption-enabled would:
- At the storage layer: receive MVCC versions with
encryption_state = 0b01set indata[0], readdata[0] != 0as tombstone, and silently drop live values from reads. - At the raft layer (Phase 2): fail every FSM apply on post-cutover entries it cannot decrypt.
Add one sentence: "After storage_envelope_active or raft_envelope_active has committed, the control plane rejects AddVoter / ConfChange for peers that do not advertise encryption_capable = true; learner promotion follows the same gate."
Conventions check
| Item | Status |
|---|---|
Filename 2026_04_29_proposed_data_at_rest_encryption.md |
✅ |
| Header Status/Author/Date | ✅ |
| Five CLAUDE.md self-review lenses in §10 | ✅ |
| §10 lens 1 updated for nonce/sidecar preconditions | ✅ |
| §10 lens 2 updated for 3-phase rollout and full-DEK resync | ✅ |
ErrLocalEpochExhausted in §9.1 startup refusal list |
✅ |
ErrNodeIDCollision in §9.1 startup refusal list |
✅ |
raft_applied_index written atomically with DEKs (§5.1) |
✅ |
env.go os.Unsetenv documented in §6.1 |
✅ |
| Reserved-key namespace collision confirmed in §5.4.8 | ✅ |
| §11.5 Lua batching interaction flagged as pre-implementation gate | ✅ |
Summary
Six rounds of review have closed every cryptographic, architectural, and wire-format issue identified since the initial submission. The document is in strong shape. One blocking issue remains before the implementation PR opens:
The raft_envelope_cutover_index off-by-one (#1): the pseudocode index >= cutover attempts to unwrap the flag entry itself, which is written in legacy framing. Change to index > cutover. This is a one-line edit to the pseudocode and one sentence clarification in step 8's prose.
The two medium issues (#2 snapshot format, #3 peer_metadata.go reference) should also be resolved before implementation begins — the snapshot format spec is required for §9.1's ErrEnvelopeCutoverDivergence check, and the wrong file reference will send the implementor to the wrong place for a non-trivial integration point.
P1 (codex on 2768cb3 line 621): §5.5 startup lag check only flagged rotation entries in the gap, so a missed enable-raft-envelope cutover would silently slip past and the FSM would decode post-cutover entries through the legacy first-byte tag space. Generalise the lag check to cover every encryption-relevant entry: rotate-dek, rewrap-deks, enable-storage-envelope, enable-raft-envelope, retire-dek. Implementation derives the set from a single isEncryptionRelevant predicate so future commands cannot bypass the check. Major (coderabbitai on 2768cb3 line 1000): §7.1 Phase 2 step 9 said snapshots carry the cutover index but kv/snapshot.go currently writes a fixed 16-byte header with no version field. Specify a versioned header in §4.4: magic||ver||len||ceiling||cutover||reserved with implicit-v1 fallback for legacy snapshots, v2 only emitted after enable-raft-envelope has been observed locally, ErrSnapshotHeaderVersion on forward-unknown versions, and zero-validation on reserved bytes to prevent silent forward-compat slips.
|
@claude review |
|
Claude finished @bootjp's task in 7m 37s —— View job Review:
|
| Issue | Resolution |
|---|---|
Phase 2 discriminator byte collision with kv/fsm.go tags |
Replaced byte-tagged dispatch with index-gated dispatch (if index >= raft_envelope_cutover_index: raftDEK.Unwrap(data)) — sidesteps collision by construction ✅ |
local_epoch overflow: no hard stop |
ErrLocalEpochExhausted on local_epoch == 0xFFFF in §9.1 ✅ |
16-bit node_id truncation unspecified |
Documented as uint16(DeriveNodeID(raftID)) with startup ErrNodeIDCollision check ✅ |
raft_applied_index write timing contradiction |
Step 1 of §5.1 write protocol now builds raft_applied_index as part of the same JSON payload, eliminating the deferred-write window ✅ |
encryption_capable heartbeat field underspecified |
§7.1 Phase 0 step 3 now specifies: fresh poll within one election timeout; ErrCapabilityCheckFailed if any member is unreachable or reports false ✅ |
Snapshot header versioning for raft_envelope_cutover_index |
§4.4 now specifies an explicit versioned v2 header (magic + ver + len + ceiling + cutover + reserved) with read/write path rules ✅ |
| §5.5 lag check covered only rotations | Expanded to cover ALL encryption-relevant entries via isEncryptionRelevant() predicate ✅ |
| Node_id 32-bit prefix birthday risk | Retracted in favour of deterministic nonce (node_id ‖ local_epoch ‖ write_count) ✅ |
raft_applied_index consistency between §5.1 and §5.5 |
Made consistent: written atomically with DEKs in same fsync ✅ |
Critical (must fix before implementation PR opens)
1. Snapshot magic bytes and variable name are wrong in §4.4
docs/design/2026_04_29_proposed_data_at_rest_encryption.md line 321, 326
The v2 header diagram states:
| magic(8) | ...
| EKVPBBL1 | ...
and the annotation says "magic — unchanged 8-byte **pebbleSnapshotMagic** (EKVPBBL1)".
Both facts are wrong. Reading kv/snapshot.go:16:
var hlcSnapshotMagic = [8]byte{'E', 'K', 'V', 'T', 'H', 'L', 'C', '1'}The magic is EKVTHLC1 (not EKVPBBL1) and the variable is hlcSnapshotMagic (not pebbleSnapshotMagic). The restore path in kv/fsm.go:277 also uses hlcSnapshotMagic for the equality check. If the implementation copies the design doc's magic bytes, v1 snapshot detection will fail (the 8-byte comparison will always be false), making every existing v1 snapshot unreadable after upgrade.
Fix: Replace pebbleSnapshotMagic (EKVPBBL1) with hlcSnapshotMagic (EKVTHLC1) everywhere in §4.4.
2. >= dispatch boundary decrypts the flag entry itself — WAL replay fails at cutover
docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 1062–1082
The design says (line 1062–1064):
"This entry itself uses the legacy first-byte tag space so it remains decodable by any replica that has applied entries up to it."
The raft_envelope_cutover_index is then defined as "the apply index of that flag entry" (line 1069). Yet the FSM pseudocode (line 1075) reads:
if index >= raft_envelope_cutover_index:
data = raftDEK.Unwrap(data)
>= includes the flag entry (index == raft_envelope_cutover_index). On WAL replay, that entry is in legacy framing (raftEncodeSingle = 0x00 or raftEncodeBatch = 0x01); raftDEK.Unwrap(data) on legacy bytes produces an AES-GCM tag mismatch. The cluster halts at the cutover index on every replay — snapshot installs and follower catch-up are broken.
This is the Codex P1 "Decode raft envelopes only after the cutover flag entry" that e887e9f was meant to close but left the >= unchanged.
Two correct resolutions:
Option A (preferred — no format change): Keep raft_envelope_cutover_index as the flag entry's index, change the comparison to >:
if index > raft_envelope_cutover_index:
data = raftDEK.Unwrap(data)
The leader wraps entries starting from index = cutover_index + 1. The flag entry decodes through the existing raftEncodeSingle/raftEncodeBatch dispatch unchanged.
Option B: Define raft_envelope_cutover_index = flag_entry_index + 1 (the index of the first encrypted entry), keep >=.
Pick one and update the pseudocode, the sidecar JSON example in §5.1 (where "raft_envelope_cutover_index": 184201 is shown alongside "raft_applied_index": 184273 — if 184201 is the flag entry, then 184202 is the first encrypted entry), and the §5.5 startup check description.
High (should resolve before implementation)
3. peer_metadata.go reference is inaccurate — no heartbeat mechanism exists there
docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 1011–1018
The design says:
"Each upgraded node advertises
encryption_capable = truein the peer-metadata extension that today already piggy-backs on the etcd-raft heartbeat path (internal/raftengine/etcd/peer_metadata.go). A newencryption_capable boolfield is added to the existingPeerMetadataproto; old binaries omit the field, which protobuf decodes as the zero value (false)"
Two problems:
-
peer_metadata.gomanages the persistent peers binary file (etcd-raft-peers.bin) using custom binary encoding (readU64,readString,readU8). It has nothing to do with heartbeats or protobuf. The existing extension point for piggybacking data on heartbeats israftpb.Message.Context []byte, which flows throughgrpc_transport.go/engine.go. -
"Today already piggy-backs" — there is no existing heartbeat metadata mechanism in the codebase. This is a new mechanism that needs to be built, not extended.
The design should name the correct extension point (e.g., "a new encryption_capable flag will be added to the etcd-raft message context bytes piggy-backed on MsgHeartbeat / MsgHeartbeatResp in internal/raftengine/etcd/engine.go") and clarify that this is new infrastructure. The "protobuf zero value" claim also needs replacing with the correct encoding scheme for a context-byte piggyback.
Medium (address before or alongside implementation)
4. resync-sidecar manual recovery only replays "rotation and rewrap-deks" — misses 3 entry types
docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 738–744
The §5.5 preamble correctly defines an isEncryptionRelevant() predicate covering five entry types: rotate-dek, rewrap-deks, enable-storage-envelope, enable-raft-envelope, retire-dek. The manual recovery path at step 3 then says:
"The command replays all rotation and
rewrap-deksentries between the sidecar'sraft_applied_indexand the FSM's applied index"
This is inconsistent — a node that missed enable-raft-envelope or enable-storage-envelope would not have those entries replayed, leaving raft_envelope_cutover_index unset and causing the same mis-dispatch the rest of §5.5 is designed to prevent. Change to "all isEncryptionRelevant() entries" throughout the recovery description to match the predicate.
5. Snapshot header len = 0x40 is unexplained and likely wrong
docs/design/2026_04_29_proposed_data_at_rest_encryption.md line 323
The v2 header diagram shows len = 0x40 (64 decimal). The text defines len as "total payload length after the len field." The only defined fields after len are ceiling (8B) and cutover (8B) = 16 bytes. len = 0x40 = 64 therefore implies 48 bytes of reserved padding, which is never explained. A v3+ reader that calls skip(len - 16) would skip 48 bytes of zero padding before the first extension field.
If this is intentional alignment, document why. If it's an error, fix to len = 0x10 = 16 (the two defined fields, no reserved) and update the diagram. Either way the design must specify the exact reserved byte count so implementors know the fixed header size.
6. Post-cutover membership gate still missing
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1
The Phase 0 capability gate prevents the Phase 1 and Phase 2 cutover until every current voting member reports encryption_capable. However, the design contains no invariant governing AddVoter / ChangeConf proposals after Phase 1 or Phase 2 is enabled. A non-capable node added as a voter after Phase 2 would silently read all encrypted storage values as tombstones (via the data[0] != 0 tombstone path) and fail to decode raft-envelope entries.
Add one paragraph to Phase 2 specifying: "Once storage_envelope_active is true, the leader rejects ConfChangeAddVoter / ConfChangeAddLearner-then-promote proposals that do not include encryption_capable = true in the candidate's heartbeat context. Non-voting (learner) members may still join and sync, but cannot be promoted to voters until they demonstrate capability."
Minor
7. v1 detection heuristic lacks a precise threshold
docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 330–333
The read-path step 2 says: "if [byte 9] falls in the documented v1 ceiling high-byte range (i.e., the implicit-v1 detection)." This heuristic is sound for today (the ceiling high byte is 0x00 for any HLC timestamp within the past several decades and won't reach 0x02 until year ~2253), but the implementation needs a concrete rule. Suggest: "treat byte 9 values 0x02..0x0F as version markers (aligned with §11.3's envelope-version reservation); any other value, including 0x00 and 0x01, indicates a v1 stream."
Conventions check
| Item | Status |
|---|---|
Filename 2026_04_29_proposed_data_at_rest_encryption.md |
✅ |
| Header Status/Author/Date | ✅ |
| Five CLAUDE.md self-review lenses in §10 | ✅ |
| §10 lens 1 / lens 2 updated for new mechanisms | ✅ |
| Open question §11.5 (Lua batching) still flagged | ✅ |
| `!encryption | ` reserved-key namespace collision confirmed no conflict |
ErrNodeIDCollision / ErrLocalEpochExhausted in §9.1 |
✅ |
| Full-DEK resync in §5.5 (not just active key) | ✅ |
| Snapshot versioning with backward-compat v1 detection | ✅ (modulo issues 1–2) |
Summary
Rounds 1–6 of review have closed every P1 and most High issues from all prior passes. The architecture — index-gated Raft dispatch, MVCC encryption_state bit, deterministic nonces, three-phase rollout, crash-durable sidecar — is solid and well-specified.
Two blockers remain before the implementation PR:
-
§4.4 snapshot magic is wrong. The diagram names
pebbleSnapshotMagic = EKVPBBL1butkv/snapshot.gouseshlcSnapshotMagic = EKVTHLC1. An implementation following the doc writes the wrong 8 bytes, breaking all v1 snapshot backward-compat. -
>=boundary in the FSM dispatch pseudocode. The flag entry atraft_envelope_cutover_indexis in legacy framing; applyingraftDEK.Unwrap()to it on WAL replay produces a GCM tag mismatch. Fix by changing to>(exclusive) and noting that the leader wraps entries strictly after the flag.
Issue 3 (incorrect peer_metadata.go reference) should also be resolved before implementation so the capability heartbeat mechanism is implemented in the right place from day one.
|
Claude finished @bootjp's task in 7m 39s —— View job Review:
|
| Issue | Resolution |
|---|---|
Codex P1 (r13, line 319): storage envelope AAD covered only envelope_version ‖ flag ‖ key_id, leaving the ciphertext vulnerable to cut-and-paste / blob relocation attacks |
AAD now includes pebble_key (encodeKey(user_key, commit_ts)); §4.1 documents three defeated attack classes: header rewrite, cut-and-paste/blob relocation, MVCC version substitution ✅ |
Codex P2 (r13, line 1360): cutover capability gate covered only voters, but learners also run kv/fsm.go::Apply |
Gate extended to ConfState.Voters ∪ ConfState.Learners in §7.1 Phase 0 step 3, §4.1 bootstrap, §5.6 step 1a, and §10 self-review lens 2. ErrCapabilityCheckFailed now surfaces offending node IDs from either set ✅ |
Critical — must fix before implementation PR opens
1. onEntry pseudocode unwraps the wrong bytes — implementors will write broken code
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 2 step 8 (lines 1490–1502) and §6.3 (lines 1158–1176)
This is the issue first raised in Round 12 that was not addressed in rounds 13 or 14.
The design's flow (§6.3) is:
kv/sharded_coordinator.go— wrap the proposalData []bytebefore submitting toraftengine
So the coordinator calls something like:
raftWrapped := raftDEK.Wrap(fsmPayload)
engine.Propose(ctx, raftWrapped)engine.go:1394 then calls encodeProposalEnvelope(req.id, req.payload) — reading the source:
// engine.go:3612–3618
func encodeProposalEnvelope(id uint64, payload []byte) []byte {
out := make([]byte, envelopeHeaderSize+len(payload))
out[0] = proposalEnvelopeVersion // 0x01
binary.BigEndian.PutUint64(out[1:envelopeHeaderSize], id)
copy(out[envelopeHeaderSize:], payload)
return out
}So entry.Data = [0x01 (proposalEnvelopeVersion)][uint64 id (8 bytes)][raftDEK.Wrap(fsmPayload)].
The pseudocode at lines 1492–1502:
onEntry(index, entry):
data := entry.Data // = [0x01][id 8B][raft_envelope(fsmPayload)]
if index > raft_envelope_cutover_index:
data = raftDEK.Unwrap(data) // ← WRONG: data[0]=0x01 is the proposal version,
// not the start of a raft envelope; GCM auth fails
fsm.Apply(data)
raftDEK.Unwrap(entry.Data) operates on bytes that start with proposalEnvelopeVersion = 0x01 followed by an 8-byte ID — not the raft envelope the coordinator produced. AES-GCM authentication fails on every entry.
The actual applyNormalEntry in engine.go:2020–2028 already calls decodeProposalEnvelope(entry.Data) first and gives the FSM only the extracted payload:
func (e *Engine) applyNormalEntry(entry raftpb.Entry) any {
_, payload, ok := decodeProposalEnvelope(entry.Data)
if !ok { return nil }
return e.fsm.Apply(payload)
}It is payload — not entry.Data — that holds raftDEK.Wrap(fsmPayload). The hook must fire on payload, after decodeProposalEnvelope:
// Correct version of applyNormalEntry (§6.3 hook placement):
func (e *Engine) applyNormalEntry(entry raftpb.Entry) any {
_, payload, ok := decodeProposalEnvelope(entry.Data)
if !ok { return nil }
if entry.Index > e.raftEnvelopeCutoverIndex() {
var err error
payload, err = e.raftDEK.Unwrap(payload) // payload = raft_DEK.Wrap(fsmPayload)
if err != nil { return err }
}
return e.fsm.Apply(payload)
}resolveProposal (line 2031) also calls decodeProposalEnvelope(entry.Data), which works correctly because entry.Data[0] = 0x01 = proposalEnvelopeVersion — that byte is never encrypted.
Fix the pseudocode and §6.3 to show the unwrap operating on payload (the output of decodeProposalEnvelope), not on entry.Data:
onEntry(index, entry):
// engine.go:2020 — existing first step, unchanged:
id, payload, ok := decodeProposalEnvelope(entry.Data)
if !ok: return nil
// NEW hook between decodeProposalEnvelope and fsm.Apply:
if index > raft_envelope_cutover_index:
payload = raftDEK.Unwrap(payload) // payload was raft_DEK.Wrap(fsmPayload)
return fsm.Apply(payload) // fsmPayload, contract unchanged
Also update §4.2 to clarify: "the coordinator wraps fsmPayload with the raft DEK before passing it to engine.Propose; engine.Propose then wraps the result in encodeProposalEnvelope, so entry.Data = [proposalEnvelopeVersion][id][raft_DEK.Wrap(fsmPayload)]."
Medium — should resolve before or alongside implementation
2. Raft envelope wire format, nonce scheme, and AAD are unspecified
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.2
§4.1 fully specifies the storage envelope: version byte, flag, key_id, deterministic 96-bit nonce (node_id ‖ local_epoch ‖ write_count), AAD (envelope_version ‖ flag ‖ key_id ‖ pebble_key), and overhead. §4.2 says only that the raft DEK "wraps the entire Data []byte" with no format diagram, no nonce scheme, and no AAD.
Two concrete risks this leaves open:
Nonce scheme: If an implementor uses random 12-byte nonces for the raft DEK, the NIST SP 800-38D §8.3 limit of 2³² invocations per key is reached in approximately 5 days at 10k proposals/s — the same problem the storage envelope fixed multiple rounds ago.
Cross-context replay via missing purpose discriminator in AAD: After the Round 14 fix, the storage AAD now binds to pebble_key. But a raft-envelope ciphertext sealed against AAD = envelope_version ‖ flag ‖ key_id ‖ raft_payload_context (or whatever the implementation chooses) would not be detectable as a raft ciphertext if an attacker replays it into the Pebble storage layer (or vice versa) — unless the AAD includes a purpose field ("storage" vs "raft").
Specify in §4.2:
Raft envelope format (same layout as §4.1, purpose-specific AAD):
| 0x01 | 0x00 (flag=0, no compress) | key_id(4) | nonce(12) | ciphertext(N) | tag(16) |
Nonce: same node_id ‖ local_epoch ‖ write_count construction as §4.1,
drawing from the same sidecar-persisted local_epoch counter for the
raft DEK (not shared with the storage DEK counter — each DEK has
its own counter in the sidecar). Only the leader calls Wrap.
AAD: purpose("raft") ‖ envelope_version ‖ key_id
(distinct from storage-layer AAD = envelope_version ‖ flag ‖ key_id ‖ pebble_key
so a storage-layer ciphertext replayed as a raft envelope, or vice versa,
fails GCM authentication)
3. Learner registration gap: no path covers learners added after bootstrap
docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.1 (writer registry, registration paths)
The three current registration paths (bootstrap batch, process-start via coordinator, post-rotation via coordinator) all assume the registering node either (a) exists at bootstrap time or (b) has a client-facing coordinator that can block until the RegisterEncryptionWriter commit is acknowledged. Learners don't serve client writes and have no coordinator gate — so the "process-start via coordinator" path cannot protect a learner from writing its first encrypted entry before its registry entry commits.
Line 1013–1016 says: "Nodes added to the cluster after bootstrap fall back to the process-start trigger described in §4.1 (their first RegisterEncryptionWriter is proposed before they accept any encrypted write at the coordinator)." This is correct for a new voter starting with a coordinator, but a new learner has no such coordinator gate.
The §7.1 post-cutover membership gate (line 1576) checks EncryptionAdmin.GetCapability before AddVoter and learner promotion — but not before ConfChangeAddLearner. A new learner that passes the capability check still has no RegisterEncryptionWriter in flight; its FSM apply will encrypt values under a DEK it is not yet registered for.
Add a fourth registration path: when the leader processes ConfChangeAddLearner (checked via the post-cutover gate), it also proposes RegisterEncryptionWriter(dek_id, learner_full_node_id, learner_local_epoch) — both fields collected from the GetCapability call already made during the membership gate — in the same commit batch or immediately after. The post-cutover gate section at §7.1 should state this explicitly.
Conventions check
| Item | Status |
|---|---|
Filename 2026_04_29_proposed_data_at_rest_encryption.md |
✅ |
| Header Status/Author/Date | ✅ |
| Five CLAUDE.md self-review lenses in §10 | ✅ |
AAD extended to envelope_version ‖ flag ‖ key_id ‖ pebble_key in §4.1 |
✅ |
| §4.1 documents cut-and-paste / blob-relocation / MVCC-version-substitution defeats | ✅ |
Capability gate covers ConfState.Voters ∪ ConfState.Learners |
✅ |
raft_envelope_cutover_index > (exclusive) boundary with comment |
✅ |
Snapshot format EKVTHLC1 / EKVTHLC2 distinct magics |
✅ |
hlcSnapshotMagic / hlcSnapshotMagicV2 in §4.6 (not pebbleSnapshotMagic) |
✅ |
bootstrap-encryption in §5.5 six-entry isEncryptionRelevant() set |
✅ |
EncryptionAdmin.GetCapability with full_node_id and local_epoch in §6.1 |
✅ |
Post-cutover AddVoter/learner-promotion gate (ErrPeerNotEncryptionCapable) |
✅ |
ErrLocalEpochExhausted and ErrNodeIDCollision in §9.1 |
✅ |
raft_applied_index written atomically with DEKs in §5.1 |
✅ |
key_id == 0 reserved sentinel excluded from CSPRNG |
✅ |
| Phase 1 step 4 "not raft-DEK-wrapped" (not "legacy framing") | ✅ |
onEntry pseudocode operates on correct bytes |
❌ still entry.Data — see issue #1 |
| Raft envelope format, nonce, and AAD specified | ❌ not yet specified — see issue #2 |
| Learner registration path for post-bootstrap adds | ❌ missing — see issue #3 |
| §11.5 Lua batching flagged as pre-implementation gate | ✅ |
Summary
Fourteen rounds of review have hardened every cryptographic, wire-format, rollout-safety, and key-management decision in this proposal. The Round 14 AAD fix (record-binding via pebble_key) and the learner/voter capability gate are well-done improvements.
One blocking issue remains before the implementation PR opens:
Issue #1 (Critical): The onEntry pseudocode unwraps entry.Data instead of the FSM payload extracted by decodeProposalEnvelope. Verified against engine.go:2020–2028 and engine.go:3612–3625: entry.Data = [0x01 (proposalEnvelopeVersion)][8-byte id][raft_DEK.Wrap(fsmPayload)]. Calling raftDEK.Unwrap(entry.Data) would fail with a GCM authentication error because the proposal envelope header bytes are not what the raft DEK encrypted. The correct hook placement is inside applyNormalEntry after decodeProposalEnvelope extracts payload, not before it.
Two medium issues should also be resolved before implementation:
Issue #2 (Medium): The raft envelope has no wire format diagram, nonce scheme, or AAD specification. Without these, the implementation could introduce random nonces (hitting the 2³² NIST limit in ~5 days at 10k/s) or omit a purpose discriminator (enabling cross-context replay between storage and raft envelopes).
Issue #3 (Medium): Learners added after the initial bootstrap batch have no described path to register in the writer registry. The "process-start via coordinator" path requires a client-facing coordinator that learners lack. A fourth path (leader proposes RegisterEncryptionWriter on ConfChangeAddLearner) should be documented.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d340ab528c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| exist cluster-wide. FSM apply may run on Raft entries | ||
| proposed by the leader during this window (those are | ||
| decrypted using DEKs the node already has from its | ||
| sidecar), but no self-originated encrypted entry can be | ||
| queued by the local coordinator until registration |
There was a problem hiding this comment.
Gate follower apply on writer-registry registration
This window allows FSM apply to process encrypted entries before RegisterEncryptionWriter is committed, which violates the stated invariant in §4.1 that a node must be in the per-DEK writer registry before its first write. For a newly joined learner/voter, apply-path writes can happen without any coordinator traffic, so a 16-bit node_id collision with a historical writer under the same DEK can bypass the registry check and reuse (DEK, nonce) pairs. The membership/join flow should require successful writer registration (or equivalent pre-registration in the conf-change path) before encrypted apply is allowed.
Useful? React with 👍 / 👎.
| replays **every entry that satisfies the | ||
| `isEncryptionRelevant()` predicate** (the same six-entry | ||
| set used by the §5.5 lag check above: `rotate-dek`, | ||
| `rewrap-deks`, `bootstrap-encryption`, | ||
| `enable-storage-envelope`, `enable-raft-envelope`, |
There was a problem hiding this comment.
Add resync fallback when missed entries are compacted
The manual recovery path is defined as replaying encryption-relevant log entries between the sidecar index and FSM applied index, but this fails if the missing bootstrap/cutover entries were already compacted from the Raft log (a normal case after long downtime). In that scenario startup refusal (ErrSidecarBehindRaftLog) can become unrecoverable with the documented resync-sidecar flow, because there is no fallback that reconstructs sidecar state from a full current-state snapshot/entry.
Useful? React with 👍 / 👎.
…vs proposal envelope) Critical (claude[bot] r12 #1): the previous design directed raftDEK.Unwrap to operate on entry.Data, but engine.go already wraps every proposal in encodeProposalEnvelope(id, payload) producing entry.Data = [0x01 proposalEnvelopeVersion][uint64 id][payload]. Unwrapping entry.Data would clobber the leading 0x01 that resolveProposal/decodeProposalEnvelope check, returning ok=false and silently dropping every coordinator response -- every write would time out forever after Phase 2. Even if the unwrap "succeeded," the result would be the inner proposal envelope itself starting with 0x01 = raftEncodeBatch, misrouting every post-Phase-2 entry. Fix the layering: the raft envelope wraps the FSM payload INSIDE the proposal envelope, not entry.Data itself. §4.2 explains the entry.Data layout and the call sequence. §6.3 hook is now placed inside applyNormalEntry, between decodeProposalEnvelope and fsm.Apply -- decode first to recover (id, maybe_enc_payload), then unwrap if index > cutover, then Apply the cleartext. §7.1 pseudocode rewritten as applyNormalEntry showing the full sequence with explicit comments calling out the layering trap. Coordinator-side bullet in §6.3 also clarified to wrap with raft DEK BEFORE handing to engine submit path.
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 36ffb904ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # refuses unless every | ||
| # voting member reports | ||
| # encryption_capable |
There was a problem hiding this comment.
Require learner capability before enabling envelopes
This command contract still says cutover is allowed when only voting members report encryption_capable, but later in §7.1 the rollout safety argument depends on checking voters and learners because learners run the same apply path. If implementers follow this section, Phase 1/2 could be enabled with a stale learner that then mis-handles encrypted entries (tombstone misread or apply failure), which undermines the migration safety guarantees.
Useful? React with 👍 / 👎.
…15 medium x2) P1 (codex on 36ffb90 line 245) + claude[bot] r15 #3: writer registry §4.1 had three registration paths (bootstrap batch / process-start via coordinator / post-rotation via coordinator). Learners do not serve client writes so they have NO coordinator gate to defer first encrypted FSM apply behind RegisterEncryptionWriter commit. A new learner whose uint16(node_id) collides with a historical writer would reuse (DEK, nonce) pairs on the very first apply Put. Add a fourth path: ConfChange-time. The leader processing ConfChangeAddLearner / ConfChangeAddNode (already gated by GetCapability per §7.1 post-cutover gate) proposes paired RegisterEncryptionWriter entries for every unretired DEK before committing the conf-change; collisions reject with ErrPeerNotEncryptionRegistered. Same applies to learner→voter promotion. P2 (codex on 36ffb90 line 966): §5.5 manual recovery via log replay fails when the missing encryption-relevant entries have been compacted from the Raft log (normal after long downtime + snapshot retention). ErrSidecarBehindRaftLog becomes unrecoverable. Add a leader-side full-state fallback: ErrLogCompactedPastSidecar triggers EncryptionAdmin.GetSidecarState which returns the entire unretired DEK set (KEK-wrapped) plus active flags and cutover index. The recovering node overwrites its sidecar authoritatively. Safe because it was already behind and has no committed-but-unapplied state to lose. Medium (claude[bot] r15 #2): §4.2 raft envelope had no wire format / nonce scheme / AAD specified. Add explicit format diagram (same layout as §4.1, flag fixed at 0x00, separate per-DEK local_epoch counter), nonce uses the same node_id||local_epoch||write_count construction with the raft DEKs OWN counter, and AAD = purpose("raft") || envelope_version || key_id with a literal byte 0x52 prefix that distinguishes raft-layer ciphertext from storage-layer ciphertext (so cross-context replay fails GCM verification). No pebble_key in raft AAD because the raft envelope is location-independent (identified by entry.Index, used by the engine hook for index-gating). Note: claude[bot] r15 Critical #1 (entry.Data vs payload layering) was already addressed in round-15 commit 36ffb90. §6.1 admin.go updated to include GetSidecarState RPC and SidecarStateReport message.
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c89b62057
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 2. If an entry exists with the **same** `full_node_id` → | ||
| this is a re-registration (post-restart). Bump | ||
| `last_seen_local_epoch` and allow. The node was already | ||
| a legitimate writer. |
There was a problem hiding this comment.
Reject stale local_epoch on writer re-registration
The re-registration rule currently accepts any entry with the same full_node_id and just updates last_seen_local_epoch, which allows a node restored from stale disk/backup state to come back with a lower local_epoch under the same unretired DEK. Because the nonce is node_id || local_epoch || write_count, that rollback can reproduce previously used nonces and cause AES-GCM nonce reuse. The registration check should enforce monotonicity (e.g., require new_local_epoch > last_seen_local_epoch) and fail startup/registration on regression.
Useful? React with 👍 / 👎.
…earner consistency) P1 (codex on 2c89b62 line 1450): §6.6 admin command contract for enable-storage-envelope / enable-raft-envelope still said "voting member" only, contradicting §7.1 Phase 0 step 3 (which I extended to ConfState.Voters union ConfState.Learners in round-14). An implementer following §6.6 in isolation would build a gate that lets cutover proceed with a stale learner present, defeating the migration safety guarantee. Updated §6.6 contract comments and four other site-wide "voting member" references (§6.1 GetCapability fan-out scope, §6.3 raftEncodeEncryptionBootstrap body description, §6.2 lsm_store layout extension, §7.1 Phase 0 / Phase 2 prose) to consistently say "voters AND learners" or "every member" so the safety invariant is uniform across the design.
|
@claude review |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/design/2026_04_29_proposed_data_at_rest_encryption.md (1)
1514-1520: Clarify that Phase 2 cutover does not require rewrite completion.Lines 1516-1519 describe Phase 1 completion criteria (zero cleartext versions +
minRetainedTSadvanced), but Phase 2 cutover (lines 1521-1610) does not list "Phase 1 complete" as a precondition. This implies the operator can runenable-raft-envelopewhile cleartext MVCC versions remain on disk.This is safe (storage reads dispatch on
encryption_stateper lines 1373-1390), but it violates the §1 threat model goal until rewrite completes. An attacker who steals disks during the Phase 1→2 window gets both cleartext historical versions and encrypted new writes.Add an explicit note at line 1523 (start of Phase 2 section) clarifying:
- Phase 2 can proceed before rewrite finishes (raft envelope and storage envelope are orthogonal).
- However, the §1 threat model guarantee ("persisted state cannot be decrypted without KEK") only holds cluster-wide after the §7.1 step 6 rewrite completes and
minRetainedTSadvances past the youngest cleartextcommit_tson every node.- Operators with strict compliance requirements should gate Phase 2 on rewrite completion.
This makes the security posture explicit for operators who care about the timing of full coverage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md` around lines 1514 - 1520, Add a clear note at the start of the Phase 2 section (before or at the existing "enable-raft-envelope" discussion) stating that Phase 2 can be executed before the in-place MVCC rewrite completes because raft-envelope and storage-envelope are orthogonal (storage reads still dispatch based on encryption_state), but that the §1 threat-model guarantee ("persisted state cannot be decrypted without KEK") only holds cluster-wide after the §7.1 step 6 rewrite finishes and minRetainedTS has advanced beyond the youngest cleartext commit_ts on every node; advise operators with strict compliance to gate execution of enable-raft-envelope (Phase 2) until the rewrite and minRetainedTS advancement are verified.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md`:
- Around line 1514-1520: Add a clear note at the start of the Phase 2 section
(before or at the existing "enable-raft-envelope" discussion) stating that Phase
2 can be executed before the in-place MVCC rewrite completes because
raft-envelope and storage-envelope are orthogonal (storage reads still dispatch
based on encryption_state), but that the §1 threat-model guarantee ("persisted
state cannot be decrypted without KEK") only holds cluster-wide after the §7.1
step 6 rewrite finishes and minRetainedTS has advanced beyond the youngest
cleartext commit_ts on every node; advise operators with strict compliance to
gate execution of enable-raft-envelope (Phase 2) until the rewrite and
minRetainedTS advancement are verified.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 30d9da28-4ad3-43cb-ac93-fd6cb331957a
📒 Files selected for processing (1)
docs/design/2026_04_29_proposed_data_at_rest_encryption.md
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6168382a96
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| `EncryptionAdmin.GetSidecarState(Empty) → SidecarStateReport | ||
| { wrapped_deks_by_id (map<uint32, bytes>), | ||
| active_storage_id (uint32), active_raft_id (uint32), | ||
| storage_envelope_active (bool), | ||
| raft_envelope_cutover_index (uint64), | ||
| latest_applied_index (uint64) }` |
There was a problem hiding this comment.
Preserve per-node local_epoch during sidecar resync
The compaction fallback snapshot sent by GetSidecarState includes wrapped DEKs and cutover flags but omits each key's local_epoch, yet §5.5 says the follower overwrites its sidecar from this authoritative state. If a node had previously encrypted data under a DEK and then resyncs this way, its per-DEK epoch can roll back (or reset), so after restart it can regenerate previously used node_id || local_epoch || write_count nonces under the same key, which is catastrophic for AES-GCM. The resync payload needs to carry and preserve the follower's monotonic per-node epochs (or otherwise re-establish a nonce domain) before permitting further encryption.
Useful? React with 👍 / 👎.
| 8. From the **next** entry after the flag entry's apply index | ||
| onward, every leader wraps new proposal `Data []byte` with | ||
| the raft DEK (§4.2). The flag entry itself is **not** |
There was a problem hiding this comment.
Quiesce proposals across raft-envelope cutover
This cutover defines a strict index boundary (index > raft_envelope_cutover_index) but does not require pausing or sequencing client proposals while the enable-raft-envelope entry is in flight. In a busy leader, cleartext proposals can be appended after the cutover entry is proposed but before wrapping is enabled, giving them indexes above cutover; followers will then try to unwrap those cleartext payloads and fail apply. The rollout needs an explicit barrier (e.g., block proposal intake until the cutover entry is applied, then enable wrapping) to keep format and index boundaries consistent.
Useful? React with 👍 / 👎.
… + cutover quiescence) P1 (codex on 6168382 line 286): §4.1 re-registration rule unconditionally bumped last_seen_local_epoch on same-full_node_id matches, allowing a node restored from stale disk/backup state to re-register with a LOWER local_epoch and reproduce previously-used nonces under the same DEK -- catastrophic AES-GCM nonce reuse. Split case 2 into two: (2) accept only if proposed local_epoch > registry.last_seen_local_epoch, bumping the registry; (3) reject with ErrLocalEpochRollback otherwise. Recovery is either manual sidecar bump or DEK rotation (which retires the registry slice). P1 (codex on 6168382 line 1061): §5.5 GetSidecarState compaction-fallback overwrote the follower sidecar authoritatively without preserving per-DEK local_epoch, allowing a roll-back / reset of the per-(node, DEK) nonce counter on resync. Add a writer_registry_for_caller field to SidecarStateReport carrying the leader-recorded last_seen_local_epoch for THIS follower under each DEK; the recovering follower computes new_local_epoch = max(existing_sidecar, registry_for_caller) + 1 per DEK so the post-resync value is strictly above any nonce ever issued by this follower under that DEK. The +1 increment is the same crash-safe step §5.1 performs on every process start. The re-registration check (case 3 above) closes the loop on the next round-trip. P1 (codex on 6168382 line 1673): §7.1 Phase 2 had a propose-time / apply-time race -- a busy leader could append cleartext proposals after proposing the cutover entry but before flipping its wrap-on-propose switch (which fires at FSM apply), giving them indexes > raft_envelope_cutover_index without the §4.2 envelope. Followers would then unwrap them as raft envelopes and fail every apply with GCM verification errors. Add a leader-side proposal-quiescence barrier: block intake → drain in-flight → propose cutover → wait for local apply → flip switch → unblock. Single Raft commit RTT cost. Phase 1 enable-storage-envelope does NOT need this barrier because storage encryption is decided per-Put at FSM apply time (per-replica, deterministic) rather than per-proposal at propose time (leader-only). §6.1 SidecarStateReport proto extended with writer_registry_for_caller. §9.1 startup refusal list extended with ErrLocalEpochRollback.
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7bd5539178
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| `RegisterEncryptionWriter` (§4.1) **after** it has | ||
| observed both the bootstrap entry and the | ||
| `enable-storage-envelope` flag entry in its applied log, | ||
| not at process start. Until both are present the storage |
There was a problem hiding this comment.
Do not force initial nodes to re-register bootstrap writers
Step 1a already inserts writer-registry records during bootstrap-encryption, but this step says every node performs its first RegisterEncryptionWriter right after bootstrap/flag apply. Under the registry rules in §4.1, a registration with the same full_node_id and unchanged local_epoch is rejected as ErrLocalEpochRollback, so implementing both instructions together can make the initial rollout fail even when all nodes are healthy. Clarify that the bootstrap cohort must skip the immediate per-node registration (or must increment epoch before re-registering).
Useful? React with 👍 / 👎.
| 4. **DEK retirement criterion.** A DEK is safe to unload only when | ||
| **both** of the following are true cluster-wide: | ||
| - The rewrite cursor for that DEK has reached the end of the | ||
| keyspace AND `elastickv_encryption_values_per_dek{key_id} == 0` | ||
| across every node (verified by `encryption status --verify`). |
There was a problem hiding this comment.
Gate DEK retirement on WAL needs for raft-purpose keys
This retirement rule is MVCC-only (rewrite completion and values_per_dek), but the design also supports rotate-dek --purpose=raft. If applied to raft-purpose DEKs, operators can retire an old raft key while WAL still contains entries encrypted with that key_id; a restart or lagging follower replay would then hit unknown_key_id during apply. Raft-purpose retirement needs an additional guard tied to log/snapshot compaction state (e.g., no retained entries below last index encrypted by that DEK).
Useful? React with 👍 / 👎.
…ister + raft DEK WAL retention) P1 (codex on 7bd5539 line 1214): §5.6 step 1a inserted bootstrap cohort into the writer registry but §4.1 process-start trigger ALSO fired on the bootstrap nodes first observation of enable-storage-envelope. Per the round-18 §4.1 case 3 monotonicity check (ErrLocalEpochRollback), the second registration with the same local_epoch as the bootstrap entry would be rejected -- breaking the cluster-wide rollout even when every node is healthy. Reword §4.1 process-start: trigger fires only on a fresh process load whose §5.1 local_epoch bump has just persisted a value strictly greater than the registry record. Bootstrap cohorts skip the second registration because their slot already contains the current local_epoch; the implementation must check the registry on coordinator-startup and skip the propose if its slot is already current. P1 (codex on 7bd5539 line 919): §5.4 retirement criterion was MVCC-only (rewrite cursor + minRetainedTS), but rotate-dek --purpose=raft retires raft DEKs whose ciphertext lives in the etcd raft WAL, not Pebble. Retiring a raft DEK while the WAL still contains entries encrypted under it would cause unknown_key_id apply failures on the next process restart (WAL replay) or on a lagging follower catch-up. Split the criterion by purpose: storage-purpose keeps the MVCC-driven gate; raft-purpose adds a WAL-driven gate (every nodes log start index strictly greater than the largest log index ever proposed under the retiring raft DEK + last committed snapshot taken under the new raft DEK). New ErrRaftDEKWALStillReferences and metric elastickv_encryption_last_proposed_index_per_raft_dek for the operator to observe before retire.
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b343d0339
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| failure, return an error from `Apply` so the entry is not | ||
| silently skipped (consistency invariant — see CLAUDE.md self- |
There was a problem hiding this comment.
Propagate unwrap failure via engine error path
This section says unwrap failures can be handled by “return[ing] an error from Apply so the entry is not silently skipped,” but the current raft contract does not support that: StateMachine.Apply returns any (not error), and internal/raftengine/etcd/engine.go::applyCommitted advances setApplied(entry.Index) regardless of the Apply return value. In a post-cutover node that hits a raft-envelope decrypt/tag failure, treating the failure as a normal Apply return would still mark the entry applied and continue replay, violating the consistency guarantee this design relies on. The design should require an engine-level error channel/return path that stops apply before setApplied when unwrap fails.
Useful? React with 👍 / 👎.
…t halt apply, not silently skip) P1 (codex on 6b343d0 line 1411): §6.3 promised "unwrap failure propagates as an Apply error so the entry is not silently skipped" but the actual StateMachine.Apply contract returns "any" (no error), and engine.go::applyCommitted advances setApplied(entry.Index) regardless of the Apply return value. A raft-envelope decrypt failure (sidecar/keystore divergence or on-disk corruption) would silently mark the entry applied and continue replay -- exactly the silent-FSM-divergence the integrity tag was added to detect. Fix: the engine hook itself returns a sentinel ErrRaftUnwrapFailed (or sibling ErrEncryptionApply for 0x03/0x04/0x05 internal errors) directly to applyCommitted, BEFORE calling fsm.Apply. applyCommitted is extended to recognise these sentinels and stop the apply loop WITHOUT advancing setApplied -- the process exits via the existing fatal-error path so an operator-supervised restart under a corrected keystore (§5.5 resync, §9.3 KEK custody) replays the same entry. §6.3 implementation plan, §7.1 Phase 2 pseudocode, and §10 self-review lens 1 all updated to reflect this fail-fast contract.
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e27f2d429
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. block new proposal intake at the engine.Propose entrypoint | ||
| (return ErrEnvelopeCutoverInProgress to coordinator) | ||
| 2. wait for the in-flight proposal queue to drain (all | ||
| previously-accepted proposals committed and applied) | ||
| 3. propose the enable-raft-envelope entry (raftEncodeEncryptionRotation |
There was a problem hiding this comment.
Exempt cutover entry from global propose block
This sequence blocks engine.Propose before submitting enable-raft-envelope, but internal/raftengine/etcd/engine.go currently exposes Propose as the write entrypoint; implementing this literally can reject the cutover proposal itself and leave the cluster stuck in a write-rejecting barrier. The cutover flow needs an explicit exception/bypass for the cutover proposal (or a narrower gate that only blocks user proposals) so step 3 can always be appended and committed.
Useful? React with 👍 / 👎.
| string build_sha = 2; | ||
| bool sidecar_present = 3; | ||
| uint64 full_node_id = 4; // for §5.6 batch registry | ||
| uint32 local_epoch = 5; // for §5.6 batch registry |
There was a problem hiding this comment.
Keep local_epoch wire types aligned with 16-bit nonce field
The design defines local_epoch as 16-bit and encodes it into a 2-byte nonce field (§4.1), but the RPC schema widens it to uint32; this mismatch invites silent truncation if values are copied back into sidecar/nonce state without a hard bounds check. A truncation on resync/registration can violate the monotonic epoch invariant that prevents nonce reuse, so the wire contract should be explicitly 16-bit-equivalent (or require strict <= 0xFFFF validation at decode points).
Useful? React with 👍 / 👎.
…+ epoch wire type) P1 (codex on 5e27f2d line 1804): §7.1 Phase 2 step 7 quiescence barrier blocked engine.Propose globally before submitting the cutover entry, but step 3 of the same sequence is itself an engine.Propose call -- so the literal implementation would reject its own cutover proposal and leave the cluster stuck in a write-rejecting barrier forever. Specify a per-call source tag: USER proposals from coordinators are blocked, but the encryption-admin path (cutover proposals + RegisterEncryptionWriter triggered by ConfChangeAddLearner mid-cutover) bypasses with source = "encryption_admin". P2 (codex on 5e27f2d line 1317): §6.1 RPC schema used uint32 for local_epoch but the §4.1 nonce field reserves only 16 bits. Without explicit decode-side validation, a corrupted/malicious response carrying a value > 0xFFFF would silently truncate when copied into the nonce field, breaking the monotonicity invariant the round-18 ErrLocalEpochRollback was added to enforce. Added explicit "value <= 0xFFFF or ErrLocalEpochOutOfRange" rule applied symmetrically to CapabilityReport.local_epoch and every value in SidecarStateReport.writer_registry_for_caller. New ErrLocalEpochOutOfRange added to §9.1 startup refusal list (and applies on every RPC decode site).
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |

Summary
docs/design/2026_04_29_proposed_data_at_rest_encryption.md(Status: Proposed) covering encryption-at-rest for elastickv.Data []byte. Keeps the same ciphertext flowing through Raft → WAL → Pebble SST → FSM snapshot, so no surface holds cleartext values.encryption/keys.json. DEK rotation is operator-driven via Raft so every replica observes the new key at the same log index.0x00cleartext,0x01encrypted) plus a rate-limited rewrite job. Reverse migration is intentionally unsupported (dump-and-reload).Follows the design-doc-first workflow in CLAUDE.md — implementation PRs will land after review of this proposal.
Test plan
docs/design/README.mdconventions.Summary by CodeRabbit