Skip to content

docs: propose data-at-rest encryption design#707

Open
bootjp wants to merge 22 commits intomainfrom
docs/proposed-data-at-rest-encryption
Open

docs: propose data-at-rest encryption design#707
bootjp wants to merge 22 commits intomainfrom
docs/proposed-data-at-rest-encryption

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 29, 2026

Summary

  • Add docs/design/2026_04_29_proposed_data_at_rest_encryption.md (Status: Proposed) covering encryption-at-rest for elastickv.
  • Threat model: protect against disk theft, decommissioned-drive recovery, backup leak, and Raft WAL/snapshot leak. Without the externally-held key material, the persisted state cannot be decrypted.
  • Encryption boundary: per-value AES-256-GCM envelope at the storage layer + the same envelope wrapping every Raft proposal Data []byte. Keeps the same ciphertext flowing through Raft → WAL → Pebble SST → FSM snapshot, so no surface holds cleartext values.
  • Key hierarchy: external KEK (AWS KMS / GCP KMS / Vault / file) wraps DEKs; the data dir only holds wrapped DEKs in encryption/keys.json. DEK rotation is operator-driven via Raft so every replica observes the new key at the same log index.
  • Migration: rolling restart with envelope-version byte (0x00 cleartext, 0x01 encrypted) plus a rate-limited rewrite job. Reverse migration is intentionally unsupported (dump-and-reload).
  • Self-review per CLAUDE.md (data loss / concurrency / performance / consistency / test coverage) included; Jepsen Redis + DynamoDB suites against an encrypted 3-node cluster are the implementation acceptance gate.

Follows the design-doc-first workflow in CLAUDE.md — implementation PRs will land after review of this proposal.

Test plan

  • Doc-only change; no code or tests in this PR.
  • Reviewer: confirm filename / header follow docs/design/README.md conventions.
  • Reviewer: confirm threat model scope (in/out) matches the operational stance you want.
  • Reviewer: weigh open question §11.5 (interaction with Lua commit batching) before the implementation PR is opened.

Summary by CodeRabbit

  • Documentation
    • Added comprehensive design proposal for data‑at‑rest encryption: per‑value ciphertext everywhere, encrypted replication envelopes, external key management and DEK lifecycle, cluster-wide nonce/uniqueness and refusal conditions, crash‑durable key sidecar and startup validation, MVCC encryption state handling, multi‑phase rollout and admin commands, snapshot/joiner semantics, observability constraints, performance/Jepsen gates, and a full test plan.

Add a Proposed design doc covering value-level + Raft-proposal envelope encryption with KEK/DEK separation and KMS-backed key custody, so the cluster persisted state is undecryptable without externally-held key material.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 55 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d71f01c4-a3ee-4ad5-a1d0-64db095ade7f

📥 Commits

Reviewing files that changed from the base of the PR and between 36ffb90 and 07b2fd9.

📒 Files selected for processing (1)
  • docs/design/2026_04_29_proposed_data_at_rest_encryption.md
📝 Walkthrough

Walkthrough

Proposes elastickv data-at-rest encryption: per-value AES-256-GCM storage envelopes, separate Raft-level envelopes with distinct DEKs, external KEK wrapping, cluster-wide nonce/epoch model, crash-durable sidecar DEK persistence, three-phase rollout, MVCC encryption state bits, admin APIs, and refusal/resync semantics.

Changes

Cohort / File(s) Summary
Data-at-Rest Encryption Design
docs/design/2026_04_29_proposed_data_at_rest_encryption.md
Adds a new comprehensive design document defining per-value AES-256-GCM storage envelopes (deterministic nonce, AAD binding), separate raft-purpose DEKs and raft envelopes, cluster-wide nonce/epoch writer registry, KEK-wrapped DEKs (KMS/Vault/file/env), leader-generated DEKs committed via Raft, crash-durable sidecar protocol for wrapped DEKs and resync semantics, MVCC encryption_state bits, three-phase rollout gated by Raft flags and an EncryptionAdmin.GetCapability RPC, admin commands (status/rotate/rewrap/rewrite/retire/resync), snapshot/restore cutover/versioning, refusal conditions, observability constraints, metrics/logging guidance, performance/Jepsen gates, and a detailed test plan.

Sequence Diagram(s)

sequenceDiagram
    participant Admin
    participant Leader
    participant KMS
    participant Raft
    participant Sidecar
    participant WAL
    participant Pebble

    Admin->>Leader: request DEK rotate / generate
    Leader->>KMS: request KEK wrap (wrap DEK)
    KMS-->>Leader: wrapped DEK
    Leader->>Raft: propose DEK-commit (wrapped DEK)
    Raft-->>Leader: commit index
    Leader->>Sidecar: persist wrapped DEK with commit index
    Sidecar-->>Leader: ack persisted
    Admin->>Leader: propose write (key + clear value)
    Leader->>Leader: create raft-envelope (op + key + clear value)
    Leader->>Raft: propose raft-envelope (encrypted with raft-DEK)
    Raft-->>WAL: append encrypted raft-envelope
    Raft-->>Pebble: apply -> create storage-envelope (encrypt value with storage-DEK)
    Pebble-->>Pebble: persist storage-envelope (value ciphertext)
    Pebble-->>Leader: ack applied
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 a rabbit in the vault, penning lines in salt
Wrapped keys tucked snug beneath the leaves,
DEKs and KEKs in careful heaves,
Three phases hop, the cluster hums,
Encrypted fields where sunlight comes,
Securely stored — the burrow dreams. 🥕✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: proposing a data-at-rest encryption design for the elastickv system.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/proposed-data-at-rest-encryption

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 17 minutes and 55 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 5m 48s —— View job


Review: docs: propose data-at-rest encryption design

  • Gather context and understand request
  • Read design document
  • Check docs/design/README.md conventions
  • Review architecture and design decisions
  • Post review feedback

Overall the proposal is thorough and well-structured. The boundary choice (§3b), key hierarchy, Raft-coordinated rotation, and self-review section are all strong. There are a few places where the design has internal contradictions or will collide with the existing lsm_store.go encoding — these need resolution before the implementation PR opens.


Critical issues

1. Version-byte sentinel collides with lsm_store.go's existing tombstone byte

encodeValue in store/lsm_store.go already prepends a 9-byte header to every Pebble value:

[tombstone(1)] [expireAt(8)] [user_value...]

tombstone = false → byte 0 is 0x00. tombstone = true → byte 0 is 0x01.

The §4.1 envelope proposes using 0x01 as the encrypted-version marker placed at the start of the user value bytes (position 9 in the full Pebble-encoded value). Because the encodeValue format already uses 0x01 in byte 0 for tombstones, and any new tombstone entry written by the encrypted build would need to store nil as the value, the collision between "tombstone" (0x01 at byte 0) and "encrypted envelope starting with 0x01" (at byte 9) isn't in the tombstone byte itself, but the design implies the version byte is checked by decodeValue's caller on data[valueHeaderSize:]. That means an old, non-tombstone user value whose first byte happens to be 0x01 would be misread as an encrypted envelope.

Since the user value is arbitrary bytes, this is a 1-in-256 per-value failure rate during the mixed-mode migration window — silently returning a decrypt error for ~0.4% of old values.

A clean fix that doesn't break the on-disk format: replace tombstone byte ∈ {0,1} with a 2-bit field — use 0x02 for "non-tombstone, encrypted" and 0x03 for "tombstone, encrypted" — so the encryption flag lives in the existing valueHeaderSize header where old builds would see an unexpected tombstone byte value (an error) rather than silently returning garbage. This keeps valueHeaderSize = 9 and is backward-compatible in the right direction (old build refuses to serve encrypted data).

Fix this →

2. Double-encryption ambiguity: §4.2 contradicts §6.2

§4.2 says: "the coordinator encrypts before building the proposal" (the value in the Raft proposal is already §4.1 ciphertext).

§6.2 says: "pebbleStore gains an optional *encryption.Keystore. Put paths wrap before pebble.Batch.Set" (the storage layer encrypts on write).

If both are true as written, every value is encrypted twice — once by the coordinator before proposing, then again by lsm_store.PutAt. The design needs to pick one canonical encryption site and be explicit about what the other layer does (or does not do) with already-encrypted bytes. The most coherent reading is:

  • Option A (storage-layer-only): The coordinator proposes cleartext values. The §4.2 raft envelope wraps the entire proposal (including cleartext values) with the raft DEK. On apply, FSM unwraps the proposal, gets cleartext values, calls lsm_store.PutAt which encrypts with the storage DEK. This is simpler and avoids a coordinator-visible keystore.
  • Option B (coordinator pre-encrypts): The coordinator encrypts values (§4.1) before proposing; the storage layer's Put must detect already-encrypted bytes and skip re-encryption.

The document should commit to one option and strike the conflicting language from the other section.


Important issues

3. Envelope format is inconsistent between §4.1 and §6.4

§4.1 defines the envelope as 33 bytes overhead:

| 0x01 | key_id(4) | nonce(12) | ciphertext(N) | tag(16) |

§6.4 says: "The envelope gains a 1-byte 'compressed?' flag immediately after the version byte."

That would change the format to:

| 0x01 | compressed(1) | key_id(4) | nonce(12) | ciphertext(N) | tag(16) |

But §8.2 still says "34 bytes per stored value" (33 + 1 compress flag), while the §4.1 diagram is never updated. The design needs a single authoritative envelope diagram that includes the compress flag, with updated overhead numbers and a clear statement of what the AAD covers (key_id ‖ envelope_version ‖ compressed_flag?).

4. "Same ciphertext" claim in §1 and §4.2 is inaccurate

The PR description and §1 state that "the same ciphertext flows through Raft → WAL → Pebble SST → FSM snapshot." This is not true: the Raft WAL carries the §4.2 envelope encrypted with the raft DEK, while Pebble SSTs carry the §4.1 envelope encrypted with the storage DEK. These are different ciphertexts of (partially) the same plaintext.

The correct property is: every persistent surface holds ciphertext — no surface holds cleartext user values. Clarifying this avoids misleading the implementor about what DEK is used where.

5. Per-node sidecar out-of-sync is unaddressed

Each node holds <dataDir>/encryption/keys.json. The design says DEK rotation goes through Raft so "every replica observes the new DEK at the same log index." But Raft does not atomically update the sidecar — the FSM apply writes the sidecar, which is a local file write. A node that crashes between receiving the committed Raft entry and fsyncing the sidecar could start with a stale sidecar after restart.

§9.3 says a lost KEK makes data unrecoverable; the analogous case here is a node whose sidecar missed the last rotation. The design should specify: how does a node detect sidecar-vs-log-index divergence on restart, and what does it do (refuse to start? re-apply the rotation?).


Moderate issues

6. Rewrite cursor key encryption bootstrapping

§7.1 says the rewrite job stores its checkpoint in !encryption|rewrite|cursor. If that key lives in Pebble under the same lsm_store.PutAt path, it will itself be encrypted with the storage DEK. The job reading its own cursor must therefore decrypt it — which works as long as the DEK is loaded. But if the job is interrupted across a DEK rotation, the cursor entry's key_id may reference an old DEK that is still loaded (correct) or one that has been retired and unloaded (problem). The design should explicitly state that the rewrite cursor key uses the same envelope as user values and is subject to the same old-DEK-retention requirement.

7. MVCC metadata exposure is broader than §11.6 acknowledges

§11.6 correctly notes that ExpireAt is cleartext. The encodeValue format also stores CommitTS as part of the Pebble key (via encodeKey(key, commitTS)), not the value. Version timestamps for every write are therefore always visible on disk, regardless of encryption. This is already implicit in §2.2 ("set of keys ever written"), but the design should also note that the write history (number of versions, write timestamps) is visible from SST key metadata.

8. --encryption-rotate-on-startup on a follower

§6.5 says this flag "Force[s] a DEK rotation at boot." If a node starts as a follower (which it often does during a rolling restart), it cannot propose to Raft. The flag description should specify that rotation is only attempted by the node that becomes leader, or that it is a no-op / deferred until leadership is acquired, to prevent a follower-startup deadlock.


Minor issues

9. Backup sidecar requirement needs an operator guard

§7.3 correctly says backup tooling must capture <dataDir>/encryption/keys.json. But the design has no enforcement mechanism — there is no backup validation that checks whether the sidecar is included. Consider adding a elastickv-admin backup verify --backup-dir=... command to the §6.6 admin surface, or at minimum a WARNING-level log on restore if the sidecar is absent.

10. elastickv_encryption_value_overhead_bytes leaks value sizes

§9.2 exposes a histogram of payload_size - plaintext_size. When compression is enabled this reveals plaintext sizes (since compressed+encrypted overhead reflects compressibility). This is metadata leakage and may matter for high-security deployments. Worth noting in §2.2 alongside the existing key-metadata admission, so operators can decide whether to suppress this metric.

11. 90-day rotation cadence is undocumented relative to write volume

§5.2 says rotation is "every 90 days, or on suspicion of compromise" and notes the 2⁴⁸-write nonce birthday bound. For completeness, add the back-of-envelope: at 10 k writes/s × 90 days ≈ 7.8 × 10¹⁰ writes per DEK, well under 2⁴⁸ ≈ 2.8 × 10¹⁴. The paragraph should state this bound explicitly so an operator running at higher write rates knows when to tighten the rotation cadence.


Conventions check

Item Status
Filename 2026_04_29_proposed_data_at_rest_encryption.md ✅ matches YYYY_MM_DD_proposed_<slug>
Header Status: Proposed
Header Author: bootjp
Header Date: 2026-04-29
No _design suffix in slug
Five CLAUDE.md self-review lenses in §10
Open question §11.5 (Lua batching) flagged as pre-implementation gate

Summary

The document's overall architecture (boundary choice, envelope format, two-tier KEK/DEK hierarchy, Raft-coordinated rotation, compress-before-encrypt) is sound and well-motivated. The three issues that need resolving before the implementation PR opens are: (1) the version-byte collision with lsm_store.go's existing tombstone byte encoding, (2) the coordinator-vs-storage-layer double-encryption contradiction, and (3) the single authoritative envelope diagram that includes the compress flag. The "same ciphertext" framing in §1/§4.2 should also be corrected to avoid misdirecting the implementor.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a9afe72f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +430 to +431
envelope-version byte (`0x00` = cleartext, `0x01` = encrypted) lets
any node read either format.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop using leading byte to detect legacy cleartext values

The migration plan assumes 0x00 = cleartext and 0x01 = encrypted is sufficient for mixed-mode reads, but pre-encryption data already on disk is arbitrary raw bytes with no version prefix. Any legacy value whose first byte is 0x01 will be misclassified as an encrypted envelope and can fail with tag/format errors, making existing data unreadable during rollout or before rewrite completes. This breaks the stated online migration safety guarantees for workloads storing binary payloads.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a design for data-at-rest encryption in elastickv, utilizing AES-256-GCM at the storage and Raft layers with a two-tier key hierarchy (KEK/DEK). Feedback focuses on correcting the safe write limits for random 96-bit nonces to align with NIST standards, ensuring the encryption rewrite job accounts for MVCC history and zero-downtime requirements, and updating the envelope format documentation to include the compression flag in both the diagram and the AAD definition.

Comment on lines +153 to +154
up to ~2⁴⁸ writes per DEK; we rotate DEKs well before that bound
(see §5.2).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The claim that random 96-bit nonces are safe up to $2^{48}$ writes per DEK is inaccurate and exceeds safe cryptographic limits. $2^{48}$ is the birthday bound for a 96-bit space, meaning there is a ~50% probability of a nonce collision at this limit. For AES-GCM, a single nonce collision allows an attacker to recover the XOR of two plaintexts and potentially the authentication key. NIST SP 800-38D (Section 8.3) recommends a limit of $2^{32}$ invocations for a single key when using random nonces to keep the collision probability cryptographically negligible (< 2^{-32}). Given the 10k writes/s estimate in §8.1, this limit could be reached in approximately 5 days, making the 90-day rotation target potentially unsafe. Consider using a counter-based nonce or a more conservative rotation trigger based on the number of writes.

Comment on lines +311 to +315
`elastickv-admin encryption rewrite` job can sweep the keyspace
over time to retire old DEKs; until that job has completed and the
retired DEK has zero references, the operator must keep the old
DEK loaded.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The design for the encryption rewrite job should clarify how it interacts with MVCC history. This presents a significant operational risk regarding DEK retirement and write amplification. As this constitutes a production-grade system migration, the plan must detail mitigation strategies, such as a temporary "bridge" or "proxy" mode, and strategies for live or zero-downtime cutovers (e.g., dual-write proxy) to avoid service interruption during the rewrite process.

References
  1. Significant operational risks and production-grade system migrations must detail mitigation strategies (like bridge/proxy modes) and zero-downtime cutover strategies.

Comment on lines +139 to +142
+--------+---------+----------+----------------+--------+
| 0x01 | key_id | nonce | ciphertext | tag |
| 1 byte | 4 bytes | 12 bytes | N bytes | 16 B |
+--------+---------+----------+----------------+--------+
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The envelope format diagram is missing the 1-byte "compressed?" flag described in §6.4. Additionally, the AAD definition in §4.1 (line 155) should be updated to include this flag to prevent header-rewrite attacks that could toggle the compression bit and cause decryption or decompression failures. The per-value overhead estimate in §4.1 (line 163) should also be updated from 33 to 34 bytes for consistency with the disk overhead calculation in §8.2.

Suggested change
+--------+---------+----------+----------------+--------+
| 0x01 | key_id | nonce | ciphertext | tag |
| 1 byte | 4 bytes | 12 bytes | N bytes | 16 B |
+--------+---------+----------+----------------+--------+
+--------+------------+---------+----------+----------------+--------+
| 0x01 | compressed | key_id | nonce | ciphertext | tag |
| 1 byte | 1 byte | 4 bytes | 12 bytes | N bytes | 16 B |
+--------+------------+---------+----------+----------------+--------+

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 3m 52s —— View job


Design Review: Data-at-Rest Encryption

  • Gather context and understand request
  • Read docs/design/README.md conventions
  • Read design document fully
  • Review cryptographic design
  • Review architecture and integration points
  • Review migration plan
  • Check open questions
  • Post comprehensive review feedback

Conventions Check ✓

Filename, header (Status/Author/Date), and lifecycle marker all conform to docs/design/README.md. The design-doc-first workflow is properly followed. No issues here.


Critical Issues

1. Migration version-byte detection is broken for pre-existing data

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1, §4.1 (line 145–146)

0x00 is reserved to mean "cleartext, written by a pre-encryption build"

This assumption is fundamentally incorrect. Pre-encryption values stored in Pebble are raw bytes with no version prefix at all. When the read path branches on the first byte to detect format, any existing cleartext value whose first byte happens to be 0x01 will be misclassified as an encrypted envelope and will fail to parse (wrong tag/format), making that value permanently unreadable during and after migration.

This is not a corner case: for binary workloads (images, serialized protobufs, arbitrary byte payloads) first-byte 0x01 is common. A Redis INCR on a small counter encoded as a big-endian int32 produces values like \x00\x00\x00\x01 whose fourth byte is 0x01. Redis list elements, SQS message bodies, and S3 object bytes have no constraints on their first byte.

The Codex review correctly flagged this as P1. Some approaches worth considering:

  • Tag legacy values with a 0x00 prefix at upgrade time (in a single sweep before new writes begin), so all values have an explicit version byte before the ambiguous period starts. This requires the upgrade to stop writes momentarily or use a two-phase scheme.
  • Use a magic sentinel that cannot appear as the first byte of valid pre-existing values (hard to guarantee for arbitrary bytes).
  • Track migrated vs. unmigraded keys in a separate Pebble metadata column family instead of relying on the value's first byte.

Fix this →


2. AES-GCM random nonce limit stated incorrectly — 90-day rotation may be unsafe

docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 152–154, 316–319

Random 96-bit nonces are safe up to ~2⁴⁸ writes per DEK

2⁴⁸ is the birthday bound for a 96-bit space — the point at which nonce collision probability reaches ~50%. This is the danger threshold, not the safety limit. NIST SP 800-38D §8.3 recommends a maximum of 2³² random invocations per key to keep collision probability cryptographically negligible (< 2⁻³²).

At the 10k writes/s baseline given in §8.1, 2³² invocations is reached in approximately 5 days — well inside the stated 90-day rotation target. The design as written could allow a 90-day-old DEK to be used for 90× more writes than NIST considers safe for random nonces.

Two remedies, pick one or combine:

  • Counter-based nonces: Use a per-DEK monotonic counter stored in the keystore (persisted through the Raft sidecar). No collision risk up to 2⁹⁶ − 1 writes per key, but the counter must be crash-safe.
  • Tighten the rotation trigger: Enforce rotation when the write counter reaches ~2³², not on a calendar interval. The keystore already needs a counter for encryption status --verify; expose it as a rotation trigger.

Fix this →


High-Priority Issues

3. Double-encryption ambiguity in §6.3 / §4.2

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.2 (line 171), §6.3 (lines 363–371)

§4.2 states:

the coordinator encrypts before building the proposal [with storage DEK] … the proposal envelope itself is also wrapped using a separate raft DEK

§6.3 then states:

kv/fsm.go — unwrap on apply before dispatching to the existing storage handlers

But §6.2 says the storage layer's Put path always wraps before pebble.Batch.Set. If the FSM calls the normal Put path with the already-storage-encrypted value it recovered from the raft envelope, the storage layer will encrypt it a second time — writing enc_storage(enc_storage(plaintext)) to Pebble.

The design is ambiguous about which of these two interpretations is intended:

  • (A) Coordinator pre-encrypts with storage DEK → wraps with raft DEK. FSM decrypts only the raft layer → gets storage-ciphertext → calls a raw Pebble write that bypasses the storage-layer encryption. Requires a "pre-encrypted write" path in lsm_store.go.
  • (B) Coordinator wraps only with raft DEK (value is cleartext inside). FSM decrypts raft layer → gets cleartext → calls normal Put which applies storage DEK. Simpler, but §4.2's phrase "the coordinator encrypts before building the proposal" then needs to be removed.

Either is workable; the doc needs to commit to one path and trace it explicitly through §4.2 → §6.2 → §6.3 without contradiction.

4. Rewrite job must handle all MVCC versions, not just the latest

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §5.2, §7.1

The design says the rewrite job walks the keyspace to rewrite cleartext → ciphertext and to retire old DEKs after rotation. But MVCCStore retains multiple MVCC versions of each key. The rewrite job must rewrite every version at every timestamp that still exists in the MVCC range (i.e., has not yet been compacted away by kv/compactor.go). If it rewrites only the latest version, older MVCC versions keep their old key_id, the old DEK cannot be safely retired, and snapshot reads at timestamps prior to the rewrite will still require the retired DEK.

The doc should specify how the rewrite job iterates MVCC history (likely a full Pebble scan of the MVCC keyspace, not just a logical key scan) and how the checkpoint cursor (!encryption|rewrite|cursor) encodes MVCC timestamps, not just keys.


Medium Issues

5. Envelope format diagram is missing the compress flag

docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 137–143 vs. §6.4 (lines 378–391)

The §4.1 envelope diagram:

+--------+---------+----------+----------------+--------+
| 0x01   | key_id  | nonce    | ciphertext     | tag    |
| 1 byte | 4 bytes | 12 bytes | N bytes        | 16 B   |
+--------+---------+----------+----------------+--------+

does not include the 1-byte "compressed?" flag added in §6.4. This has two downstream inconsistencies:

  • The AAD definition (AAD = key_id ‖ envelope_version) should include the compress flag to prevent a header-rewrite attack that toggles it. An attacker who flips this bit causes the decrypt path to attempt Snappy decompression on uncompressed ciphertext or vice versa.
  • The overhead estimate in §4.1 says 33 bytes; §8.2 correctly says 34 bytes. One of them is wrong.

The compress flag should either appear in the §4.1 format (preferred, since it's part of v1) or be clearly called out as a v2 extension (with §8.2 updated to match §4.1's 33-byte claim).

6. key_id mapping between sidecar and envelope is undefined

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.1 (line 148), §5.1 (lines 277–284)

The envelope carries a 4-byte key_id (uint32). The sidecar keys.json uses string identifiers ("k1a2", "k7c9"). The mapping between the two is never defined: is key_id a hash of the string? An incrementing counter persisted in the sidecar? A fixed prefix of the string? This needs to be specified precisely, because the decrypt path must do a constant-time lookup of key_id → []byte and the sidecar's persistence format must round-trip it without collision.

7. AES-NI dependency should be explicit

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §8.1

The performance claim "3–5 GB/s per core" implicitly assumes AES-NI on x86_64. Go's crypto/aes falls back to a software AES implementation on hosts without AES-NI (e.g., older ARM, some RISC-V targets) where throughput drops to ~200–400 MB/s — a 10×–15× penalty. The doc should either scope the performance section explicitly to AES-NI hosts or add a note that non-AES-NI deployment requires re-benchmarking against the ≤10% regression budget in §8.3.


Minor

8. Backup sidecar capture is advisory only

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.3

"Backup tooling must capture the encryption/keys.json sidecar alongside the data dir" is a documentation obligation with no enforcement. A backup taken without the sidecar is silently valid (the backup completes) but leaves data permanently unrecoverable. Consider whether the elastickv-admin backup command (if it exists or will exist) should embed the sidecar contents in the backup manifest, or at minimum warn at restore time if the sidecar is missing.

9. Rewrite cursor key namespace

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1

The checkpoint !encryption|rewrite|cursor uses ! as a reserved-key prefix. Verify this prefix doesn't collide with any existing control keys in distribution/ or store/ (e.g., !redis|ttl|, the route catalog keys, or HLC ceiling keys). The design should cross-reference the existing reserved-key namespace.


Positive notes

  • The choice of boundary (b) over (a) (value-level vs. VFS) is well reasoned and avoids patching upstream etcd raft. The writeup in §3 is thorough.
  • Encrypting the entire Raft proposal payload (§4.2) to protect lookup keys in the WAL is a nice addition beyond the baseline threat model.
  • DEK rotation via Raft (§5.2) ensures all replicas see the new key at the same log index — no split-brain key state possible.
  • atomic.Pointer for the keystore hot path (§10.2) is exactly right.
  • The five-lens self-review (§10) is present and substantive.
  • Open question §11.5 on Lua batch encoding is the right thing to flag before implementation.
  • Startup refusal when encryption state mismatches the sidecar (§9.1) prevents silent security downgrade.

Summary

The overall architecture is sound. The two most important things to resolve before the implementation PR is opened:

  1. The version-byte migration strategy — the 0x00/0x01 first-byte branch doesn't work for pre-existing untagged data. This needs a rethink of how format detection works during the mixed-mode window.
  2. The nonce-limit claim — the stated 90-day rotation target is not safe under NIST guidance for random 96-bit nonces at the written write rate. Either use counter-based nonces or tie rotation to a write-count trigger.

The double-encryption ambiguity in §6.3 also needs to be resolved before implementation begins to avoid a correctness bug that would only surface under integration testing.

P1 (codex): replace value-byte cleartext detection with a per-version encryption_state bit in MVCC metadata, so a legacy value whose first byte happens to equal the envelope version cannot be misclassified as ciphertext.

High (gemini): retract the 2^48 random-nonce claim (birthday bound, not safe). Switch to per-DEK counter-based nonces and cap each (DEK, process-load) at 2^32 writes per NIST SP 800-38D §8.3.

High (gemini): add §5.4 detailing how the rewrite job iterates over MVCC versions, rewrites in place at the same commit_ts, and what dual-condition criterion (cursor done + minRetainedTS advanced) gates DEK retirement.

Medium (gemini): add the compressed flag byte to the envelope diagram, include it in the AAD to defeat header-rewrite attacks against the decompressor, and bump per-value overhead from 33 to 34 bytes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 5m 50s —— View job


Review: docs: propose data-at-rest encryption design

  • Gather context and understand request
  • Read design document (current state after both commits)
  • Check what commit 79d446f changed vs original
  • Review cryptographic design
  • Review architecture and integration points
  • Synthesize all feedback and post review

What commit 79d446f resolved

The follow-up commit addressed four issues raised in the prior reviews:

Issue Resolution
P1 (Codex): version-byte cleartext detection Replaced with per-version encryption_state bit in MVCC metadata (§7.1) — correct fix
High: 2⁴⁸ random-nonce claim Retracted; now uses counter-based nonces with 2³² write ceiling per NIST SP 800-38D §8.3 (§4.1, §5.2)
High: rewrite job must cover MVCC history Added §5.4 with full iteration/retirement/crash-safety protocol
Medium: compress flag missing from envelope Added to diagram, AAD, and overhead count (§4.1) — 34 bytes now consistent everywhere except §8.2 (see below)

These were the right fixes. The remaining issues below are what the second commit did not address.


Critical (still blocking implementation)

1. Double-encryption is still designed in — §4.2 contradicts §6.2/6.3

The data-flow has three statements that, taken together, produce AES-256-GCM applied twice to every value:

  • §4.2: "the coordinator encrypts before building the proposal" — coordinator applies §4.1 (storage DEK) to the value before Raft.
  • §6.2: "Put paths wrap before pebble.Batch.Set" — the storage layer always encrypts on write.
  • §6.3: "FSM unwraps on apply before dispatching to the existing storage handlers" — FSM calls the standard lsm_store.PutAt, which invokes the §6.2 hook.

Follow the path: coordinator produces enc_storage(plaintext) → raft proposal carries enc_raft(enc_storage(plaintext)) → FSM decrypts raft layer → gets enc_storage(plaintext) → calls lsm_store.PutAt → storage hook encrypts again → Pebble stores enc_storage(enc_storage(plaintext)). That is double encryption.

§3(b) promises "no double-encryption", which is the correct design goal. The document needs to commit to one of:

  • Option A (storage-layer-only, simpler): Coordinator does NOT pre-encrypt. It only wraps the full proposal with the raft DEK. On FSM apply, after unwrapping the raft DEK, the value is cleartext and the storage layer applies the storage DEK normally. §4.2 then needs to be rewritten: the coordinator wraps key + op + cleartext-value with the raft DEK, not pre-encrypted value.
  • Option B (coordinator pre-encrypts): Coordinator applies §4.1 (storage DEK) before proposing. FSM must call a "raw Pebble write" path in lsm_store.go that bypasses the §6.2 encryption hook — explicitly specified as a new PutEncryptedAt or similar. §6.2 then needs a note that this bypass path exists and must be used from the FSM.

Neither option is wrong; Option A is significantly simpler. The document must pick one and remove the text that implies the other.

Fix this →


High (must resolve before implementation)

2. key_id type is undefined — string in sidecar, uint32 in envelope

The envelope carries a 4-byte key_id (uint32). The sidecar keys.json uses string identifiers ("k1a2", "k7c9"). The mapping between the two is nowhere defined. The decrypt path must do a constant-time lookup of key_id → []byte DEK, so the implementation must pick one of: hash of string, sequential counter persisted in the sidecar, first/last 4 bytes of a hex string, etc.

This is a wire-format decision. An implementation that chooses wrong and then rotates will permanently fail to decrypt old values. Specify the exact encoding and round-trip behavior.

Fix this →

3. encryption_state bit placement in the existing 9-byte header is underspecified

§7.1 says "We add a 2-bit encryption_state field to that header." The current lsm_store.go header is exactly 9 bytes: [tombstone(1)][expireAt(8)], where tombstone is decoded as data[0] != 0.

The design does not say which bytes/bits the field occupies. The two natural choices have different compat implications:

  • Pack into byte 0 (spare bits): Set bit 1 of the tombstone byte to indicate encryption. Old decodeValue reads tombstone := data[0] != 0 — any non-zero encryption state in bits 1–7 is misread as a tombstone, silently losing the live value. This is the "right direction" for forward compat (old code errors rather than serving cleartext-from-ciphertext), but the error mode is silent wrong-answer, not a clear refused-to-start.
  • Extend header to 10 bytes: Add a dedicated encryption byte. Old decodeValue mis-slices expireAt (reads 8 bytes starting at the new encryption byte, getting a wrong expiry) and misinterprets the value start. Equally confusing.

§9.1's startup refusal ("process refuses to start if sidecar present but encryption disabled") prevents the above in normal operation, but the design should still document the exact bit layout with a precise valueHeaderSize change (or confirmation it stays 9) and note why old code can't serve encrypted data even if somehow bypassing the startup check.

Fix this →


Medium (address before or alongside implementation)

4. Sidecar crash safety during DEK rotation is uncovered

§5.4.6 covers rewrite cursor crash safety ("cursor updated in same Pebble batch as rewritten value") — that's good. But the sidecar (keys.json) write during DEK rotation has no equivalent guarantee. The rotation Raft entry commits, the FSM applies it (updating in-memory keystore), then writes keys.json to disk. A crash between the apply and the fsync leaves the sidecar stale after restart.

On restart, §9.1 checks "wrapped DEKs unwrap under configured KEK" — a stale sidecar (missing the new DEK) would pass this check and silently boot with the old DEK as active, while Pebble already holds values encrypted under the new DEK. Those values would fail to decrypt.

Specify the recovery path: the FSM should also persist a "pending rotation" marker in Pebble under the same batch as any in-memory DEK state change, so on restart it can detect the sidecar-vs-log divergence and re-derive the sidecar from the Pebble marker before opening.

5. --encryption-rotate-on-startup follower behavior undefined

§6.5 documents the flag as "Force a DEK rotation at boot (used by ops scripts; not for normal restarts)" but doesn't say what happens when the booting node starts as a follower (the common case in a rolling restart). A follower cannot propose to Raft; if it tries, the proposal is rejected or queued indefinitely.

Add one sentence: "Rotation is attempted only after the node successfully wins a leadership election; on followers this flag is a no-op until the node becomes leader, at which point it proposes the rotation automatically."


Minor

6. §3(b) "single ciphertext object" claim is still inaccurate

§3(b) Pro: "Single ciphertext object — no double-encryption — across every persistent surface (Raft WAL, Pebble SST, FSM snapshot, snapshot streaming)."

This is not accurate even if issue #1 is resolved as Option A. The Raft WAL carries raft_DEK(key + op + value) while Pebble SSTs carry storage_DEK(value) — two different ciphertexts of different plaintexts. The true property is: no persistent surface holds cleartext user values. "Single ciphertext object" misdirects an implementor into believing the same bytes flow through all surfaces.

7. §8.2 overhead breakdown inconsistency

§8.2: "33 bytes (envelope) + up to 1 byte (compress flag) = 34 bytes per stored value"

But the flag byte is already inside the envelope (the §4.1 diagram shows it as the second field, and §4.1 says "Per-value overhead is 34 bytes (1 version + 1 flag + 4 key_id + 12 nonce + 16 tag)"). The breakdown in §8.2 should read "34-byte envelope", not "33 + 1".

8. AES-NI assumption in §8.1 is implicit

"AES-256-GCM with AES-NI runs at 3–5 GB/s per core" is correct for x86_64 with hardware acceleration. Go's crypto/aes falls back to a pure-software implementation on non-AES-NI targets (older ARM, RISC-V) where throughput is ~200–400 MB/s — a 10–15× reduction that could push encryption CPU well above the stated 10% regression budget in §8.3. Add: "The performance numbers in this section assume AES-NI. Operators targeting non-AES-NI hardware must re-run the §8.3 benchmarks against the ≤10% regression budget."

9. Verified OK: reserved-key namespace

!encryption| does not collide with existing reserved prefixes (!redis|, !sqs|, !dist| — confirmed in store/write_conflict_counter.go, kv/shard_key.go, distribution/catalog.go).


Conventions check

Item Status
Filename 2026_04_29_proposed_data_at_rest_encryption.md
Header Status/Author/Date
Five CLAUDE.md self-review lenses in §10
§10 lens 1 updated to reference MVCC-bit approach
Open question §11.5 (Lua batching) flagged
§11.6 (TTL expiry visibility) cross-referenced to §2.2

Summary

The second commit closed the four most-cited issues across all prior reviews. Two blockers remain before the implementation PR opens:

  1. §4.2 + §6.2/6.3 double-encryption — the design still implies the storage DEK is applied twice per write. Pick Option A (coordinator wraps proposal with raft DEK only; storage layer applies storage DEK on FSM apply) and rewrite §4.2 accordingly. This is also the change that makes §3(b)'s "no double-encryption" claim accurate.
  2. key_id type mismatch — the uint32 in the envelope must be precisely mapped to the string identifier in keys.json. Define it (recommended: sequential uint32 counter persisted in the sidecar) and update the JSON example.

The encryption_state bit placement (#3) should also be pinned before implementation begins since it determines whether valueHeaderSize changes and how backward compatibility is structured.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 79d446f338

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +559 to +561
Restart is rolling; mixed-mode clusters are supported because
each MVCC version carries its own `encryption_state` bit and
reads dispatch on that bit, not on the value bytes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate Raft payload encryption on full-cluster rollout

The rollout step says mixed-mode restarts are safe because reads dispatch on MVCC encryption_state, but §4.2 also encrypts the entire Raft Data[] payload; a follower that has not yet been restarted with KEK/DEK material cannot decode committed entries and can fail apply during the rolling window. This makes the documented "restart one node at a time" migration unsafe unless proposal-envelope encryption is deferred until all replicas are encryption-capable (or a compatibility mode is defined).

Useful? React with 👍 / 👎.

Comment on lines +159 to +162
The counter's high 32 bits are a per-process random prefix
(re-drawn whenever the DEK is loaded, so two processes that share
a DEK never collide), and the low 64 bits are an `atomic.Uint64`
incremented per write. This makes nonce reuse impossible within
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace random 32-bit nonce prefix with collision-free source

This claims two processes sharing a DEK "never collide," but a 32-bit random prefix can collide across nodes/restarts; if it does, both processes generate the same counter sequence and reuse (key, nonce) pairs. For AES-GCM, even a single nonce reuse under the same key is a catastrophic integrity/confidentiality failure, so this construction needs a deterministic uniqueness mechanism (or a much stronger persisted nonce allocator) instead of probabilistic 32-bit separation.

Useful? React with 👍 / 👎.

Comment on lines +437 to +438
- `sidecar.go` — atomic read/write of `keys.json` (write to
`keys.json.tmp` + `os.Rename`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make key sidecar updates crash-durable, not just atomic

The sidecar write plan uses keys.json.tmp + os.Rename, which is atomic for visibility but not sufficient for crash durability by itself; a power loss after Raft commits a rotation but before the new file/dir metadata is durably synced can roll back keys.json and strand ciphertext written under the new DEK. Because losing wrapped DEK metadata is unrecoverable for encrypted data, the design should require fsync of file and parent directory (or equivalent durable commit semantics) before acknowledging rotation success.

Useful? React with 👍 / 👎.

Critical: Resolve §4.2 vs §6.2 double-encryption ambiguity by committing to storage-layer-only value encryption; the coordinator only wraps the proposal Data with the raft DEK and the FSM hands cleartext to the storage layer, which is the sole encryption site.

Critical: Tie the per-version encryption_state bit to lsm_store.go encodeValue header. The leading byte now encodes tombstone (bit 0) plus encryption-state (bits 1-2). Old builds reading an encrypted value see an unrecognised tombstone byte and refuse to serve, never returning ciphertext as cleartext. valueHeaderSize stays 9.

High: Replace the misleading "same ciphertext flows through Raft -> WAL -> Pebble" framing in §1 and §3 with the correct invariant -- each persistent surface holds ciphertext but Raft and Pebble use different DEKs, so the bytes differ.

High: New §5.5 sidecar / Raft-log reconciliation -- raft_applied_index in keys.json, ErrSidecarBehindRaftLog refusal at startup, automatic leader rewrap + manual follower resync-sidecar repair.

Medium: Define key_id as a 32-bit CSPRNG-drawn integer with the sidecar JSON keying on the decimal string form; on-disk envelope and in-memory keystore stay binary uint32.

Medium: Specify --encryption-rotate-on-startup behaviour on followers -- request is queued in memory and auto-fires only if this node becomes leader within the same uptime window; safe to set across a rolling restart.

Medium: §5.4 step 7 covers rewrite-cursor encryption bootstrapping; step 8 documents the !encryption| reserved-key namespace collision check; §2.2 adds explicit call-outs for commit_ts visibility and TTL/tombstone byte being cleartext by design.

Medium: §8.1 splits the AES-NI fast path from the software-fallback path with the 10-15x penalty so non-AES-NI deployments cannot quietly miss the §8.3 throughput gate.

Minor: Add encryption retire-dek, resync-sidecar, and backup verify admin commands; new metrics elastickv_encryption_writes_per_dek and elastickv_encryption_sidecar_raft_index; flag size-leak risk on the overhead histogram. §10 test list updated for the new regression and reconciliation cases.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 5m 56s —— View job


Review: docs: propose data-at-rest encryption design (Round 3)

  • Gather context and understand request
  • Read current design document (post round-2 updates)
  • Check git diff and commit messages
  • Review previous feedback resolution
  • Post comprehensive round-3 review

The two rounds of fixes have resolved nearly every P1/Critical and most High/Medium items. The document is in much better shape. Two issues from prior rounds remain open: one P1 (raft proposal encryption vs. rolling restart) and one High (nonce prefix collision claim). Three medium/minor items are also noted below.


Previously raised — resolved ✅

All of the following were addressed in commits 79d446f and ff6a91a:

  • P1: version-byte migration detection — replaced with per-version MVCC encryption_state bit ✅
  • Critical: double-encryption ambiguity — storage layer is now the sole encryption site ✅
  • Critical: "same ciphertext flows through Raft→WAL→Pebble" framing — corrected in §4.2 ✅
  • High: random 96-bit nonce 2⁴⁸ claim — retracted; switched to counter-based nonces + 2³² ceiling per NIST SP 800-38D ✅
  • High: MVCC history rewrite — §5.4 fully specifies the per-(key, commit_ts) iteration ✅
  • Medium: compress flag missing from envelope diagram and AAD — §4.1 now has flag byte in the diagram and AAD covers envelope_version ‖ flag ‖ key_id
  • Medium: overhead inconsistency (33 vs 34 bytes) — §4.1 and §8.2 agree at 34 bytes ✅
  • Medium: key_id mapping undefined — §5.1 specifies a CSPRNG uint32, sidecar uses decimal string form ✅
  • Medium: AES-NI assumption — §8.1 now documents the 10–15× software-fallback penalty ✅
  • Medium: --encryption-rotate-on-startup on followers — §6.5 specifies the deferred-until-leader behaviour ✅
  • Medium: sidecar/Raft-log divergence — §5.5 covers detection + recovery ✅
  • Medium: rewrite cursor bootstrapping — §5.4 step 7 ✅
  • Minor: reserved-key namespace collision check — §5.4 step 8 ✅
  • Minor: backup sidecar enforcement — elastickv-admin backup verify in §6.6 and §7.3 ✅
  • Minor: elastickv_encryption_value_overhead_bytes size leak — §9.2 flagged for operators ✅
  • Minor: MVCC write-timestamp / TTL cleartext exposure — §2.2 ✅

Outstanding issues

1. Raft proposal encryption breaks the "rolling restart" claim (P1 — unresolved from Codex round 2)

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 (line 725–728) vs. §4.2

§7.1 step 2 says:

Restart is rolling; mixed-mode clusters are supported because each MVCC version carries its own encryption_state bit and reads dispatch on that bit, not on the value bytes.

But the encryption_state bit only governs Pebble value reads. §4.2 also wraps the entire Raft Data []byte — operation tag, lookup key, and cleartext value — with the raft DEK. A follower that has not yet been restarted with --encryption-enabled has no keystore; it cannot unwrap the raft envelope, so kv/fsm.go::Apply will fail on every entry proposed after the first upgraded node becomes leader.

The raft-level failure does not have a per-version fallback — either the follower can decrypt the entry or the apply fails. The "mixed-mode" window breaks the moment the leader starts encrypting proposals, which is the first write after the leader restarts.

Two viable resolutions, pick one:

Option A (staged cutover): Disable raft proposal encryption during the rolling restart window. Run the cluster fully upgraded with value encryption enabled (encryption_state = 0b01) but with raft DEK wrapping deferred until a leader observes a quorum of peers advertising encryption capability (via a Raft membership extension or a new cluster health check). A single elastickv-admin encryption enable-raft-proposals command triggers the raft-level cutover once the cluster is fully upgraded.

Option B (envelope-version negotiation in the FSM): The raft Data []byte carries a one-byte prefix that indicates cleartext (0x00) or raft-envelope-encrypted (0x01). Old FSM builds process only 0x00; new builds process both. This lets the leader continue producing unencrypted raft entries until all followers confirm upgrade, at which point the leader shifts to 0x01 entries. More complex but truly rolling-restart-safe.

Until one option is chosen and documented, §7.1's "rolling restart is safe" claim is incorrect.

Fix this →


2. Nonce prefix collision probability is non-zero, not "never" (High — partially resolved from Codex round 2)

docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 177–181

The counter's high 32 bits are a per-process random prefix (re-drawn whenever the DEK is loaded, so two processes that share a DEK never collide)

"Never" is incorrect. Two process loads of the same DEK each draw an independent 32-bit prefix uniformly at random. The probability they match is 1/2³² ≈ 2.3 × 10⁻¹⁰ per pair. If they do match, both counters reset to 0 on load, so nonces 0, 1, 2, … from node A alias exactly to nonces 0, 1, 2, … from node B — every write from both nodes reuses a nonce for the lifetime of the process.

Practical exposure: a 3-node cluster with a 90-day DEK lifetime and weekly restarts accumulates ~39 independent prefix draws. C(39,2) = 741 pairs each with 1/2³² collision probability gives a cumulative tail of ~1.7 × 10⁻⁷ per DEK rotation period. For AES-256-GCM, a single nonce collision breaks confidentiality (plaintext XOR recovery) and integrity (authentication key leakage). The consequence is disproportionate to the probability.

Two fixes:

  • 64-bit prefix (minimal change): Replace the 32-bit random prefix with a 64-bit draw. Collision probability per pair drops to ~5 × 10⁻²⁰ — negligible for any realistic deployment. The 96-bit nonce then splits 64 (prefix) ‖ 32 (counter per process load).
  • Persisted counter (stronger): Persist the counter state in the sidecar, so the counter is monotonically increasing across restarts for a given DEK and no prefix is needed. Sidecar writes are already crash-safe via §5.5; the counter would increment by 1 and sync every N writes (or at shutdown). Requires a sidecar write per N encryptoperations rather than per rotation.

The document should either adopt one of these or weaken "never" to "with negligible probability (1/2³² per pair of process loads)" and document the cumulative risk.

Fix this →


Medium issues

3. Sidecar fsync not specified in §6.1 (partially resolved from Codex P2 round 2)

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §6.1

sidecar.go is described as "atomic read/write of keys.json (write to keys.json.tmp + os.Rename)." os.Rename provides atomicity (the old or new file is visible, never a partial write) but not crash durability — without an fsync of keys.json.tmp before the rename, and an fsync of the parent directory after, a kernel crash can silently discard the new file's contents on some filesystems (ext4 with data=writeback, default XFS).

The §5.5 reconciliation mechanism detects and recovers from this (if keys.json reverts to the pre-rotation content after a crash, raft_applied_index will lag the raftengine's applied index and ErrSidecarBehindRaftLog fires). However, the implementation spec should state this explicitly so the implementation doesn't omit the fsync:

sidecar.go — write to keys.json.tmp, fsync(keys.json.tmp), os.Rename, fsync(dir). §5.5 detects the residual crash window; the fsync minimises how often it fires.

Without this, an implementor reading §6.1 in isolation may produce a sidecar writer that triggers §5.5 far more often than necessary.

4. "Refuses to serve" vs. "silent tombstone" for old builds reading encrypted values

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §6.2 (lines 584–586)

an old build reading a value written by an encryption-enabled build sees an unrecognised tombstone byte and refuses to serve it rather than returning ciphertext as cleartext.

The phrasing "refuses to serve" implies an error return. The actual behaviour in decodeValue (line 395, store/lsm_store.go):

tombstone := data[0] != 0

An encrypted non-tombstone value has byte[0] = 0b00000010 (enc_state bits set, tombstone bit clear). 0b00000010 != 0 is true, so tombstone = true. The old build silently returns "key deleted / not found" — it does not return an error, and it does not log anything unexpected. This is safe (no ciphertext is returned as cleartext), but it is not an error. During a rolling restart, users hitting old nodes will see their writes disappear rather than receiving an explicit error.

The design should state: "an old build reading a value written by an encryption-enabled build will return a tombstone (key-not-found) result for that version, not an error." If the goal is to surface an explicit error instead, decodeValue would need to distinguish 0x01 (known tombstone) from other non-zero leading bytes (unknown → error) — which requires a code change beyond just bit-field remapping.


Minor issues

5. ELASTICKV_KEK_BASE64: should mention os.Unsetenv after reading

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §5.1

The env var is correctly flagged as "strongly discouraged" due to leakage via /proc/<pid>/environ. The implementation should call os.Unsetenv("ELASTICKV_KEK_BASE64") immediately after reading and decoding the value, to narrow the window during which the raw key material appears in the process environment. Worth one sentence in §6.1 or §5.1.

6. encryption_state bit location terminology

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 ("Per-version encryption_state bit in MVCC metadata")

The design places encryption_state in "MVCC metadata" — but the actual location is byte 0 of the Pebble value (the lsm_store.go value header), not a separate MVCC header structure. commit_ts lives in the Pebble key (via encodeKey(key, commitTS)). These are distinct structures; calling them collectively "MVCC metadata" is loose but workable. The implementation PR should add a comment in encodeValue / decodeValue that pins the bit assignments (tombstone = bit 0, enc_state = bits 1–2) to prevent accidental clobbering by future bit additions.


Summary

All P1/Critical items from rounds 1 and 2 have been addressed except one: the raft proposal encryption vs. rolling restart incompatibility (#1 above). That issue should be resolved before the implementation PR opens, because the chosen resolution (staged cutover vs. FSM envelope-version negotiation) materially affects how §4.2, §6.3, and §7.1 are implemented.

The nonce prefix "never collide" claim (#2) should also be corrected — widening to a 64-bit prefix is a one-line change in the design and a trivial implementation delta.

Everything else is either minor polish or implementation-spec clarifications that can land in the implementation PR's own design update.

…erministic nonce, sidecar fsync)

P1 (codex line 728): §4.2 raft envelope rollout was unsafe in a rolling restart -- a non-upgraded follower without KEK cannot decrypt entries proposed by an upgraded leader, breaking deterministic apply. Split rollout into Phase 1 (storage envelope only, raft Data stays cleartext) and Phase 2 (enable-raft-envelope admin command, gated on every voting member reporting encryption_capable). Phase 2 is one-way because the WAL would interleave cleartext and raft-envelope entries.

P1 (codex line 179): 32-bit random nonce prefix had a non-zero birthday-collision probability across process loads (~1e-7 over multi-year DEK lifetime) and a single AES-GCM nonce reuse is catastrophic. Replace with deterministic node_id (16b) || local_epoch (16b) || write_count (64b). local_epoch is persisted in the sidecar and fsync-bumped before any encryption, removing collision probability entirely.

P2 (codex line 561): os.Rename was atomic for visibility but not crash-durable. Specify the fsync-file -> rename -> fsync-parent-dir protocol; refuse to start on filesystems that cannot guarantee dir.Sync semantics (NFS, some FUSE).

Self-review §10 lens 1 + lens 2 updated to call out sidecar non-durability and nonce reuse as data-loss-class hard preconditions, and the membership-snapshot check as the load-bearing piece for Phase 2 safety. §9.1 startup refusal list extended with ErrUnsupportedFilesystem and missing node_id. §6.6 admin commands gain enable-raft-envelope.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 7m 23s —— View job


Review: docs: propose data-at-rest encryption design (Round 4)

  • Gather context and understand request
  • Read current design document (post round-3 updates)
  • Review what commit 1b8613f resolved vs. outstanding issues
  • Check remaining blockers from prior rounds
  • Review cryptographic design and new nonce strategy
  • Review 2-phase rollout plan
  • Review sidecar fsync changes
  • Post comprehensive round-4 review feedback

What commit 1b8613f resolved

Round 3 addressed every outstanding P1 and P2 from earlier rounds. The resolutions are solid:

Issue Resolution
Double-encryption contradiction (§4.2 vs §6.2/6.3) §4.2 now explicitly states: "The coordinator does NOT pre-encrypt values before building a proposal, and the FSM does NOT call a 'raw' Pebble write."
key_id uint32 vs. string mismatch §5.1 specifies CSPRNG-drawn uint32, decimal string in JSON, binary uint32 in envelope + strconv.ParseUint decode. ✅
encryption_state bit placement unspecified §6.2: bits 1–2 of byte 0, valueHeaderSize stays 9. ✅
Compress flag missing from envelope / AAD §4.1 diagram includes flag, AAD binds envelope_version ‖ flag ‖ key_id, overhead is 34 B everywhere. ✅
Random nonce birthday bound (2⁴⁸ was wrong) Replaced with deterministic node_id ‖ local_epoch ‖ write_count; zero collision probability under stated preconditions. ✅
Sidecar fsync non-durable §5.1: write-tmp → fsync-file → rename → fsync-parent → ack; ErrUnsupportedFilesystem on NFS/FUSE. ✅
32-bit random nonce-prefix birthday risk Retracted; same deterministic nonce fix above. ✅
Raft rolling-restart safety (pre-Phase-2 followers) Two-phase rollout in §7.1: Phase 1 keeps Raft proposals cleartext; Phase 2 gated on encryption_capable membership check. ✅

Critical (must fix before implementation PR opens)

1. Phase 2 discriminator byte collides with existing FSM tag bytes

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1

The design says replicas "dispatch on a 1-byte format tag at the start of the Data payload — 0x00 = cleartext, 0x01 = raft envelope." But kv/fsm.go already uses that exact byte for the existing operation encoding:

// kv/fsm.go lines 110–116
const (
    raftEncodeSingle byte = 0x00  // single pb.Request
    raftEncodeBatch  byte = 0x01  // pb.RaftCommand with multiple requests
    raftEncodeHLCLease byte = 0x02 // HLC ceiling-only entry
)

Under the proposed scheme, every raftEncodeBatch proposal (0x01) — used by Lua commit batching, XADD, HSET, and any multi-op transaction — is misidentified as a raft envelope under Phase 2. The FSM attempts AES-GCM decryption on a protobuf-encoded pb.RaftCommand, gets a tag mismatch, and returns an apply error. The cluster stops making progress as soon as the first batch write arrives after enable-raft-envelope is applied. Every raftEncodeHLCLease (0x02) entry would similarly be misrouted.

The fix is straightforward: reserve a new unused tag byte, e.g. raftEncodeEnveloped byte = 0x03, for the encrypted wrapper. The Phase 2 FSM path becomes:

Apply(data):
  if data[0] == raftEncodeEnveloped:   // new — §4.2 envelope
    data = raftDEK.Unwrap(data[1:])    // strip envelope, recover inner bytes
  // existing dispatch (unchanged):
  if data[0] == raftEncodeHLCLease: → applyHLCLease
  else: decodeRaftRequests (0x00 / 0x01 / future tags)

This also correctly handles the Phase 2 ordering requirement: the raft-envelope unwrap happens before the existing HLC-lease and request-decode checks, so HLC ceiling proposals (§4.5) are also transparently decrypted.

Update §7.1 to replace "0x00 = cleartext, 0x01 = raft envelope" with the raftEncodeEnveloped = 0x03 scheme and cross-reference kv/fsm.go so the implementor knows exactly which file gains the new constant.

Fix this →


High (must resolve before implementation)

2. local_epoch overflow has no hard block — silent nonce collision risk

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.1, §9.1

The design correctly warns at local_epoch > 0xff00 (256 restarts before wrap) but does not specify what happens when local_epoch reaches 0xFFFF and would overflow to 0x0000. If the operator ignores all 256 warnings and the epoch wraps:

  • The next process load uses node_id ‖ 0 ‖ 0 for its first write
  • The DEK's very first ever write (at epoch 0, write_count 0 after creation) used the same nonce
  • AES-GCM nonce collision → key recovery + XOR of two plaintexts

The deterministic nonce's safety guarantee ("zero collision probability under its preconditions") is only valid when wrapping is prevented. Add ErrLocalEpochExhausted to §9.1's startup refusal list:

"The local_epoch field for any active DEK in the sidecar has reached 0xFFFF. The process refuses to start until the operator rotates the DEK, which resets the epoch to 0 under a fresh key."

Fix this →

3. 16-bit node_id truncation from DeriveNodeID (uint64) is unspecified

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.1

The design describes node_id as "the 16-bit Raft member ID assigned at cluster bootstrap." In the actual codebase, internal/raftengine/etcd/peers.go:DeriveNodeID returns a uint64 FNV-64a hash of the --raftId string (e.g., "n1", "n2"). The nonce field allocates only 2 bytes for node_id, so the implementation must truncate the 64-bit hash.

The design must specify:

  1. The truncation rule (e.g., "lower 16 bits of DeriveNodeID(--raftId)").
  2. A startup collision check: on every boot, the encryption package reads the current Raft membership snapshot and asserts that no two voting members share the same lower-16-bit value of their derived node_id. This check is cheap (at most dozens of comparisons) and prevents a silent nonce-space collision in a cluster with unlucky ID strings.

For elastickv's current ID scheme ("n1", "n2", "n3" → well-distributed FNV hashes), collisions in the lower 16 bits are extremely unlikely, but the requirement must be stated and enforced rather than assumed.

Fix this →


Medium (address before or alongside implementation)

4. raft_applied_index write timing is contradictory between §5.1 and §5.5

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §5.1 step 5, §5.5

§5.1 step 5 says: "update raft_applied_index in memory; that value is then persisted on the next sidecar write."

§5.5 says: "The sidecar carries raft_applied_index, updated in the same os.WriteFile + os.Rename that persists the new wrapped DEKs."

These are contradictory. If step 5's reading is correct (index deferred to next write), then after a rotation the on-disk sidecar has the new DEK but an old raft_applied_index. On restart, §5.5's check sees the gap between the raftengine's persisted applied index and the sidecar's index, covers a rotation entry, and fires ErrSidecarBehindRaftLog — even though the sidecar is fully consistent. This would be a spurious startup failure loop for any correctly-functioning node that crashed between steps 4 and 5.

The correct behavior (matching §5.5) is that the sidecar written in steps 1–4 includes the current rotation's raft_applied_index. Step 5's "update in memory" should read: "update the in-memory raft_applied_index to the now-confirmed-durable value; the sidecar already reflects it." Reword step 5 to eliminate the ambiguity.

5. encryption_capable heartbeat field is underspecified

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 2

The enable-raft-envelope gate requires "every voting member of every Raft group has reported encryption_capable = true in its periodic heartbeat metadata." This introduces a new field to the heartbeat protocol but the design doesn't specify:

  • Which gRPC message or Raft protocol extension carries the field.
  • How staleness is bounded — a heartbeat from 30 minutes ago is not a safe signal.
  • Whether enable-raft-envelope forces a fresh membership poll (blocking) or reads cached heartbeat state.

Without this, an operator could issue enable-raft-envelope immediately after a rolling restart before all heartbeats have refreshed, and the check would pass on stale encryption_capable = falsetrue transitions. Add a paragraph to §7.1 specifying the heartbeat path (e.g., piggyback on the existing peer metadata extension in internal/raftengine/etcd/peer_metadata.go) and a maximum staleness bound (e.g., "the command polls until a fresh heartbeat is received from every member within one election timeout, or returns an error if any member is unreachable").


Minor

6. Phase 2 FSM dispatch ordering for HLC lease entries should be explicit

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.5, §6.3

§4.5 correctly notes that HLC ceiling proposals get the §4.2 raft envelope "for free." kv/fsm.go::Apply currently checks for HLC lease entries (data[0] == raftEncodeHLCLease) before decodeRaftRequests. With the fix from issue #1 above (raftEncodeEnveloped = 0x03), the correct Phase 2 dispatch order is:

1. if data[0] == 0x03 → unwrap raft envelope → data = inner bytes
2. if data[0] == 0x02 → applyHLCLease (HLC ceiling entry)
3. else              → decodeRaftRequests (0x00 single / 0x01 batch)

§6.3 says "unwrap the raft envelope on apply, recover the cleartext operation, and dispatch to the existing storage handlers" — this correctly implies the unwrap happens first. But since the HLC lease path is a special case that bypasses the storage layer, it should be called out explicitly so the implementor doesn't accidentally place the HLC check before the envelope check.


Conventions check

Item Status
Filename 2026_04_29_proposed_data_at_rest_encryption.md
Header Status/Author/Date
Five CLAUDE.md self-review lenses in §10
§10 lens 1 updated for nonce/sidecar preconditions
§10 lens 2 updated for Phase-2 membership check
§10 test coverage includes local_epoch > 0xff00 warning
§11.5 Lua batching open question flagged
Reserved-key `!encryption ` namespace confirmed no collision

Positive summary

Three rounds of review have closed every cryptographic and architectural gap identified in rounds 1–3. The deterministic nonce design (node_id ‖ local_epoch ‖ write_count) is the strongest possible fix for the original birthday-bound problem. The two-phase rollout correctly handles the asymmetry between per-version storage envelopes (safe in mixed-mode) and per-proposal raft envelopes (requires full-cluster capability). The sidecar fsync protocol now meets crash-durability requirements, and the double-encryption contradiction is cleanly resolved.

The one blocking issue before the implementation PR opens is the discriminator byte collision in §7.1 Phase 2 — discovered by cross-referencing the design against kv/fsm.go:110-116. The fix is a one-line constant addition (raftEncodeEnveloped = 0x03) and a two-paragraph update to §7.1. The local_epoch hard-stop and node_id truncation spec (#2 and #3) are also required before implementation to keep the nonce uniqueness guarantee unconditional.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b8613f08a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +643 to +645
so an old build reading a value written by an encryption-enabled
build sees an unrecognised tombstone byte and refuses to serve it
rather than returning ciphertext as cleartext. `valueHeaderSize`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid treating new header bits as tombstones on old binaries

This section assumes an older binary will "refuse to serve" values written with the new tombstone+enc_state byte, but in the current tree store/lsm_store.go decodes tombstones as data[0] != 0 (any non-zero value). In a rolling Phase-1 deployment, old nodes would therefore interpret encrypted live values as tombstones and drop them from reads, which breaks the mixed-version safety claim in §7.1. The migration needs a backward-compatible encoding strategy (or a hard gate that prevents encrypted writes until all readers understand the new bits).

Useful? React with 👍 / 👎.

Comment on lines +584 to +586
3. Recovery is **automatic on the leader** (it re-proposes a
"rewrap current active DEK" entry that brings every node's
sidecar back in sync) and **manual on a stuck follower**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Resync sidecars with all unretired DEKs, not only active

The recovery plan here re-proposes only a rewrap of the current active DEK, but §5.2 explicitly allows old DEKs to remain decrypt-required for retained MVCC history. If a node misses multiple rotations, restoring just the active key can still leave intermediate key_ids absent, causing decrypt failures for historical versions until rewrite/retire completes. The resync flow should rebuild the full set of unretired DEKs (or it can strand readable history on the recovered node).

Useful? React with 👍 / 👎.

Comment on lines +845 to +847
Replicas dispatch on a 1-byte format tag at the start of the
`Data` payload — `0x00` = cleartext (pre-flag entry, found
only in WAL/snapshot history), `0x01` = raft envelope. The
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve existing Raft tag space in Phase-2 envelope framing

This assigns 0x00 = cleartext and 0x01 = raft envelope for historical/new entries, but current kv/fsm.go already uses 0x00 for single-request, 0x01 for batch, and 0x02 for HLC-lease entries. Reusing 0x01 for encrypted payloads would make pre-cutover batch entries ambiguous during WAL replay/snapshot restore, so a node can misroute old cleartext data into envelope decode and fail apply. The envelope framing needs a non-conflicting outer tag scheme (or explicit version/index gating) that remains compatible with existing persisted entries.

Useful? React with 👍 / 👎.

…EK resync, index-gated framing)

P1 (codex on 1b8613f line 645): old-binary tombstone test data[0] != 0 reads encrypted live values (encryption_state in bits 1-2 of byte 0) as tombstones and silently drops them. Convert §7.1 from 2-phase to 3-phase: Phase 0 = capability rollout (new binary writes cleartext, advertises encryption_capable in heartbeats); Phase 1 = enable-storage-envelope cluster-flag flip after membership-snapshot capability check; Phase 2 = enable-raft-envelope under the same gate. Per-node --encryption-enabled is now a capability assertion, not a behaviour trigger.

P1 (codex on 1b8613f line 586): §5.5 recovery rewrapped only the active DEK, leaving intermediate unretired DEKs missing on a node that missed multiple rotations and silently breaking historical reads. Recovery now rewraps every unretired DEK (active + retiring) in one apply; resync-sidecar replays all rotation/rewrap-deks entries, not just the most recent.

P1 (codex on 1b8613f line 847): Phase 2 first-byte framing tag (0x00 cleartext / 0x01 envelope) collided with kv/fsm.go existing tag space (0x00 single, 0x01 batch, 0x02 HLC-lease) and would misroute pre-cutover batch entries during WAL replay. Replace with raft-log-index gating: sidecar persists raft_envelope_cutover_index; entries below it use the legacy first-byte tags, entries at-or-above unwrap through the raft envelope. Snapshot headers carry the cutover index so a follower joining mid-Phase-2 reconstructs the same boundary.

Sidecar example expanded with storage_envelope_active and raft_envelope_cutover_index. §9.1 startup refusal list adds ErrEnvelopeCutoverDivergence. §10 self-review lens 2 updated to call out the old-binary tombstone path and the full-DEK resync requirement as load-bearing. §6.6 admin gains enable-storage-envelope.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 7s —— View job


Review: docs: propose data-at-rest encryption design (Round 5)

  • Gather context and understand request
  • Read current design document (post round-4 updates)
  • Check kv/fsm.go for tag constants and dispatch logic
  • Check internal/raftengine/etcd/peer_metadata.go and peers.go
  • Review round-4 issue resolutions and remaining gaps
  • Post comprehensive review feedback

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f34d9b73d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +904 to +909
relative to the persisted `raft_envelope_cutover_index` (the
apply index of the flag entry, recorded in the local sidecar
on apply). Entries below the cutover index are decoded via
the existing first-byte tag space (`0x00` single-request,
`0x01` batch, `0x02` HLC-lease, etc.); entries at-or-above
the cutover index are unwrapped through the raft envelope
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Decode raft envelopes only after the cutover flag entry

Step 7 says the enable-raft-envelope flag entry itself is written in legacy framing, but this section defines raft_envelope_cutover_index as that flag entry’s index and then unwraps entries "at-or-above" it as envelopes. That makes WAL replay attempt to decrypt the flag entry as an envelope and fail. The boundary needs to exclude the flag entry (or record the first encrypted entry index instead), otherwise replay/apply can break exactly at cutover.

Useful? React with 👍 / 👎.

Comment on lines +592 to +593
sidecar's index AND the gap covers any rotation entries, the
node refuses to start with `ErrSidecarBehindRaftLog`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail startup when sidecar misses raft cutover updates

The startup guard only blocks when sidecar lag spans rotation entries, but Raft entry decoding also depends on raft_envelope_cutover_index. If a node crashes after Raft advances past Phase-2 cutover but before that cutover state is persisted locally, this check can pass and the node may decode post-cutover entries with the wrong format. Include cutover/format-transition entries in the lag check (not only rotations) to avoid mis-dispatch on restart.

Useful? React with 👍 / 👎.

Comment on lines +416 to +417
as applied (and update `raft_applied_index` in memory; that
value is then persisted on the next sidecar write).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist raft_applied_index in the same sidecar rewrite

This step says raft_applied_index is updated only in memory and written on the next sidecar write, but §5.5 relies on that index being durable alongside the just-written DEK set. If implemented as written here, a crash after rotation can leave a fresh sidecar with a stale applied index and trigger unnecessary ErrSidecarBehindRaftLog refusals on restart. The design should require persisting raft_applied_index atomically with the DEK rewrite.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
docs/design/2026_04_29_proposed_data_at_rest_encryption.md (1)

1243-1248: Promote Lua batching interaction from “open question” to acceptance criteria.

Given the rollout risk, this should be a concrete pre-merge check, not a deferred question. Add explicit tests/benchmarks for lua_commit_batching + raft envelope + inner per-value envelopes (mixed compressed/uncompressed payloads).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md` around lines 1243
- 1248, The docs note is asking to promote the open question into concrete
pre-merge checks: add unit tests and benchmarks that exercise
lua_commit_batching with Raft proposal envelopes containing per-value envelopes
(both encrypted and unencrypted, both compressed and uncompressed) so the batch
encoder cannot assume inner payloads are cleartext or compressible; implement
tests named like TestLuaCommitBatchingEnvelopeHandling and
BenchmarkLuaBatchEncoderMixedPayloads that build batches via
lua_commit_batching, wrap inner Puts with per-value envelopes and an outer Raft
envelope, and assert decoding/round-trip integrity, correct sizing (no
double-compression), and that compressed flag handling is preserved for mixed
payloads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md`:
- Around line 865-869: Add a post-cutover membership gate: specify an invariant
that once the encryption flags are active (after Phase 0 completion where every
voting member shows encryption_capable = true) any membership-change proposal
that would add or promote a voting member must be rejected if the candidate
lacks encryption_capable = true. Update the Phase 1/2 sections and the
membership-change proposal validation text (referencing "encryption_capable",
"Phase 0", "Phase 1/2", and "membership-change proposals") to require this check
in the membership-change code path and operator/API validation (i.e.,
join/replace/add-voter requests and Raft proposal validation) and note that
non-voting members may still be permitted but cannot be promoted to voters
unless they present the flag.
- Around line 914-917: The snapshot header currently written in kv/snapshot.go
(lines ~13-46) must be versioned to carry raft_envelope_cutover_index safely:
introduce a header format with a version byte (e.g., headerVersion) and a
deterministic layout (magic, version, length, then fields including
raft_envelope_cutover_index as a fixed-size u64) and update the write path
(WriteSnapshotHeader) to emit the new version when creating Phase‑2 snapshots;
update the read path (ReadSnapshotHeader) to parse the magic and version, handle
legacy v1 (existing 16‑byte magic+ceiling) by detecting the old length/magic and
treating it as implicit v1 behavior, and parse v2+ to extract
raft_envelope_cutover_index, failing fast on unknown versions while logging
helpful errors for forward/backward compatibility checks.
- Around line 183-186: The document currently relies on a 16-bit node_id as part
of the AES-GCM nonce which is too small and risks nonce reuse; update the spec
to require a collision-resistant, persisted per-node nonce prefix (e.g., a
cryptographically-random 96–128-bit nonce_prefix or a UUIDv4/UUIDv7) generated
at bootstrap and stored persistently (and bound to the node’s identity) instead
of the 16-bit node_id, describe that the nonce construction (nonce =
nonce_prefix || per-message counter) must guarantee monotonic, non-repeating
counters per DEK, add a requirement that implementations persist and verify the
nonce_prefix on startup (similar to membership snapshot checks) and provide
explicit non-reuse guarantees across node reboots, replacements, and migrations
so AES-GCM nonce reuse under any DEK cannot occur.
- Around line 415-417: The doc contains conflicting persistence semantics for
raft_applied_index: reconcile both places so raft_applied_index is persisted
atomically in the same fsync'ed sidecar write as the wrapped DEKs (i.e., the FSM
should acknowledge rotation only after that combined fsync completes). Update
the text around the FSM acknowledgment (references to raft_applied_index,
sidecar write, and wrapped DEKs) so both the earlier statement and the later
paragraph (the lines mentioning wrapped DEKs) consistently require a single
atomic fsync'd sidecar transaction that includes the key updates and
raft_applied_index before acknowledging the rotation entry.

---

Nitpick comments:
In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md`:
- Around line 1243-1248: The docs note is asking to promote the open question
into concrete pre-merge checks: add unit tests and benchmarks that exercise
lua_commit_batching with Raft proposal envelopes containing per-value envelopes
(both encrypted and unencrypted, both compressed and uncompressed) so the batch
encoder cannot assume inner payloads are cleartext or compressible; implement
tests named like TestLuaCommitBatchingEnvelopeHandling and
BenchmarkLuaBatchEncoderMixedPayloads that build batches via
lua_commit_batching, wrap inner Puts with per-value envelopes and an outer Raft
envelope, and assert decoding/round-trip integrity, correct sizing (no
double-compression), and that compressed flag handling is preserved for mixed
payloads.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aa7208b9-acf0-417f-9fe0-92b64e9ce300

📥 Commits

Reviewing files that changed from the base of the PR and between 579d193 and f34d9b7.

📒 Files selected for processing (1)
  • docs/design/2026_04_29_proposed_data_at_rest_encryption.md

Comment thread docs/design/2026_04_29_proposed_data_at_rest_encryption.md Outdated
Comment thread docs/design/2026_04_29_proposed_data_at_rest_encryption.md Outdated
Comment thread docs/design/2026_04_29_proposed_data_at_rest_encryption.md Outdated
Comment thread docs/design/2026_04_29_proposed_data_at_rest_encryption.md
High (claude[bot] r4 #2): local_epoch wrap to 0 re-issues nonce node_id||0||0 of the same DEK first write -- catastrophic AES-GCM nonce reuse. Add ErrLocalEpochExhausted to §9.1 startup refusal at local_epoch == 0xFFFF; only DEK rotation can recover.

High (claude[bot] r4 #3): node_id was described as 16-bit but DeriveNodeID() returns uint64 FNV-64a. Specify the truncation rule (lower 16 bits) and the startup ErrNodeIDCollision check that asserts no two voting members share the same uint16 truncated value.

Medium (claude[bot] r4 #4): §5.1 step 5 said raft_applied_index is persisted on the next sidecar write, contradicting §5.5 which expects the same write. Reword: the index is part of the same JSON payload as the new wrapped DEKs, eliminating any deferred-write race that would spuriously fire ErrSidecarBehindRaftLog.

Medium (claude[bot] r4 #5): encryption_capable heartbeat field path was unspecified. Pin it to the existing peer_metadata.go extension; cutover commands force a fresh poll within one election timeout rather than reading cached state.

Critical reinforcement (claude[bot] r4 #1): explicitly cross-reference kv/fsm.go raftEncodeSingle/raftEncodeBatch/raftEncodeHLCLease constants in §7.1 Phase 2 and pseudocode the exact FSM dispatch order so the implementor cannot accidentally re-use the colliding tag bytes. Index-gating remains the chosen mechanism (already in round-4 commit).

Medium (claude[bot] r3 #3): §6.1 sidecar.go entry now references the §5.1 crash-durable write protocol explicitly so an implementor reading §6.1 in isolation cannot omit the fsync.

Medium (claude[bot] r3 #4): §6.2 corrected the misleading "old build refuses to serve" wording -- old decodeValue actually returns a silent tombstone, but the 3-phase rollout makes that path unreachable on a real cluster. Bit-layout comment requirement added.

Minor (claude[bot] r3 #5): env KEK provider must os.Unsetenv("ELASTICKV_KEK_BASE64") immediately after read.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 57s —— View job


Review: docs: propose data-at-rest encryption design (Round 5)

  • Gather context and understand request
  • Read current design document (post round-5 commits)
  • Check kv/fsm.go for FSM tag constants and dispatch logic
  • Check kv/snapshot.go for snapshot header format
  • Check internal/raftengine/etcd/peers.go for DeriveNodeID
  • Review round-4/5 issue resolutions
  • Post comprehensive review feedback

What commits f34d9b7 and 2768cb3 resolved

Every P1 and P2 from round 4 has been addressed:

Issue Resolution
Critical: FSM tag byte collision (0x01 = raftEncodeBatch vs raft envelope discriminator) Eliminated entirely — Phase 2 dispatch is now index-gated (index >= raft_envelope_cutover_index), not tag-byte-gated. No new constant needed. ✅
High: local_epoch overflow has no hard stop §4.1 specifies process refusal at 0xFFFF via ErrLocalEpochExhausted; §9.1 includes it in the startup refusal list. ✅
High: 16-bit node_id truncation rule unspecified §4.1 specifies node_id := uint16(DeriveNodeID(raftID)) with a cross-membership collision check on startup. ✅
Medium: sidecar fsync missing from §6.1 §5.1 and §6.1 both specify the four-step write-tmp → fsync → rename → fsync-dir protocol. ✅
Medium: encryption_capable staleness / blocking §7.1 Phase 0 step 3 now specifies a fresh blocking poll per election timeout rather than reading cached state. ✅
Medium: full-DEK resync on sidecar recovery §5.5 recovery explicitly rebuilds every unretired DEK, not just the active one. ✅
CodeRabbit P2: raft_applied_index deferred-write ambiguity §5.1 step 1 and step 5 now consistently state the index is written atomically in the same JSON payload as the wrapped DEKs. ✅

Critical (must fix before implementation PR opens)

1. raft_envelope_cutover_index off-by-one: >= should be >

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 2 steps 7–8

The pseudocode reads:

Apply(index, data):
  if index >= raft_envelope_cutover_index:
    data = raftDEK.Unwrap(data)

raft_envelope_cutover_index is recorded as the Raft apply index of the enable-raft-envelope flag entry itself. Step 7 says explicitly:

This entry itself uses the legacy first-byte tag space so it remains decodable by any replica that has applied entries up to it.

With >=, when the FSM applies the flag entry (its index equals raft_envelope_cutover_index), it will attempt raftDEK.Unwrap(data) on a cleartext entry — the entry that was written in legacy framing. The unwrap call will fail with a tag mismatch, and the node will either error-out on apply (breaking the cluster) or silently skip the flag entry (never setting raft_envelope_cutover_index in the sidecar, which means the node will never recognize Phase 2 at all).

Step 8 says "From the apply index of that flag entry onward" — "onward" implies exclusive. The >= in the pseudocode is inconsistent with this prose. The correct condition is:

if index > raft_envelope_cutover_index:
  data = raftDEK.Unwrap(data)

This processes the flag entry in legacy mode (which records raft_envelope_cutover_index in the sidecar), then wraps all subsequent entries. The snapshot ingest path (step 9) also needs to record the cutover index and apply the same > boundary on replay.

Fix this →


Medium (address before or alongside implementation)

2. Snapshot header format extension is unspecified

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 2, step 9

Step 9 says: "Snapshots taken during Phase 2 carry the cutover index in their metadata header so a fresh follower joining mid-Phase-2 reconstructs the same dispatch boundary on ingest."

The existing store/snapshot_pebble.go snapshot header is:

[pebbleSnapshotMagic (8 bytes)] [lastCommitTS uint64 (8 bytes)] [key-value entries...]

There is no version field in this header. Adding raft_envelope_cutover_index requires a format bump. Without specifying how the format evolves, the implementation will have to make uncoordinated decisions that could produce incompatible snapshots between builds.

The design should specify:

  • The versioned format: e.g., change the magic from EKVPBBL1 to EKVPBBL2 when Phase 2 is active, appending raft_envelope_cutover_index uint64 after lastCommitTS.
  • Backward compatibility: a node reading an EKVPBBL1 snapshot treats raft_envelope_cutover_index as 0 (Phase 1 or earlier; no raft envelopes).
  • §9.1's ErrEnvelopeCutoverDivergence check depends on the snapshot carrying this field — without the format spec, that check cannot be implemented.

Fix this →

3. peer_metadata.go reference names the wrong file

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 0, step 2

"Each upgraded node advertises encryption_capable = true in the peer-metadata extension that today already piggy-backs on the etcd-raft heartbeat path (internal/raftengine/etcd/peer_metadata.go)."

internal/raftengine/etcd/peer_metadata.go is the peers-file persistence layer (it reads and writes etcd-raft-peers.bin, the on-disk membership snapshot). It is not a heartbeat extension and has no mechanism for advertising per-node capability flags.

The etcd raft library's heartbeat messages (raftpb.MsgHeartbeat / MsgHeartbeatResp) carry a Context []byte field that applications use to piggyback metadata — this is the correct extension point. Alternatively, a separate out-of-band RPC (e.g., on the existing Distribution service or a new EncryptionAdmin service) can poll capability before the cutover commands issue.

The design should replace the peer_metadata.go reference with a concrete mechanism:

  • Option A (heartbeat context): Populate MsgHeartbeat.Context with a small proto carrying encryption_capable bool; parse it in engine.go's ReadyLoop when raftpb.MsgHeartbeatResp is received. This requires adding a new field to the transport encoding but reuses the existing heartbeat cadence.
  • Option B (out-of-band RPC): The enable-storage-envelope and enable-raft-envelope commands issue a CheckEncryptionCapability RPC to every voting peer before proposing. Simpler to implement; does not require changing the raft message format.

Either option is correct; the design must name one and remove the peer_metadata.go reference, which will send the implementor to the wrong file.

Fix this →


Minor

4. Post-cutover membership gate not addressed

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1

The Phase 0 → Phase 1 → Phase 2 gate correctly blocks cutover until every voting member is encryption_capable. But the document says nothing about membership changes after cutover. An operator adding a new voter (e.g., replacing a failed node) running an old binary without --encryption-enabled would:

  • At the storage layer: receive MVCC versions with encryption_state = 0b01 set in data[0], read data[0] != 0 as tombstone, and silently drop live values from reads.
  • At the raft layer (Phase 2): fail every FSM apply on post-cutover entries it cannot decrypt.

Add one sentence: "After storage_envelope_active or raft_envelope_active has committed, the control plane rejects AddVoter / ConfChange for peers that do not advertise encryption_capable = true; learner promotion follows the same gate."


Conventions check

Item Status
Filename 2026_04_29_proposed_data_at_rest_encryption.md
Header Status/Author/Date
Five CLAUDE.md self-review lenses in §10
§10 lens 1 updated for nonce/sidecar preconditions
§10 lens 2 updated for 3-phase rollout and full-DEK resync
ErrLocalEpochExhausted in §9.1 startup refusal list
ErrNodeIDCollision in §9.1 startup refusal list
raft_applied_index written atomically with DEKs (§5.1)
env.go os.Unsetenv documented in §6.1
Reserved-key namespace collision confirmed in §5.4.8
§11.5 Lua batching interaction flagged as pre-implementation gate

Summary

Six rounds of review have closed every cryptographic, architectural, and wire-format issue identified since the initial submission. The document is in strong shape. One blocking issue remains before the implementation PR opens:

The raft_envelope_cutover_index off-by-one (#1): the pseudocode index >= cutover attempts to unwrap the flag entry itself, which is written in legacy framing. Change to index > cutover. This is a one-line edit to the pseudocode and one sentence clarification in step 8's prose.

The two medium issues (#2 snapshot format, #3 peer_metadata.go reference) should also be resolved before implementation begins — the snapshot format spec is required for §9.1's ErrEnvelopeCutoverDivergence check, and the wrong file reference will send the implementor to the wrong place for a non-trivial integration point.

P1 (codex on 2768cb3 line 621): §5.5 startup lag check only flagged rotation entries in the gap, so a missed enable-raft-envelope cutover would silently slip past and the FSM would decode post-cutover entries through the legacy first-byte tag space. Generalise the lag check to cover every encryption-relevant entry: rotate-dek, rewrap-deks, enable-storage-envelope, enable-raft-envelope, retire-dek. Implementation derives the set from a single isEncryptionRelevant predicate so future commands cannot bypass the check.

Major (coderabbitai on 2768cb3 line 1000): §7.1 Phase 2 step 9 said snapshots carry the cutover index but kv/snapshot.go currently writes a fixed 16-byte header with no version field. Specify a versioned header in §4.4: magic||ver||len||ceiling||cutover||reserved with implicit-v1 fallback for legacy snapshots, v2 only emitted after enable-raft-envelope has been observed locally, ErrSnapshotHeaderVersion on forward-unknown versions, and zero-validation on reserved bytes to prevent silent forward-compat slips.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 7m 37s —— View job


Review: docs: propose data-at-rest encryption design (Round 6)

  • Gather context and understand request
  • Read current design document (post round-5/6 commits)
  • Check what e887e9f and 2768cb3 resolved
  • Review outstanding issues from Round 5 (FSM discriminator, epoch overflow, node_id truncation, heartbeat spec, snapshot versioning)
  • Cross-reference design claims against actual source files
  • Post comprehensive round-6 review feedback

What the round-5/6 commits resolved

Issue Resolution
Phase 2 discriminator byte collision with kv/fsm.go tags Replaced byte-tagged dispatch with index-gated dispatch (if index >= raft_envelope_cutover_index: raftDEK.Unwrap(data)) — sidesteps collision by construction ✅
local_epoch overflow: no hard stop ErrLocalEpochExhausted on local_epoch == 0xFFFF in §9.1 ✅
16-bit node_id truncation unspecified Documented as uint16(DeriveNodeID(raftID)) with startup ErrNodeIDCollision check ✅
raft_applied_index write timing contradiction Step 1 of §5.1 write protocol now builds raft_applied_index as part of the same JSON payload, eliminating the deferred-write window ✅
encryption_capable heartbeat field underspecified §7.1 Phase 0 step 3 now specifies: fresh poll within one election timeout; ErrCapabilityCheckFailed if any member is unreachable or reports false
Snapshot header versioning for raft_envelope_cutover_index §4.4 now specifies an explicit versioned v2 header (magic + ver + len + ceiling + cutover + reserved) with read/write path rules ✅
§5.5 lag check covered only rotations Expanded to cover ALL encryption-relevant entries via isEncryptionRelevant() predicate ✅
Node_id 32-bit prefix birthday risk Retracted in favour of deterministic nonce (node_id ‖ local_epoch ‖ write_count) ✅
raft_applied_index consistency between §5.1 and §5.5 Made consistent: written atomically with DEKs in same fsync ✅

Critical (must fix before implementation PR opens)

1. Snapshot magic bytes and variable name are wrong in §4.4

docs/design/2026_04_29_proposed_data_at_rest_encryption.md line 321, 326

The v2 header diagram states:

| magic(8) | ...
| EKVPBBL1 | ...

and the annotation says "magic — unchanged 8-byte **pebbleSnapshotMagic** (EKVPBBL1)".

Both facts are wrong. Reading kv/snapshot.go:16:

var hlcSnapshotMagic = [8]byte{'E', 'K', 'V', 'T', 'H', 'L', 'C', '1'}

The magic is EKVTHLC1 (not EKVPBBL1) and the variable is hlcSnapshotMagic (not pebbleSnapshotMagic). The restore path in kv/fsm.go:277 also uses hlcSnapshotMagic for the equality check. If the implementation copies the design doc's magic bytes, v1 snapshot detection will fail (the 8-byte comparison will always be false), making every existing v1 snapshot unreadable after upgrade.

Fix: Replace pebbleSnapshotMagic (EKVPBBL1) with hlcSnapshotMagic (EKVTHLC1) everywhere in §4.4.

Fix this →


2. >= dispatch boundary decrypts the flag entry itself — WAL replay fails at cutover

docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 1062–1082

The design says (line 1062–1064):

"This entry itself uses the legacy first-byte tag space so it remains decodable by any replica that has applied entries up to it."

The raft_envelope_cutover_index is then defined as "the apply index of that flag entry" (line 1069). Yet the FSM pseudocode (line 1075) reads:

if index >= raft_envelope_cutover_index:
  data = raftDEK.Unwrap(data)

>= includes the flag entry (index == raft_envelope_cutover_index). On WAL replay, that entry is in legacy framing (raftEncodeSingle = 0x00 or raftEncodeBatch = 0x01); raftDEK.Unwrap(data) on legacy bytes produces an AES-GCM tag mismatch. The cluster halts at the cutover index on every replay — snapshot installs and follower catch-up are broken.

This is the Codex P1 "Decode raft envelopes only after the cutover flag entry" that e887e9f was meant to close but left the >= unchanged.

Two correct resolutions:

Option A (preferred — no format change): Keep raft_envelope_cutover_index as the flag entry's index, change the comparison to >:

if index > raft_envelope_cutover_index:
  data = raftDEK.Unwrap(data)

The leader wraps entries starting from index = cutover_index + 1. The flag entry decodes through the existing raftEncodeSingle/raftEncodeBatch dispatch unchanged.

Option B: Define raft_envelope_cutover_index = flag_entry_index + 1 (the index of the first encrypted entry), keep >=.

Pick one and update the pseudocode, the sidecar JSON example in §5.1 (where "raft_envelope_cutover_index": 184201 is shown alongside "raft_applied_index": 184273 — if 184201 is the flag entry, then 184202 is the first encrypted entry), and the §5.5 startup check description.

Fix this →


High (should resolve before implementation)

3. peer_metadata.go reference is inaccurate — no heartbeat mechanism exists there

docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 1011–1018

The design says:

"Each upgraded node advertises encryption_capable = true in the peer-metadata extension that today already piggy-backs on the etcd-raft heartbeat path (internal/raftengine/etcd/peer_metadata.go). A new encryption_capable bool field is added to the existing PeerMetadata proto; old binaries omit the field, which protobuf decodes as the zero value (false)"

Two problems:

  1. peer_metadata.go manages the persistent peers binary file (etcd-raft-peers.bin) using custom binary encoding (readU64, readString, readU8). It has nothing to do with heartbeats or protobuf. The existing extension point for piggybacking data on heartbeats is raftpb.Message.Context []byte, which flows through grpc_transport.go / engine.go.

  2. "Today already piggy-backs" — there is no existing heartbeat metadata mechanism in the codebase. This is a new mechanism that needs to be built, not extended.

The design should name the correct extension point (e.g., "a new encryption_capable flag will be added to the etcd-raft message context bytes piggy-backed on MsgHeartbeat / MsgHeartbeatResp in internal/raftengine/etcd/engine.go") and clarify that this is new infrastructure. The "protobuf zero value" claim also needs replacing with the correct encoding scheme for a context-byte piggyback.

Fix this →


Medium (address before or alongside implementation)

4. resync-sidecar manual recovery only replays "rotation and rewrap-deks" — misses 3 entry types

docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 738–744

The §5.5 preamble correctly defines an isEncryptionRelevant() predicate covering five entry types: rotate-dek, rewrap-deks, enable-storage-envelope, enable-raft-envelope, retire-dek. The manual recovery path at step 3 then says:

"The command replays all rotation and rewrap-deks entries between the sidecar's raft_applied_index and the FSM's applied index"

This is inconsistent — a node that missed enable-raft-envelope or enable-storage-envelope would not have those entries replayed, leaving raft_envelope_cutover_index unset and causing the same mis-dispatch the rest of §5.5 is designed to prevent. Change to "all isEncryptionRelevant() entries" throughout the recovery description to match the predicate.

5. Snapshot header len = 0x40 is unexplained and likely wrong

docs/design/2026_04_29_proposed_data_at_rest_encryption.md line 323

The v2 header diagram shows len = 0x40 (64 decimal). The text defines len as "total payload length after the len field." The only defined fields after len are ceiling (8B) and cutover (8B) = 16 bytes. len = 0x40 = 64 therefore implies 48 bytes of reserved padding, which is never explained. A v3+ reader that calls skip(len - 16) would skip 48 bytes of zero padding before the first extension field.

If this is intentional alignment, document why. If it's an error, fix to len = 0x10 = 16 (the two defined fields, no reserved) and update the diagram. Either way the design must specify the exact reserved byte count so implementors know the fixed header size.

6. Post-cutover membership gate still missing

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1

The Phase 0 capability gate prevents the Phase 1 and Phase 2 cutover until every current voting member reports encryption_capable. However, the design contains no invariant governing AddVoter / ChangeConf proposals after Phase 1 or Phase 2 is enabled. A non-capable node added as a voter after Phase 2 would silently read all encrypted storage values as tombstones (via the data[0] != 0 tombstone path) and fail to decode raft-envelope entries.

Add one paragraph to Phase 2 specifying: "Once storage_envelope_active is true, the leader rejects ConfChangeAddVoter / ConfChangeAddLearner-then-promote proposals that do not include encryption_capable = true in the candidate's heartbeat context. Non-voting (learner) members may still join and sync, but cannot be promoted to voters until they demonstrate capability."


Minor

7. v1 detection heuristic lacks a precise threshold

docs/design/2026_04_29_proposed_data_at_rest_encryption.md lines 330–333

The read-path step 2 says: "if [byte 9] falls in the documented v1 ceiling high-byte range (i.e., the implicit-v1 detection)." This heuristic is sound for today (the ceiling high byte is 0x00 for any HLC timestamp within the past several decades and won't reach 0x02 until year ~2253), but the implementation needs a concrete rule. Suggest: "treat byte 9 values 0x02..0x0F as version markers (aligned with §11.3's envelope-version reservation); any other value, including 0x00 and 0x01, indicates a v1 stream."


Conventions check

Item Status
Filename 2026_04_29_proposed_data_at_rest_encryption.md
Header Status/Author/Date
Five CLAUDE.md self-review lenses in §10
§10 lens 1 / lens 2 updated for new mechanisms
Open question §11.5 (Lua batching) still flagged
`!encryption ` reserved-key namespace collision confirmed no conflict
ErrNodeIDCollision / ErrLocalEpochExhausted in §9.1
Full-DEK resync in §5.5 (not just active key)
Snapshot versioning with backward-compat v1 detection ✅ (modulo issues 1–2)

Summary

Rounds 1–6 of review have closed every P1 and most High issues from all prior passes. The architecture — index-gated Raft dispatch, MVCC encryption_state bit, deterministic nonces, three-phase rollout, crash-durable sidecar — is solid and well-specified.

Two blockers remain before the implementation PR:

  1. §4.4 snapshot magic is wrong. The diagram names pebbleSnapshotMagic = EKVPBBL1 but kv/snapshot.go uses hlcSnapshotMagic = EKVTHLC1. An implementation following the doc writes the wrong 8 bytes, breaking all v1 snapshot backward-compat.

  2. >= boundary in the FSM dispatch pseudocode. The flag entry at raft_envelope_cutover_index is in legacy framing; applying raftDEK.Unwrap() to it on WAL replay produces a GCM tag mismatch. Fix by changing to > (exclusive) and noting that the leader wraps entries strictly after the flag.

Issue 3 (incorrect peer_metadata.go reference) should also be resolved before implementation so the capability heartbeat mechanism is implemented in the right place from day one.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 7m 39s —— View job


Review: docs: propose data-at-rest encryption design (Round 15)

  • Gather context and read current design document
  • Check what commit d340ab5 resolved
  • Cross-reference design claims against source files (engine.go:2020-2028, engine.go:3612-3625)
  • Identify remaining gaps
  • Post comprehensive review feedback

What commit d340ab5 resolved

Issue Resolution
Codex P1 (r13, line 319): storage envelope AAD covered only envelope_version ‖ flag ‖ key_id, leaving the ciphertext vulnerable to cut-and-paste / blob relocation attacks AAD now includes pebble_key (encodeKey(user_key, commit_ts)); §4.1 documents three defeated attack classes: header rewrite, cut-and-paste/blob relocation, MVCC version substitution ✅
Codex P2 (r13, line 1360): cutover capability gate covered only voters, but learners also run kv/fsm.go::Apply Gate extended to ConfState.Voters ∪ ConfState.Learners in §7.1 Phase 0 step 3, §4.1 bootstrap, §5.6 step 1a, and §10 self-review lens 2. ErrCapabilityCheckFailed now surfaces offending node IDs from either set ✅

Critical — must fix before implementation PR opens

1. onEntry pseudocode unwraps the wrong bytes — implementors will write broken code

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §7.1 Phase 2 step 8 (lines 1490–1502) and §6.3 (lines 1158–1176)

This is the issue first raised in Round 12 that was not addressed in rounds 13 or 14.

The design's flow (§6.3) is:

kv/sharded_coordinator.go — wrap the proposal Data []byte before submitting to raftengine

So the coordinator calls something like:

raftWrapped := raftDEK.Wrap(fsmPayload)
engine.Propose(ctx, raftWrapped)

engine.go:1394 then calls encodeProposalEnvelope(req.id, req.payload) — reading the source:

// engine.go:3612–3618
func encodeProposalEnvelope(id uint64, payload []byte) []byte {
    out := make([]byte, envelopeHeaderSize+len(payload))
    out[0] = proposalEnvelopeVersion   // 0x01
    binary.BigEndian.PutUint64(out[1:envelopeHeaderSize], id)
    copy(out[envelopeHeaderSize:], payload)
    return out
}

So entry.Data = [0x01 (proposalEnvelopeVersion)][uint64 id (8 bytes)][raftDEK.Wrap(fsmPayload)].

The pseudocode at lines 1492–1502:

onEntry(index, entry):
  data := entry.Data           // = [0x01][id 8B][raft_envelope(fsmPayload)]
  if index > raft_envelope_cutover_index:
    data = raftDEK.Unwrap(data) // ← WRONG: data[0]=0x01 is the proposal version,
                                //   not the start of a raft envelope; GCM auth fails
  fsm.Apply(data)

raftDEK.Unwrap(entry.Data) operates on bytes that start with proposalEnvelopeVersion = 0x01 followed by an 8-byte ID — not the raft envelope the coordinator produced. AES-GCM authentication fails on every entry.

The actual applyNormalEntry in engine.go:2020–2028 already calls decodeProposalEnvelope(entry.Data) first and gives the FSM only the extracted payload:

func (e *Engine) applyNormalEntry(entry raftpb.Entry) any {
    _, payload, ok := decodeProposalEnvelope(entry.Data)
    if !ok { return nil }
    return e.fsm.Apply(payload)
}

It is payload — not entry.Data — that holds raftDEK.Wrap(fsmPayload). The hook must fire on payload, after decodeProposalEnvelope:

// Correct version of applyNormalEntry (§6.3 hook placement):
func (e *Engine) applyNormalEntry(entry raftpb.Entry) any {
    _, payload, ok := decodeProposalEnvelope(entry.Data)
    if !ok { return nil }
    if entry.Index > e.raftEnvelopeCutoverIndex() {
        var err error
        payload, err = e.raftDEK.Unwrap(payload) // payload = raft_DEK.Wrap(fsmPayload)
        if err != nil { return err }
    }
    return e.fsm.Apply(payload)
}

resolveProposal (line 2031) also calls decodeProposalEnvelope(entry.Data), which works correctly because entry.Data[0] = 0x01 = proposalEnvelopeVersion — that byte is never encrypted.

Fix the pseudocode and §6.3 to show the unwrap operating on payload (the output of decodeProposalEnvelope), not on entry.Data:

onEntry(index, entry):
  // engine.go:2020 — existing first step, unchanged:
  id, payload, ok := decodeProposalEnvelope(entry.Data)
  if !ok: return nil
  // NEW hook between decodeProposalEnvelope and fsm.Apply:
  if index > raft_envelope_cutover_index:
    payload = raftDEK.Unwrap(payload)   // payload was raft_DEK.Wrap(fsmPayload)
  return fsm.Apply(payload)            // fsmPayload, contract unchanged

Also update §4.2 to clarify: "the coordinator wraps fsmPayload with the raft DEK before passing it to engine.Propose; engine.Propose then wraps the result in encodeProposalEnvelope, so entry.Data = [proposalEnvelopeVersion][id][raft_DEK.Wrap(fsmPayload)]."

Fix this →


Medium — should resolve before or alongside implementation

2. Raft envelope wire format, nonce scheme, and AAD are unspecified

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.2

§4.1 fully specifies the storage envelope: version byte, flag, key_id, deterministic 96-bit nonce (node_id ‖ local_epoch ‖ write_count), AAD (envelope_version ‖ flag ‖ key_id ‖ pebble_key), and overhead. §4.2 says only that the raft DEK "wraps the entire Data []byte" with no format diagram, no nonce scheme, and no AAD.

Two concrete risks this leaves open:

Nonce scheme: If an implementor uses random 12-byte nonces for the raft DEK, the NIST SP 800-38D §8.3 limit of 2³² invocations per key is reached in approximately 5 days at 10k proposals/s — the same problem the storage envelope fixed multiple rounds ago.

Cross-context replay via missing purpose discriminator in AAD: After the Round 14 fix, the storage AAD now binds to pebble_key. But a raft-envelope ciphertext sealed against AAD = envelope_version ‖ flag ‖ key_id ‖ raft_payload_context (or whatever the implementation chooses) would not be detectable as a raft ciphertext if an attacker replays it into the Pebble storage layer (or vice versa) — unless the AAD includes a purpose field ("storage" vs "raft").

Specify in §4.2:

Raft envelope format (same layout as §4.1, purpose-specific AAD):
| 0x01 | 0x00 (flag=0, no compress) | key_id(4) | nonce(12) | ciphertext(N) | tag(16) |

Nonce: same node_id ‖ local_epoch ‖ write_count construction as §4.1,
  drawing from the same sidecar-persisted local_epoch counter for the
  raft DEK (not shared with the storage DEK counter — each DEK has
  its own counter in the sidecar). Only the leader calls Wrap.

AAD: purpose("raft") ‖ envelope_version ‖ key_id
  (distinct from storage-layer AAD = envelope_version ‖ flag ‖ key_id ‖ pebble_key
   so a storage-layer ciphertext replayed as a raft envelope, or vice versa,
   fails GCM authentication)

Fix this →

3. Learner registration gap: no path covers learners added after bootstrap

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §4.1 (writer registry, registration paths)

The three current registration paths (bootstrap batch, process-start via coordinator, post-rotation via coordinator) all assume the registering node either (a) exists at bootstrap time or (b) has a client-facing coordinator that can block until the RegisterEncryptionWriter commit is acknowledged. Learners don't serve client writes and have no coordinator gate — so the "process-start via coordinator" path cannot protect a learner from writing its first encrypted entry before its registry entry commits.

Line 1013–1016 says: "Nodes added to the cluster after bootstrap fall back to the process-start trigger described in §4.1 (their first RegisterEncryptionWriter is proposed before they accept any encrypted write at the coordinator)." This is correct for a new voter starting with a coordinator, but a new learner has no such coordinator gate.

The §7.1 post-cutover membership gate (line 1576) checks EncryptionAdmin.GetCapability before AddVoter and learner promotion — but not before ConfChangeAddLearner. A new learner that passes the capability check still has no RegisterEncryptionWriter in flight; its FSM apply will encrypt values under a DEK it is not yet registered for.

Add a fourth registration path: when the leader processes ConfChangeAddLearner (checked via the post-cutover gate), it also proposes RegisterEncryptionWriter(dek_id, learner_full_node_id, learner_local_epoch) — both fields collected from the GetCapability call already made during the membership gate — in the same commit batch or immediately after. The post-cutover gate section at §7.1 should state this explicitly.

Fix this →


Conventions check

Item Status
Filename 2026_04_29_proposed_data_at_rest_encryption.md
Header Status/Author/Date
Five CLAUDE.md self-review lenses in §10
AAD extended to envelope_version ‖ flag ‖ key_id ‖ pebble_key in §4.1
§4.1 documents cut-and-paste / blob-relocation / MVCC-version-substitution defeats
Capability gate covers ConfState.Voters ∪ ConfState.Learners
raft_envelope_cutover_index > (exclusive) boundary with comment
Snapshot format EKVTHLC1 / EKVTHLC2 distinct magics
hlcSnapshotMagic / hlcSnapshotMagicV2 in §4.6 (not pebbleSnapshotMagic)
bootstrap-encryption in §5.5 six-entry isEncryptionRelevant() set
EncryptionAdmin.GetCapability with full_node_id and local_epoch in §6.1
Post-cutover AddVoter/learner-promotion gate (ErrPeerNotEncryptionCapable)
ErrLocalEpochExhausted and ErrNodeIDCollision in §9.1
raft_applied_index written atomically with DEKs in §5.1
key_id == 0 reserved sentinel excluded from CSPRNG
Phase 1 step 4 "not raft-DEK-wrapped" (not "legacy framing")
onEntry pseudocode operates on correct bytes ❌ still entry.Data — see issue #1
Raft envelope format, nonce, and AAD specified ❌ not yet specified — see issue #2
Learner registration path for post-bootstrap adds ❌ missing — see issue #3
§11.5 Lua batching flagged as pre-implementation gate

Summary

Fourteen rounds of review have hardened every cryptographic, wire-format, rollout-safety, and key-management decision in this proposal. The Round 14 AAD fix (record-binding via pebble_key) and the learner/voter capability gate are well-done improvements.

One blocking issue remains before the implementation PR opens:

Issue #1 (Critical): The onEntry pseudocode unwraps entry.Data instead of the FSM payload extracted by decodeProposalEnvelope. Verified against engine.go:2020–2028 and engine.go:3612–3625: entry.Data = [0x01 (proposalEnvelopeVersion)][8-byte id][raft_DEK.Wrap(fsmPayload)]. Calling raftDEK.Unwrap(entry.Data) would fail with a GCM authentication error because the proposal envelope header bytes are not what the raft DEK encrypted. The correct hook placement is inside applyNormalEntry after decodeProposalEnvelope extracts payload, not before it.

Two medium issues should also be resolved before implementation:

Issue #2 (Medium): The raft envelope has no wire format diagram, nonce scheme, or AAD specification. Without these, the implementation could introduce random nonces (hitting the 2³² NIST limit in ~5 days at 10k/s) or omit a purpose discriminator (enabling cross-context replay between storage and raft envelopes).

Issue #3 (Medium): Learners added after the initial bootstrap batch have no described path to register in the writer registry. The "process-start via coordinator" path requires a client-facing coordinator that learners lack. A fourth path (leader proposes RegisterEncryptionWriter on ConfChangeAddLearner) should be documented.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d340ab528c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +241 to +245
exist cluster-wide. FSM apply may run on Raft entries
proposed by the leader during this window (those are
decrypted using DEKs the node already has from its
sidecar), but no self-originated encrypted entry can be
queued by the local coordinator until registration
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate follower apply on writer-registry registration

This window allows FSM apply to process encrypted entries before RegisterEncryptionWriter is committed, which violates the stated invariant in §4.1 that a node must be in the per-DEK writer registry before its first write. For a newly joined learner/voter, apply-path writes can happen without any coordinator traffic, so a 16-bit node_id collision with a historical writer under the same DEK can bypass the registry check and reuse (DEK, nonce) pairs. The membership/join flow should require successful writer registration (or equivalent pre-registration in the conf-change path) before encrypted apply is allowed.

Useful? React with 👍 / 👎.

Comment on lines +928 to +932
replays **every entry that satisfies the
`isEncryptionRelevant()` predicate** (the same six-entry
set used by the §5.5 lag check above: `rotate-dek`,
`rewrap-deks`, `bootstrap-encryption`,
`enable-storage-envelope`, `enable-raft-envelope`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add resync fallback when missed entries are compacted

The manual recovery path is defined as replaying encryption-relevant log entries between the sidecar index and FSM applied index, but this fails if the missing bootstrap/cutover entries were already compacted from the Raft log (a normal case after long downtime). In that scenario startup refusal (ErrSidecarBehindRaftLog) can become unrecoverable with the documented resync-sidecar flow, because there is no fallback that reconstructs sidecar state from a full current-state snapshot/entry.

Useful? React with 👍 / 👎.

…vs proposal envelope)

Critical (claude[bot] r12 #1): the previous design directed raftDEK.Unwrap to operate on entry.Data, but engine.go already wraps every proposal in encodeProposalEnvelope(id, payload) producing entry.Data = [0x01 proposalEnvelopeVersion][uint64 id][payload]. Unwrapping entry.Data would clobber the leading 0x01 that resolveProposal/decodeProposalEnvelope check, returning ok=false and silently dropping every coordinator response -- every write would time out forever after Phase 2. Even if the unwrap "succeeded," the result would be the inner proposal envelope itself starting with 0x01 = raftEncodeBatch, misrouting every post-Phase-2 entry.

Fix the layering: the raft envelope wraps the FSM payload INSIDE the proposal envelope, not entry.Data itself. §4.2 explains the entry.Data layout and the call sequence. §6.3 hook is now placed inside applyNormalEntry, between decodeProposalEnvelope and fsm.Apply -- decode first to recover (id, maybe_enc_payload), then unwrap if index > cutover, then Apply the cleartext. §7.1 pseudocode rewritten as applyNormalEntry showing the full sequence with explicit comments calling out the layering trap. Coordinator-side bullet in §6.3 also clarified to wrap with raft DEK BEFORE handing to engine submit path.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36ffb904ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1331 to +1333
# refuses unless every
# voting member reports
# encryption_capable
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require learner capability before enabling envelopes

This command contract still says cutover is allowed when only voting members report encryption_capable, but later in §7.1 the rollout safety argument depends on checking voters and learners because learners run the same apply path. If implementers follow this section, Phase 1/2 could be enabled with a stale learner that then mis-handles encrypted entries (tombstone misread or apply failure), which undermines the migration safety guarantees.

Useful? React with 👍 / 👎.

…15 medium x2)

P1 (codex on 36ffb90 line 245) + claude[bot] r15 #3: writer registry §4.1 had three registration paths (bootstrap batch / process-start via coordinator / post-rotation via coordinator). Learners do not serve client writes so they have NO coordinator gate to defer first encrypted FSM apply behind RegisterEncryptionWriter commit. A new learner whose uint16(node_id) collides with a historical writer would reuse (DEK, nonce) pairs on the very first apply Put. Add a fourth path: ConfChange-time. The leader processing ConfChangeAddLearner / ConfChangeAddNode (already gated by GetCapability per §7.1 post-cutover gate) proposes paired RegisterEncryptionWriter entries for every unretired DEK before committing the conf-change; collisions reject with ErrPeerNotEncryptionRegistered. Same applies to learner→voter promotion.

P2 (codex on 36ffb90 line 966): §5.5 manual recovery via log replay fails when the missing encryption-relevant entries have been compacted from the Raft log (normal after long downtime + snapshot retention). ErrSidecarBehindRaftLog becomes unrecoverable. Add a leader-side full-state fallback: ErrLogCompactedPastSidecar triggers EncryptionAdmin.GetSidecarState which returns the entire unretired DEK set (KEK-wrapped) plus active flags and cutover index. The recovering node overwrites its sidecar authoritatively. Safe because it was already behind and has no committed-but-unapplied state to lose.

Medium (claude[bot] r15 #2): §4.2 raft envelope had no wire format / nonce scheme / AAD specified. Add explicit format diagram (same layout as §4.1, flag fixed at 0x00, separate per-DEK local_epoch counter), nonce uses the same node_id||local_epoch||write_count construction with the raft DEKs OWN counter, and AAD = purpose("raft") || envelope_version || key_id with a literal byte 0x52 prefix that distinguishes raft-layer ciphertext from storage-layer ciphertext (so cross-context replay fails GCM verification). No pebble_key in raft AAD because the raft envelope is location-independent (identified by entry.Index, used by the engine hook for index-gating).

Note: claude[bot] r15 Critical #1 (entry.Data vs payload layering) was already addressed in round-15 commit 36ffb90. §6.1 admin.go updated to include GetSidecarState RPC and SidecarStateReport message.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c89b62057

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +283 to +286
2. If an entry exists with the **same** `full_node_id` →
this is a re-registration (post-restart). Bump
`last_seen_local_epoch` and allow. The node was already
a legitimate writer.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject stale local_epoch on writer re-registration

The re-registration rule currently accepts any entry with the same full_node_id and just updates last_seen_local_epoch, which allows a node restored from stale disk/backup state to come back with a lower local_epoch under the same unretired DEK. Because the nonce is node_id || local_epoch || write_count, that rollback can reproduce previously used nonces and cause AES-GCM nonce reuse. The registration check should enforce monotonicity (e.g., require new_local_epoch > last_seen_local_epoch) and fail startup/registration on regression.

Useful? React with 👍 / 👎.

…earner consistency)

P1 (codex on 2c89b62 line 1450): §6.6 admin command contract for enable-storage-envelope / enable-raft-envelope still said "voting member" only, contradicting §7.1 Phase 0 step 3 (which I extended to ConfState.Voters union ConfState.Learners in round-14). An implementer following §6.6 in isolation would build a gate that lets cutover proceed with a stale learner present, defeating the migration safety guarantee. Updated §6.6 contract comments and four other site-wide "voting member" references (§6.1 GetCapability fan-out scope, §6.3 raftEncodeEncryptionBootstrap body description, §6.2 lsm_store layout extension, §7.1 Phase 0 / Phase 2 prose) to consistently say "voters AND learners" or "every member" so the safety invariant is uniform across the design.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/design/2026_04_29_proposed_data_at_rest_encryption.md (1)

1514-1520: Clarify that Phase 2 cutover does not require rewrite completion.

Lines 1516-1519 describe Phase 1 completion criteria (zero cleartext versions + minRetainedTS advanced), but Phase 2 cutover (lines 1521-1610) does not list "Phase 1 complete" as a precondition. This implies the operator can run enable-raft-envelope while cleartext MVCC versions remain on disk.

This is safe (storage reads dispatch on encryption_state per lines 1373-1390), but it violates the §1 threat model goal until rewrite completes. An attacker who steals disks during the Phase 1→2 window gets both cleartext historical versions and encrypted new writes.

Add an explicit note at line 1523 (start of Phase 2 section) clarifying:

  • Phase 2 can proceed before rewrite finishes (raft envelope and storage envelope are orthogonal).
  • However, the §1 threat model guarantee ("persisted state cannot be decrypted without KEK") only holds cluster-wide after the §7.1 step 6 rewrite completes and minRetainedTS advances past the youngest cleartext commit_ts on every node.
  • Operators with strict compliance requirements should gate Phase 2 on rewrite completion.

This makes the security posture explicit for operators who care about the timing of full coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md` around lines 1514
- 1520, Add a clear note at the start of the Phase 2 section (before or at the
existing "enable-raft-envelope" discussion) stating that Phase 2 can be executed
before the in-place MVCC rewrite completes because raft-envelope and
storage-envelope are orthogonal (storage reads still dispatch based on
encryption_state), but that the §1 threat-model guarantee ("persisted state
cannot be decrypted without KEK") only holds cluster-wide after the §7.1 step 6
rewrite finishes and minRetainedTS has advanced beyond the youngest cleartext
commit_ts on every node; advise operators with strict compliance to gate
execution of enable-raft-envelope (Phase 2) until the rewrite and minRetainedTS
advancement are verified.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/design/2026_04_29_proposed_data_at_rest_encryption.md`:
- Around line 1514-1520: Add a clear note at the start of the Phase 2 section
(before or at the existing "enable-raft-envelope" discussion) stating that Phase
2 can be executed before the in-place MVCC rewrite completes because
raft-envelope and storage-envelope are orthogonal (storage reads still dispatch
based on encryption_state), but that the §1 threat-model guarantee ("persisted
state cannot be decrypted without KEK") only holds cluster-wide after the §7.1
step 6 rewrite finishes and minRetainedTS has advanced beyond the youngest
cleartext commit_ts on every node; advise operators with strict compliance to
gate execution of enable-raft-envelope (Phase 2) until the rewrite and
minRetainedTS advancement are verified.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 30d9da28-4ad3-43cb-ac93-fd6cb331957a

📥 Commits

Reviewing files that changed from the base of the PR and between f34d9b7 and 36ffb90.

📒 Files selected for processing (1)
  • docs/design/2026_04_29_proposed_data_at_rest_encryption.md

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6168382a96

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1056 to +1061
`EncryptionAdmin.GetSidecarState(Empty) → SidecarStateReport
{ wrapped_deks_by_id (map<uint32, bytes>),
active_storage_id (uint32), active_raft_id (uint32),
storage_envelope_active (bool),
raft_envelope_cutover_index (uint64),
latest_applied_index (uint64) }`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve per-node local_epoch during sidecar resync

The compaction fallback snapshot sent by GetSidecarState includes wrapped DEKs and cutover flags but omits each key's local_epoch, yet §5.5 says the follower overwrites its sidecar from this authoritative state. If a node had previously encrypted data under a DEK and then resyncs this way, its per-DEK epoch can roll back (or reset), so after restart it can regenerate previously used node_id || local_epoch || write_count nonces under the same key, which is catastrophic for AES-GCM. The resync payload needs to carry and preserve the follower's monotonic per-node epochs (or otherwise re-establish a nonce domain) before permitting further encryption.

Useful? React with 👍 / 👎.

Comment on lines +1671 to +1673
8. From the **next** entry after the flag entry's apply index
onward, every leader wraps new proposal `Data []byte` with
the raft DEK (§4.2). The flag entry itself is **not**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Quiesce proposals across raft-envelope cutover

This cutover defines a strict index boundary (index > raft_envelope_cutover_index) but does not require pausing or sequencing client proposals while the enable-raft-envelope entry is in flight. In a busy leader, cleartext proposals can be appended after the cutover entry is proposed but before wrapping is enabled, giving them indexes above cutover; followers will then try to unwrap those cleartext payloads and fail apply. The rollout needs an explicit barrier (e.g., block proposal intake until the cutover entry is applied, then enable wrapping) to keep format and index boundaries consistent.

Useful? React with 👍 / 👎.

… + cutover quiescence)

P1 (codex on 6168382 line 286): §4.1 re-registration rule unconditionally bumped last_seen_local_epoch on same-full_node_id matches, allowing a node restored from stale disk/backup state to re-register with a LOWER local_epoch and reproduce previously-used nonces under the same DEK -- catastrophic AES-GCM nonce reuse. Split case 2 into two: (2) accept only if proposed local_epoch > registry.last_seen_local_epoch, bumping the registry; (3) reject with ErrLocalEpochRollback otherwise. Recovery is either manual sidecar bump or DEK rotation (which retires the registry slice).

P1 (codex on 6168382 line 1061): §5.5 GetSidecarState compaction-fallback overwrote the follower sidecar authoritatively without preserving per-DEK local_epoch, allowing a roll-back / reset of the per-(node, DEK) nonce counter on resync. Add a writer_registry_for_caller field to SidecarStateReport carrying the leader-recorded last_seen_local_epoch for THIS follower under each DEK; the recovering follower computes new_local_epoch = max(existing_sidecar, registry_for_caller) + 1 per DEK so the post-resync value is strictly above any nonce ever issued by this follower under that DEK. The +1 increment is the same crash-safe step §5.1 performs on every process start. The re-registration check (case 3 above) closes the loop on the next round-trip.

P1 (codex on 6168382 line 1673): §7.1 Phase 2 had a propose-time / apply-time race -- a busy leader could append cleartext proposals after proposing the cutover entry but before flipping its wrap-on-propose switch (which fires at FSM apply), giving them indexes > raft_envelope_cutover_index without the §4.2 envelope. Followers would then unwrap them as raft envelopes and fail every apply with GCM verification errors. Add a leader-side proposal-quiescence barrier: block intake → drain in-flight → propose cutover → wait for local apply → flip switch → unblock. Single Raft commit RTT cost. Phase 1 enable-storage-envelope does NOT need this barrier because storage encryption is decided per-Put at FSM apply time (per-replica, deterministic) rather than per-proposal at propose time (leader-only).

§6.1 SidecarStateReport proto extended with writer_registry_for_caller. §9.1 startup refusal list extended with ErrLocalEpochRollback.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7bd5539178

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1211 to +1214
`RegisterEncryptionWriter` (§4.1) **after** it has
observed both the bootstrap entry and the
`enable-storage-envelope` flag entry in its applied log,
not at process start. Until both are present the storage
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not force initial nodes to re-register bootstrap writers

Step 1a already inserts writer-registry records during bootstrap-encryption, but this step says every node performs its first RegisterEncryptionWriter right after bootstrap/flag apply. Under the registry rules in §4.1, a registration with the same full_node_id and unchanged local_epoch is rejected as ErrLocalEpochRollback, so implementing both instructions together can make the initial rollout fail even when all nodes are healthy. Clarify that the bootstrap cohort must skip the immediate per-node registration (or must increment epoch before re-registering).

Useful? React with 👍 / 👎.

Comment on lines +915 to +919
4. **DEK retirement criterion.** A DEK is safe to unload only when
**both** of the following are true cluster-wide:
- The rewrite cursor for that DEK has reached the end of the
keyspace AND `elastickv_encryption_values_per_dek{key_id} == 0`
across every node (verified by `encryption status --verify`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate DEK retirement on WAL needs for raft-purpose keys

This retirement rule is MVCC-only (rewrite completion and values_per_dek), but the design also supports rotate-dek --purpose=raft. If applied to raft-purpose DEKs, operators can retire an old raft key while WAL still contains entries encrypted with that key_id; a restart or lagging follower replay would then hit unknown_key_id during apply. Raft-purpose retirement needs an additional guard tied to log/snapshot compaction state (e.g., no retained entries below last index encrypted by that DEK).

Useful? React with 👍 / 👎.

…ister + raft DEK WAL retention)

P1 (codex on 7bd5539 line 1214): §5.6 step 1a inserted bootstrap cohort into the writer registry but §4.1 process-start trigger ALSO fired on the bootstrap nodes first observation of enable-storage-envelope. Per the round-18 §4.1 case 3 monotonicity check (ErrLocalEpochRollback), the second registration with the same local_epoch as the bootstrap entry would be rejected -- breaking the cluster-wide rollout even when every node is healthy. Reword §4.1 process-start: trigger fires only on a fresh process load whose §5.1 local_epoch bump has just persisted a value strictly greater than the registry record. Bootstrap cohorts skip the second registration because their slot already contains the current local_epoch; the implementation must check the registry on coordinator-startup and skip the propose if its slot is already current.

P1 (codex on 7bd5539 line 919): §5.4 retirement criterion was MVCC-only (rewrite cursor + minRetainedTS), but rotate-dek --purpose=raft retires raft DEKs whose ciphertext lives in the etcd raft WAL, not Pebble. Retiring a raft DEK while the WAL still contains entries encrypted under it would cause unknown_key_id apply failures on the next process restart (WAL replay) or on a lagging follower catch-up. Split the criterion by purpose: storage-purpose keeps the MVCC-driven gate; raft-purpose adds a WAL-driven gate (every nodes log start index strictly greater than the largest log index ever proposed under the retiring raft DEK + last committed snapshot taken under the new raft DEK). New ErrRaftDEKWALStillReferences and metric elastickv_encryption_last_proposed_index_per_raft_dek for the operator to observe before retire.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b343d0339

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1410 to +1411
failure, return an error from `Apply` so the entry is not
silently skipped (consistency invariant — see CLAUDE.md self-
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate unwrap failure via engine error path

This section says unwrap failures can be handled by “return[ing] an error from Apply so the entry is not silently skipped,” but the current raft contract does not support that: StateMachine.Apply returns any (not error), and internal/raftengine/etcd/engine.go::applyCommitted advances setApplied(entry.Index) regardless of the Apply return value. In a post-cutover node that hits a raft-envelope decrypt/tag failure, treating the failure as a normal Apply return would still mark the entry applied and continue replay, violating the consistency guarantee this design relies on. The design should require an engine-level error channel/return path that stops apply before setApplied when unwrap fails.

Useful? React with 👍 / 👎.

…t halt apply, not silently skip)

P1 (codex on 6b343d0 line 1411): §6.3 promised "unwrap failure propagates as an Apply error so the entry is not silently skipped" but the actual StateMachine.Apply contract returns "any" (no error), and engine.go::applyCommitted advances setApplied(entry.Index) regardless of the Apply return value. A raft-envelope decrypt failure (sidecar/keystore divergence or on-disk corruption) would silently mark the entry applied and continue replay -- exactly the silent-FSM-divergence the integrity tag was added to detect.

Fix: the engine hook itself returns a sentinel ErrRaftUnwrapFailed (or sibling ErrEncryptionApply for 0x03/0x04/0x05 internal errors) directly to applyCommitted, BEFORE calling fsm.Apply. applyCommitted is extended to recognise these sentinels and stop the apply loop WITHOUT advancing setApplied -- the process exits via the existing fatal-error path so an operator-supervised restart under a corrected keystore (§5.5 resync, §9.3 KEK custody) replays the same entry. §6.3 implementation plan, §7.1 Phase 2 pseudocode, and §10 self-review lens 1 all updated to reflect this fail-fast contract.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e27f2d429

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1800 to +1804
1. block new proposal intake at the engine.Propose entrypoint
(return ErrEnvelopeCutoverInProgress to coordinator)
2. wait for the in-flight proposal queue to drain (all
previously-accepted proposals committed and applied)
3. propose the enable-raft-envelope entry (raftEncodeEncryptionRotation
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exempt cutover entry from global propose block

This sequence blocks engine.Propose before submitting enable-raft-envelope, but internal/raftengine/etcd/engine.go currently exposes Propose as the write entrypoint; implementing this literally can reject the cutover proposal itself and leave the cluster stuck in a write-rejecting barrier. The cutover flow needs an explicit exception/bypass for the cutover proposal (or a narrower gate that only blocks user proposals) so step 3 can always be appended and committed.

Useful? React with 👍 / 👎.

string build_sha = 2;
bool sidecar_present = 3;
uint64 full_node_id = 4; // for §5.6 batch registry
uint32 local_epoch = 5; // for §5.6 batch registry
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep local_epoch wire types aligned with 16-bit nonce field

The design defines local_epoch as 16-bit and encodes it into a 2-byte nonce field (§4.1), but the RPC schema widens it to uint32; this mismatch invites silent truncation if values are copied back into sidecar/nonce state without a hard bounds check. A truncation on resync/registration can violate the monotonic epoch invariant that prevents nonce reuse, so the wire contract should be explicitly 16-bit-equivalent (or require strict <= 0xFFFF validation at decode points).

Useful? React with 👍 / 👎.

…+ epoch wire type)

P1 (codex on 5e27f2d line 1804): §7.1 Phase 2 step 7 quiescence barrier blocked engine.Propose globally before submitting the cutover entry, but step 3 of the same sequence is itself an engine.Propose call -- so the literal implementation would reject its own cutover proposal and leave the cluster stuck in a write-rejecting barrier forever. Specify a per-call source tag: USER proposals from coordinators are blocked, but the encryption-admin path (cutover proposals + RegisterEncryptionWriter triggered by ConfChangeAddLearner mid-cutover) bypasses with source = "encryption_admin".

P2 (codex on 5e27f2d line 1317): §6.1 RPC schema used uint32 for local_epoch but the §4.1 nonce field reserves only 16 bits. Without explicit decode-side validation, a corrupted/malicious response carrying a value > 0xFFFF would silently truncate when copied into the nonce field, breaking the monotonicity invariant the round-18 ErrLocalEpochRollback was added to enforce. Added explicit "value <= 0xFFFF or ErrLocalEpochOutOfRange" rule applied symmetrically to CapabilityReport.local_epoch and every value in SidecarStateReport.writer_registry_for_caller. New ErrLocalEpochOutOfRange added to §9.1 startup refusal list (and applies on every RPC decode site).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant