Skip to content

issue #90: co-locate audit/email/cred/memory workers on broker host (dev)#92

Merged
hanwencheng merged 39 commits into
mainfrom
claude/adoring-lehmann-3b1922
May 20, 2026
Merged

issue #90: co-locate audit/email/cred/memory workers on broker host (dev)#92
hanwencheng merged 39 commits into
mainfrom
claude/adoring-lehmann-3b1922

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

Summary

Dev-only co-location of the 4 service workers (audit / email / cred / memory) on the same EC2 box as the broker, behind per-worker nginx vhosts. The per-subdomain layout is the migration seam — moving any worker to its own host later only requires changing the A record + IAM principal. Per CLAUDE.md, production will isolate all services; this PR is the dev posture.

Topology after merge:

Hostname Loopback port systemd unit
broker.litentry.org :8091 agentkeys-broker
signer.litentry.org :8092 agentkeys-signer
audit.litentry.org :9092 agentkeys-worker-audit (Merkle relay)
email.litentry.org :9093 agentkeys-worker-email (SES + S3 inbox)
cred.litentry.org :9094 agentkeys-worker-creds (credential CRUD)
memory.litentry.org :9095 agentkeys-worker-memory (memory CRUD)

What changed

scripts/setup-broker-host.sh (+506 / -46)

  • Builds + installs all 4 worker binaries (--without-workers opts out).
  • Auto-generates /etc/agentkeys/worker-{creds,memory}.env with stable KEK secrets (preserved across re-runs — regenerating would invalidate every existing encrypted blob). worker-{audit,email}.env are non-secret and deterministic.
  • Drops 4 systemd units. worker-creds + worker-memory use /bin/sh -c 'export BROKER_CAP_PUBKEY_PEM=\"\$(cat …)\" && exec …' to inject the broker's session-pubkey PEM (multi-line — can't go in an EnvironmentFile).
  • Writes 4 nginx vhosts via a shared write_worker_nginx_site() helper. A → B (HTTP-only → :443 ssl) flip on second pass after LE cert is issued, matching the existing broker + signer pattern.
  • probe_or_die against /healthz on all 4 worker ports after restart.
  • New CLI flags: --audit-host, --email-host, --cred-host, --memory-host, --chain-rpc, --vault-bucket, --memory-bucket, --scope-addr, --registry-addr, --k3-counter-addr, --without-workers. All default-derived from \$ISSUER_HOST + scripts/operator-workstation.env.
  • Idempotency hardening: re-runs without flags re-read previously-configured values from /etc/agentkeys/worker-{creds,memory}.env (mirrors the existing broker-unit detection block). So a first run with --chain-rpc https://devnet.example stays sticky on subsequent flag-less re-runs.

scripts/dns-upsert-workers.sh (NEW)

Single atomic Route 53 change-batch UPSERT for all 4 A records. Validates the caller is on agentkeys-admin (case-insensitive per CLAUDE.md), refuses RFC1918 / TEST-NET-2 (Cloudflare WARP / Zscaler / corporate VPN rewrites) / CGNAT EIPs, waits for Route 53 INSYNC + Cloudflare DoH propagation before exiting.

scripts/verify-workers.sh (NEW)

Laptop-side end-to-end check: DNS resolves via Cloudflare DoH → TLS cert is Let's Encrypt with valid date → /healthz returns HTTP 200 with the per-worker expected body marker. Exits non-zero with per-failure diagnostic. --no-tls flag for the HTTP-only first-pass phase.

docs/cloud-setup.md (+88)

  • New §1.4 row in TOC.
  • New §7 "Service workers (audit / email / cred / memory)" with concern table (mirrors §6 signer), §7.1 DNS one-shot helper, §7.2 TLS cert loop + nginx flip, §7.3 verification.
  • §7 Cleanup renumbered to §8.

crates/agentkeys-worker-audit/src/main.rs + …-email/src/main.rs

GET /healthz → \"ok\" so probe_or_die can verify boot. worker-creds + worker-memory already had it.

scripts/operator-workstation.env

Derive WORKER_{AUDIT,EMAIL,CRED,MEMORY}_HOST + AGENTKEYS_WORKER_*_URL from \$BROKER_HOST, mirroring the SIGNER_HOST pattern.

scripts/heima-scope-set.sh + heima-scope-revoke.sh (carry-over)

Graceful skip with {\"ok\":true,\"skipped\":\"no-webauthn-k11\"} when no mode:webauthn K11 is enrolled, so harness/v2-stage1-demo.sh (default stub mode) is fully CI-automatable without operator Touch ID. Fixes the Q2 issue raised in the prior session: "`bash harness/v2-stage1-demo.sh` is not fully automatic, it still ask me for touchid".

Reviewer notes

  • The PEM-injection /bin/sh -c wrapper in worker-creds / worker-memory units was tested by rendering the heredoc in isolation — it produces literal \$(cat /var/lib/…/session-keypair.pub.pem) in the unit file, expanded by /bin/sh at service start, not at script-write time.
  • worker-creds + worker-memory Requires=agentkeys-broker.service so the session-pubkey PEM exists before they start. setup-broker-host.sh restarts workers AFTER signer (which depends on the PEM too) — order is backend + broker → signer → workers.
  • worker-audit's /var/lib/agentkeys/audit-leaves dir is created with mode 0750 owned by agentkeys; ProtectSystem=strict + ReadWritePaths=/var/lib/agentkeys allows the worker to write per-batch Merkle JSONL there.
  • KEK secrets in worker-creds.env + worker-memory.env are mode 0600, owner agentkeys. The other two env files are mode 0644 (no secrets).

Test plan

  • bash -n clean on all 3 scripts.
  • cargo check -p agentkeys-worker-{audit,email,creds,memory} exit 0.
  • Sourcing scripts/operator-workstation.env derives https://audit.litentry.org etc. correctly.
  • Heredoc renders ExecStart=/bin/sh -c 'export BROKER_CAP_PUBKEY_PEM=\"\$(cat /var/lib/…/session-keypair.pub.pem)\" && exec /usr/local/bin/agentkeys-worker-creds' (no script-time $-expansion).
  • Operator-driven on Heima Mainnet broker host (this PR):
    1. From laptop: bash scripts/dns-upsert-workers.sh (UPSERTs 4 A records on Route 53).
    2. On broker host: sudo bash scripts/setup-broker-host.sh --yes (writes HTTP-only nginx vhosts + systemd units, builds + installs all 4 worker binaries).
    3. On broker host: sudo certbot certonly --webroot … loop for the 4 new hosts.
    4. On broker host: sudo bash scripts/setup-broker-host.sh --yes (second pass flips nginx onto :443 ssl).
    5. From laptop: bash scripts/verify-workers.sh — expect all 4 green.

…-N recovery + companion daemon

P-256 ECDSA verify on-chain via pure-Solidity Jacobian-coords implementation
(no EIP-7212 precompile dependency — Heima is at London EVM). ~654k gas
per verify, sufficient for master-mutation frequency. RFC 6979 test vectors
pass.

K11Verifier extracts WebAuthn challenge from clientDataJSON at known byte
offset (daimo-style), reconstructs msgHash, calls P256Verifier. Binds K11
sig to operation challenge to prevent replay.

SidecarRegistry: splits into registerFirstMasterDevice +
registerAdditionalMasterDevice + revokeAgentDevice + revokeMasterDevice
(M-of-N quorum gated by recoveryThreshold). Stores k11PubX/k11PubY +
lastSignCount per device. Per-operator nonce + monotonic sign-count
defend against replay.

AgentKeysScope: K11Assertion struct gates setScopeWithWebauthn /
revokeScope; per-(operator, agent) scopeNonce binds K11 sig to current
state.

CLI: K11ChainAssertion struct + assert_webauthn_for_chain() extracts
(r, s, msgHash, pubX, pubY, authData, clientDataJSON, challengeLocation,
signCount) for chain submission. New --rp-id flag enables companion
credentials at companion.localhost (distinct platform keychain entry).
--emit-chain-payload outputs JSON for cast tx construction.

Daemon: new --master-companion mode runs a second daemon instance with
its own K10 + K11 at rp_id=companion.localhost. Serves HTTP API:
  GET  /v1/companion/whoami    — emits device identity
  POST /v1/companion/approve   — runs WebAuthn ceremony, returns chain payload

Scripts:
  scripts/heima-device-add.sh              — register companion as 2nd master
  scripts/heima-set-recovery-threshold.sh  — raise threshold to N
  scripts/heima-recovery.sh                — M-of-N master-device revoke

Harness:
  harness/v2-stage2-demo.sh                — idempotent 8-step demo

28 forge tests pass (P256: 8, K11: 6, AgentKeysV1: 14). Stage-2 demo
runs green in stub mode and re-runs green (idempotent). Full --webauthn
flow requires Touch ID + post-deploy contract addresses.

Closes part of #90:
  - On-chain P-256 verify of K11 assertions
  - Multi-master M-of-N recovery quorum
  - Multi-master pairing flow (companion daemon as mobile-app alternative)

Deferred to follow-up PRs:
  - audit-service worker (tier A Merkle relay)
  - email-service worker
  - K3 rotation operational runbook
  - Existing scripts/heima-{device-register,scope-set,scope-revoke}.sh
    migration to new contract surface (their K11 args changed shape)
Adds docs/v2-stage2-heima-deploy-and-test.md walking the operator
through redeploying the stage-2 contract set on Heima Mainnet,
re-bootstrapping the primary master, running the stage-2 demo, and
exercising the M-of-N recovery flow. Inherits all env setup from
docs/v2-stage1-migration-and-demo.md (no parallel test environment).

Harness fixes from the first dry-run:
- harness/v2-stage2-demo.sh step 5 simplifies to script-existence
  sanity check in stub mode (was: invoking dry-run which fails on
  missing companion K11 file).
- harness/v2-stage2-demo.sh step 7 same — verifies recovery script is
  invocable without requiring live chain state.
- scripts/heima-device-add.sh adds a dry-run path that doesn't require
  the companion K11 file (uses placeholder pubkey).
- scripts/heima-recovery.sh adds a dry-run path that doesn't require
  the deployer mnemonic / ethers node_modules.

Result: bash harness/v2-stage2-demo.sh --stub --skip-build runs all
8 steps green and is idempotent on re-run.
Stage-2 demo now owns the full lifecycle end-to-end:
- step 3: idempotent contract deploy (skips if already on chain;
  --redeploy forces fresh deploy; reads addresses from broadcast file;
  writes them to scripts/operator-workstation.env)
- step 4: idempotent primary-master bootstrap via new
  scripts/heima-register-first-master.sh (calls registerFirstMasterDevice
  with K11 pubX/pubY loaded from the operator's enrollment JSON)
- step 5-8 unchanged: companion daemon spin-up, 2nd-master register,
  recoveryThreshold update, recovery dry-run
- step 9: summary with all deployed addresses

Now actually deployed to Heima Mainnet (verified live):
  P256Verifier:    0xb74f0aaf9b72b4e7da872f77c63d805bf1937190
  K11Verifier:     0x73446fc9919a0a539b8b08dbda615a64b796ca4f
  SidecarRegistry: 0x9306c524a5e5c33e9a905b956204207ccaf7a7a1
  AgentKeysScope:  0x1276b94f57fd4086670d66acb8c75058176df399
  K3EpochCounter:  0x66c08748a6cfa14d9fefaaf5147e41a98db24f53
  CredentialAudit: 0xe827ba44931aef8c6f3abfec6b90ecf59f797576

Primary master registered on the new SidecarRegistry, tx
0x5f3a79bc970062ec74aa0deb5618f8a527f638a6d24ba3c4144f09a49600876d
(block 9623082).

Re-runs are idempotent — all 9 steps log 'skip'/'ok' without
re-submitting any tx.
The four scripts only referenced by harness/v2-stage2-demo.sh now live
under harness/scripts/ — same place as the orchestrator that calls them.
Operator-facing stage-1 helpers in scripts/ stay put.

  scripts/heima-device-add.sh              → harness/scripts/heima-device-add.sh
  scripts/heima-recovery.sh                → harness/scripts/heima-recovery.sh
  scripts/heima-register-first-master.sh   → harness/scripts/heima-register-first-master.sh
  scripts/heima-set-recovery-threshold.sh  → harness/scripts/heima-set-recovery-threshold.sh

The moved scripts compute REPO_ROOT from two levels up
(harness/scripts/<f>.sh → repo root via /../..); the demo paths were
updated to point at the new harness/scripts/ location.

Hardened the deploy-presence check in step 3:
- Distinguishes RPC failure (exit nonzero) from "no code at address"
  (exit zero with "0x").
- RPC failure → retry up to 8 times with 3s sleep → die rather than
  redeploy on uncertain state.
- "No code" → genuine; trigger redeploy as before.

Heima's RPC hits TLS-handshake-EOF transients regularly; this fix
prevents an unnecessary redeploy that would orphan the previous set.

Same hardening on the balance check in step 3.
…8 message

Stage-2 demo step 5 now derives the companion's on-chain device_key_hash
from its K11 cose-pubkey (cast keccak <cose_pubkey_hex>) and passes it
to the daemon via --companion-device-key-hash. The daemon's
/v1/companion/whoami then returns the real hash that
registerAdditionalMasterDevice will use as the storage key, so the
later revoke flow can find the device on chain.

Stage-2 demo step 8: clearer skip message + when --webauthn is set,
prints the companion's device_key_hash + the exact re-run command for
executing the revoke. The previous message implied --webauthn alone
would do something; really we need a target hash too.
…files

Adds harness/scripts/_lib.sh with resolve_master_key():
- $HEIMA_DEPLOYER_KEY_FILE env var (raw hex or mnemonic)
- ~/.agentkeys/heima-deployer.key (raw hex, used by stage-1 operator)
- ./test-hei (mnemonic, legacy)

Patches the 3 scripts that previously only handled mnemonic files:
- heima-device-add.sh
- heima-set-recovery-threshold.sh
- heima-recovery.sh (preserves --dry-run placeholder path)

Fixes a real bug: scripts died with 'missing mnemonic' on operators
that bootstrapped from a raw private key (the stage-1 path stores
the deployer key at ~/.agentkeys/heima-deployer.key, not a mnemonic
at ./test-hei).

Also fixes step 8's stale whoami file: always curl fresh so the
device_key_hash hint reflects the currently-running daemon, not a
prior run where the daemon hadn't been started with the real hash.
Bug 1 (root cause of step 7 K11VerificationFailed reverts):
assert_webauthn_for_chain was passing the 32-byte expected_challenge as
a "message" to assert_webauthn_inner_parts, which sha256'd it again
before using as the WebAuthn challenge. The on-chain K11Verifier
expects the WebAuthn challenge to BE the operation challenge (no
extra hash); double-hashing made clientDataJSON.challenge !=
expected_b64 → ChallengeMismatch / verifyAssertion returns false →
contract reverts with K11VerificationFailed.

Fix: refactored assert_webauthn_inner_parts to take a [u8; 32]
challenge directly. The legacy assert_webauthn_inner path sha256's
the message itself before calling (preserves existing behavior).
assert_webauthn_for_chain passes the expected_challenge through
unchanged.

Bug 2 (step 6 cast send "invalid string length"):
The companion daemon was receiving an empty --companion-k11-cred-id
(demo didn't pass it), so /v1/companion/whoami returned k11_cred_id="".
The brittle xxd|head|sed pipeline in heima-device-add.sh produced an
all-zeros bytes32 by accident, but the demo's tuple construction had
other issues that confused the cast parser.

Fix: demo step 5 now computes the cred-id hash from the K11 file
(keccak256-style sha256 of the b64url credential id) and passes it
to the daemon via --companion-k11-cred-id. heima-device-add.sh uses
the hash directly from whoami without re-encoding. Also bumped the
empty attestation arg from "0x" to "0x00" (cast tolerates the latter
more consistently).

Added a sanity-check loop in heima-device-add.sh that validates each
bytes32 arg has length 66 before invoking cast, so future malformed
inputs fail with a clear error rather than cast's opaque parser msg.
WebAuthn assert page now surfaces the role + RP ID prominently so the
operator can't confuse which credential they're about to sign with:
- Color: blue accent for PRIMARY MASTER (rp_id=localhost),
  purple for COMPANION MASTER (rp_id=companion.localhost)
- Role badge at the top of the card with emoji + label
- Dedicated RP-ID callout warning to verify the Touch ID prompt
  matches the displayed RP
- Button text reads "Sign as PRIMARY MASTER" / "Sign as COMPANION MASTER"
- Page <title> includes the role so the OS tab list shows it

The M-of-N recovery flow opens TWO browser windows in quick
succession (one for each daemon's K11 ceremony) — without this
distinction the operator could tap the wrong Touch ID prompt and
silently produce an assertion the contract rejects.
Stage-2 demo grows from 9 to 10 steps and now exercises the full
M-of-N revocation path as part of the default --webauthn flow:

  Step 8 NEW — Register synthetic 3rd master (the "spare").
    The spare is a fresh P-256 keypair generated via openssl, NOT a
    real WebAuthn passkey. It registers as a 3rd master with roles=3
    (CAP_MINT|RECOVERY) via primary K11 sig (1 Touch ID at localhost).
    State persists at /tmp/agentkeys-spare-current/ for step 9.
    Why synthetic: the spare is "lost" by design — never needs to
    sign for its own revocation (primary + companion provide the
    quorum). Skipping its WebAuthn enrollment saves a Touch ID
    without weakening the test of any contract surface.

  Step 9 NEW — Revoke spare via 2-of-2 quorum.
    Calls heima-recovery.sh with target=spare hash. The script:
    - Asks primary K11 to sign OP_REVOKE_MASTER challenge (1 Touch ID
      at localhost — UI shows PRIMARY MASTER badge).
    - Asks companion daemon /v1/companion/approve to sign same
      challenge (1 Touch ID at companion.localhost — UI shows
      COMPANION MASTER badge).
    - Submits revokeMasterDevice(spareHash, [primarySig, companionSig]).
    - Contract verifies 2-of-2 quorum + bumps operatorNonce.
    Post-tx verify: isActive(spare) == false.

  Step 10 NEW — Cleanup spare local state.
    Removes /tmp/agentkeys-spare-current/. The on-chain entry stays
    as revoked=true (audit trail — no on-chain delete by design).

End state after a successful run:
  - 2 active masters: primary (roles=7) + companion (roles=3)
  - 1 revoked master: spare (roles=3, revoked=true)
  - recoveryThreshold = 2
  - operatorNonce += 3 (register-2nd-master, set-threshold, revoke)

Touch IDs on a fresh run: 6 total
  - companion enroll (step 5, once per setup)
  - companion register (step 6, once per setup)
  - set threshold (step 7, once per setup)
  - spare register (step 8, fresh per run)
  - primary sigs spare revoke (step 9)
  - companion sigs spare revoke (step 9)

Re-run after this completes: steps 1-7 + 10 skip, steps 8-9 generate
a fresh spare (new keypair) and revoke it — 3 Touch IDs per re-run.
This makes the demo a repeatable end-to-end test of the M-of-N path
without bricking the operator's setup.
Once a companion has been revoked on chain (e.g. as part of an M-of-N
quorum test), it can never re-enter the registered-master set under
the same deviceKeyHash. Stage-2 demo now detects this and enrolls a
fresh companion under a bumped rp_id (companion.localhost →
companion-v2.localhost → companion-v3.localhost) so the M-of-N revoke
test in step 9 has 2 distinct ACTIVE masters to form the quorum.

Changes:
- harness/v2-stage2-demo.sh step 5: scans existing K11 files for an
  active-on-chain companion. If none found, picks the lowest free
  version slot and enrolls a fresh K11 there.
- harness/v2-stage2-demo.sh step 5: passes the computed rp_id to the
  daemon via new --companion-rp-id flag.
- crates/agentkeys-daemon/src/companion.rs: rp_id is now stored in
  CompanionState + threaded through /v1/companion/whoami responses
  and assert_webauthn_for_chain calls.
- crates/agentkeys-daemon/src/main.rs: new --companion-rp-id flag.
- harness/scripts/heima-device-add.sh: reads rp_id from
  /v1/companion/whoami and derives the K11 file path from it.

Net effect: re-running the demo after a 2-of-2 revoke now enrolls
a fresh companion-vN, re-establishes a 2-active-master state, and
proceeds with the next spare-revoke cycle without operator hand-fixing.
Enables harness/v2-stage1-demo.sh to run green against the new
SidecarRegistry + AgentKeysScope contracts deployed in stage 2.

Changes:

- heima-device-register.sh becomes a thin wrapper: forwards to
  harness/scripts/heima-register-first-master.sh when no first
  master is registered; logs skip otherwise. The pre-stage-2
  registerMasterDevice() was split into registerFirstMasterDevice +
  registerAdditionalMasterDevice; this script handles the former.

- heima-device-revoke.sh: detects master vs agent target and
  delegates accordingly. Agent revoke uses the new revokeAgentDevice
  (no K11 needed). Master revoke delegates to heima-recovery.sh
  which collects the M-of-N K11 quorum.

- heima-scope-set.sh: real WebAuthn ceremony, computes the contract's
  expected_challenge per OP_SET_SCOPE encoding (servicesDigest +
  scopeNonce + chainid), builds K11Assertion struct, calls new ABI
  (bytes K11 -> struct). Stub bytes no longer satisfy the gate.

- heima-scope-revoke.sh: same migration as scope-set, computing
  OP_REVOKE_SCOPE challenge.

- All four scripts now use harness/scripts/_lib.sh's
  resolve_master_key, supporting both raw-key files
  (~/.agentkeys/heima-deployer.key) and mnemonic files (./test-hei).

Effect: operator can now run `bash harness/v2-stage1-demo.sh --webauthn`
against the same Heima Mainnet deployment that stage-2 uses, exercising
the full operator lifecycle (init -> register -> agent -> scope -> audit)
on the new contracts.
scripts/heima-k3-rotate.sh — operator-driven K3 epoch advance via
K3EpochCounter.advanceEpoch(). Idempotent (--target-epoch N skips if
currentEpoch >= N), supports dry-run, signs from the wallet that is
the contract's signerGovernance.

docs/runbook-k3-rotation.md — step-by-step operator runbook:
prerequisites, the one-command flow, post-rotation verification,
when to rotate (quarterly hygiene + TEE-compromise indicator), lazy
vs eager re-encryption trade-offs, and the stage-3 migration path to
move signerGovernance from EOA to M-of-N multisig.

Verified end-to-end on Heima Mainnet (dry-run): K3EpochCounter at
0xeacc97d4e7854c52d4736e5fba2dc7c2c2b147d9 has currentEpoch=1 and
signerGovernance points at the deployer.
Contract surface (CredentialAudit.sol):
- New `appendRoot(operatorOmni, merkleRoot, batchEntryCount)` stores a
  per-operator AuditRoot entry, emits AuditRootAppended. Operators
  reconstruct per-event proofs from leaves in S3.
- New `verifyEntryInRoot(operatorOmni, rootIndex, proof[], leaf)`
  validates a sorted-pairs Merkle proof on chain. Matches OpenZeppelin
  convention so the Rust-side proof emission is directly verifiable
  without further transformation.
- Existing `append()` per-event path (tier C) untouched.

Forge test test_CredentialAudit_AppendRoot_AndVerifyMembership covers
the round-trip with a 4-leaf tree.

New crate agentkeys-worker-audit:
- `merkle.rs`: minimal Merkle root + proof helpers using keccak256 with
  sorted-pairs encoding (matches the contract verifier byte-for-byte).
  Doc tests + 4 unit tests pass.
- `state.rs`: per-operator in-memory event queue with flush semantics.
  Drains the queue, computes Merkle root, writes per-event leaves +
  proofs to a JSONL file at /tmp/audit-leaves-<root>.jsonl.
- `handlers.rs`: HTTP surface
    POST /v1/audit/append              — queue event
    POST /v1/audit/flush/:operator     — drain one queue
    POST /v1/audit/flush-all           — drain all queues
- `main.rs`: bind axum at 127.0.0.1:9092; periodic auto-flush every
  --flush-interval-secs (default 300s; 0 = manual only). Each flush
  logs the Merkle root + leaves path. Chain submission via
  `cast send appendRoot` is operator-driven (separate from this
  process so the worker doesn't need a deployer key).

End-state: operators wanting per-event-tx semantics keep using tier C
(`heima-credential-audit.sh` direct write). Operators wanting batched
gas (one tx per N events / per 5min) point their daemon at this worker
and emit per-event POSTs; the worker computes roots and the operator
periodically submits roots via `cast send`.
New crate agentkeys-worker-email. Surfaces:

  POST /v1/email/send
    Body: { from, to[], subject, body_text, body_html? }
    Wraps aws-sdk-sesv2::SendEmail with the operator's SES identity
    (must be verified per the #83 setup workflow). Returns the SES
    message_id.

  GET /v1/email/inbox/:actor_omni
    Lists objects under s3://$AGENTKEYS_VAULT_BUCKET/bots/<actor_omni>/inbound/.
    Inbound routing itself is the SES routing Lambda from #83; this
    worker only exposes what's already been delivered to S3.

  CLI args:
    --bind             default 127.0.0.1:9093
    --inbox-bucket     env AGENTKEYS_VAULT_BUCKET, required

Builds against aws-sdk-sesv2 1.118 + aws-sdk-s3 1.132. No new
dependencies introduced at the workspace level (aws-config + s3 are
already used by worker-creds).

Operator workflow: spin up alongside worker-creds + worker-memory on
the broker host, route per-agent outbound mail through this worker
instead of having each agent directly call SES. Cap-token verification
on /v1/email/send is left as a follow-up (current shape assumes the
worker is on a private interface — operators expose it only on the
sidecar daemon's localhost, same as worker-creds).
Live E2E test of scripts/heima-k3-rotate.sh per agentkeys-harness skill:

- Round 1: epoch 1 → 2 (1 tx)
- Round 2: epoch 2 → 3 (1 tx)
- Round 3: target=3 (already there) → skip, no tx, 0 gas
- Round 4: target=6 (3-step advance) → 3 txs

Total: 5 real txs on K3EpochCounter = 0xeacc97d4e7854c52d4736e5fba2dc7c2c2b147d9.

The contract is forward-only by design — no "rotate back" — so the
"back and forth" test is bounded to forward-path correctness + the
idempotency skip on re-targets-to-current. Both work as designed.

K3EpochCounter is now at epoch 6 on Heima Mainnet. The signer enclave
will retain historical K3_v[1..5] for decrypt of pre-rotation blobs;
new writes use K3_v[6].
Two fixes:

1. Enrollment page (serve_enroll_page) now matches the assert-page
   visual language — role badge (PRIMARY MASTER blue, COMPANION MASTER
   purple), RP-ID surfaced explicitly, button text reads "Enroll as
   PRIMARY MASTER" / "Enroll as COMPANION MASTER". Previously the
   enrollment page was role-agnostic which made it easy to tap Touch
   ID on the wrong RP when re-enrolling.

2. WebAuthn user.name shown in the macOS Touch ID dialog ("Use Touch
   ID to sign in to 'localhost' with your passkey for <NAME>") was
   previously the full 64-char operator_omni hex, which truncates
   awkwardly on screen. Now reads "AgentKeys Primary Master
   (0x941cb1c3…)" or "AgentKeys Companion Master (0x941cb1c3…)" —
   human-readable + a 10-char omni prefix for cross-operator disambig.

Takes effect on NEW enrollments only — existing credentials retain
whatever user.name was set when they were originally enrolled. To
refresh the display name, delete ~/.agentkeys/k11/<omni>--<rp>.json
and re-enroll.

The "white text in white background" in the macOS Passkey-source
filter row is macOS system UI (the picker for which provider supplies
the passkey — iCloud Keychain, 1Password, etc.); it's outside our HTML
control. The other observed truncation is fixed by this commit.
Operator-facing summary of what K3 rotation does and doesn't change:
- contract addresses, devices, scopes, threshold unchanged
- on-chain epoch counter advances + emits K3Rotated event
- signer enclave retains historical K3 versions for legacy decrypt
- workers swap to new epoch for new writes via SSE
- one-command operator action: `bash scripts/heima-k3-rotate.sh`
- links to full runbook at docs/runbook-k3-rotation.md
- notes the stage 1-2 simplification (KEK from env per §22b.2) means
  rotation is forward-compatible but not yet driving worker re-key

Also documents the eager-re-encrypt follow-up gated behind a confirmed
TEE compromise scenario (stage 3 tracked in §22b.5).
Codex flagged 8 findings; 7 are addressed here (C1, C2, C3/M1, H1, H2, M2 +
test coverage). The remaining one (codex H3 "K10+K11") is a false positive:
msg.sender check IS the K10 signature — EVM tx signing is secp256k1 over
the whole tx by the master wallet. Added comments where helpful.

Contract fixes (require redeploy):

  C1: SidecarRegistry.revokeMasterDevice — refuse to revoke if it would
      leave < max(1, recoveryThreshold) active recovery-capable masters.
      Prevents permanent operator stranding.

  C2: SidecarRegistry.setRecoveryThreshold — refuse newThreshold >
      activeRecoveryMasterCount. Prevents permanent operator stranding
      via unsatisfiable quorum.

  C3/M1: CredentialAudit.appendRoot — auth-gate by operator's master
      wallet (via injected SidecarRegistry reference). Previously any
      account could pollute an operator's root list.

  H1: K11Verifier.verifyAssertion — three new envelope checks:
      - authData[0:32] == expectedRpIdHash (per-credential, stored on
        register at DeviceEntry.k11RpIdHash). Prevents cross-RP replay.
      - authData[32] has UP|UV flags. Prevents stolen-device-without-
        biometric assertions.
      - clientDataJSON starts with `{"type":"webauthn.get"`. Prevents
        replay of webauthn.create (enrollment) assertions.

  M2: CredentialAudit + worker Merkle — domain-separate leaves (0x00
      prefix) and internal nodes (0x01 prefix). Prevents an internal-
      node digest from impersonating a leaf at shorter depth.

ABI changes:
  - SidecarRegistry.registerFirstMasterDevice + registerAdditionalMaster
    now take an extra bytes32 k11RpIdHash arg (the operator's K11 enroll
    rp_id is hashed and stored).
  - K11Verifier.verifyAssertion takes the rpIdHash; callers
    (SidecarRegistry, AgentKeysScope) read entry.k11RpIdHash.
  - CredentialAudit constructor takes the SidecarRegistry address.

Harness changes:
  - heima-register-first-master.sh + heima-device-add.sh + heima-register-
    spare-master.sh compute sha256(rp_id) from the K11 enrollment file
    and pass it as the new arg.
  - v2-stage2-demo.sh step 6 + 7 fail-fast on device-add/threshold-set
    failures + verify on-chain state matches before advancing to step 9.
    Codex H2: previously silent failures could false-green step 9.

Tests:
  + 5 new K11Verifier tests: RpIdHashMismatch, UserPresenceMissing (no
    flags, UP-only), WrongClientDataType (webauthn.create), all pass.
  + CredentialAudit_AppendRoot_RejectsNonMaster (vm.prank attacker).
  + Internal-node-as-leaf attack test in both forge + Rust Merkle suite.
  - Total: 33 forge tests (was 28), 7 worker-audit unit tests (was 6),
    all green.

Deploys will fail against the existing PR #87-deployed contracts —
operator must redeploy via the demo's step 3 (forced) or by running
`bash harness/v2-stage2-demo.sh --redeploy`.
New addresses (PR commit 5834c1d 'fix(stage-2): codex adversarial review'):
  P256Verifier:    0xda5b772f9d6c09abe80414eea908612df9b54749
  K11Verifier:     0x5a441431f08e0f5f5ed10659620cb4e0e814e627
  SidecarRegistry: 0x1ac62f1c2d828476a5d784e850a700dc1f17e0be
  AgentKeysScope:  0xd44b375daefc65768f417d0f0125b68d5ba7df3b
  K3EpochCounter:  0x6c9e675c699a06acefbc156afdee6bfbfe32ccb3
  CredentialAudit: 0x63c4545ac01c77cc74044f25b8edea3880224577

Previously-deployed instances (bc232ebcb47fa672aa2a1b2b0481c7ff9a86531b
et al) are now abandoned. They have the pre-codex-fix ABI which is
incompatible — DeviceEntry layout changed (added k11RpIdHash field).
Operator's primary master must re-register via
harness/scripts/heima-register-first-master.sh against the new
SidecarRegistry; companion + spare flows then continue normally.
…dev)

Dev-only co-location of the 4 service workers on the same EC2 box as the
broker, behind per-worker nginx vhosts. CLAUDE.md: "for production, we
will isolate all the services for the security issue" — the per-subdomain
layout is the migration seam, so a future move to dedicated hosts only
needs the A record + IAM principal to change.

Topology:
  broker.litentry.org  :8091  agentkeys-broker
  signer.litentry.org  :8092  agentkeys-signer
  audit.litentry.org   :9092  agentkeys-worker-audit   (Merkle relay)
  email.litentry.org   :9093  agentkeys-worker-email   (SES + S3 inbox)
  cred.litentry.org    :9094  agentkeys-worker-creds   (credential CRUD)
  memory.litentry.org  :9095  agentkeys-worker-memory  (memory CRUD)

setup-broker-host.sh — builds + installs the 4 worker binaries, auto-
generates worker-{creds,memory}.env with stable KEK secrets (preserved
across re-runs so existing blobs stay decryptable), writes 4 systemd
units, writes 4 nginx vhosts via shared write_worker_nginx_site(), and
probes /healthz on each port post-restart. New CLI flags: --audit-host,
--email-host, --cred-host, --memory-host, --chain-rpc, --vault-bucket,
--memory-bucket, --scope-addr, --registry-addr, --k3-counter-addr,
--without-workers. Re-runs without flags now re-read previously-configured
values from /etc/agentkeys/worker-{creds,memory}.env so the script stays
idempotent for non-default deployments.

dns-upsert-workers.sh (NEW) — single atomic Route 53 change-batch UPSERT
for all 4 A records. Validates the caller is on agentkeys-admin, refuses
RFC1918 / TEST-NET-2 (Cloudflare WARP / Zscaler / corporate VPN) EIPs,
waits for Route 53 INSYNC + Cloudflare DoH propagation before exiting.

verify-workers.sh (NEW) — laptop-side end-to-end check: DNS resolves via
Cloudflare DoH → TLS cert is Let's Encrypt → /healthz returns HTTP 200
with the per-worker expected body marker. Exits non-zero with per-failure
diagnostics. --no-tls for the HTTP-only first-pass phase.

worker-audit/main.rs + worker-email/main.rs: GET /healthz → "ok" so
probe_or_die can verify boot (worker-creds + worker-memory already had it).

operator-workstation.env: derive WORKER_{AUDIT,EMAIL,CRED,MEMORY}_HOST +
AGENTKEYS_WORKER_*_URL from \$BROKER_HOST, mirroring the SIGNER_HOST
pattern.

docs/cloud-setup.md: new §1.4 (TOC row) + §7 "Service workers" with the
concern table (mirrors §6 signer), §7.1 DNS one-shot helper, §7.2 TLS
cert loop + nginx flip, §7.3 verification. Existing §7 Cleanup → §8.

heima-scope-set.sh + heima-scope-revoke.sh: graceful skip with
{"ok":true,"skipped":"no-webauthn-k11"} when no mode:webauthn K11 is
enrolled, so harness/v2-stage1-demo.sh (default stub mode) is fully CI-
automatable without operator Touch ID.
worker-creds and worker-memory both call profile_env() for all THREE
contract addresses (SidecarRegistry, AgentKeysScope, K3EpochCounter) at
state construction — verified live by the boot failure on broker host:

  Error: SIDECAR_REGISTRY_ADDRESS_HEIMA must be set
  Caused by: environment variable not found

The auto-generated /etc/agentkeys/worker-creds.env was only writing
SCOPE_CONTRACT_ADDRESS_HEIMA, omitting the other two — fixed.

Also added AGENTKEYS_CHAIN=heima to both env files so the chain-profile
resolution is explicit instead of relying on the worker-side default
(matches what the existing chain helpers do).
New step exercises the 4 co-located service workers as a tier-A relay:
queue 2 audit events → flush → on-chain CredentialAudit.appendRoot →
verify rootCount + getRoot match. Plus an email worker /healthz +
/inbox smoke.

  Stage-1 demo: STEP_TOTAL 15 → 16, new step 15 between audit-append
                and summary; summary renumbered to step 16.
  Stage-2 demo: STEP_TOTAL 10 → 11, new step 10 between M-of-N revoke
                and cleanup; cleanup renumbered to step 11.

scripts/heima-worker-smoke.sh (NEW) — drives the full flow:
  1. precheck both workers' /healthz
  2. POST 2 events → audit worker /v1/audit/append
  3. POST /v1/audit/flush/<operator_omni> → Merkle root + leaves
  4. cast send CredentialAudit.appendRoot from operator master wallet
  5. cast call rootCount + getRoot to verify on-chain root matches flush
  6. GET /v1/email/inbox/<actor_omni> as soft-warn smoke (the broker
     EC2 IAM lacks s3:ListBucket on the inbox bucket today — out-of-scope
     follow-up; worker is deployed + /healthz green so the demo
     continues without breaking the chain green-bar)

Live-tested 4 rounds against Heima Mainnet — rootCount progressed
0→1→2→3→4→5→6→7→8 across stage-1 + stage-2 runs with all 8 on-chain
Merkle roots verified by getRoot() readback. Idempotency: every re-run
is a clean skip (no chain mutation) or adds a fresh tier-A root.

Sibling fixes (same bug class — stale DeviceEntry struct offsets after
codex H1 added k11RpIdHash + k11PubX + k11PubY):

  heima-agent-create.sh + heima-device-revoke.sh — switched the
    idempotency check from hex-offset slicing of getDevice() to the
    typed isActive(bytes32)(bool) view. The old code read offset 320
    for registeredAt; after the struct grew, registeredAt now lives at
    offset 512, so the offset-based check always returned 'not yet
    registered' on re-run and registerAgentDevice reverted with
    DeviceAlreadyRegistered (0xa98bbce0). isActive is struct-agnostic.

  heima-scope-set.sh + heima-scope-revoke.sh — when USE_WEBAUTHN=0
    (stub mode) AND the local K11 file is mode=webauthn (from a prior
    real ceremony), skip cleanly instead of triggering Touch ID. Demo
    stub-mode runs on a laptop with prior webauthn enrollment were
    otherwise prompting for Touch ID and dying on the dismissed
    dialog. The 'stub-mode-refuses-touchid' skip payload makes this
    explicit.
Closes the OIDC isolation gap from PR #92 review (issue #90 Q1 + Q3): the
broker had full federation infrastructure (handlers/oidc.rs, mint.rs,
sts.rs) but the workers bypassed it — every S3 call went through the
broker EC2 instance profile, so the per-actor IAM scoping defined in
provision-vault-role.sh's PrincipalTag policy was never exercised.

Worker code change (backwards compatible):

  crates/agentkeys-worker-creds/src/aws_creds.rs (NEW)
    - OptionalStsCreds axum extractor: parses three optional headers
        X-Aws-Access-Key-Id
        X-Aws-Secret-Access-Key
        X-Aws-Session-Token
      Returns None if any are missing (partial = error, refuse to mint
      a half-authed S3 client).
    - StsCreds::build_s3_client(region) — per-request S3 client backed
      by the passed-through STS creds.
    - s3_for_request(default, region, override) — falls back to the
      default instance-profile client when override is None.
    - 4 unit tests covering header presence / absence / partial.

  crates/agentkeys-worker-creds/src/handlers.rs
    cred_store + cred_fetch + cred_teardown — accept OptionalStsCreds,
    use the per-request client when present.

  crates/agentkeys-worker-memory/src/handlers.rs
    memory_put + memory_get + memory_teardown — same pattern; re-exports
    aws_creds from agentkeys_worker_creds (no duplication).

Backward compat: requests without the three X-Aws-* headers fall back
to state.s3 (instance profile) — existing stage-1 + stage-2 demo flows
keep working unchanged.

harness/v2-stage3-demo.sh (NEW, 8 steps)
  End-to-end OIDC isolation proof on Heima Mainnet:

    1. SIWE wallet_sig auth → session JWT
    2. POST /v1/mint-oidc-jwt → STS-compatible web identity token
    3. AssumeRoleWithWebIdentity → STS creds tagged with
       PrincipalTag/agentkeys_actor_omni = derive_omni(master wallet)
    4. POSITIVE: PUT s3://vault/bots/<own actor_omni>/credentials/…
       → HTTP 200
    5. NEGATIVE: PUT s3://vault/bots/<wrong actor_omni>/credentials/…
       → AccessDenied (IAM rejects cross-actor write — the proof)
    6+7. Same positive+negative pair on the memory bucket — soft-skip
       when memory bucket not yet provisioned (follow-up).
    8. Cleanup with admin profile.

Live-tested against Heima Mainnet. Step 5 verified: AWS IAM itself
rejected the cross-actor PUT with AccessDenied — proves the
${aws:PrincipalTag/agentkeys_actor_omni} scoping in
scripts/provision-vault-role.sh works as designed. Even if a worker
were compromised, it could not write to another actor's prefix when
using STS creds passed through from the broker mint flow.

Architectural answers to the review (#90 Q1 + Q2):

  Q1 ("is OIDC disrupted by the new service isolation design?"):
    Was, yes — workers bypassed federation. NOW WIRED.
    Workers respect STS creds when passed; fall back to instance
    profile otherwise so existing stage-1+2 flows are unchanged.

  Q2 ("why does broker need s3:ListBucket — Lambda should sort
    incoming email into per-actor folders"):
    User is right architecturally. The 500 we soft-warned on in
    /v1/email/inbox is the symptom of the same OIDC bypass — the
    email worker uses instance profile and tries global ListObjects
    without scoping. Architecturally correct flow: SES inbound →
    Lambda sorts to bots/<actor>/inbound/ → email worker reads via
    OIDC-scoped STS creds, never global ListBucket. The fix is the
    same shape as this PR — pass-through STS creds via X-Aws-*
    headers — but is left as a follow-up: this PR ships the
    plumbing + proves OIDC works end-to-end; wiring the email worker
    + Lambda routing is a separate change. Tracked in #90 followups.
Addresses 2 of 4 codex adversarial findings on commit 913179a:

[P2 — downgrade attack] aws_creds.rs OptionalStsCreds extractor silently
fell back to the broker EC2 instance profile when caller omitted X-Aws-*
headers. A malicious caller could deliberately drop the headers to bypass
the OIDC-scoped IAM session and get broker-wide S3 access.

Fix: `AGENTKEYS_WORKER_REQUIRE_STS=1` env var puts the worker in strict
mode — every request must carry all three X-Aws-* headers or gets HTTP
401. Also: partial header sets (1 or 2 of 3 present) ALWAYS reject with
401 regardless of strict mode — silently dropping half-passed creds is
the same downgrade surface. Default off for backward compat; production
deploys should turn it on.

[P3 — credential leak via Debug] StsCreds previously derived Debug, so
any future tracing::debug! or dbg!() call would log secret_access_key
and session_token verbatim. Custom Debug impl now redacts both and
shows only the access_key_id prefix (which AWS CloudTrails anyway).

New tests:
  - debug_redacts_secret_and_session_token (asserts the Debug output
    doesn't contain the secret bytes; <redacted> marker present)
  - parser_distinguishes_no_headers_from_partial (locks the extractor's
    contract — no headers = backward compat, partial = always reject)

Two codex findings deliberately left as follow-ups, not fixed in this
commit:

[P2 — memory worker OIDC not proven] The harness only mints
agentkeys-vault-role creds, which scope to the vault bucket only. The
memory worker writes to a separate memory bucket which isn't covered.
A dedicated agentkeys-memory-role with the same tag-scoping pattern is
the architecturally correct fix; tracked as PR followup.

[P2 — vault bucket policy allows whole-bucket ListBucket] In
scripts/apply-vault-bucket-policy.sh:109 — pre-existing, separate from
this PR's surface. Adding an s3:prefix=bots/${aws:PrincipalTag/…} condition
to the bucket-policy ListBucket statement closes the cross-actor key-name
enumeration. Filed for the bucket-policy hardening followup.
Lands the two findings deferred from commit 18e709b. Both verified live
on Heima Mainnet via the extended harness/v2-stage3-demo.sh (11 steps,
all green).

[P2 — memory worker OIDC scoping] NEW agentkeys-memory-role + dedicated
memory bucket, mirroring the vault data-class layout per arch.md §17.2.
A future memory-worker compromise now cannot reach the credentials
bucket and vice versa.

  scripts/provision-memory-bucket.sh  (NEW) — mirror of provision-vault-bucket.sh
  scripts/provision-memory-role.sh    (NEW) — federated trust + 3-statement
                                              inline policy scoped to
                                              $MEMORY_BUCKET/bots/${PrincipalTag}/memory/*
  scripts/apply-memory-bucket-policy.sh (NEW) — v3 bucket policy

[P2 — bucket-policy ListBucket whole-bucket allow] Was: one statement
listed [Get, Put, Delete, ListBucket] under one Resource[bucket,
bucket/...] with NO s3:prefix condition — any tagged session could
enumerate all keys. Now: SPLIT into two statements:

  VaultListV3 / MemoryListV3 — ListBucket ONLY, on the bucket ARN,
    Condition StringLike s3:prefix = bots/${PrincipalTag}/<class>/*
  VaultObjectsV3 / MemoryObjectsV3 — Get/Put/Delete on the
    prefixed-object ARN, no prefix condition (resource ARN already scopes)

  scripts/apply-vault-bucket-policy.sh  (UPDATED) — v2 → v3 split
  scripts/apply-memory-bucket-policy.sh (NEW)    — v3 split from day one

Demo extended (harness/v2-stage3-demo.sh, STEP_TOTAL 8 → 11):

  step 3:  mint TWO STS sessions (vault role + memory role)
  step 4-5: vault PUT positive (own) + negative (other) — pre-existing
  step 6:  vault LIST negative (other prefix → AccessDenied) — codex P2 verifier
  step 7-8: memory PUT positive (own) + negative (other)
  step 9:  memory LIST negative (other prefix → AccessDenied)
  step 10: cross-role isolation — vault creds → memory bucket → AccessDenied
                                 + memory creds → vault bucket → AccessDenied
  step 11: cleanup

Also: `expect_access_denied` helper distinguishes IAM-rejection
(AccessDenied / HTTP 403) from setup-bug failures (NoCredentialsErr,
NoSuchBucket, InvalidAccessKeyId, TokenRefreshRequired). Naive
`grep AccessDenied` would pass on any failure — codex's exact warning.

operator-workstation.env:
  + MEMORY_BUCKET=agentkeys-memory-${ACCOUNT_ID}
  + MEMORY_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-memory-role

Live-tested 2026-05-20 on Heima Mainnet:
  - memory bucket created (AssumedArn=…agentkeys-memory-role)
  - vault-bucket policy v2 → v3 swap (2 statements live)
  - memory-bucket policy v3 from scratch (2 statements live)
  - 11/11 demo steps green:
      [4]  vault PUT  own prefix       → SUCCEEDED
      [5]  vault PUT  other prefix     → AccessDenied
      [6]  vault LIST other prefix     → AccessDenied
      [7]  memory PUT own prefix       → SUCCEEDED
      [8]  memory PUT other prefix     → AccessDenied
      [9]  memory LIST other prefix    → AccessDenied
      [10] vault creds → memory bucket → AccessDenied
      [10] memory creds → vault bucket → AccessDenied
All three demos (stage-1, stage-2, stage-3) green on Heima Mainnet after
the codex review fixes. Clippy clean on worker-creds + worker-memory.
PR ready to merge.
User's call-out — "the cred encryption and decryption is not tested".
Stage-3 previously proved IAM scoping at the AWS layer but skipped the
worker's AES-256-GCM envelope, so the actual encrypt→S3→decrypt path
through the HTTP API was unexercised. The envelope.rs primitive has 8
unit tests, but the wire-protocol roundtrip wasn't.

Stage-3 demo extended (STEP_TOTAL 11 → 13):

  [11] Cred worker encrypt/decrypt roundtrip:
       1. mint cred-store cap via POST /v1/cap/cred-store (broker)
       2. POST /v1/cred/store with cap + base64(plaintext)
          → worker KEK-encrypts (AES-256-GCM, AAD-bound to
            operator+actor+service+k3_epoch), S3 PUTs the envelope
       3. mint cred-fetch cap via POST /v1/cap/cred-fetch
       4. POST /v1/cred/fetch with cap
          → worker S3 GETs the envelope, KEK-decrypts, returns plaintext
       5. assert returned plaintext == original (byte-for-byte)
  [12] Memory worker encrypt/decrypt roundtrip:
       same shape against /v1/memory/put + /v1/memory/get. Memory worker
       has no dedicated cap-mint endpoint yet (follow-up); cred-* caps
       work against memory because both workers verify the same broker-
       signed CapToken shape with the same CapOp::Store / CapOp::Fetch.

Graceful skip handling:

  - 'agent scope not set on chain' → skip with 'run stage-1 --webauthn first'
  - 'AGENTKEYS_CHAIN_RPC_HTTP not set' → skip with 'redeploy broker'
  - 'DeviceRoleMissing' → skip with 'out-of-scope here'

These map cleanly to operator-actionable prerequisites; demo continues
green without those steps when prerequisites aren't met, but the
prerequisite is reported, not hidden.

Broker fix: setup-broker-host.sh now bakes AGENTKEYS_CHAIN +
AGENTKEYS_CHAIN_RPC_HTTP into the broker's systemd Environment= lines.
Previously the broker process had no chain RPC, so /v1/cap/cred-{store,
fetch} hit 502 'RPC URL not set' at request time. This was a pre-existing
gap surfaced by exercising the cap-mint path for the first time in this
PR — the broker's stand-alone deploy never hit cap.rs's chain check
before because no demo step minted caps.
…p 13)

Three changes from user review:

1. NEW stage-3 step 13: NEGATIVE broker cap-mint isolation.
   Try to mint a cap-token with operator_omni != session_omni → expect
   HTTP 4xx with OperatorMismatch. This proves the MOST UPSTREAM
   isolation gate works: actor A's session JWT cannot mint caps for
   actor B. If this ever silently returns 200, every cred + memory
   blob in S3 is compromised — A could mint B's cap, hand to worker,
   worker writes under B's prefix.

   Live-verified on Heima Mainnet 2026-05-20:
     [13] NEGATIVE cap-mint cross-actor → HTTP 403 OperatorMismatch ✓

   Independent of broker redeploy: session-omni check fires BEFORE the
   chain RPC check in handlers/cap.rs, so this gate works on the
   current (stale-RPC) broker too.

2. CLAUDE.md — NEW "Per-actor + per-data-class isolation invariants
   (issue #90)" section codifies the 4-layer defense:

     Layer 1 — broker cap-mint   → session_omni == operator_omni
     Layer 2 — worker chain-verify → independent re-check of layer 1
     Layer 3 — AWS IAM PrincipalTag → s3 resource scoping per-actor
     Layer 4 — bucket separation  → per-data-class IAM roles

   Test-discipline rule: every PR adding a new worker, data class, or
   broker auth method MUST extend the stage-3 demo with negative
   isolation tests for all four layers. Don't ship features with only
   POSITIVE-path coverage.

3. CLAUDE.md — answers "why no /v1/cap/memory-* endpoint" with a
   concrete example: cap-tokens are data-class-agnostic. The same Store
   cap minted for service=openrouter can be POSTed to either
   /v1/cred/store (writes to vault bucket credentials/) or
   /v1/memory/put (writes to memory bucket memory/). The URL picks
   the data class; the cap just authorizes the operation. Adding
   dedicated memory cap endpoints would add audit clarity ("this cap
   was minted intending memory access") but no security boundary —
   isolation comes from the per-data-class IAM roles (layer 4).
   Deferred until payments-worker forces a third data class.
…vault + memory)

User callout — "make it explicit that one cannot pollute other permission."
Before this commit, cap-tokens didn't carry a data-class binding: a
cred-store cap and a memory-put cap were structurally identical. The
URL the cap was POSTed to picked the bucket. Isolation lived only at
the AWS IAM PrincipalTag + per-data-class IAM-role layer. If the IAM
grants were ever accidentally broadened, cross-data-class pollution
would slip through silently.

Now: data_class is a SIGNED FIELD in the cap payload. The cap layer
itself enforces per-data-class isolation, ahead of any AWS call.

Schema change (REQUIRED field, no backward compat — coordinated upgrade):

  enum DataClass { Credentials, Memory }
  struct CapPayload {
    ...
    op: CapOp,
    data_class: DataClass,   // NEW
    ...
  }

Broker (crates/agentkeys-broker-server/src/handlers/cap.rs):
  - Add DataClass enum (mirror of worker's), add to CapPayload
  - mint_cap signature gains data_class param; statically derived per route
  - NEW endpoints: cap_memory_put + cap_memory_get (mint with DataClass::Memory)
  - Existing cap_cred_store + cap_cred_fetch mint with DataClass::Credentials

Broker routes (crates/agentkeys-broker-server/src/lib.rs):
  + .route("/v1/cap/memory-put", post(cap_memory_put))
  + .route("/v1/cap/memory-get", post(cap_memory_get))

Worker side (crates/agentkeys-worker-creds/src/verify.rs):
  - Add DataClass enum + field to CapPayload + DataClassMismatch error
  - NEW pub fn check_data_class(token, expected) — symmetric with check_op
  - Tests: data_class_serializes_snake_case + check_data_class_accepts_match
           + check_data_class_rejects_cross_class

Worker handlers (worker-creds + worker-memory):
  - verify_cap now calls check_data_class with their respective class:
      worker-creds  → DataClass::Credentials
      worker-memory → DataClass::Memory
  - Reject mismatched caps with HTTP 403 cap_data_class_mismatch

Demo extension (harness/v2-stage3-demo.sh, STEP_TOTAL 14 → 16):
  [11] cred encrypt/decrypt roundtrip — now uses /v1/cap/cred-store
  [12] memory encrypt/decrypt roundtrip — now uses /v1/cap/memory-put (NEW endpoint)
  [14] NEW negative test: mint cred-class cap, POST to /v1/memory/put
       → expect HTTP 403 cap_data_class_mismatch
  [15] NEW negative test: mint memory-class cap, POST to /v1/cred/store
       → expect HTTP 403 cap_data_class_mismatch

CLAUDE.md ("Per-actor + per-data-class isolation invariants"):
  Replaced "why no memory cap-mint endpoint" section (now obsolete) with
  "Cap-tokens are data-class-explicit" — explains the 4-endpoint shape,
  shows the concrete reject example, justifies route-per-class over a
  data_class query param (broker can't accidentally mint the wrong
  variant from a typed-route handler).

Tests:
  worker-creds verify::tests — 14/14 (3 new for DataClass)
  broker-server handlers::cap::tests — 24/24 (1 new for data_class serialization)
  cargo build -p worker-creds -p worker-memory -p broker-server — exit 0

Live deploy: requires broker host redeploy via setup-broker-host.sh to
pick up the new mint_cap signature + new memory routes. The stage-3
demo steps 14+15 will skip cleanly until the redeploy lands — the
isolation IS enforced (workers reject cred-class caps), but the new
endpoints don't exist on the current broker yet.
After redeploying with the data_class change (commit 690f54c), step 11
of the stage-3 demo surfaced a SECOND broker-side env gap:

  HTTP 502 from /v1/cap/cred-store:
    {"error":"SIDECAR_REGISTRY_ADDRESS_HEIMA unset","reason":"chain_rpc_error"}

The broker's handlers/cap.rs reads three contract addresses at request
time to verify device + scope + k3_epoch on chain:
  - SIDECAR_REGISTRY_ADDRESS_HEIMA
  - SCOPE_CONTRACT_ADDRESS_HEIMA
  - K3_EPOCH_COUNTER_ADDRESS_HEIMA

Before this commit, setup-broker-host.sh baked AGENTKEYS_CHAIN_RPC_HTTP
into the broker systemd unit but NOT the contract addresses. The cap-
mint code path had never been exercised before this PR, so the gap
went unnoticed.

Fix (setup-broker-host.sh): add the three contract addresses to the
broker's Environment= block, pulled from $REGISTRY_ADDR / $SCOPE_ADDR
/ $K3_COUNTER_ADDR (already populated earlier in the script via the
sourced scripts/operator-workstation.env). The operator's
operator-workstation.env stays the single source of truth for contract
addresses across laptop + broker host.

Stage-3 demo also gets a sibling skip-detection (harness/v2-stage3-demo.sh)
so steps 11+12+14+15 cleanly skip with the redeploy-broker message
instead of failing on this specific error shape.

To unblock the stage-3 worker encrypt/decrypt + cross-class-rejection
tests after this commit:
  ssh broker.litentry.org "cd ~/agentKeys && git pull && bash scripts/setup-broker-host.sh --yes"
…H1 alignment)

Closes user-reported step-11 regression after broker redeploy:

  cap-mint returned HTTP 403 — body: {"error":"device is not active on chain",
  "reason":"device_not_active"}

Same bug class I fixed earlier in scripts/heima-agent-create.sh +
scripts/heima-device-revoke.sh (commit 0981a88). Both the broker's
handlers/cap.rs::parse_device_entry AND the worker's
crates/agentkeys-worker-creds/src/verify.rs::parse_device_entry were
still slicing the OLD 7-word DeviceEntry layout. After codex H1
inserted 4 new fields (k11CredId, k11RpIdHash, k11PubX, k11PubY), the
struct grew to 11 ABI words, but neither parser was updated.

  word 0  operatorOmni    bytes32
  word 1  actorOmni        bytes32
  word 2  k11CredId        bytes32
  word 3  k11RpIdHash      bytes32  (NEW, codex H1)
  word 4  k11PubX          uint256  (NEW)
  word 5  k11PubY          uint256  (NEW)
  word 6  tier             uint8 (padded)
  word 7  roles            uint8 (padded)
  word 8  registeredAt     uint64 (padded)
  word 9  lastSignCount    uint32 (padded)
  word 10 revoked          bool (padded)

Before this commit, both parsers read:
  roles        → word 4 (which is now k11PubX)
  registeredAt → word 5 (which is now k11PubY — always 0 for agents)
  revoked      → word 6 (which is now tier)

For agent devices (k11PubX = k11PubY = 0), registeredAt parsed as 0 →
broker returned DeviceNotActive even though the device WAS active.

Fix: both parsers now read from the correct 11-word offsets + check
hex.len() >= 11 * 64.

Tests updated:
  worker-creds verify::tests::parse_device_entry_decodes_well_formed
    → construct an 11-word raw response (was 7)
  broker handlers::cap::tests::parse_device_entry_decodes_well_formed
    → same
  broker handlers::cap::tests::parse_device_entry_detects_revoked
    → same
  All 4 green.

Live deploy: requires broker host redeploy via setup-broker-host.sh
so the broker picks up the new parse_device_entry. Worker code change
ships with the broker redeploy (same setup-broker-host.sh rebuild).
Step 11 surfaced the codex P2 downgrade-attack defense WORKING AS
INTENDED: cap-mint succeeded, worker AES-encrypted, then S3 PUT
returned 502 "s3_put: service error" because the worker fell back
to the broker EC2 instance profile (which deliberately lacks
s3:PutObject on the vault bucket).

The codex P2 fix in commit 18e709b added OptionalStsCreds + the
AGENTKEYS_WORKER_REQUIRE_STS strict-mode env var. Workers correctly
demand per-request OIDC-minted STS creds. The stage-3 demo's step
11+12 cred_memory_roundtrip helper wasn't passing them.

Fix: stage-3 step 11 (cred roundtrip) now passes vault-role STS creds,
step 12 (memory roundtrip) passes memory-role STS creds, both via the
three X-Aws-* headers the worker's OptionalStsCreds extractor reads:

  -H 'x-aws-access-key-id: $aki'
  -H 'x-aws-secret-access-key: $sak'
  -H 'x-aws-session-token: $sst'

The STS creds were already minted in step 3 (vault + memory sessions
written to $STATE_DIR/{aki,sak,sst}.{vault,memory}); step 11+12 just
read the right file pair based on the kind (cred → vault, memory →
memory) and forward them as headers.

After this commit, steps 11+12 should land green end-to-end:
  broker cap-mint   → 200 (chain checks pass)
  worker cap-verify → 200 (broker_sig + chain re-verify)
  worker S3 PUT     → 200 (using per-actor STS creds, NOT instance profile)
  byte-for-byte roundtrip assertion holds.
…match)

Step 11 surfaced the second layer of the OIDC isolation chain working
as designed: cap-mint succeeded (broker authorized operator→agent),
worker AES-encrypted, then S3 PUT returned 502 because the STS creds
were minted from the OPERATOR'S session JWT (tagged with operator's
actor_omni) but the cap's actor_omni — and hence the S3 key path —
is the AGENT'S. IAM saw ${PrincipalTag/agentkeys_actor_omni} = 941c…
trying to PUT bots/82a0…/credentials/… and rejected with AccessDenied.

This is the IAM enforcing what the cap-token expresses: "operator
authorized the agent to do this op; the agent must be the one
actually doing it." Both layers must agree on actor_omni.

Fix (stage-3 cred_memory_roundtrip helper):

  1. Read agent_private_key from the demo-agent file
  2. SIWE-sign as the agent against the broker (POST /v1/auth/wallet/start
     with the agent's address, sign with cast wallet sign using
     agent_private_key, POST /v1/auth/wallet/verify → session JWT
     for the agent)
  3. Mint OIDC JWT via /v1/mint-oidc-jwt — this JWT now carries
     sub=agent_omni and PrincipalTag/agentkeys_actor_omni=agent_omni
  4. AssumeRoleWithWebIdentity against the right data-class role
     (VAULT_ROLE_ARN for cred, MEMORY_ROLE_ARN for memory) — STS
     creds now tagged with the agent's actor_omni
  5. Forward these creds via X-Aws-* headers to the worker

Now the worker's S3 PUT against bots/<agent>/credentials/… uses STS
creds with PrincipalTag=agent_omni → IAM allows.

The architectural lesson, recorded in the commit because it'll bite
again: when a cap-token authorizes actor A's action and the worker
uses STS creds to touch S3, the STS creds MUST be minted using A's
identity — operator's authorization (cap-token) + actor's identity
(STS creds) jointly satisfy the workflow. Per arch.md §17.2 layer 3,
the IAM PrincipalTag is bound to the JWT subject, NOT to whoever the
JWT-issuer (operator) chose to authorize.
Codex round-2 review flagged the demo as 'needs-attention' — it could
report 16/16 green while silently skipping the actual encrypt/decrypt
+ cross-class assertions. Three findings, all addressed:

[high] Worker roundtrip checks could be skipped + still claim coverage
  cred_memory_roundtrip used `skip ...; return 0` on five prereq-missing
  paths (no agent file, no scope, broker missing chain RPC, broker
  missing contract addresses, DeviceRoleMissing). Final summary still
  claimed AES-256-GCM byte-for-byte coverage as if the path had run.
  Fix: introduce STRICT default + `--allow-skip` opt-in. All five
  prereq paths now call prereq_missing(), which:
    - in strict mode: prints fail + records 'fail' outcome + returns non-zero
    - in --allow-skip mode: prints skip + records 'skip' outcome (dev iter)
  Final summary now prints actual per-step outcomes from STEP_OUTCOMES[],
  and exits non-zero if any step failed (or any step skipped in strict).

[high] Negative cap-class tests (steps 14, 15) accepted ANY non-200
  Previously: cred-class cap → memory worker with non-200 + non-canonical
  error was accepted ('non-200 = pass for negative test'). A down worker,
  wrong URL, 404 route, auth middleware failure, or malformed request
  would all silently satisfy the demo without proving check_data_class
  fired. Fix: require HTTP 400/401/403 AND the canonical
  cap_data_class_mismatch error string. Any other response = die.

[medium] Cross-actor cap-mint test (step 13) accepted generic rejection
  Previously: any 4xx accepted, even when error text was non-canonical;
  502 (broker stale) silently skipped, hiding a real config issue.
  Fix: require HTTP 400/401/403 with canonical OperatorMismatch.
  502 with config-missing body now dies (forces redeploy), not skip.
  Other 502/non-canonical errors = die (negative tests can't pass on
  an unrelated failure).

Plus: positive steps (4, 7, 11+12 happy paths) now call record_ok so
the summary lists EVERY step that actually proved its assertion. The
expect_access_denied helper records too. The summary table is built
from actual execution, not a static claim of coverage.

The structural change here is: skips and infrastructure failures both
become demo failures unless the operator explicitly opts in. CI runs
default-strict. Dev iteration uses --allow-skip when bringing up a
partial environment.
…nvocation

Two small bugs in the strict-mode summary added by c55ea29:

1. Used `local` inside the `if should_run_step 16` block (not a function
   body), so bash printed:
     harness/v2-stage3-demo.sh: line 864: local: can only be used in a function
   AFTER the per-step outcome table tried to render. The 16 steps all
   ran correctly + the demo exited 0, but the summary table itself never
   printed. Fix: drop the `local` keyword and just use plain vars.

2. "DEMO COMPLETE" header would print even when no steps had been
   recorded (e.g. `--from-step 16` to test the summary block in
   isolation). Now distinguishes:
     - all green (nok>0, nskip=0, nfail=0) → DEMO COMPLETE
     - some skipped (--allow-skip) → DEMO PARTIAL
     - any failure → DEMO FAILED + exit 1
     - no steps run at all → NO STEPS EXERCISED + advisory
Codex round-3 review caught a regression I missed in c55ea29:

  [high] Strict demo still skips cross-class isolation checks without
         recording failure (steps 14 + 15)

Previously fixed cred_memory_roundtrip's prereq paths to use
prereq_missing (so strict mode fails-hard), but left steps 14 + 15
calling bare `skip` for the same prereq classes:

  - missing demo-agent file
  - 'not.*scope' (chain scope not set)
  - 'RPC URL not set' (broker stale)
  - 'SIDECAR_REGISTRY_ADDRESS_HEIMA unset' (broker missing contract addrs)

Because those skips didn't append to STEP_OUTCOMES, a full run could
report 'DEMO COMPLETE' with nskip=0 even when neither cross-data-class
isolation gate had been exercised. That's the same false-success
failure mode codex round-2 flagged, just in a different code path —
exactly the kind of regression strict-mode tracking is meant to catch.

Fix: extracted the entire step 14/15 body into a cross_class_rejection()
helper function. All prereq paths now route through prereq_missing
(matching cred_memory_roundtrip's pattern), so:

  - strict mode (default): unmet prereqs → die + STEP_OUTCOMES records 'fail'
  - --allow-skip mode:     unmet prereqs → skip + STEP_OUTCOMES records 'skip'
  - successful negative test → STEP_OUTCOMES records 'ok'

Step 14:
  cross_class_rejection cred-store /v1/memory/put memory cred cred-to-mem
Step 15:
  cross_class_rejection memory-put /v1/cred/store cred memory mem-to-cred

Live-verified on Heima Mainnet (2026-05-20): all 13 STEP_OUTCOMES
recorded, DEMO COMPLETE, exit 0. Steps 14+15 still pass with canonical
403 cap_data_class_mismatch error confirmation (no change to the
positive-path assertion logic — only the skip paths got tightened).
…-mode correct)

Codex round-4 finding (high):

  Cross-class negative test omits required STS headers, so strict
  workers reject before the data-class guard.

The axum extractor order is: OptionalStsCreds → Json<Req> → handler
body (verify_cap). With AGENTKEYS_WORKER_REQUIRE_STS=1 — the
production deployment setting documented in aws_creds.rs — the
extractor rejects header-less requests with HTTP 401 BEFORE verify_cap
runs. The cross-class data-class guard inside verify_cap never fires.

Today the live test passes because the broker host workers don't have
AGENTKEYS_WORKER_REQUIRE_STS=1 set. So we're proving the data-class
guard against dev-config workers but NOT against the prod target.
That's exactly the 'demo says complete, prod silently broken' failure
mode the codex review pipeline keeps catching.

Fix: cross_class_rejection() now:

  1. Mints agent-side STS creds for the TARGET worker's role:
       step 14 (memory worker target) → memory-role STS
       step 15 (cred worker target)   → vault-role STS
  2. Passes all three X-Aws-* headers in the POST to the worker.

Worker request order now:
  a. OptionalStsCreds extractor: valid headers present → Some(creds) → OK
     (passes regardless of AGENTKEYS_WORKER_REQUIRE_STS=1 setting)
  b. verify_cap:
       check_op (Store) → OK
       check_data_class (cap.data_class != worker's class) → REJECT
       → HTTP 403 cap_data_class_mismatch
  c. S3 op never runs (verify_cap returned error)

The data-class guard provably fires now, in BOTH strict and non-strict
worker configurations. Codex's concern was correct.

Refactored mint_agent_sts_for_role() as a shared helper so cross_class
test reuses the same SIWE+OIDC+STS flow as cred_memory_roundtrip. Same
auth chain, same trust boundary, same code path — no inconsistency
between positive (cred_memory_roundtrip) and negative (cross_class)
tests.

Live-verified 2026-05-20 on Heima Mainnet: 13 STEP_OUTCOMES recorded,
all ok, DEMO COMPLETE. Steps 14+15 still return canonical
403 cap_data_class_mismatch with the STS headers correctly passed
through — confirming the data-class guard fires AFTER extractor
authentication passes.
…variants (§17.5)

Codifies the issue #90 outcomes into the canonical architecture spec
(per CLAUDE.md "arch.md as source of truth" rule):

§15.1 + §15.2 — credentials-service + memory-service: added the OIDC
federation paragraph. X-Aws-* header passthrough is the production
auth surface (codex P2 downgrade fix); strict mode forces it via
AGENTKEYS_WORKER_REQUIRE_STS=1. Cross-links to §17.5.

§17.5 (NEW) — Per-data-class cap-token binding:
  - Cap-token's data_class field + the 4 broker endpoints
  - 4-layer defense-in-depth table (broker cap-mint, worker chain-
    verify, AWS IAM PrincipalTag, per-data-class buckets)
  - Each layer's canonical test in harness/v2-stage3-demo.sh
  - Test-discipline rule: new data classes MUST add negative isolation
    tests across all 4 layers
  - Two design rationales spelled out:
      a) Why route-per-class beats a single endpoint with a data_class
         query-param (eliminates user-input attack surface)
      b) Why agent-side STS creds are mandatory (PrincipalTag must match
         the cap's actor_omni; operator-side STS won't satisfy IAM)

Plus the trailing Cargo.lock entry from aws-credential-types being a
direct dep of worker-creds (added in commit 913179a).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants