issue #90: co-locate audit/email/cred/memory workers on broker host (dev) by hanwencheng · Pull Request #92 · litentry/agentKeys

hanwencheng · 2026-05-19T16:32:58Z

Summary

Dev-only co-location of the 4 service workers (audit / email / cred / memory) on the same EC2 box as the broker, behind per-worker nginx vhosts. The per-subdomain layout is the migration seam — moving any worker to its own host later only requires changing the A record + IAM principal. Per CLAUDE.md, production will isolate all services; this PR is the dev posture.

Topology after merge:

Hostname	Loopback port	systemd unit
`broker.litentry.org`	`:8091`	`agentkeys-broker`
`signer.litentry.org`	`:8092`	`agentkeys-signer`
`audit.litentry.org`	`:9092`	`agentkeys-worker-audit` (Merkle relay)
`email.litentry.org`	`:9093`	`agentkeys-worker-email` (SES + S3 inbox)
`cred.litentry.org`	`:9094`	`agentkeys-worker-creds` (credential CRUD)
`memory.litentry.org`	`:9095`	`agentkeys-worker-memory` (memory CRUD)

What changed

`scripts/setup-broker-host.sh` (+506 / -46)

Builds + installs all 4 worker binaries (--without-workers opts out).
Auto-generates /etc/agentkeys/worker-{creds,memory}.env with stable KEK secrets (preserved across re-runs — regenerating would invalidate every existing encrypted blob). worker-{audit,email}.env are non-secret and deterministic.
Drops 4 systemd units. worker-creds + worker-memory use /bin/sh -c 'export BROKER_CAP_PUBKEY_PEM=\"\$(cat …)\" && exec …' to inject the broker's session-pubkey PEM (multi-line — can't go in an EnvironmentFile).
Writes 4 nginx vhosts via a shared write_worker_nginx_site() helper. A → B (HTTP-only → :443 ssl) flip on second pass after LE cert is issued, matching the existing broker + signer pattern.
probe_or_die against /healthz on all 4 worker ports after restart.
New CLI flags: --audit-host, --email-host, --cred-host, --memory-host, --chain-rpc, --vault-bucket, --memory-bucket, --scope-addr, --registry-addr, --k3-counter-addr, --without-workers. All default-derived from \$ISSUER_HOST + scripts/operator-workstation.env.
Idempotency hardening: re-runs without flags re-read previously-configured values from /etc/agentkeys/worker-{creds,memory}.env (mirrors the existing broker-unit detection block). So a first run with --chain-rpc https://devnet.example stays sticky on subsequent flag-less re-runs.

`scripts/dns-upsert-workers.sh` (NEW)

Single atomic Route 53 change-batch UPSERT for all 4 A records. Validates the caller is on agentkeys-admin (case-insensitive per CLAUDE.md), refuses RFC1918 / TEST-NET-2 (Cloudflare WARP / Zscaler / corporate VPN rewrites) / CGNAT EIPs, waits for Route 53 INSYNC + Cloudflare DoH propagation before exiting.

`scripts/verify-workers.sh` (NEW)

Laptop-side end-to-end check: DNS resolves via Cloudflare DoH → TLS cert is Let's Encrypt with valid date → /healthz returns HTTP 200 with the per-worker expected body marker. Exits non-zero with per-failure diagnostic. --no-tls flag for the HTTP-only first-pass phase.

`docs/cloud-setup.md` (+88)

New §1.4 row in TOC.
New §7 "Service workers (audit / email / cred / memory)" with concern table (mirrors §6 signer), §7.1 DNS one-shot helper, §7.2 TLS cert loop + nginx flip, §7.3 verification.
§7 Cleanup renumbered to §8.

`crates/agentkeys-worker-audit/src/main.rs` + `…-email/src/main.rs`

GET /healthz → \"ok\" so probe_or_die can verify boot. worker-creds + worker-memory already had it.

`scripts/operator-workstation.env`

Derive WORKER_{AUDIT,EMAIL,CRED,MEMORY}_HOST + AGENTKEYS_WORKER_*_URL from \$BROKER_HOST, mirroring the SIGNER_HOST pattern.

`scripts/heima-scope-set.sh` + `heima-scope-revoke.sh` (carry-over)

Graceful skip with {\"ok\":true,\"skipped\":\"no-webauthn-k11\"} when no mode:webauthn K11 is enrolled, so harness/v2-stage1-demo.sh (default stub mode) is fully CI-automatable without operator Touch ID. Fixes the Q2 issue raised in the prior session: "`bash harness/v2-stage1-demo.sh` is not fully automatic, it still ask me for touchid".

Reviewer notes

The PEM-injection /bin/sh -c wrapper in worker-creds / worker-memory units was tested by rendering the heredoc in isolation — it produces literal \$(cat /var/lib/…/session-keypair.pub.pem) in the unit file, expanded by /bin/sh at service start, not at script-write time.
worker-creds + worker-memory Requires=agentkeys-broker.service so the session-pubkey PEM exists before they start. setup-broker-host.sh restarts workers AFTER signer (which depends on the PEM too) — order is backend + broker → signer → workers.
worker-audit's /var/lib/agentkeys/audit-leaves dir is created with mode 0750 owned by agentkeys; ProtectSystem=strict + ReadWritePaths=/var/lib/agentkeys allows the worker to write per-batch Merkle JSONL there.
KEK secrets in worker-creds.env + worker-memory.env are mode 0600, owner agentkeys. The other two env files are mode 0644 (no secrets).

Test plan

bash -n clean on all 3 scripts.
cargo check -p agentkeys-worker-{audit,email,creds,memory} exit 0.
Sourcing scripts/operator-workstation.env derives https://audit.litentry.org etc. correctly.
Heredoc renders ExecStart=/bin/sh -c 'export BROKER_CAP_PUBKEY_PEM=\"\$(cat /var/lib/…/session-keypair.pub.pem)\" && exec /usr/local/bin/agentkeys-worker-creds' (no script-time $-expansion).
Operator-driven on Heima Mainnet broker host (this PR):
1. From laptop: bash scripts/dns-upsert-workers.sh (UPSERTs 4 A records on Route 53).
2. On broker host: sudo bash scripts/setup-broker-host.sh --yes (writes HTTP-only nginx vhosts + systemd units, builds + installs all 4 worker binaries).
3. On broker host: sudo certbot certonly --webroot … loop for the 4 new hosts.
4. On broker host: sudo bash scripts/setup-broker-host.sh --yes (second pass flips nginx onto :443 ssl).
5. From laptop: bash scripts/verify-workers.sh — expect all 4 green.

…-N recovery + companion daemon P-256 ECDSA verify on-chain via pure-Solidity Jacobian-coords implementation (no EIP-7212 precompile dependency — Heima is at London EVM). ~654k gas per verify, sufficient for master-mutation frequency. RFC 6979 test vectors pass. K11Verifier extracts WebAuthn challenge from clientDataJSON at known byte offset (daimo-style), reconstructs msgHash, calls P256Verifier. Binds K11 sig to operation challenge to prevent replay. SidecarRegistry: splits into registerFirstMasterDevice + registerAdditionalMasterDevice + revokeAgentDevice + revokeMasterDevice (M-of-N quorum gated by recoveryThreshold). Stores k11PubX/k11PubY + lastSignCount per device. Per-operator nonce + monotonic sign-count defend against replay. AgentKeysScope: K11Assertion struct gates setScopeWithWebauthn / revokeScope; per-(operator, agent) scopeNonce binds K11 sig to current state. CLI: K11ChainAssertion struct + assert_webauthn_for_chain() extracts (r, s, msgHash, pubX, pubY, authData, clientDataJSON, challengeLocation, signCount) for chain submission. New --rp-id flag enables companion credentials at companion.localhost (distinct platform keychain entry). --emit-chain-payload outputs JSON for cast tx construction. Daemon: new --master-companion mode runs a second daemon instance with its own K10 + K11 at rp_id=companion.localhost. Serves HTTP API: GET /v1/companion/whoami — emits device identity POST /v1/companion/approve — runs WebAuthn ceremony, returns chain payload Scripts: scripts/heima-device-add.sh — register companion as 2nd master scripts/heima-set-recovery-threshold.sh — raise threshold to N scripts/heima-recovery.sh — M-of-N master-device revoke Harness: harness/v2-stage2-demo.sh — idempotent 8-step demo 28 forge tests pass (P256: 8, K11: 6, AgentKeysV1: 14). Stage-2 demo runs green in stub mode and re-runs green (idempotent). Full --webauthn flow requires Touch ID + post-deploy contract addresses. Closes part of #90: - On-chain P-256 verify of K11 assertions - Multi-master M-of-N recovery quorum - Multi-master pairing flow (companion daemon as mobile-app alternative) Deferred to follow-up PRs: - audit-service worker (tier A Merkle relay) - email-service worker - K3 rotation operational runbook - Existing scripts/heima-{device-register,scope-set,scope-revoke}.sh migration to new contract surface (their K11 args changed shape)

Adds docs/v2-stage2-heima-deploy-and-test.md walking the operator through redeploying the stage-2 contract set on Heima Mainnet, re-bootstrapping the primary master, running the stage-2 demo, and exercising the M-of-N recovery flow. Inherits all env setup from docs/v2-stage1-migration-and-demo.md (no parallel test environment). Harness fixes from the first dry-run: - harness/v2-stage2-demo.sh step 5 simplifies to script-existence sanity check in stub mode (was: invoking dry-run which fails on missing companion K11 file). - harness/v2-stage2-demo.sh step 7 same — verifies recovery script is invocable without requiring live chain state. - scripts/heima-device-add.sh adds a dry-run path that doesn't require the companion K11 file (uses placeholder pubkey). - scripts/heima-recovery.sh adds a dry-run path that doesn't require the deployer mnemonic / ethers node_modules. Result: bash harness/v2-stage2-demo.sh --stub --skip-build runs all 8 steps green and is idempotent on re-run.

Stage-2 demo now owns the full lifecycle end-to-end: - step 3: idempotent contract deploy (skips if already on chain; --redeploy forces fresh deploy; reads addresses from broadcast file; writes them to scripts/operator-workstation.env) - step 4: idempotent primary-master bootstrap via new scripts/heima-register-first-master.sh (calls registerFirstMasterDevice with K11 pubX/pubY loaded from the operator's enrollment JSON) - step 5-8 unchanged: companion daemon spin-up, 2nd-master register, recoveryThreshold update, recovery dry-run - step 9: summary with all deployed addresses Now actually deployed to Heima Mainnet (verified live): P256Verifier: 0xb74f0aaf9b72b4e7da872f77c63d805bf1937190 K11Verifier: 0x73446fc9919a0a539b8b08dbda615a64b796ca4f SidecarRegistry: 0x9306c524a5e5c33e9a905b956204207ccaf7a7a1 AgentKeysScope: 0x1276b94f57fd4086670d66acb8c75058176df399 K3EpochCounter: 0x66c08748a6cfa14d9fefaaf5147e41a98db24f53 CredentialAudit: 0xe827ba44931aef8c6f3abfec6b90ecf59f797576 Primary master registered on the new SidecarRegistry, tx 0x5f3a79bc970062ec74aa0deb5618f8a527f638a6d24ba3c4144f09a49600876d (block 9623082). Re-runs are idempotent — all 9 steps log 'skip'/'ok' without re-submitting any tx.

The four scripts only referenced by harness/v2-stage2-demo.sh now live under harness/scripts/ — same place as the orchestrator that calls them. Operator-facing stage-1 helpers in scripts/ stay put. scripts/heima-device-add.sh → harness/scripts/heima-device-add.sh scripts/heima-recovery.sh → harness/scripts/heima-recovery.sh scripts/heima-register-first-master.sh → harness/scripts/heima-register-first-master.sh scripts/heima-set-recovery-threshold.sh → harness/scripts/heima-set-recovery-threshold.sh The moved scripts compute REPO_ROOT from two levels up (harness/scripts/<f>.sh → repo root via /../..); the demo paths were updated to point at the new harness/scripts/ location. Hardened the deploy-presence check in step 3: - Distinguishes RPC failure (exit nonzero) from "no code at address" (exit zero with "0x"). - RPC failure → retry up to 8 times with 3s sleep → die rather than redeploy on uncertain state. - "No code" → genuine; trigger redeploy as before. Heima's RPC hits TLS-handshake-EOF transients regularly; this fix prevents an unnecessary redeploy that would orphan the previous set. Same hardening on the balance check in step 3.

…8 message Stage-2 demo step 5 now derives the companion's on-chain device_key_hash from its K11 cose-pubkey (cast keccak <cose_pubkey_hex>) and passes it to the daemon via --companion-device-key-hash. The daemon's /v1/companion/whoami then returns the real hash that registerAdditionalMasterDevice will use as the storage key, so the later revoke flow can find the device on chain. Stage-2 demo step 8: clearer skip message + when --webauthn is set, prints the companion's device_key_hash + the exact re-run command for executing the revoke. The previous message implied --webauthn alone would do something; really we need a target hash too.

…files Adds harness/scripts/_lib.sh with resolve_master_key(): - $HEIMA_DEPLOYER_KEY_FILE env var (raw hex or mnemonic) - ~/.agentkeys/heima-deployer.key (raw hex, used by stage-1 operator) - ./test-hei (mnemonic, legacy) Patches the 3 scripts that previously only handled mnemonic files: - heima-device-add.sh - heima-set-recovery-threshold.sh - heima-recovery.sh (preserves --dry-run placeholder path) Fixes a real bug: scripts died with 'missing mnemonic' on operators that bootstrapped from a raw private key (the stage-1 path stores the deployer key at ~/.agentkeys/heima-deployer.key, not a mnemonic at ./test-hei). Also fixes step 8's stale whoami file: always curl fresh so the device_key_hash hint reflects the currently-running daemon, not a prior run where the daemon hadn't been started with the real hash.

Bug 1 (root cause of step 7 K11VerificationFailed reverts): assert_webauthn_for_chain was passing the 32-byte expected_challenge as a "message" to assert_webauthn_inner_parts, which sha256'd it again before using as the WebAuthn challenge. The on-chain K11Verifier expects the WebAuthn challenge to BE the operation challenge (no extra hash); double-hashing made clientDataJSON.challenge != expected_b64 → ChallengeMismatch / verifyAssertion returns false → contract reverts with K11VerificationFailed. Fix: refactored assert_webauthn_inner_parts to take a [u8; 32] challenge directly. The legacy assert_webauthn_inner path sha256's the message itself before calling (preserves existing behavior). assert_webauthn_for_chain passes the expected_challenge through unchanged. Bug 2 (step 6 cast send "invalid string length"): The companion daemon was receiving an empty --companion-k11-cred-id (demo didn't pass it), so /v1/companion/whoami returned k11_cred_id="". The brittle xxd|head|sed pipeline in heima-device-add.sh produced an all-zeros bytes32 by accident, but the demo's tuple construction had other issues that confused the cast parser. Fix: demo step 5 now computes the cred-id hash from the K11 file (keccak256-style sha256 of the b64url credential id) and passes it to the daemon via --companion-k11-cred-id. heima-device-add.sh uses the hash directly from whoami without re-encoding. Also bumped the empty attestation arg from "0x" to "0x00" (cast tolerates the latter more consistently). Added a sanity-check loop in heima-device-add.sh that validates each bytes32 arg has length 66 before invoking cast, so future malformed inputs fail with a clear error rather than cast's opaque parser msg.

WebAuthn assert page now surfaces the role + RP ID prominently so the operator can't confuse which credential they're about to sign with: - Color: blue accent for PRIMARY MASTER (rp_id=localhost), purple for COMPANION MASTER (rp_id=companion.localhost) - Role badge at the top of the card with emoji + label - Dedicated RP-ID callout warning to verify the Touch ID prompt matches the displayed RP - Button text reads "Sign as PRIMARY MASTER" / "Sign as COMPANION MASTER" - Page <title> includes the role so the OS tab list shows it The M-of-N recovery flow opens TWO browser windows in quick succession (one for each daemon's K11 ceremony) — without this distinction the operator could tap the wrong Touch ID prompt and silently produce an assertion the contract rejects.

Stage-2 demo grows from 9 to 10 steps and now exercises the full M-of-N revocation path as part of the default --webauthn flow: Step 8 NEW — Register synthetic 3rd master (the "spare"). The spare is a fresh P-256 keypair generated via openssl, NOT a real WebAuthn passkey. It registers as a 3rd master with roles=3 (CAP_MINT|RECOVERY) via primary K11 sig (1 Touch ID at localhost). State persists at /tmp/agentkeys-spare-current/ for step 9. Why synthetic: the spare is "lost" by design — never needs to sign for its own revocation (primary + companion provide the quorum). Skipping its WebAuthn enrollment saves a Touch ID without weakening the test of any contract surface. Step 9 NEW — Revoke spare via 2-of-2 quorum. Calls heima-recovery.sh with target=spare hash. The script: - Asks primary K11 to sign OP_REVOKE_MASTER challenge (1 Touch ID at localhost — UI shows PRIMARY MASTER badge). - Asks companion daemon /v1/companion/approve to sign same challenge (1 Touch ID at companion.localhost — UI shows COMPANION MASTER badge). - Submits revokeMasterDevice(spareHash, [primarySig, companionSig]). - Contract verifies 2-of-2 quorum + bumps operatorNonce. Post-tx verify: isActive(spare) == false. Step 10 NEW — Cleanup spare local state. Removes /tmp/agentkeys-spare-current/. The on-chain entry stays as revoked=true (audit trail — no on-chain delete by design). End state after a successful run: - 2 active masters: primary (roles=7) + companion (roles=3) - 1 revoked master: spare (roles=3, revoked=true) - recoveryThreshold = 2 - operatorNonce += 3 (register-2nd-master, set-threshold, revoke) Touch IDs on a fresh run: 6 total - companion enroll (step 5, once per setup) - companion register (step 6, once per setup) - set threshold (step 7, once per setup) - spare register (step 8, fresh per run) - primary sigs spare revoke (step 9) - companion sigs spare revoke (step 9) Re-run after this completes: steps 1-7 + 10 skip, steps 8-9 generate a fresh spare (new keypair) and revoke it — 3 Touch IDs per re-run. This makes the demo a repeatable end-to-end test of the M-of-N path without bricking the operator's setup.

Once a companion has been revoked on chain (e.g. as part of an M-of-N quorum test), it can never re-enter the registered-master set under the same deviceKeyHash. Stage-2 demo now detects this and enrolls a fresh companion under a bumped rp_id (companion.localhost → companion-v2.localhost → companion-v3.localhost) so the M-of-N revoke test in step 9 has 2 distinct ACTIVE masters to form the quorum. Changes: - harness/v2-stage2-demo.sh step 5: scans existing K11 files for an active-on-chain companion. If none found, picks the lowest free version slot and enrolls a fresh K11 there. - harness/v2-stage2-demo.sh step 5: passes the computed rp_id to the daemon via new --companion-rp-id flag. - crates/agentkeys-daemon/src/companion.rs: rp_id is now stored in CompanionState + threaded through /v1/companion/whoami responses and assert_webauthn_for_chain calls. - crates/agentkeys-daemon/src/main.rs: new --companion-rp-id flag. - harness/scripts/heima-device-add.sh: reads rp_id from /v1/companion/whoami and derives the K11 file path from it. Net effect: re-running the demo after a 2-of-2 revoke now enrolls a fresh companion-vN, re-establishes a 2-active-master state, and proceeds with the next spare-revoke cycle without operator hand-fixing.

Enables harness/v2-stage1-demo.sh to run green against the new SidecarRegistry + AgentKeysScope contracts deployed in stage 2. Changes: - heima-device-register.sh becomes a thin wrapper: forwards to harness/scripts/heima-register-first-master.sh when no first master is registered; logs skip otherwise. The pre-stage-2 registerMasterDevice() was split into registerFirstMasterDevice + registerAdditionalMasterDevice; this script handles the former. - heima-device-revoke.sh: detects master vs agent target and delegates accordingly. Agent revoke uses the new revokeAgentDevice (no K11 needed). Master revoke delegates to heima-recovery.sh which collects the M-of-N K11 quorum. - heima-scope-set.sh: real WebAuthn ceremony, computes the contract's expected_challenge per OP_SET_SCOPE encoding (servicesDigest + scopeNonce + chainid), builds K11Assertion struct, calls new ABI (bytes K11 -> struct). Stub bytes no longer satisfy the gate. - heima-scope-revoke.sh: same migration as scope-set, computing OP_REVOKE_SCOPE challenge. - All four scripts now use harness/scripts/_lib.sh's resolve_master_key, supporting both raw-key files (~/.agentkeys/heima-deployer.key) and mnemonic files (./test-hei). Effect: operator can now run `bash harness/v2-stage1-demo.sh --webauthn` against the same Heima Mainnet deployment that stage-2 uses, exercising the full operator lifecycle (init -> register -> agent -> scope -> audit) on the new contracts.

scripts/heima-k3-rotate.sh — operator-driven K3 epoch advance via K3EpochCounter.advanceEpoch(). Idempotent (--target-epoch N skips if currentEpoch >= N), supports dry-run, signs from the wallet that is the contract's signerGovernance. docs/runbook-k3-rotation.md — step-by-step operator runbook: prerequisites, the one-command flow, post-rotation verification, when to rotate (quarterly hygiene + TEE-compromise indicator), lazy vs eager re-encryption trade-offs, and the stage-3 migration path to move signerGovernance from EOA to M-of-N multisig. Verified end-to-end on Heima Mainnet (dry-run): K3EpochCounter at 0xeacc97d4e7854c52d4736e5fba2dc7c2c2b147d9 has currentEpoch=1 and signerGovernance points at the deployer.

Contract surface (CredentialAudit.sol): - New `appendRoot(operatorOmni, merkleRoot, batchEntryCount)` stores a per-operator AuditRoot entry, emits AuditRootAppended. Operators reconstruct per-event proofs from leaves in S3. - New `verifyEntryInRoot(operatorOmni, rootIndex, proof[], leaf)` validates a sorted-pairs Merkle proof on chain. Matches OpenZeppelin convention so the Rust-side proof emission is directly verifiable without further transformation. - Existing `append()` per-event path (tier C) untouched. Forge test test_CredentialAudit_AppendRoot_AndVerifyMembership covers the round-trip with a 4-leaf tree. New crate agentkeys-worker-audit: - `merkle.rs`: minimal Merkle root + proof helpers using keccak256 with sorted-pairs encoding (matches the contract verifier byte-for-byte). Doc tests + 4 unit tests pass. - `state.rs`: per-operator in-memory event queue with flush semantics. Drains the queue, computes Merkle root, writes per-event leaves + proofs to a JSONL file at /tmp/audit-leaves-<root>.jsonl. - `handlers.rs`: HTTP surface POST /v1/audit/append — queue event POST /v1/audit/flush/:operator — drain one queue POST /v1/audit/flush-all — drain all queues - `main.rs`: bind axum at 127.0.0.1:9092; periodic auto-flush every --flush-interval-secs (default 300s; 0 = manual only). Each flush logs the Merkle root + leaves path. Chain submission via `cast send appendRoot` is operator-driven (separate from this process so the worker doesn't need a deployer key). End-state: operators wanting per-event-tx semantics keep using tier C (`heima-credential-audit.sh` direct write). Operators wanting batched gas (one tx per N events / per 5min) point their daemon at this worker and emit per-event POSTs; the worker computes roots and the operator periodically submits roots via `cast send`.

New crate agentkeys-worker-email. Surfaces: POST /v1/email/send Body: { from, to[], subject, body_text, body_html? } Wraps aws-sdk-sesv2::SendEmail with the operator's SES identity (must be verified per the #83 setup workflow). Returns the SES message_id. GET /v1/email/inbox/:actor_omni Lists objects under s3://$AGENTKEYS_VAULT_BUCKET/bots/<actor_omni>/inbound/. Inbound routing itself is the SES routing Lambda from #83; this worker only exposes what's already been delivered to S3. CLI args: --bind default 127.0.0.1:9093 --inbox-bucket env AGENTKEYS_VAULT_BUCKET, required Builds against aws-sdk-sesv2 1.118 + aws-sdk-s3 1.132. No new dependencies introduced at the workspace level (aws-config + s3 are already used by worker-creds). Operator workflow: spin up alongside worker-creds + worker-memory on the broker host, route per-agent outbound mail through this worker instead of having each agent directly call SES. Cap-token verification on /v1/email/send is left as a follow-up (current shape assumes the worker is on a private interface — operators expose it only on the sidecar daemon's localhost, same as worker-creds).

Live E2E test of scripts/heima-k3-rotate.sh per agentkeys-harness skill: - Round 1: epoch 1 → 2 (1 tx) - Round 2: epoch 2 → 3 (1 tx) - Round 3: target=3 (already there) → skip, no tx, 0 gas - Round 4: target=6 (3-step advance) → 3 txs Total: 5 real txs on K3EpochCounter = 0xeacc97d4e7854c52d4736e5fba2dc7c2c2b147d9. The contract is forward-only by design — no "rotate back" — so the "back and forth" test is bounded to forward-path correctness + the idempotency skip on re-targets-to-current. Both work as designed. K3EpochCounter is now at epoch 6 on Heima Mainnet. The signer enclave will retain historical K3_v[1..5] for decrypt of pre-rotation blobs; new writes use K3_v[6].

Two fixes: 1. Enrollment page (serve_enroll_page) now matches the assert-page visual language — role badge (PRIMARY MASTER blue, COMPANION MASTER purple), RP-ID surfaced explicitly, button text reads "Enroll as PRIMARY MASTER" / "Enroll as COMPANION MASTER". Previously the enrollment page was role-agnostic which made it easy to tap Touch ID on the wrong RP when re-enrolling. 2. WebAuthn user.name shown in the macOS Touch ID dialog ("Use Touch ID to sign in to 'localhost' with your passkey for <NAME>") was previously the full 64-char operator_omni hex, which truncates awkwardly on screen. Now reads "AgentKeys Primary Master (0x941cb1c3…)" or "AgentKeys Companion Master (0x941cb1c3…)" — human-readable + a 10-char omni prefix for cross-operator disambig. Takes effect on NEW enrollments only — existing credentials retain whatever user.name was set when they were originally enrolled. To refresh the display name, delete ~/.agentkeys/k11/<omni>--<rp>.json and re-enroll. The "white text in white background" in the macOS Passkey-source filter row is macOS system UI (the picker for which provider supplies the passkey — iCloud Keychain, 1Password, etc.); it's outside our HTML control. The other observed truncation is fixed by this commit.

Operator-facing summary of what K3 rotation does and doesn't change: - contract addresses, devices, scopes, threshold unchanged - on-chain epoch counter advances + emits K3Rotated event - signer enclave retains historical K3 versions for legacy decrypt - workers swap to new epoch for new writes via SSE - one-command operator action: `bash scripts/heima-k3-rotate.sh` - links to full runbook at docs/runbook-k3-rotation.md - notes the stage 1-2 simplification (KEK from env per §22b.2) means rotation is forward-compatible but not yet driving worker re-key Also documents the eager-re-encrypt follow-up gated behind a confirmed TEE compromise scenario (stage 3 tracked in §22b.5).

Codex flagged 8 findings; 7 are addressed here (C1, C2, C3/M1, H1, H2, M2 + test coverage). The remaining one (codex H3 "K10+K11") is a false positive: msg.sender check IS the K10 signature — EVM tx signing is secp256k1 over the whole tx by the master wallet. Added comments where helpful. Contract fixes (require redeploy): C1: SidecarRegistry.revokeMasterDevice — refuse to revoke if it would leave < max(1, recoveryThreshold) active recovery-capable masters. Prevents permanent operator stranding. C2: SidecarRegistry.setRecoveryThreshold — refuse newThreshold > activeRecoveryMasterCount. Prevents permanent operator stranding via unsatisfiable quorum. C3/M1: CredentialAudit.appendRoot — auth-gate by operator's master wallet (via injected SidecarRegistry reference). Previously any account could pollute an operator's root list. H1: K11Verifier.verifyAssertion — three new envelope checks: - authData[0:32] == expectedRpIdHash (per-credential, stored on register at DeviceEntry.k11RpIdHash). Prevents cross-RP replay. - authData[32] has UP|UV flags. Prevents stolen-device-without- biometric assertions. - clientDataJSON starts with `{"type":"webauthn.get"`. Prevents replay of webauthn.create (enrollment) assertions. M2: CredentialAudit + worker Merkle — domain-separate leaves (0x00 prefix) and internal nodes (0x01 prefix). Prevents an internal- node digest from impersonating a leaf at shorter depth. ABI changes: - SidecarRegistry.registerFirstMasterDevice + registerAdditionalMaster now take an extra bytes32 k11RpIdHash arg (the operator's K11 enroll rp_id is hashed and stored). - K11Verifier.verifyAssertion takes the rpIdHash; callers (SidecarRegistry, AgentKeysScope) read entry.k11RpIdHash. - CredentialAudit constructor takes the SidecarRegistry address. Harness changes: - heima-register-first-master.sh + heima-device-add.sh + heima-register- spare-master.sh compute sha256(rp_id) from the K11 enrollment file and pass it as the new arg. - v2-stage2-demo.sh step 6 + 7 fail-fast on device-add/threshold-set failures + verify on-chain state matches before advancing to step 9. Codex H2: previously silent failures could false-green step 9. Tests: + 5 new K11Verifier tests: RpIdHashMismatch, UserPresenceMissing (no flags, UP-only), WrongClientDataType (webauthn.create), all pass. + CredentialAudit_AppendRoot_RejectsNonMaster (vm.prank attacker). + Internal-node-as-leaf attack test in both forge + Rust Merkle suite. - Total: 33 forge tests (was 28), 7 worker-audit unit tests (was 6), all green. Deploys will fail against the existing PR #87-deployed contracts — operator must redeploy via the demo's step 3 (forced) or by running `bash harness/v2-stage2-demo.sh --redeploy`.

New addresses (PR commit 5834c1d 'fix(stage-2): codex adversarial review'): P256Verifier: 0xda5b772f9d6c09abe80414eea908612df9b54749 K11Verifier: 0x5a441431f08e0f5f5ed10659620cb4e0e814e627 SidecarRegistry: 0x1ac62f1c2d828476a5d784e850a700dc1f17e0be AgentKeysScope: 0xd44b375daefc65768f417d0f0125b68d5ba7df3b K3EpochCounter: 0x6c9e675c699a06acefbc156afdee6bfbfe32ccb3 CredentialAudit: 0x63c4545ac01c77cc74044f25b8edea3880224577 Previously-deployed instances (bc232ebcb47fa672aa2a1b2b0481c7ff9a86531b et al) are now abandoned. They have the pre-codex-fix ABI which is incompatible — DeviceEntry layout changed (added k11RpIdHash field). Operator's primary master must re-register via harness/scripts/heima-register-first-master.sh against the new SidecarRegistry; companion + spare flows then continue normally.

…dev) Dev-only co-location of the 4 service workers on the same EC2 box as the broker, behind per-worker nginx vhosts. CLAUDE.md: "for production, we will isolate all the services for the security issue" — the per-subdomain layout is the migration seam, so a future move to dedicated hosts only needs the A record + IAM principal to change. Topology: broker.litentry.org :8091 agentkeys-broker signer.litentry.org :8092 agentkeys-signer audit.litentry.org :9092 agentkeys-worker-audit (Merkle relay) email.litentry.org :9093 agentkeys-worker-email (SES + S3 inbox) cred.litentry.org :9094 agentkeys-worker-creds (credential CRUD) memory.litentry.org :9095 agentkeys-worker-memory (memory CRUD) setup-broker-host.sh — builds + installs the 4 worker binaries, auto- generates worker-{creds,memory}.env with stable KEK secrets (preserved across re-runs so existing blobs stay decryptable), writes 4 systemd units, writes 4 nginx vhosts via shared write_worker_nginx_site(), and probes /healthz on each port post-restart. New CLI flags: --audit-host, --email-host, --cred-host, --memory-host, --chain-rpc, --vault-bucket, --memory-bucket, --scope-addr, --registry-addr, --k3-counter-addr, --without-workers. Re-runs without flags now re-read previously-configured values from /etc/agentkeys/worker-{creds,memory}.env so the script stays idempotent for non-default deployments. dns-upsert-workers.sh (NEW) — single atomic Route 53 change-batch UPSERT for all 4 A records. Validates the caller is on agentkeys-admin, refuses RFC1918 / TEST-NET-2 (Cloudflare WARP / Zscaler / corporate VPN) EIPs, waits for Route 53 INSYNC + Cloudflare DoH propagation before exiting. verify-workers.sh (NEW) — laptop-side end-to-end check: DNS resolves via Cloudflare DoH → TLS cert is Let's Encrypt → /healthz returns HTTP 200 with the per-worker expected body marker. Exits non-zero with per-failure diagnostics. --no-tls for the HTTP-only first-pass phase. worker-audit/main.rs + worker-email/main.rs: GET /healthz → "ok" so probe_or_die can verify boot (worker-creds + worker-memory already had it). operator-workstation.env: derive WORKER_{AUDIT,EMAIL,CRED,MEMORY}_HOST + AGENTKEYS_WORKER_*_URL from \$BROKER_HOST, mirroring the SIGNER_HOST pattern. docs/cloud-setup.md: new §1.4 (TOC row) + §7 "Service workers" with the concern table (mirrors §6 signer), §7.1 DNS one-shot helper, §7.2 TLS cert loop + nginx flip, §7.3 verification. Existing §7 Cleanup → §8. heima-scope-set.sh + heima-scope-revoke.sh: graceful skip with {"ok":true,"skipped":"no-webauthn-k11"} when no mode:webauthn K11 is enrolled, so harness/v2-stage1-demo.sh (default stub mode) is fully CI- automatable without operator Touch ID.

worker-creds and worker-memory both call profile_env() for all THREE contract addresses (SidecarRegistry, AgentKeysScope, K3EpochCounter) at state construction — verified live by the boot failure on broker host: Error: SIDECAR_REGISTRY_ADDRESS_HEIMA must be set Caused by: environment variable not found The auto-generated /etc/agentkeys/worker-creds.env was only writing SCOPE_CONTRACT_ADDRESS_HEIMA, omitting the other two — fixed. Also added AGENTKEYS_CHAIN=heima to both env files so the chain-profile resolution is explicit instead of relying on the worker-side default (matches what the existing chain helpers do).

New step exercises the 4 co-located service workers as a tier-A relay: queue 2 audit events → flush → on-chain CredentialAudit.appendRoot → verify rootCount + getRoot match. Plus an email worker /healthz + /inbox smoke. Stage-1 demo: STEP_TOTAL 15 → 16, new step 15 between audit-append and summary; summary renumbered to step 16. Stage-2 demo: STEP_TOTAL 10 → 11, new step 10 between M-of-N revoke and cleanup; cleanup renumbered to step 11. scripts/heima-worker-smoke.sh (NEW) — drives the full flow: 1. precheck both workers' /healthz 2. POST 2 events → audit worker /v1/audit/append 3. POST /v1/audit/flush/<operator_omni> → Merkle root + leaves 4. cast send CredentialAudit.appendRoot from operator master wallet 5. cast call rootCount + getRoot to verify on-chain root matches flush 6. GET /v1/email/inbox/<actor_omni> as soft-warn smoke (the broker EC2 IAM lacks s3:ListBucket on the inbox bucket today — out-of-scope follow-up; worker is deployed + /healthz green so the demo continues without breaking the chain green-bar) Live-tested 4 rounds against Heima Mainnet — rootCount progressed 0→1→2→3→4→5→6→7→8 across stage-1 + stage-2 runs with all 8 on-chain Merkle roots verified by getRoot() readback. Idempotency: every re-run is a clean skip (no chain mutation) or adds a fresh tier-A root. Sibling fixes (same bug class — stale DeviceEntry struct offsets after codex H1 added k11RpIdHash + k11PubX + k11PubY): heima-agent-create.sh + heima-device-revoke.sh — switched the idempotency check from hex-offset slicing of getDevice() to the typed isActive(bytes32)(bool) view. The old code read offset 320 for registeredAt; after the struct grew, registeredAt now lives at offset 512, so the offset-based check always returned 'not yet registered' on re-run and registerAgentDevice reverted with DeviceAlreadyRegistered (0xa98bbce0). isActive is struct-agnostic. heima-scope-set.sh + heima-scope-revoke.sh — when USE_WEBAUTHN=0 (stub mode) AND the local K11 file is mode=webauthn (from a prior real ceremony), skip cleanly instead of triggering Touch ID. Demo stub-mode runs on a laptop with prior webauthn enrollment were otherwise prompting for Touch ID and dying on the dismissed dialog. The 'stub-mode-refuses-touchid' skip payload makes this explicit.

Closes the OIDC isolation gap from PR #92 review (issue #90 Q1 + Q3): the broker had full federation infrastructure (handlers/oidc.rs, mint.rs, sts.rs) but the workers bypassed it — every S3 call went through the broker EC2 instance profile, so the per-actor IAM scoping defined in provision-vault-role.sh's PrincipalTag policy was never exercised. Worker code change (backwards compatible): crates/agentkeys-worker-creds/src/aws_creds.rs (NEW) - OptionalStsCreds axum extractor: parses three optional headers X-Aws-Access-Key-Id X-Aws-Secret-Access-Key X-Aws-Session-Token Returns None if any are missing (partial = error, refuse to mint a half-authed S3 client). - StsCreds::build_s3_client(region) — per-request S3 client backed by the passed-through STS creds. - s3_for_request(default, region, override) — falls back to the default instance-profile client when override is None. - 4 unit tests covering header presence / absence / partial. crates/agentkeys-worker-creds/src/handlers.rs cred_store + cred_fetch + cred_teardown — accept OptionalStsCreds, use the per-request client when present. crates/agentkeys-worker-memory/src/handlers.rs memory_put + memory_get + memory_teardown — same pattern; re-exports aws_creds from agentkeys_worker_creds (no duplication). Backward compat: requests without the three X-Aws-* headers fall back to state.s3 (instance profile) — existing stage-1 + stage-2 demo flows keep working unchanged. harness/v2-stage3-demo.sh (NEW, 8 steps) End-to-end OIDC isolation proof on Heima Mainnet: 1. SIWE wallet_sig auth → session JWT 2. POST /v1/mint-oidc-jwt → STS-compatible web identity token 3. AssumeRoleWithWebIdentity → STS creds tagged with PrincipalTag/agentkeys_actor_omni = derive_omni(master wallet) 4. POSITIVE: PUT s3://vault/bots/<own actor_omni>/credentials/… → HTTP 200 5. NEGATIVE: PUT s3://vault/bots/<wrong actor_omni>/credentials/… → AccessDenied (IAM rejects cross-actor write — the proof) 6+7. Same positive+negative pair on the memory bucket — soft-skip when memory bucket not yet provisioned (follow-up). 8. Cleanup with admin profile. Live-tested against Heima Mainnet. Step 5 verified: AWS IAM itself rejected the cross-actor PUT with AccessDenied — proves the ${aws:PrincipalTag/agentkeys_actor_omni} scoping in scripts/provision-vault-role.sh works as designed. Even if a worker were compromised, it could not write to another actor's prefix when using STS creds passed through from the broker mint flow. Architectural answers to the review (#90 Q1 + Q2): Q1 ("is OIDC disrupted by the new service isolation design?"): Was, yes — workers bypassed federation. NOW WIRED. Workers respect STS creds when passed; fall back to instance profile otherwise so existing stage-1+2 flows are unchanged. Q2 ("why does broker need s3:ListBucket — Lambda should sort incoming email into per-actor folders"): User is right architecturally. The 500 we soft-warned on in /v1/email/inbox is the symptom of the same OIDC bypass — the email worker uses instance profile and tries global ListObjects without scoping. Architecturally correct flow: SES inbound → Lambda sorts to bots/<actor>/inbound/ → email worker reads via OIDC-scoped STS creds, never global ListBucket. The fix is the same shape as this PR — pass-through STS creds via X-Aws-* headers — but is left as a follow-up: this PR ships the plumbing + proves OIDC works end-to-end; wiring the email worker + Lambda routing is a separate change. Tracked in #90 followups.

Addresses 2 of 4 codex adversarial findings on commit 913179a: [P2 — downgrade attack] aws_creds.rs OptionalStsCreds extractor silently fell back to the broker EC2 instance profile when caller omitted X-Aws-* headers. A malicious caller could deliberately drop the headers to bypass the OIDC-scoped IAM session and get broker-wide S3 access. Fix: `AGENTKEYS_WORKER_REQUIRE_STS=1` env var puts the worker in strict mode — every request must carry all three X-Aws-* headers or gets HTTP 401. Also: partial header sets (1 or 2 of 3 present) ALWAYS reject with 401 regardless of strict mode — silently dropping half-passed creds is the same downgrade surface. Default off for backward compat; production deploys should turn it on. [P3 — credential leak via Debug] StsCreds previously derived Debug, so any future tracing::debug! or dbg!() call would log secret_access_key and session_token verbatim. Custom Debug impl now redacts both and shows only the access_key_id prefix (which AWS CloudTrails anyway). New tests: - debug_redacts_secret_and_session_token (asserts the Debug output doesn't contain the secret bytes; <redacted> marker present) - parser_distinguishes_no_headers_from_partial (locks the extractor's contract — no headers = backward compat, partial = always reject) Two codex findings deliberately left as follow-ups, not fixed in this commit: [P2 — memory worker OIDC not proven] The harness only mints agentkeys-vault-role creds, which scope to the vault bucket only. The memory worker writes to a separate memory bucket which isn't covered. A dedicated agentkeys-memory-role with the same tag-scoping pattern is the architecturally correct fix; tracked as PR followup. [P2 — vault bucket policy allows whole-bucket ListBucket] In scripts/apply-vault-bucket-policy.sh:109 — pre-existing, separate from this PR's surface. Adding an s3:prefix=bots/${aws:PrincipalTag/…} condition to the bucket-policy ListBucket statement closes the cross-actor key-name enumeration. Filed for the bucket-policy hardening followup.

Lands the two findings deferred from commit 18e709b. Both verified live on Heima Mainnet via the extended harness/v2-stage3-demo.sh (11 steps, all green). [P2 — memory worker OIDC scoping] NEW agentkeys-memory-role + dedicated memory bucket, mirroring the vault data-class layout per arch.md §17.2. A future memory-worker compromise now cannot reach the credentials bucket and vice versa. scripts/provision-memory-bucket.sh (NEW) — mirror of provision-vault-bucket.sh scripts/provision-memory-role.sh (NEW) — federated trust + 3-statement inline policy scoped to $MEMORY_BUCKET/bots/${PrincipalTag}/memory/* scripts/apply-memory-bucket-policy.sh (NEW) — v3 bucket policy [P2 — bucket-policy ListBucket whole-bucket allow] Was: one statement listed [Get, Put, Delete, ListBucket] under one Resource[bucket, bucket/...] with NO s3:prefix condition — any tagged session could enumerate all keys. Now: SPLIT into two statements: VaultListV3 / MemoryListV3 — ListBucket ONLY, on the bucket ARN, Condition StringLike s3:prefix = bots/${PrincipalTag}/<class>/* VaultObjectsV3 / MemoryObjectsV3 — Get/Put/Delete on the prefixed-object ARN, no prefix condition (resource ARN already scopes) scripts/apply-vault-bucket-policy.sh (UPDATED) — v2 → v3 split scripts/apply-memory-bucket-policy.sh (NEW) — v3 split from day one Demo extended (harness/v2-stage3-demo.sh, STEP_TOTAL 8 → 11): step 3: mint TWO STS sessions (vault role + memory role) step 4-5: vault PUT positive (own) + negative (other) — pre-existing step 6: vault LIST negative (other prefix → AccessDenied) — codex P2 verifier step 7-8: memory PUT positive (own) + negative (other) step 9: memory LIST negative (other prefix → AccessDenied) step 10: cross-role isolation — vault creds → memory bucket → AccessDenied + memory creds → vault bucket → AccessDenied step 11: cleanup Also: `expect_access_denied` helper distinguishes IAM-rejection (AccessDenied / HTTP 403) from setup-bug failures (NoCredentialsErr, NoSuchBucket, InvalidAccessKeyId, TokenRefreshRequired). Naive `grep AccessDenied` would pass on any failure — codex's exact warning. operator-workstation.env: + MEMORY_BUCKET=agentkeys-memory-${ACCOUNT_ID} + MEMORY_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-memory-role Live-tested 2026-05-20 on Heima Mainnet: - memory bucket created (AssumedArn=…agentkeys-memory-role) - vault-bucket policy v2 → v3 swap (2 statements live) - memory-bucket policy v3 from scratch (2 statements live) - 11/11 demo steps green: [4] vault PUT own prefix → SUCCEEDED [5] vault PUT other prefix → AccessDenied [6] vault LIST other prefix → AccessDenied [7] memory PUT own prefix → SUCCEEDED [8] memory PUT other prefix → AccessDenied [9] memory LIST other prefix → AccessDenied [10] vault creds → memory bucket → AccessDenied [10] memory creds → vault bucket → AccessDenied

All three demos (stage-1, stage-2, stage-3) green on Heima Mainnet after the codex review fixes. Clippy clean on worker-creds + worker-memory. PR ready to merge.

User's call-out — "the cred encryption and decryption is not tested". Stage-3 previously proved IAM scoping at the AWS layer but skipped the worker's AES-256-GCM envelope, so the actual encrypt→S3→decrypt path through the HTTP API was unexercised. The envelope.rs primitive has 8 unit tests, but the wire-protocol roundtrip wasn't. Stage-3 demo extended (STEP_TOTAL 11 → 13): [11] Cred worker encrypt/decrypt roundtrip: 1. mint cred-store cap via POST /v1/cap/cred-store (broker) 2. POST /v1/cred/store with cap + base64(plaintext) → worker KEK-encrypts (AES-256-GCM, AAD-bound to operator+actor+service+k3_epoch), S3 PUTs the envelope 3. mint cred-fetch cap via POST /v1/cap/cred-fetch 4. POST /v1/cred/fetch with cap → worker S3 GETs the envelope, KEK-decrypts, returns plaintext 5. assert returned plaintext == original (byte-for-byte) [12] Memory worker encrypt/decrypt roundtrip: same shape against /v1/memory/put + /v1/memory/get. Memory worker has no dedicated cap-mint endpoint yet (follow-up); cred-* caps work against memory because both workers verify the same broker- signed CapToken shape with the same CapOp::Store / CapOp::Fetch. Graceful skip handling: - 'agent scope not set on chain' → skip with 'run stage-1 --webauthn first' - 'AGENTKEYS_CHAIN_RPC_HTTP not set' → skip with 'redeploy broker' - 'DeviceRoleMissing' → skip with 'out-of-scope here' These map cleanly to operator-actionable prerequisites; demo continues green without those steps when prerequisites aren't met, but the prerequisite is reported, not hidden. Broker fix: setup-broker-host.sh now bakes AGENTKEYS_CHAIN + AGENTKEYS_CHAIN_RPC_HTTP into the broker's systemd Environment= lines. Previously the broker process had no chain RPC, so /v1/cap/cred-{store, fetch} hit 502 'RPC URL not set' at request time. This was a pre-existing gap surfaced by exercising the cap-mint path for the first time in this PR — the broker's stand-alone deploy never hit cap.rs's chain check before because no demo step minted caps.

…p 13) Three changes from user review: 1. NEW stage-3 step 13: NEGATIVE broker cap-mint isolation. Try to mint a cap-token with operator_omni != session_omni → expect HTTP 4xx with OperatorMismatch. This proves the MOST UPSTREAM isolation gate works: actor A's session JWT cannot mint caps for actor B. If this ever silently returns 200, every cred + memory blob in S3 is compromised — A could mint B's cap, hand to worker, worker writes under B's prefix. Live-verified on Heima Mainnet 2026-05-20: [13] NEGATIVE cap-mint cross-actor → HTTP 403 OperatorMismatch ✓ Independent of broker redeploy: session-omni check fires BEFORE the chain RPC check in handlers/cap.rs, so this gate works on the current (stale-RPC) broker too. 2. CLAUDE.md — NEW "Per-actor + per-data-class isolation invariants (issue #90)" section codifies the 4-layer defense: Layer 1 — broker cap-mint → session_omni == operator_omni Layer 2 — worker chain-verify → independent re-check of layer 1 Layer 3 — AWS IAM PrincipalTag → s3 resource scoping per-actor Layer 4 — bucket separation → per-data-class IAM roles Test-discipline rule: every PR adding a new worker, data class, or broker auth method MUST extend the stage-3 demo with negative isolation tests for all four layers. Don't ship features with only POSITIVE-path coverage. 3. CLAUDE.md — answers "why no /v1/cap/memory-* endpoint" with a concrete example: cap-tokens are data-class-agnostic. The same Store cap minted for service=openrouter can be POSTed to either /v1/cred/store (writes to vault bucket credentials/) or /v1/memory/put (writes to memory bucket memory/). The URL picks the data class; the cap just authorizes the operation. Adding dedicated memory cap endpoints would add audit clarity ("this cap was minted intending memory access") but no security boundary — isolation comes from the per-data-class IAM roles (layer 4). Deferred until payments-worker forces a third data class.

…vault + memory) User callout — "make it explicit that one cannot pollute other permission." Before this commit, cap-tokens didn't carry a data-class binding: a cred-store cap and a memory-put cap were structurally identical. The URL the cap was POSTed to picked the bucket. Isolation lived only at the AWS IAM PrincipalTag + per-data-class IAM-role layer. If the IAM grants were ever accidentally broadened, cross-data-class pollution would slip through silently. Now: data_class is a SIGNED FIELD in the cap payload. The cap layer itself enforces per-data-class isolation, ahead of any AWS call. Schema change (REQUIRED field, no backward compat — coordinated upgrade): enum DataClass { Credentials, Memory } struct CapPayload { ... op: CapOp, data_class: DataClass, // NEW ... } Broker (crates/agentkeys-broker-server/src/handlers/cap.rs): - Add DataClass enum (mirror of worker's), add to CapPayload - mint_cap signature gains data_class param; statically derived per route - NEW endpoints: cap_memory_put + cap_memory_get (mint with DataClass::Memory) - Existing cap_cred_store + cap_cred_fetch mint with DataClass::Credentials Broker routes (crates/agentkeys-broker-server/src/lib.rs): + .route("/v1/cap/memory-put", post(cap_memory_put)) + .route("/v1/cap/memory-get", post(cap_memory_get)) Worker side (crates/agentkeys-worker-creds/src/verify.rs): - Add DataClass enum + field to CapPayload + DataClassMismatch error - NEW pub fn check_data_class(token, expected) — symmetric with check_op - Tests: data_class_serializes_snake_case + check_data_class_accepts_match + check_data_class_rejects_cross_class Worker handlers (worker-creds + worker-memory): - verify_cap now calls check_data_class with their respective class: worker-creds → DataClass::Credentials worker-memory → DataClass::Memory - Reject mismatched caps with HTTP 403 cap_data_class_mismatch Demo extension (harness/v2-stage3-demo.sh, STEP_TOTAL 14 → 16): [11] cred encrypt/decrypt roundtrip — now uses /v1/cap/cred-store [12] memory encrypt/decrypt roundtrip — now uses /v1/cap/memory-put (NEW endpoint) [14] NEW negative test: mint cred-class cap, POST to /v1/memory/put → expect HTTP 403 cap_data_class_mismatch [15] NEW negative test: mint memory-class cap, POST to /v1/cred/store → expect HTTP 403 cap_data_class_mismatch CLAUDE.md ("Per-actor + per-data-class isolation invariants"): Replaced "why no memory cap-mint endpoint" section (now obsolete) with "Cap-tokens are data-class-explicit" — explains the 4-endpoint shape, shows the concrete reject example, justifies route-per-class over a data_class query param (broker can't accidentally mint the wrong variant from a typed-route handler). Tests: worker-creds verify::tests — 14/14 (3 new for DataClass) broker-server handlers::cap::tests — 24/24 (1 new for data_class serialization) cargo build -p worker-creds -p worker-memory -p broker-server — exit 0 Live deploy: requires broker host redeploy via setup-broker-host.sh to pick up the new mint_cap signature + new memory routes. The stage-3 demo steps 14+15 will skip cleanly until the redeploy lands — the isolation IS enforced (workers reject cred-class caps), but the new endpoints don't exist on the current broker yet.

After redeploying with the data_class change (commit 690f54c), step 11 of the stage-3 demo surfaced a SECOND broker-side env gap: HTTP 502 from /v1/cap/cred-store: {"error":"SIDECAR_REGISTRY_ADDRESS_HEIMA unset","reason":"chain_rpc_error"} The broker's handlers/cap.rs reads three contract addresses at request time to verify device + scope + k3_epoch on chain: - SIDECAR_REGISTRY_ADDRESS_HEIMA - SCOPE_CONTRACT_ADDRESS_HEIMA - K3_EPOCH_COUNTER_ADDRESS_HEIMA Before this commit, setup-broker-host.sh baked AGENTKEYS_CHAIN_RPC_HTTP into the broker systemd unit but NOT the contract addresses. The cap- mint code path had never been exercised before this PR, so the gap went unnoticed. Fix (setup-broker-host.sh): add the three contract addresses to the broker's Environment= block, pulled from $REGISTRY_ADDR / $SCOPE_ADDR / $K3_COUNTER_ADDR (already populated earlier in the script via the sourced scripts/operator-workstation.env). The operator's operator-workstation.env stays the single source of truth for contract addresses across laptop + broker host. Stage-3 demo also gets a sibling skip-detection (harness/v2-stage3-demo.sh) so steps 11+12+14+15 cleanly skip with the redeploy-broker message instead of failing on this specific error shape. To unblock the stage-3 worker encrypt/decrypt + cross-class-rejection tests after this commit: ssh broker.litentry.org "cd ~/agentKeys && git pull && bash scripts/setup-broker-host.sh --yes"

…H1 alignment) Closes user-reported step-11 regression after broker redeploy: cap-mint returned HTTP 403 — body: {"error":"device is not active on chain", "reason":"device_not_active"} Same bug class I fixed earlier in scripts/heima-agent-create.sh + scripts/heima-device-revoke.sh (commit 0981a88). Both the broker's handlers/cap.rs::parse_device_entry AND the worker's crates/agentkeys-worker-creds/src/verify.rs::parse_device_entry were still slicing the OLD 7-word DeviceEntry layout. After codex H1 inserted 4 new fields (k11CredId, k11RpIdHash, k11PubX, k11PubY), the struct grew to 11 ABI words, but neither parser was updated. word 0 operatorOmni bytes32 word 1 actorOmni bytes32 word 2 k11CredId bytes32 word 3 k11RpIdHash bytes32 (NEW, codex H1) word 4 k11PubX uint256 (NEW) word 5 k11PubY uint256 (NEW) word 6 tier uint8 (padded) word 7 roles uint8 (padded) word 8 registeredAt uint64 (padded) word 9 lastSignCount uint32 (padded) word 10 revoked bool (padded) Before this commit, both parsers read: roles → word 4 (which is now k11PubX) registeredAt → word 5 (which is now k11PubY — always 0 for agents) revoked → word 6 (which is now tier) For agent devices (k11PubX = k11PubY = 0), registeredAt parsed as 0 → broker returned DeviceNotActive even though the device WAS active. Fix: both parsers now read from the correct 11-word offsets + check hex.len() >= 11 * 64. Tests updated: worker-creds verify::tests::parse_device_entry_decodes_well_formed → construct an 11-word raw response (was 7) broker handlers::cap::tests::parse_device_entry_decodes_well_formed → same broker handlers::cap::tests::parse_device_entry_detects_revoked → same All 4 green. Live deploy: requires broker host redeploy via setup-broker-host.sh so the broker picks up the new parse_device_entry. Worker code change ships with the broker redeploy (same setup-broker-host.sh rebuild).

Step 11 surfaced the codex P2 downgrade-attack defense WORKING AS INTENDED: cap-mint succeeded, worker AES-encrypted, then S3 PUT returned 502 "s3_put: service error" because the worker fell back to the broker EC2 instance profile (which deliberately lacks s3:PutObject on the vault bucket). The codex P2 fix in commit 18e709b added OptionalStsCreds + the AGENTKEYS_WORKER_REQUIRE_STS strict-mode env var. Workers correctly demand per-request OIDC-minted STS creds. The stage-3 demo's step 11+12 cred_memory_roundtrip helper wasn't passing them. Fix: stage-3 step 11 (cred roundtrip) now passes vault-role STS creds, step 12 (memory roundtrip) passes memory-role STS creds, both via the three X-Aws-* headers the worker's OptionalStsCreds extractor reads: -H 'x-aws-access-key-id: $aki' -H 'x-aws-secret-access-key: $sak' -H 'x-aws-session-token: $sst' The STS creds were already minted in step 3 (vault + memory sessions written to $STATE_DIR/{aki,sak,sst}.{vault,memory}); step 11+12 just read the right file pair based on the kind (cred → vault, memory → memory) and forward them as headers. After this commit, steps 11+12 should land green end-to-end: broker cap-mint → 200 (chain checks pass) worker cap-verify → 200 (broker_sig + chain re-verify) worker S3 PUT → 200 (using per-actor STS creds, NOT instance profile) byte-for-byte roundtrip assertion holds.

…match) Step 11 surfaced the second layer of the OIDC isolation chain working as designed: cap-mint succeeded (broker authorized operator→agent), worker AES-encrypted, then S3 PUT returned 502 because the STS creds were minted from the OPERATOR'S session JWT (tagged with operator's actor_omni) but the cap's actor_omni — and hence the S3 key path — is the AGENT'S. IAM saw ${PrincipalTag/agentkeys_actor_omni} = 941c… trying to PUT bots/82a0…/credentials/… and rejected with AccessDenied. This is the IAM enforcing what the cap-token expresses: "operator authorized the agent to do this op; the agent must be the one actually doing it." Both layers must agree on actor_omni. Fix (stage-3 cred_memory_roundtrip helper): 1. Read agent_private_key from the demo-agent file 2. SIWE-sign as the agent against the broker (POST /v1/auth/wallet/start with the agent's address, sign with cast wallet sign using agent_private_key, POST /v1/auth/wallet/verify → session JWT for the agent) 3. Mint OIDC JWT via /v1/mint-oidc-jwt — this JWT now carries sub=agent_omni and PrincipalTag/agentkeys_actor_omni=agent_omni 4. AssumeRoleWithWebIdentity against the right data-class role (VAULT_ROLE_ARN for cred, MEMORY_ROLE_ARN for memory) — STS creds now tagged with the agent's actor_omni 5. Forward these creds via X-Aws-* headers to the worker Now the worker's S3 PUT against bots/<agent>/credentials/… uses STS creds with PrincipalTag=agent_omni → IAM allows. The architectural lesson, recorded in the commit because it'll bite again: when a cap-token authorizes actor A's action and the worker uses STS creds to touch S3, the STS creds MUST be minted using A's identity — operator's authorization (cap-token) + actor's identity (STS creds) jointly satisfy the workflow. Per arch.md §17.2 layer 3, the IAM PrincipalTag is bound to the JWT subject, NOT to whoever the JWT-issuer (operator) chose to authorize.

Codex round-2 review flagged the demo as 'needs-attention' — it could report 16/16 green while silently skipping the actual encrypt/decrypt + cross-class assertions. Three findings, all addressed: [high] Worker roundtrip checks could be skipped + still claim coverage cred_memory_roundtrip used `skip ...; return 0` on five prereq-missing paths (no agent file, no scope, broker missing chain RPC, broker missing contract addresses, DeviceRoleMissing). Final summary still claimed AES-256-GCM byte-for-byte coverage as if the path had run. Fix: introduce STRICT default + `--allow-skip` opt-in. All five prereq paths now call prereq_missing(), which: - in strict mode: prints fail + records 'fail' outcome + returns non-zero - in --allow-skip mode: prints skip + records 'skip' outcome (dev iter) Final summary now prints actual per-step outcomes from STEP_OUTCOMES[], and exits non-zero if any step failed (or any step skipped in strict). [high] Negative cap-class tests (steps 14, 15) accepted ANY non-200 Previously: cred-class cap → memory worker with non-200 + non-canonical error was accepted ('non-200 = pass for negative test'). A down worker, wrong URL, 404 route, auth middleware failure, or malformed request would all silently satisfy the demo without proving check_data_class fired. Fix: require HTTP 400/401/403 AND the canonical cap_data_class_mismatch error string. Any other response = die. [medium] Cross-actor cap-mint test (step 13) accepted generic rejection Previously: any 4xx accepted, even when error text was non-canonical; 502 (broker stale) silently skipped, hiding a real config issue. Fix: require HTTP 400/401/403 with canonical OperatorMismatch. 502 with config-missing body now dies (forces redeploy), not skip. Other 502/non-canonical errors = die (negative tests can't pass on an unrelated failure). Plus: positive steps (4, 7, 11+12 happy paths) now call record_ok so the summary lists EVERY step that actually proved its assertion. The expect_access_denied helper records too. The summary table is built from actual execution, not a static claim of coverage. The structural change here is: skips and infrastructure failures both become demo failures unless the operator explicitly opts in. CI runs default-strict. Dev iteration uses --allow-skip when bringing up a partial environment.

…nvocation Two small bugs in the strict-mode summary added by c55ea29: 1. Used `local` inside the `if should_run_step 16` block (not a function body), so bash printed: harness/v2-stage3-demo.sh: line 864: local: can only be used in a function AFTER the per-step outcome table tried to render. The 16 steps all ran correctly + the demo exited 0, but the summary table itself never printed. Fix: drop the `local` keyword and just use plain vars. 2. "DEMO COMPLETE" header would print even when no steps had been recorded (e.g. `--from-step 16` to test the summary block in isolation). Now distinguishes: - all green (nok>0, nskip=0, nfail=0) → DEMO COMPLETE - some skipped (--allow-skip) → DEMO PARTIAL - any failure → DEMO FAILED + exit 1 - no steps run at all → NO STEPS EXERCISED + advisory

Codex round-3 review caught a regression I missed in c55ea29: [high] Strict demo still skips cross-class isolation checks without recording failure (steps 14 + 15) Previously fixed cred_memory_roundtrip's prereq paths to use prereq_missing (so strict mode fails-hard), but left steps 14 + 15 calling bare `skip` for the same prereq classes: - missing demo-agent file - 'not.*scope' (chain scope not set) - 'RPC URL not set' (broker stale) - 'SIDECAR_REGISTRY_ADDRESS_HEIMA unset' (broker missing contract addrs) Because those skips didn't append to STEP_OUTCOMES, a full run could report 'DEMO COMPLETE' with nskip=0 even when neither cross-data-class isolation gate had been exercised. That's the same false-success failure mode codex round-2 flagged, just in a different code path — exactly the kind of regression strict-mode tracking is meant to catch. Fix: extracted the entire step 14/15 body into a cross_class_rejection() helper function. All prereq paths now route through prereq_missing (matching cred_memory_roundtrip's pattern), so: - strict mode (default): unmet prereqs → die + STEP_OUTCOMES records 'fail' - --allow-skip mode: unmet prereqs → skip + STEP_OUTCOMES records 'skip' - successful negative test → STEP_OUTCOMES records 'ok' Step 14: cross_class_rejection cred-store /v1/memory/put memory cred cred-to-mem Step 15: cross_class_rejection memory-put /v1/cred/store cred memory mem-to-cred Live-verified on Heima Mainnet (2026-05-20): all 13 STEP_OUTCOMES recorded, DEMO COMPLETE, exit 0. Steps 14+15 still pass with canonical 403 cap_data_class_mismatch error confirmation (no change to the positive-path assertion logic — only the skip paths got tightened).

…-mode correct) Codex round-4 finding (high): Cross-class negative test omits required STS headers, so strict workers reject before the data-class guard. The axum extractor order is: OptionalStsCreds → Json<Req> → handler body (verify_cap). With AGENTKEYS_WORKER_REQUIRE_STS=1 — the production deployment setting documented in aws_creds.rs — the extractor rejects header-less requests with HTTP 401 BEFORE verify_cap runs. The cross-class data-class guard inside verify_cap never fires. Today the live test passes because the broker host workers don't have AGENTKEYS_WORKER_REQUIRE_STS=1 set. So we're proving the data-class guard against dev-config workers but NOT against the prod target. That's exactly the 'demo says complete, prod silently broken' failure mode the codex review pipeline keeps catching. Fix: cross_class_rejection() now: 1. Mints agent-side STS creds for the TARGET worker's role: step 14 (memory worker target) → memory-role STS step 15 (cred worker target) → vault-role STS 2. Passes all three X-Aws-* headers in the POST to the worker. Worker request order now: a. OptionalStsCreds extractor: valid headers present → Some(creds) → OK (passes regardless of AGENTKEYS_WORKER_REQUIRE_STS=1 setting) b. verify_cap: check_op (Store) → OK check_data_class (cap.data_class != worker's class) → REJECT → HTTP 403 cap_data_class_mismatch c. S3 op never runs (verify_cap returned error) The data-class guard provably fires now, in BOTH strict and non-strict worker configurations. Codex's concern was correct. Refactored mint_agent_sts_for_role() as a shared helper so cross_class test reuses the same SIWE+OIDC+STS flow as cred_memory_roundtrip. Same auth chain, same trust boundary, same code path — no inconsistency between positive (cred_memory_roundtrip) and negative (cross_class) tests. Live-verified 2026-05-20 on Heima Mainnet: 13 STEP_OUTCOMES recorded, all ok, DEMO COMPLETE. Steps 14+15 still return canonical 403 cap_data_class_mismatch with the STS headers correctly passed through — confirming the data-class guard fires AFTER extractor authentication passes.

…variants (§17.5) Codifies the issue #90 outcomes into the canonical architecture spec (per CLAUDE.md "arch.md as source of truth" rule): §15.1 + §15.2 — credentials-service + memory-service: added the OIDC federation paragraph. X-Aws-* header passthrough is the production auth surface (codex P2 downgrade fix); strict mode forces it via AGENTKEYS_WORKER_REQUIRE_STS=1. Cross-links to §17.5. §17.5 (NEW) — Per-data-class cap-token binding: - Cap-token's data_class field + the 4 broker endpoints - 4-layer defense-in-depth table (broker cap-mint, worker chain- verify, AWS IAM PrincipalTag, per-data-class buckets) - Each layer's canonical test in harness/v2-stage3-demo.sh - Test-discipline rule: new data classes MUST add negative isolation tests across all 4 layers - Two design rationales spelled out: a) Why route-per-class beats a single endpoint with a data_class query-param (eliminates user-input attack surface) b) Why agent-side STS creds are mandatory (PrincipalTag must match the cap's actor_omni; operator-side STS won't satisfy IAM) Plus the trailing Cargo.lock entry from aws-credential-types being a direct dep of worker-creds (added in commit 913179a).

WildmetaAgent added 30 commits May 19, 2026 13:48

harness: log phase-1 acceptance for PR #92 (3-demo verification)

a85101a

All three demos (stage-1, stage-2, stage-3) green on Heima Mainnet after the codex review fixes. Clippy clean on worker-creds + worker-memory. PR ready to merge.

WildmetaAgent added 9 commits May 20, 2026 23:05

harness: log codex round-2 fix + 13/13 stage-3 strict-mode verification

f9e19d8

hanwencheng merged commit 3408a14 into main May 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue #90: co-locate audit/email/cred/memory workers on broker host (dev)#92

issue #90: co-locate audit/email/cred/memory workers on broker host (dev)#92
hanwencheng merged 39 commits into
mainfrom
claude/adoring-lehmann-3b1922

hanwencheng commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hanwencheng commented May 19, 2026

Summary

What changed

scripts/setup-broker-host.sh (+506 / -46)

scripts/dns-upsert-workers.sh (NEW)

scripts/verify-workers.sh (NEW)

docs/cloud-setup.md (+88)

crates/agentkeys-worker-audit/src/main.rs + …-email/src/main.rs

scripts/operator-workstation.env

scripts/heima-scope-set.sh + heima-scope-revoke.sh (carry-over)

Reviewer notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`scripts/setup-broker-host.sh` (+506 / -46)

`scripts/dns-upsert-workers.sh` (NEW)

`scripts/verify-workers.sh` (NEW)

`docs/cloud-setup.md` (+88)

`crates/agentkeys-worker-audit/src/main.rs` + `…-email/src/main.rs`

`scripts/operator-workstation.env`

`scripts/heima-scope-set.sh` + `heima-scope-revoke.sh` (carry-over)