issue #90: co-locate audit/email/cred/memory workers on broker host (dev)#92
Merged
Conversation
…-N recovery + companion daemon P-256 ECDSA verify on-chain via pure-Solidity Jacobian-coords implementation (no EIP-7212 precompile dependency — Heima is at London EVM). ~654k gas per verify, sufficient for master-mutation frequency. RFC 6979 test vectors pass. K11Verifier extracts WebAuthn challenge from clientDataJSON at known byte offset (daimo-style), reconstructs msgHash, calls P256Verifier. Binds K11 sig to operation challenge to prevent replay. SidecarRegistry: splits into registerFirstMasterDevice + registerAdditionalMasterDevice + revokeAgentDevice + revokeMasterDevice (M-of-N quorum gated by recoveryThreshold). Stores k11PubX/k11PubY + lastSignCount per device. Per-operator nonce + monotonic sign-count defend against replay. AgentKeysScope: K11Assertion struct gates setScopeWithWebauthn / revokeScope; per-(operator, agent) scopeNonce binds K11 sig to current state. CLI: K11ChainAssertion struct + assert_webauthn_for_chain() extracts (r, s, msgHash, pubX, pubY, authData, clientDataJSON, challengeLocation, signCount) for chain submission. New --rp-id flag enables companion credentials at companion.localhost (distinct platform keychain entry). --emit-chain-payload outputs JSON for cast tx construction. Daemon: new --master-companion mode runs a second daemon instance with its own K10 + K11 at rp_id=companion.localhost. Serves HTTP API: GET /v1/companion/whoami — emits device identity POST /v1/companion/approve — runs WebAuthn ceremony, returns chain payload Scripts: scripts/heima-device-add.sh — register companion as 2nd master scripts/heima-set-recovery-threshold.sh — raise threshold to N scripts/heima-recovery.sh — M-of-N master-device revoke Harness: harness/v2-stage2-demo.sh — idempotent 8-step demo 28 forge tests pass (P256: 8, K11: 6, AgentKeysV1: 14). Stage-2 demo runs green in stub mode and re-runs green (idempotent). Full --webauthn flow requires Touch ID + post-deploy contract addresses. Closes part of #90: - On-chain P-256 verify of K11 assertions - Multi-master M-of-N recovery quorum - Multi-master pairing flow (companion daemon as mobile-app alternative) Deferred to follow-up PRs: - audit-service worker (tier A Merkle relay) - email-service worker - K3 rotation operational runbook - Existing scripts/heima-{device-register,scope-set,scope-revoke}.sh migration to new contract surface (their K11 args changed shape)
Adds docs/v2-stage2-heima-deploy-and-test.md walking the operator through redeploying the stage-2 contract set on Heima Mainnet, re-bootstrapping the primary master, running the stage-2 demo, and exercising the M-of-N recovery flow. Inherits all env setup from docs/v2-stage1-migration-and-demo.md (no parallel test environment). Harness fixes from the first dry-run: - harness/v2-stage2-demo.sh step 5 simplifies to script-existence sanity check in stub mode (was: invoking dry-run which fails on missing companion K11 file). - harness/v2-stage2-demo.sh step 7 same — verifies recovery script is invocable without requiring live chain state. - scripts/heima-device-add.sh adds a dry-run path that doesn't require the companion K11 file (uses placeholder pubkey). - scripts/heima-recovery.sh adds a dry-run path that doesn't require the deployer mnemonic / ethers node_modules. Result: bash harness/v2-stage2-demo.sh --stub --skip-build runs all 8 steps green and is idempotent on re-run.
Stage-2 demo now owns the full lifecycle end-to-end: - step 3: idempotent contract deploy (skips if already on chain; --redeploy forces fresh deploy; reads addresses from broadcast file; writes them to scripts/operator-workstation.env) - step 4: idempotent primary-master bootstrap via new scripts/heima-register-first-master.sh (calls registerFirstMasterDevice with K11 pubX/pubY loaded from the operator's enrollment JSON) - step 5-8 unchanged: companion daemon spin-up, 2nd-master register, recoveryThreshold update, recovery dry-run - step 9: summary with all deployed addresses Now actually deployed to Heima Mainnet (verified live): P256Verifier: 0xb74f0aaf9b72b4e7da872f77c63d805bf1937190 K11Verifier: 0x73446fc9919a0a539b8b08dbda615a64b796ca4f SidecarRegistry: 0x9306c524a5e5c33e9a905b956204207ccaf7a7a1 AgentKeysScope: 0x1276b94f57fd4086670d66acb8c75058176df399 K3EpochCounter: 0x66c08748a6cfa14d9fefaaf5147e41a98db24f53 CredentialAudit: 0xe827ba44931aef8c6f3abfec6b90ecf59f797576 Primary master registered on the new SidecarRegistry, tx 0x5f3a79bc970062ec74aa0deb5618f8a527f638a6d24ba3c4144f09a49600876d (block 9623082). Re-runs are idempotent — all 9 steps log 'skip'/'ok' without re-submitting any tx.
The four scripts only referenced by harness/v2-stage2-demo.sh now live under harness/scripts/ — same place as the orchestrator that calls them. Operator-facing stage-1 helpers in scripts/ stay put. scripts/heima-device-add.sh → harness/scripts/heima-device-add.sh scripts/heima-recovery.sh → harness/scripts/heima-recovery.sh scripts/heima-register-first-master.sh → harness/scripts/heima-register-first-master.sh scripts/heima-set-recovery-threshold.sh → harness/scripts/heima-set-recovery-threshold.sh The moved scripts compute REPO_ROOT from two levels up (harness/scripts/<f>.sh → repo root via /../..); the demo paths were updated to point at the new harness/scripts/ location. Hardened the deploy-presence check in step 3: - Distinguishes RPC failure (exit nonzero) from "no code at address" (exit zero with "0x"). - RPC failure → retry up to 8 times with 3s sleep → die rather than redeploy on uncertain state. - "No code" → genuine; trigger redeploy as before. Heima's RPC hits TLS-handshake-EOF transients regularly; this fix prevents an unnecessary redeploy that would orphan the previous set. Same hardening on the balance check in step 3.
…8 message Stage-2 demo step 5 now derives the companion's on-chain device_key_hash from its K11 cose-pubkey (cast keccak <cose_pubkey_hex>) and passes it to the daemon via --companion-device-key-hash. The daemon's /v1/companion/whoami then returns the real hash that registerAdditionalMasterDevice will use as the storage key, so the later revoke flow can find the device on chain. Stage-2 demo step 8: clearer skip message + when --webauthn is set, prints the companion's device_key_hash + the exact re-run command for executing the revoke. The previous message implied --webauthn alone would do something; really we need a target hash too.
…files Adds harness/scripts/_lib.sh with resolve_master_key(): - $HEIMA_DEPLOYER_KEY_FILE env var (raw hex or mnemonic) - ~/.agentkeys/heima-deployer.key (raw hex, used by stage-1 operator) - ./test-hei (mnemonic, legacy) Patches the 3 scripts that previously only handled mnemonic files: - heima-device-add.sh - heima-set-recovery-threshold.sh - heima-recovery.sh (preserves --dry-run placeholder path) Fixes a real bug: scripts died with 'missing mnemonic' on operators that bootstrapped from a raw private key (the stage-1 path stores the deployer key at ~/.agentkeys/heima-deployer.key, not a mnemonic at ./test-hei). Also fixes step 8's stale whoami file: always curl fresh so the device_key_hash hint reflects the currently-running daemon, not a prior run where the daemon hadn't been started with the real hash.
Bug 1 (root cause of step 7 K11VerificationFailed reverts): assert_webauthn_for_chain was passing the 32-byte expected_challenge as a "message" to assert_webauthn_inner_parts, which sha256'd it again before using as the WebAuthn challenge. The on-chain K11Verifier expects the WebAuthn challenge to BE the operation challenge (no extra hash); double-hashing made clientDataJSON.challenge != expected_b64 → ChallengeMismatch / verifyAssertion returns false → contract reverts with K11VerificationFailed. Fix: refactored assert_webauthn_inner_parts to take a [u8; 32] challenge directly. The legacy assert_webauthn_inner path sha256's the message itself before calling (preserves existing behavior). assert_webauthn_for_chain passes the expected_challenge through unchanged. Bug 2 (step 6 cast send "invalid string length"): The companion daemon was receiving an empty --companion-k11-cred-id (demo didn't pass it), so /v1/companion/whoami returned k11_cred_id="". The brittle xxd|head|sed pipeline in heima-device-add.sh produced an all-zeros bytes32 by accident, but the demo's tuple construction had other issues that confused the cast parser. Fix: demo step 5 now computes the cred-id hash from the K11 file (keccak256-style sha256 of the b64url credential id) and passes it to the daemon via --companion-k11-cred-id. heima-device-add.sh uses the hash directly from whoami without re-encoding. Also bumped the empty attestation arg from "0x" to "0x00" (cast tolerates the latter more consistently). Added a sanity-check loop in heima-device-add.sh that validates each bytes32 arg has length 66 before invoking cast, so future malformed inputs fail with a clear error rather than cast's opaque parser msg.
WebAuthn assert page now surfaces the role + RP ID prominently so the operator can't confuse which credential they're about to sign with: - Color: blue accent for PRIMARY MASTER (rp_id=localhost), purple for COMPANION MASTER (rp_id=companion.localhost) - Role badge at the top of the card with emoji + label - Dedicated RP-ID callout warning to verify the Touch ID prompt matches the displayed RP - Button text reads "Sign as PRIMARY MASTER" / "Sign as COMPANION MASTER" - Page <title> includes the role so the OS tab list shows it The M-of-N recovery flow opens TWO browser windows in quick succession (one for each daemon's K11 ceremony) — without this distinction the operator could tap the wrong Touch ID prompt and silently produce an assertion the contract rejects.
Stage-2 demo grows from 9 to 10 steps and now exercises the full
M-of-N revocation path as part of the default --webauthn flow:
Step 8 NEW — Register synthetic 3rd master (the "spare").
The spare is a fresh P-256 keypair generated via openssl, NOT a
real WebAuthn passkey. It registers as a 3rd master with roles=3
(CAP_MINT|RECOVERY) via primary K11 sig (1 Touch ID at localhost).
State persists at /tmp/agentkeys-spare-current/ for step 9.
Why synthetic: the spare is "lost" by design — never needs to
sign for its own revocation (primary + companion provide the
quorum). Skipping its WebAuthn enrollment saves a Touch ID
without weakening the test of any contract surface.
Step 9 NEW — Revoke spare via 2-of-2 quorum.
Calls heima-recovery.sh with target=spare hash. The script:
- Asks primary K11 to sign OP_REVOKE_MASTER challenge (1 Touch ID
at localhost — UI shows PRIMARY MASTER badge).
- Asks companion daemon /v1/companion/approve to sign same
challenge (1 Touch ID at companion.localhost — UI shows
COMPANION MASTER badge).
- Submits revokeMasterDevice(spareHash, [primarySig, companionSig]).
- Contract verifies 2-of-2 quorum + bumps operatorNonce.
Post-tx verify: isActive(spare) == false.
Step 10 NEW — Cleanup spare local state.
Removes /tmp/agentkeys-spare-current/. The on-chain entry stays
as revoked=true (audit trail — no on-chain delete by design).
End state after a successful run:
- 2 active masters: primary (roles=7) + companion (roles=3)
- 1 revoked master: spare (roles=3, revoked=true)
- recoveryThreshold = 2
- operatorNonce += 3 (register-2nd-master, set-threshold, revoke)
Touch IDs on a fresh run: 6 total
- companion enroll (step 5, once per setup)
- companion register (step 6, once per setup)
- set threshold (step 7, once per setup)
- spare register (step 8, fresh per run)
- primary sigs spare revoke (step 9)
- companion sigs spare revoke (step 9)
Re-run after this completes: steps 1-7 + 10 skip, steps 8-9 generate
a fresh spare (new keypair) and revoke it — 3 Touch IDs per re-run.
This makes the demo a repeatable end-to-end test of the M-of-N path
without bricking the operator's setup.
Once a companion has been revoked on chain (e.g. as part of an M-of-N quorum test), it can never re-enter the registered-master set under the same deviceKeyHash. Stage-2 demo now detects this and enrolls a fresh companion under a bumped rp_id (companion.localhost → companion-v2.localhost → companion-v3.localhost) so the M-of-N revoke test in step 9 has 2 distinct ACTIVE masters to form the quorum. Changes: - harness/v2-stage2-demo.sh step 5: scans existing K11 files for an active-on-chain companion. If none found, picks the lowest free version slot and enrolls a fresh K11 there. - harness/v2-stage2-demo.sh step 5: passes the computed rp_id to the daemon via new --companion-rp-id flag. - crates/agentkeys-daemon/src/companion.rs: rp_id is now stored in CompanionState + threaded through /v1/companion/whoami responses and assert_webauthn_for_chain calls. - crates/agentkeys-daemon/src/main.rs: new --companion-rp-id flag. - harness/scripts/heima-device-add.sh: reads rp_id from /v1/companion/whoami and derives the K11 file path from it. Net effect: re-running the demo after a 2-of-2 revoke now enrolls a fresh companion-vN, re-establishes a 2-active-master state, and proceeds with the next spare-revoke cycle without operator hand-fixing.
Enables harness/v2-stage1-demo.sh to run green against the new SidecarRegistry + AgentKeysScope contracts deployed in stage 2. Changes: - heima-device-register.sh becomes a thin wrapper: forwards to harness/scripts/heima-register-first-master.sh when no first master is registered; logs skip otherwise. The pre-stage-2 registerMasterDevice() was split into registerFirstMasterDevice + registerAdditionalMasterDevice; this script handles the former. - heima-device-revoke.sh: detects master vs agent target and delegates accordingly. Agent revoke uses the new revokeAgentDevice (no K11 needed). Master revoke delegates to heima-recovery.sh which collects the M-of-N K11 quorum. - heima-scope-set.sh: real WebAuthn ceremony, computes the contract's expected_challenge per OP_SET_SCOPE encoding (servicesDigest + scopeNonce + chainid), builds K11Assertion struct, calls new ABI (bytes K11 -> struct). Stub bytes no longer satisfy the gate. - heima-scope-revoke.sh: same migration as scope-set, computing OP_REVOKE_SCOPE challenge. - All four scripts now use harness/scripts/_lib.sh's resolve_master_key, supporting both raw-key files (~/.agentkeys/heima-deployer.key) and mnemonic files (./test-hei). Effect: operator can now run `bash harness/v2-stage1-demo.sh --webauthn` against the same Heima Mainnet deployment that stage-2 uses, exercising the full operator lifecycle (init -> register -> agent -> scope -> audit) on the new contracts.
scripts/heima-k3-rotate.sh — operator-driven K3 epoch advance via K3EpochCounter.advanceEpoch(). Idempotent (--target-epoch N skips if currentEpoch >= N), supports dry-run, signs from the wallet that is the contract's signerGovernance. docs/runbook-k3-rotation.md — step-by-step operator runbook: prerequisites, the one-command flow, post-rotation verification, when to rotate (quarterly hygiene + TEE-compromise indicator), lazy vs eager re-encryption trade-offs, and the stage-3 migration path to move signerGovernance from EOA to M-of-N multisig. Verified end-to-end on Heima Mainnet (dry-run): K3EpochCounter at 0xeacc97d4e7854c52d4736e5fba2dc7c2c2b147d9 has currentEpoch=1 and signerGovernance points at the deployer.
Contract surface (CredentialAudit.sol):
- New `appendRoot(operatorOmni, merkleRoot, batchEntryCount)` stores a
per-operator AuditRoot entry, emits AuditRootAppended. Operators
reconstruct per-event proofs from leaves in S3.
- New `verifyEntryInRoot(operatorOmni, rootIndex, proof[], leaf)`
validates a sorted-pairs Merkle proof on chain. Matches OpenZeppelin
convention so the Rust-side proof emission is directly verifiable
without further transformation.
- Existing `append()` per-event path (tier C) untouched.
Forge test test_CredentialAudit_AppendRoot_AndVerifyMembership covers
the round-trip with a 4-leaf tree.
New crate agentkeys-worker-audit:
- `merkle.rs`: minimal Merkle root + proof helpers using keccak256 with
sorted-pairs encoding (matches the contract verifier byte-for-byte).
Doc tests + 4 unit tests pass.
- `state.rs`: per-operator in-memory event queue with flush semantics.
Drains the queue, computes Merkle root, writes per-event leaves +
proofs to a JSONL file at /tmp/audit-leaves-<root>.jsonl.
- `handlers.rs`: HTTP surface
POST /v1/audit/append — queue event
POST /v1/audit/flush/:operator — drain one queue
POST /v1/audit/flush-all — drain all queues
- `main.rs`: bind axum at 127.0.0.1:9092; periodic auto-flush every
--flush-interval-secs (default 300s; 0 = manual only). Each flush
logs the Merkle root + leaves path. Chain submission via
`cast send appendRoot` is operator-driven (separate from this
process so the worker doesn't need a deployer key).
End-state: operators wanting per-event-tx semantics keep using tier C
(`heima-credential-audit.sh` direct write). Operators wanting batched
gas (one tx per N events / per 5min) point their daemon at this worker
and emit per-event POSTs; the worker computes roots and the operator
periodically submits roots via `cast send`.
New crate agentkeys-worker-email. Surfaces:
POST /v1/email/send
Body: { from, to[], subject, body_text, body_html? }
Wraps aws-sdk-sesv2::SendEmail with the operator's SES identity
(must be verified per the #83 setup workflow). Returns the SES
message_id.
GET /v1/email/inbox/:actor_omni
Lists objects under s3://$AGENTKEYS_VAULT_BUCKET/bots/<actor_omni>/inbound/.
Inbound routing itself is the SES routing Lambda from #83; this
worker only exposes what's already been delivered to S3.
CLI args:
--bind default 127.0.0.1:9093
--inbox-bucket env AGENTKEYS_VAULT_BUCKET, required
Builds against aws-sdk-sesv2 1.118 + aws-sdk-s3 1.132. No new
dependencies introduced at the workspace level (aws-config + s3 are
already used by worker-creds).
Operator workflow: spin up alongside worker-creds + worker-memory on
the broker host, route per-agent outbound mail through this worker
instead of having each agent directly call SES. Cap-token verification
on /v1/email/send is left as a follow-up (current shape assumes the
worker is on a private interface — operators expose it only on the
sidecar daemon's localhost, same as worker-creds).
Live E2E test of scripts/heima-k3-rotate.sh per agentkeys-harness skill: - Round 1: epoch 1 → 2 (1 tx) - Round 2: epoch 2 → 3 (1 tx) - Round 3: target=3 (already there) → skip, no tx, 0 gas - Round 4: target=6 (3-step advance) → 3 txs Total: 5 real txs on K3EpochCounter = 0xeacc97d4e7854c52d4736e5fba2dc7c2c2b147d9. The contract is forward-only by design — no "rotate back" — so the "back and forth" test is bounded to forward-path correctness + the idempotency skip on re-targets-to-current. Both work as designed. K3EpochCounter is now at epoch 6 on Heima Mainnet. The signer enclave will retain historical K3_v[1..5] for decrypt of pre-rotation blobs; new writes use K3_v[6].
Two fixes:
1. Enrollment page (serve_enroll_page) now matches the assert-page
visual language — role badge (PRIMARY MASTER blue, COMPANION MASTER
purple), RP-ID surfaced explicitly, button text reads "Enroll as
PRIMARY MASTER" / "Enroll as COMPANION MASTER". Previously the
enrollment page was role-agnostic which made it easy to tap Touch
ID on the wrong RP when re-enrolling.
2. WebAuthn user.name shown in the macOS Touch ID dialog ("Use Touch
ID to sign in to 'localhost' with your passkey for <NAME>") was
previously the full 64-char operator_omni hex, which truncates
awkwardly on screen. Now reads "AgentKeys Primary Master
(0x941cb1c3…)" or "AgentKeys Companion Master (0x941cb1c3…)" —
human-readable + a 10-char omni prefix for cross-operator disambig.
Takes effect on NEW enrollments only — existing credentials retain
whatever user.name was set when they were originally enrolled. To
refresh the display name, delete ~/.agentkeys/k11/<omni>--<rp>.json
and re-enroll.
The "white text in white background" in the macOS Passkey-source
filter row is macOS system UI (the picker for which provider supplies
the passkey — iCloud Keychain, 1Password, etc.); it's outside our HTML
control. The other observed truncation is fixed by this commit.
Operator-facing summary of what K3 rotation does and doesn't change: - contract addresses, devices, scopes, threshold unchanged - on-chain epoch counter advances + emits K3Rotated event - signer enclave retains historical K3 versions for legacy decrypt - workers swap to new epoch for new writes via SSE - one-command operator action: `bash scripts/heima-k3-rotate.sh` - links to full runbook at docs/runbook-k3-rotation.md - notes the stage 1-2 simplification (KEK from env per §22b.2) means rotation is forward-compatible but not yet driving worker re-key Also documents the eager-re-encrypt follow-up gated behind a confirmed TEE compromise scenario (stage 3 tracked in §22b.5).
Codex flagged 8 findings; 7 are addressed here (C1, C2, C3/M1, H1, H2, M2 +
test coverage). The remaining one (codex H3 "K10+K11") is a false positive:
msg.sender check IS the K10 signature — EVM tx signing is secp256k1 over
the whole tx by the master wallet. Added comments where helpful.
Contract fixes (require redeploy):
C1: SidecarRegistry.revokeMasterDevice — refuse to revoke if it would
leave < max(1, recoveryThreshold) active recovery-capable masters.
Prevents permanent operator stranding.
C2: SidecarRegistry.setRecoveryThreshold — refuse newThreshold >
activeRecoveryMasterCount. Prevents permanent operator stranding
via unsatisfiable quorum.
C3/M1: CredentialAudit.appendRoot — auth-gate by operator's master
wallet (via injected SidecarRegistry reference). Previously any
account could pollute an operator's root list.
H1: K11Verifier.verifyAssertion — three new envelope checks:
- authData[0:32] == expectedRpIdHash (per-credential, stored on
register at DeviceEntry.k11RpIdHash). Prevents cross-RP replay.
- authData[32] has UP|UV flags. Prevents stolen-device-without-
biometric assertions.
- clientDataJSON starts with `{"type":"webauthn.get"`. Prevents
replay of webauthn.create (enrollment) assertions.
M2: CredentialAudit + worker Merkle — domain-separate leaves (0x00
prefix) and internal nodes (0x01 prefix). Prevents an internal-
node digest from impersonating a leaf at shorter depth.
ABI changes:
- SidecarRegistry.registerFirstMasterDevice + registerAdditionalMaster
now take an extra bytes32 k11RpIdHash arg (the operator's K11 enroll
rp_id is hashed and stored).
- K11Verifier.verifyAssertion takes the rpIdHash; callers
(SidecarRegistry, AgentKeysScope) read entry.k11RpIdHash.
- CredentialAudit constructor takes the SidecarRegistry address.
Harness changes:
- heima-register-first-master.sh + heima-device-add.sh + heima-register-
spare-master.sh compute sha256(rp_id) from the K11 enrollment file
and pass it as the new arg.
- v2-stage2-demo.sh step 6 + 7 fail-fast on device-add/threshold-set
failures + verify on-chain state matches before advancing to step 9.
Codex H2: previously silent failures could false-green step 9.
Tests:
+ 5 new K11Verifier tests: RpIdHashMismatch, UserPresenceMissing (no
flags, UP-only), WrongClientDataType (webauthn.create), all pass.
+ CredentialAudit_AppendRoot_RejectsNonMaster (vm.prank attacker).
+ Internal-node-as-leaf attack test in both forge + Rust Merkle suite.
- Total: 33 forge tests (was 28), 7 worker-audit unit tests (was 6),
all green.
Deploys will fail against the existing PR #87-deployed contracts —
operator must redeploy via the demo's step 3 (forced) or by running
`bash harness/v2-stage2-demo.sh --redeploy`.
New addresses (PR commit 5834c1d 'fix(stage-2): codex adversarial review'): P256Verifier: 0xda5b772f9d6c09abe80414eea908612df9b54749 K11Verifier: 0x5a441431f08e0f5f5ed10659620cb4e0e814e627 SidecarRegistry: 0x1ac62f1c2d828476a5d784e850a700dc1f17e0be AgentKeysScope: 0xd44b375daefc65768f417d0f0125b68d5ba7df3b K3EpochCounter: 0x6c9e675c699a06acefbc156afdee6bfbfe32ccb3 CredentialAudit: 0x63c4545ac01c77cc74044f25b8edea3880224577 Previously-deployed instances (bc232ebcb47fa672aa2a1b2b0481c7ff9a86531b et al) are now abandoned. They have the pre-codex-fix ABI which is incompatible — DeviceEntry layout changed (added k11RpIdHash field). Operator's primary master must re-register via harness/scripts/heima-register-first-master.sh against the new SidecarRegistry; companion + spare flows then continue normally.
…dev)
Dev-only co-location of the 4 service workers on the same EC2 box as the
broker, behind per-worker nginx vhosts. CLAUDE.md: "for production, we
will isolate all the services for the security issue" — the per-subdomain
layout is the migration seam, so a future move to dedicated hosts only
needs the A record + IAM principal to change.
Topology:
broker.litentry.org :8091 agentkeys-broker
signer.litentry.org :8092 agentkeys-signer
audit.litentry.org :9092 agentkeys-worker-audit (Merkle relay)
email.litentry.org :9093 agentkeys-worker-email (SES + S3 inbox)
cred.litentry.org :9094 agentkeys-worker-creds (credential CRUD)
memory.litentry.org :9095 agentkeys-worker-memory (memory CRUD)
setup-broker-host.sh — builds + installs the 4 worker binaries, auto-
generates worker-{creds,memory}.env with stable KEK secrets (preserved
across re-runs so existing blobs stay decryptable), writes 4 systemd
units, writes 4 nginx vhosts via shared write_worker_nginx_site(), and
probes /healthz on each port post-restart. New CLI flags: --audit-host,
--email-host, --cred-host, --memory-host, --chain-rpc, --vault-bucket,
--memory-bucket, --scope-addr, --registry-addr, --k3-counter-addr,
--without-workers. Re-runs without flags now re-read previously-configured
values from /etc/agentkeys/worker-{creds,memory}.env so the script stays
idempotent for non-default deployments.
dns-upsert-workers.sh (NEW) — single atomic Route 53 change-batch UPSERT
for all 4 A records. Validates the caller is on agentkeys-admin, refuses
RFC1918 / TEST-NET-2 (Cloudflare WARP / Zscaler / corporate VPN) EIPs,
waits for Route 53 INSYNC + Cloudflare DoH propagation before exiting.
verify-workers.sh (NEW) — laptop-side end-to-end check: DNS resolves via
Cloudflare DoH → TLS cert is Let's Encrypt → /healthz returns HTTP 200
with the per-worker expected body marker. Exits non-zero with per-failure
diagnostics. --no-tls for the HTTP-only first-pass phase.
worker-audit/main.rs + worker-email/main.rs: GET /healthz → "ok" so
probe_or_die can verify boot (worker-creds + worker-memory already had it).
operator-workstation.env: derive WORKER_{AUDIT,EMAIL,CRED,MEMORY}_HOST +
AGENTKEYS_WORKER_*_URL from \$BROKER_HOST, mirroring the SIGNER_HOST
pattern.
docs/cloud-setup.md: new §1.4 (TOC row) + §7 "Service workers" with the
concern table (mirrors §6 signer), §7.1 DNS one-shot helper, §7.2 TLS
cert loop + nginx flip, §7.3 verification. Existing §7 Cleanup → §8.
heima-scope-set.sh + heima-scope-revoke.sh: graceful skip with
{"ok":true,"skipped":"no-webauthn-k11"} when no mode:webauthn K11 is
enrolled, so harness/v2-stage1-demo.sh (default stub mode) is fully CI-
automatable without operator Touch ID.
worker-creds and worker-memory both call profile_env() for all THREE contract addresses (SidecarRegistry, AgentKeysScope, K3EpochCounter) at state construction — verified live by the boot failure on broker host: Error: SIDECAR_REGISTRY_ADDRESS_HEIMA must be set Caused by: environment variable not found The auto-generated /etc/agentkeys/worker-creds.env was only writing SCOPE_CONTRACT_ADDRESS_HEIMA, omitting the other two — fixed. Also added AGENTKEYS_CHAIN=heima to both env files so the chain-profile resolution is explicit instead of relying on the worker-side default (matches what the existing chain helpers do).
New step exercises the 4 co-located service workers as a tier-A relay:
queue 2 audit events → flush → on-chain CredentialAudit.appendRoot →
verify rootCount + getRoot match. Plus an email worker /healthz +
/inbox smoke.
Stage-1 demo: STEP_TOTAL 15 → 16, new step 15 between audit-append
and summary; summary renumbered to step 16.
Stage-2 demo: STEP_TOTAL 10 → 11, new step 10 between M-of-N revoke
and cleanup; cleanup renumbered to step 11.
scripts/heima-worker-smoke.sh (NEW) — drives the full flow:
1. precheck both workers' /healthz
2. POST 2 events → audit worker /v1/audit/append
3. POST /v1/audit/flush/<operator_omni> → Merkle root + leaves
4. cast send CredentialAudit.appendRoot from operator master wallet
5. cast call rootCount + getRoot to verify on-chain root matches flush
6. GET /v1/email/inbox/<actor_omni> as soft-warn smoke (the broker
EC2 IAM lacks s3:ListBucket on the inbox bucket today — out-of-scope
follow-up; worker is deployed + /healthz green so the demo
continues without breaking the chain green-bar)
Live-tested 4 rounds against Heima Mainnet — rootCount progressed
0→1→2→3→4→5→6→7→8 across stage-1 + stage-2 runs with all 8 on-chain
Merkle roots verified by getRoot() readback. Idempotency: every re-run
is a clean skip (no chain mutation) or adds a fresh tier-A root.
Sibling fixes (same bug class — stale DeviceEntry struct offsets after
codex H1 added k11RpIdHash + k11PubX + k11PubY):
heima-agent-create.sh + heima-device-revoke.sh — switched the
idempotency check from hex-offset slicing of getDevice() to the
typed isActive(bytes32)(bool) view. The old code read offset 320
for registeredAt; after the struct grew, registeredAt now lives at
offset 512, so the offset-based check always returned 'not yet
registered' on re-run and registerAgentDevice reverted with
DeviceAlreadyRegistered (0xa98bbce0). isActive is struct-agnostic.
heima-scope-set.sh + heima-scope-revoke.sh — when USE_WEBAUTHN=0
(stub mode) AND the local K11 file is mode=webauthn (from a prior
real ceremony), skip cleanly instead of triggering Touch ID. Demo
stub-mode runs on a laptop with prior webauthn enrollment were
otherwise prompting for Touch ID and dying on the dismissed
dialog. The 'stub-mode-refuses-touchid' skip payload makes this
explicit.
Closes the OIDC isolation gap from PR #92 review (issue #90 Q1 + Q3): the broker had full federation infrastructure (handlers/oidc.rs, mint.rs, sts.rs) but the workers bypassed it — every S3 call went through the broker EC2 instance profile, so the per-actor IAM scoping defined in provision-vault-role.sh's PrincipalTag policy was never exercised. Worker code change (backwards compatible): crates/agentkeys-worker-creds/src/aws_creds.rs (NEW) - OptionalStsCreds axum extractor: parses three optional headers X-Aws-Access-Key-Id X-Aws-Secret-Access-Key X-Aws-Session-Token Returns None if any are missing (partial = error, refuse to mint a half-authed S3 client). - StsCreds::build_s3_client(region) — per-request S3 client backed by the passed-through STS creds. - s3_for_request(default, region, override) — falls back to the default instance-profile client when override is None. - 4 unit tests covering header presence / absence / partial. crates/agentkeys-worker-creds/src/handlers.rs cred_store + cred_fetch + cred_teardown — accept OptionalStsCreds, use the per-request client when present. crates/agentkeys-worker-memory/src/handlers.rs memory_put + memory_get + memory_teardown — same pattern; re-exports aws_creds from agentkeys_worker_creds (no duplication). Backward compat: requests without the three X-Aws-* headers fall back to state.s3 (instance profile) — existing stage-1 + stage-2 demo flows keep working unchanged. harness/v2-stage3-demo.sh (NEW, 8 steps) End-to-end OIDC isolation proof on Heima Mainnet: 1. SIWE wallet_sig auth → session JWT 2. POST /v1/mint-oidc-jwt → STS-compatible web identity token 3. AssumeRoleWithWebIdentity → STS creds tagged with PrincipalTag/agentkeys_actor_omni = derive_omni(master wallet) 4. POSITIVE: PUT s3://vault/bots/<own actor_omni>/credentials/… → HTTP 200 5. NEGATIVE: PUT s3://vault/bots/<wrong actor_omni>/credentials/… → AccessDenied (IAM rejects cross-actor write — the proof) 6+7. Same positive+negative pair on the memory bucket — soft-skip when memory bucket not yet provisioned (follow-up). 8. Cleanup with admin profile. Live-tested against Heima Mainnet. Step 5 verified: AWS IAM itself rejected the cross-actor PUT with AccessDenied — proves the ${aws:PrincipalTag/agentkeys_actor_omni} scoping in scripts/provision-vault-role.sh works as designed. Even if a worker were compromised, it could not write to another actor's prefix when using STS creds passed through from the broker mint flow. Architectural answers to the review (#90 Q1 + Q2): Q1 ("is OIDC disrupted by the new service isolation design?"): Was, yes — workers bypassed federation. NOW WIRED. Workers respect STS creds when passed; fall back to instance profile otherwise so existing stage-1+2 flows are unchanged. Q2 ("why does broker need s3:ListBucket — Lambda should sort incoming email into per-actor folders"): User is right architecturally. The 500 we soft-warned on in /v1/email/inbox is the symptom of the same OIDC bypass — the email worker uses instance profile and tries global ListObjects without scoping. Architecturally correct flow: SES inbound → Lambda sorts to bots/<actor>/inbound/ → email worker reads via OIDC-scoped STS creds, never global ListBucket. The fix is the same shape as this PR — pass-through STS creds via X-Aws-* headers — but is left as a follow-up: this PR ships the plumbing + proves OIDC works end-to-end; wiring the email worker + Lambda routing is a separate change. Tracked in #90 followups.
Addresses 2 of 4 codex adversarial findings on commit 913179a: [P2 — downgrade attack] aws_creds.rs OptionalStsCreds extractor silently fell back to the broker EC2 instance profile when caller omitted X-Aws-* headers. A malicious caller could deliberately drop the headers to bypass the OIDC-scoped IAM session and get broker-wide S3 access. Fix: `AGENTKEYS_WORKER_REQUIRE_STS=1` env var puts the worker in strict mode — every request must carry all three X-Aws-* headers or gets HTTP 401. Also: partial header sets (1 or 2 of 3 present) ALWAYS reject with 401 regardless of strict mode — silently dropping half-passed creds is the same downgrade surface. Default off for backward compat; production deploys should turn it on. [P3 — credential leak via Debug] StsCreds previously derived Debug, so any future tracing::debug! or dbg!() call would log secret_access_key and session_token verbatim. Custom Debug impl now redacts both and shows only the access_key_id prefix (which AWS CloudTrails anyway). New tests: - debug_redacts_secret_and_session_token (asserts the Debug output doesn't contain the secret bytes; <redacted> marker present) - parser_distinguishes_no_headers_from_partial (locks the extractor's contract — no headers = backward compat, partial = always reject) Two codex findings deliberately left as follow-ups, not fixed in this commit: [P2 — memory worker OIDC not proven] The harness only mints agentkeys-vault-role creds, which scope to the vault bucket only. The memory worker writes to a separate memory bucket which isn't covered. A dedicated agentkeys-memory-role with the same tag-scoping pattern is the architecturally correct fix; tracked as PR followup. [P2 — vault bucket policy allows whole-bucket ListBucket] In scripts/apply-vault-bucket-policy.sh:109 — pre-existing, separate from this PR's surface. Adding an s3:prefix=bots/${aws:PrincipalTag/…} condition to the bucket-policy ListBucket statement closes the cross-actor key-name enumeration. Filed for the bucket-policy hardening followup.
Lands the two findings deferred from commit 18e709b. Both verified live on Heima Mainnet via the extended harness/v2-stage3-demo.sh (11 steps, all green). [P2 — memory worker OIDC scoping] NEW agentkeys-memory-role + dedicated memory bucket, mirroring the vault data-class layout per arch.md §17.2. A future memory-worker compromise now cannot reach the credentials bucket and vice versa. scripts/provision-memory-bucket.sh (NEW) — mirror of provision-vault-bucket.sh scripts/provision-memory-role.sh (NEW) — federated trust + 3-statement inline policy scoped to $MEMORY_BUCKET/bots/${PrincipalTag}/memory/* scripts/apply-memory-bucket-policy.sh (NEW) — v3 bucket policy [P2 — bucket-policy ListBucket whole-bucket allow] Was: one statement listed [Get, Put, Delete, ListBucket] under one Resource[bucket, bucket/...] with NO s3:prefix condition — any tagged session could enumerate all keys. Now: SPLIT into two statements: VaultListV3 / MemoryListV3 — ListBucket ONLY, on the bucket ARN, Condition StringLike s3:prefix = bots/${PrincipalTag}/<class>/* VaultObjectsV3 / MemoryObjectsV3 — Get/Put/Delete on the prefixed-object ARN, no prefix condition (resource ARN already scopes) scripts/apply-vault-bucket-policy.sh (UPDATED) — v2 → v3 split scripts/apply-memory-bucket-policy.sh (NEW) — v3 split from day one Demo extended (harness/v2-stage3-demo.sh, STEP_TOTAL 8 → 11): step 3: mint TWO STS sessions (vault role + memory role) step 4-5: vault PUT positive (own) + negative (other) — pre-existing step 6: vault LIST negative (other prefix → AccessDenied) — codex P2 verifier step 7-8: memory PUT positive (own) + negative (other) step 9: memory LIST negative (other prefix → AccessDenied) step 10: cross-role isolation — vault creds → memory bucket → AccessDenied + memory creds → vault bucket → AccessDenied step 11: cleanup Also: `expect_access_denied` helper distinguishes IAM-rejection (AccessDenied / HTTP 403) from setup-bug failures (NoCredentialsErr, NoSuchBucket, InvalidAccessKeyId, TokenRefreshRequired). Naive `grep AccessDenied` would pass on any failure — codex's exact warning. operator-workstation.env: + MEMORY_BUCKET=agentkeys-memory-${ACCOUNT_ID} + MEMORY_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-memory-role Live-tested 2026-05-20 on Heima Mainnet: - memory bucket created (AssumedArn=…agentkeys-memory-role) - vault-bucket policy v2 → v3 swap (2 statements live) - memory-bucket policy v3 from scratch (2 statements live) - 11/11 demo steps green: [4] vault PUT own prefix → SUCCEEDED [5] vault PUT other prefix → AccessDenied [6] vault LIST other prefix → AccessDenied [7] memory PUT own prefix → SUCCEEDED [8] memory PUT other prefix → AccessDenied [9] memory LIST other prefix → AccessDenied [10] vault creds → memory bucket → AccessDenied [10] memory creds → vault bucket → AccessDenied
All three demos (stage-1, stage-2, stage-3) green on Heima Mainnet after the codex review fixes. Clippy clean on worker-creds + worker-memory. PR ready to merge.
User's call-out — "the cred encryption and decryption is not tested".
Stage-3 previously proved IAM scoping at the AWS layer but skipped the
worker's AES-256-GCM envelope, so the actual encrypt→S3→decrypt path
through the HTTP API was unexercised. The envelope.rs primitive has 8
unit tests, but the wire-protocol roundtrip wasn't.
Stage-3 demo extended (STEP_TOTAL 11 → 13):
[11] Cred worker encrypt/decrypt roundtrip:
1. mint cred-store cap via POST /v1/cap/cred-store (broker)
2. POST /v1/cred/store with cap + base64(plaintext)
→ worker KEK-encrypts (AES-256-GCM, AAD-bound to
operator+actor+service+k3_epoch), S3 PUTs the envelope
3. mint cred-fetch cap via POST /v1/cap/cred-fetch
4. POST /v1/cred/fetch with cap
→ worker S3 GETs the envelope, KEK-decrypts, returns plaintext
5. assert returned plaintext == original (byte-for-byte)
[12] Memory worker encrypt/decrypt roundtrip:
same shape against /v1/memory/put + /v1/memory/get. Memory worker
has no dedicated cap-mint endpoint yet (follow-up); cred-* caps
work against memory because both workers verify the same broker-
signed CapToken shape with the same CapOp::Store / CapOp::Fetch.
Graceful skip handling:
- 'agent scope not set on chain' → skip with 'run stage-1 --webauthn first'
- 'AGENTKEYS_CHAIN_RPC_HTTP not set' → skip with 'redeploy broker'
- 'DeviceRoleMissing' → skip with 'out-of-scope here'
These map cleanly to operator-actionable prerequisites; demo continues
green without those steps when prerequisites aren't met, but the
prerequisite is reported, not hidden.
Broker fix: setup-broker-host.sh now bakes AGENTKEYS_CHAIN +
AGENTKEYS_CHAIN_RPC_HTTP into the broker's systemd Environment= lines.
Previously the broker process had no chain RPC, so /v1/cap/cred-{store,
fetch} hit 502 'RPC URL not set' at request time. This was a pre-existing
gap surfaced by exercising the cap-mint path for the first time in this
PR — the broker's stand-alone deploy never hit cap.rs's chain check
before because no demo step minted caps.
…p 13)
Three changes from user review:
1. NEW stage-3 step 13: NEGATIVE broker cap-mint isolation.
Try to mint a cap-token with operator_omni != session_omni → expect
HTTP 4xx with OperatorMismatch. This proves the MOST UPSTREAM
isolation gate works: actor A's session JWT cannot mint caps for
actor B. If this ever silently returns 200, every cred + memory
blob in S3 is compromised — A could mint B's cap, hand to worker,
worker writes under B's prefix.
Live-verified on Heima Mainnet 2026-05-20:
[13] NEGATIVE cap-mint cross-actor → HTTP 403 OperatorMismatch ✓
Independent of broker redeploy: session-omni check fires BEFORE the
chain RPC check in handlers/cap.rs, so this gate works on the
current (stale-RPC) broker too.
2. CLAUDE.md — NEW "Per-actor + per-data-class isolation invariants
(issue #90)" section codifies the 4-layer defense:
Layer 1 — broker cap-mint → session_omni == operator_omni
Layer 2 — worker chain-verify → independent re-check of layer 1
Layer 3 — AWS IAM PrincipalTag → s3 resource scoping per-actor
Layer 4 — bucket separation → per-data-class IAM roles
Test-discipline rule: every PR adding a new worker, data class, or
broker auth method MUST extend the stage-3 demo with negative
isolation tests for all four layers. Don't ship features with only
POSITIVE-path coverage.
3. CLAUDE.md — answers "why no /v1/cap/memory-* endpoint" with a
concrete example: cap-tokens are data-class-agnostic. The same Store
cap minted for service=openrouter can be POSTed to either
/v1/cred/store (writes to vault bucket credentials/) or
/v1/memory/put (writes to memory bucket memory/). The URL picks
the data class; the cap just authorizes the operation. Adding
dedicated memory cap endpoints would add audit clarity ("this cap
was minted intending memory access") but no security boundary —
isolation comes from the per-data-class IAM roles (layer 4).
Deferred until payments-worker forces a third data class.
…vault + memory)
User callout — "make it explicit that one cannot pollute other permission."
Before this commit, cap-tokens didn't carry a data-class binding: a
cred-store cap and a memory-put cap were structurally identical. The
URL the cap was POSTed to picked the bucket. Isolation lived only at
the AWS IAM PrincipalTag + per-data-class IAM-role layer. If the IAM
grants were ever accidentally broadened, cross-data-class pollution
would slip through silently.
Now: data_class is a SIGNED FIELD in the cap payload. The cap layer
itself enforces per-data-class isolation, ahead of any AWS call.
Schema change (REQUIRED field, no backward compat — coordinated upgrade):
enum DataClass { Credentials, Memory }
struct CapPayload {
...
op: CapOp,
data_class: DataClass, // NEW
...
}
Broker (crates/agentkeys-broker-server/src/handlers/cap.rs):
- Add DataClass enum (mirror of worker's), add to CapPayload
- mint_cap signature gains data_class param; statically derived per route
- NEW endpoints: cap_memory_put + cap_memory_get (mint with DataClass::Memory)
- Existing cap_cred_store + cap_cred_fetch mint with DataClass::Credentials
Broker routes (crates/agentkeys-broker-server/src/lib.rs):
+ .route("/v1/cap/memory-put", post(cap_memory_put))
+ .route("/v1/cap/memory-get", post(cap_memory_get))
Worker side (crates/agentkeys-worker-creds/src/verify.rs):
- Add DataClass enum + field to CapPayload + DataClassMismatch error
- NEW pub fn check_data_class(token, expected) — symmetric with check_op
- Tests: data_class_serializes_snake_case + check_data_class_accepts_match
+ check_data_class_rejects_cross_class
Worker handlers (worker-creds + worker-memory):
- verify_cap now calls check_data_class with their respective class:
worker-creds → DataClass::Credentials
worker-memory → DataClass::Memory
- Reject mismatched caps with HTTP 403 cap_data_class_mismatch
Demo extension (harness/v2-stage3-demo.sh, STEP_TOTAL 14 → 16):
[11] cred encrypt/decrypt roundtrip — now uses /v1/cap/cred-store
[12] memory encrypt/decrypt roundtrip — now uses /v1/cap/memory-put (NEW endpoint)
[14] NEW negative test: mint cred-class cap, POST to /v1/memory/put
→ expect HTTP 403 cap_data_class_mismatch
[15] NEW negative test: mint memory-class cap, POST to /v1/cred/store
→ expect HTTP 403 cap_data_class_mismatch
CLAUDE.md ("Per-actor + per-data-class isolation invariants"):
Replaced "why no memory cap-mint endpoint" section (now obsolete) with
"Cap-tokens are data-class-explicit" — explains the 4-endpoint shape,
shows the concrete reject example, justifies route-per-class over a
data_class query param (broker can't accidentally mint the wrong
variant from a typed-route handler).
Tests:
worker-creds verify::tests — 14/14 (3 new for DataClass)
broker-server handlers::cap::tests — 24/24 (1 new for data_class serialization)
cargo build -p worker-creds -p worker-memory -p broker-server — exit 0
Live deploy: requires broker host redeploy via setup-broker-host.sh to
pick up the new mint_cap signature + new memory routes. The stage-3
demo steps 14+15 will skip cleanly until the redeploy lands — the
isolation IS enforced (workers reject cred-class caps), but the new
endpoints don't exist on the current broker yet.
After redeploying with the data_class change (commit 690f54c), step 11 of the stage-3 demo surfaced a SECOND broker-side env gap: HTTP 502 from /v1/cap/cred-store: {"error":"SIDECAR_REGISTRY_ADDRESS_HEIMA unset","reason":"chain_rpc_error"} The broker's handlers/cap.rs reads three contract addresses at request time to verify device + scope + k3_epoch on chain: - SIDECAR_REGISTRY_ADDRESS_HEIMA - SCOPE_CONTRACT_ADDRESS_HEIMA - K3_EPOCH_COUNTER_ADDRESS_HEIMA Before this commit, setup-broker-host.sh baked AGENTKEYS_CHAIN_RPC_HTTP into the broker systemd unit but NOT the contract addresses. The cap- mint code path had never been exercised before this PR, so the gap went unnoticed. Fix (setup-broker-host.sh): add the three contract addresses to the broker's Environment= block, pulled from $REGISTRY_ADDR / $SCOPE_ADDR / $K3_COUNTER_ADDR (already populated earlier in the script via the sourced scripts/operator-workstation.env). The operator's operator-workstation.env stays the single source of truth for contract addresses across laptop + broker host. Stage-3 demo also gets a sibling skip-detection (harness/v2-stage3-demo.sh) so steps 11+12+14+15 cleanly skip with the redeploy-broker message instead of failing on this specific error shape. To unblock the stage-3 worker encrypt/decrypt + cross-class-rejection tests after this commit: ssh broker.litentry.org "cd ~/agentKeys && git pull && bash scripts/setup-broker-host.sh --yes"
…H1 alignment)
Closes user-reported step-11 regression after broker redeploy:
cap-mint returned HTTP 403 — body: {"error":"device is not active on chain",
"reason":"device_not_active"}
Same bug class I fixed earlier in scripts/heima-agent-create.sh +
scripts/heima-device-revoke.sh (commit 0981a88). Both the broker's
handlers/cap.rs::parse_device_entry AND the worker's
crates/agentkeys-worker-creds/src/verify.rs::parse_device_entry were
still slicing the OLD 7-word DeviceEntry layout. After codex H1
inserted 4 new fields (k11CredId, k11RpIdHash, k11PubX, k11PubY), the
struct grew to 11 ABI words, but neither parser was updated.
word 0 operatorOmni bytes32
word 1 actorOmni bytes32
word 2 k11CredId bytes32
word 3 k11RpIdHash bytes32 (NEW, codex H1)
word 4 k11PubX uint256 (NEW)
word 5 k11PubY uint256 (NEW)
word 6 tier uint8 (padded)
word 7 roles uint8 (padded)
word 8 registeredAt uint64 (padded)
word 9 lastSignCount uint32 (padded)
word 10 revoked bool (padded)
Before this commit, both parsers read:
roles → word 4 (which is now k11PubX)
registeredAt → word 5 (which is now k11PubY — always 0 for agents)
revoked → word 6 (which is now tier)
For agent devices (k11PubX = k11PubY = 0), registeredAt parsed as 0 →
broker returned DeviceNotActive even though the device WAS active.
Fix: both parsers now read from the correct 11-word offsets + check
hex.len() >= 11 * 64.
Tests updated:
worker-creds verify::tests::parse_device_entry_decodes_well_formed
→ construct an 11-word raw response (was 7)
broker handlers::cap::tests::parse_device_entry_decodes_well_formed
→ same
broker handlers::cap::tests::parse_device_entry_detects_revoked
→ same
All 4 green.
Live deploy: requires broker host redeploy via setup-broker-host.sh
so the broker picks up the new parse_device_entry. Worker code change
ships with the broker redeploy (same setup-broker-host.sh rebuild).
Step 11 surfaced the codex P2 downgrade-attack defense WORKING AS INTENDED: cap-mint succeeded, worker AES-encrypted, then S3 PUT returned 502 "s3_put: service error" because the worker fell back to the broker EC2 instance profile (which deliberately lacks s3:PutObject on the vault bucket). The codex P2 fix in commit 18e709b added OptionalStsCreds + the AGENTKEYS_WORKER_REQUIRE_STS strict-mode env var. Workers correctly demand per-request OIDC-minted STS creds. The stage-3 demo's step 11+12 cred_memory_roundtrip helper wasn't passing them. Fix: stage-3 step 11 (cred roundtrip) now passes vault-role STS creds, step 12 (memory roundtrip) passes memory-role STS creds, both via the three X-Aws-* headers the worker's OptionalStsCreds extractor reads: -H 'x-aws-access-key-id: $aki' -H 'x-aws-secret-access-key: $sak' -H 'x-aws-session-token: $sst' The STS creds were already minted in step 3 (vault + memory sessions written to $STATE_DIR/{aki,sak,sst}.{vault,memory}); step 11+12 just read the right file pair based on the kind (cred → vault, memory → memory) and forward them as headers. After this commit, steps 11+12 should land green end-to-end: broker cap-mint → 200 (chain checks pass) worker cap-verify → 200 (broker_sig + chain re-verify) worker S3 PUT → 200 (using per-actor STS creds, NOT instance profile) byte-for-byte roundtrip assertion holds.
…match)
Step 11 surfaced the second layer of the OIDC isolation chain working
as designed: cap-mint succeeded (broker authorized operator→agent),
worker AES-encrypted, then S3 PUT returned 502 because the STS creds
were minted from the OPERATOR'S session JWT (tagged with operator's
actor_omni) but the cap's actor_omni — and hence the S3 key path —
is the AGENT'S. IAM saw ${PrincipalTag/agentkeys_actor_omni} = 941c…
trying to PUT bots/82a0…/credentials/… and rejected with AccessDenied.
This is the IAM enforcing what the cap-token expresses: "operator
authorized the agent to do this op; the agent must be the one
actually doing it." Both layers must agree on actor_omni.
Fix (stage-3 cred_memory_roundtrip helper):
1. Read agent_private_key from the demo-agent file
2. SIWE-sign as the agent against the broker (POST /v1/auth/wallet/start
with the agent's address, sign with cast wallet sign using
agent_private_key, POST /v1/auth/wallet/verify → session JWT
for the agent)
3. Mint OIDC JWT via /v1/mint-oidc-jwt — this JWT now carries
sub=agent_omni and PrincipalTag/agentkeys_actor_omni=agent_omni
4. AssumeRoleWithWebIdentity against the right data-class role
(VAULT_ROLE_ARN for cred, MEMORY_ROLE_ARN for memory) — STS
creds now tagged with the agent's actor_omni
5. Forward these creds via X-Aws-* headers to the worker
Now the worker's S3 PUT against bots/<agent>/credentials/… uses STS
creds with PrincipalTag=agent_omni → IAM allows.
The architectural lesson, recorded in the commit because it'll bite
again: when a cap-token authorizes actor A's action and the worker
uses STS creds to touch S3, the STS creds MUST be minted using A's
identity — operator's authorization (cap-token) + actor's identity
(STS creds) jointly satisfy the workflow. Per arch.md §17.2 layer 3,
the IAM PrincipalTag is bound to the JWT subject, NOT to whoever the
JWT-issuer (operator) chose to authorize.
Codex round-2 review flagged the demo as 'needs-attention' — it could
report 16/16 green while silently skipping the actual encrypt/decrypt
+ cross-class assertions. Three findings, all addressed:
[high] Worker roundtrip checks could be skipped + still claim coverage
cred_memory_roundtrip used `skip ...; return 0` on five prereq-missing
paths (no agent file, no scope, broker missing chain RPC, broker
missing contract addresses, DeviceRoleMissing). Final summary still
claimed AES-256-GCM byte-for-byte coverage as if the path had run.
Fix: introduce STRICT default + `--allow-skip` opt-in. All five
prereq paths now call prereq_missing(), which:
- in strict mode: prints fail + records 'fail' outcome + returns non-zero
- in --allow-skip mode: prints skip + records 'skip' outcome (dev iter)
Final summary now prints actual per-step outcomes from STEP_OUTCOMES[],
and exits non-zero if any step failed (or any step skipped in strict).
[high] Negative cap-class tests (steps 14, 15) accepted ANY non-200
Previously: cred-class cap → memory worker with non-200 + non-canonical
error was accepted ('non-200 = pass for negative test'). A down worker,
wrong URL, 404 route, auth middleware failure, or malformed request
would all silently satisfy the demo without proving check_data_class
fired. Fix: require HTTP 400/401/403 AND the canonical
cap_data_class_mismatch error string. Any other response = die.
[medium] Cross-actor cap-mint test (step 13) accepted generic rejection
Previously: any 4xx accepted, even when error text was non-canonical;
502 (broker stale) silently skipped, hiding a real config issue.
Fix: require HTTP 400/401/403 with canonical OperatorMismatch.
502 with config-missing body now dies (forces redeploy), not skip.
Other 502/non-canonical errors = die (negative tests can't pass on
an unrelated failure).
Plus: positive steps (4, 7, 11+12 happy paths) now call record_ok so
the summary lists EVERY step that actually proved its assertion. The
expect_access_denied helper records too. The summary table is built
from actual execution, not a static claim of coverage.
The structural change here is: skips and infrastructure failures both
become demo failures unless the operator explicitly opts in. CI runs
default-strict. Dev iteration uses --allow-skip when bringing up a
partial environment.
…nvocation Two small bugs in the strict-mode summary added by c55ea29: 1. Used `local` inside the `if should_run_step 16` block (not a function body), so bash printed: harness/v2-stage3-demo.sh: line 864: local: can only be used in a function AFTER the per-step outcome table tried to render. The 16 steps all ran correctly + the demo exited 0, but the summary table itself never printed. Fix: drop the `local` keyword and just use plain vars. 2. "DEMO COMPLETE" header would print even when no steps had been recorded (e.g. `--from-step 16` to test the summary block in isolation). Now distinguishes: - all green (nok>0, nskip=0, nfail=0) → DEMO COMPLETE - some skipped (--allow-skip) → DEMO PARTIAL - any failure → DEMO FAILED + exit 1 - no steps run at all → NO STEPS EXERCISED + advisory
Codex round-3 review caught a regression I missed in c55ea29: [high] Strict demo still skips cross-class isolation checks without recording failure (steps 14 + 15) Previously fixed cred_memory_roundtrip's prereq paths to use prereq_missing (so strict mode fails-hard), but left steps 14 + 15 calling bare `skip` for the same prereq classes: - missing demo-agent file - 'not.*scope' (chain scope not set) - 'RPC URL not set' (broker stale) - 'SIDECAR_REGISTRY_ADDRESS_HEIMA unset' (broker missing contract addrs) Because those skips didn't append to STEP_OUTCOMES, a full run could report 'DEMO COMPLETE' with nskip=0 even when neither cross-data-class isolation gate had been exercised. That's the same false-success failure mode codex round-2 flagged, just in a different code path — exactly the kind of regression strict-mode tracking is meant to catch. Fix: extracted the entire step 14/15 body into a cross_class_rejection() helper function. All prereq paths now route through prereq_missing (matching cred_memory_roundtrip's pattern), so: - strict mode (default): unmet prereqs → die + STEP_OUTCOMES records 'fail' - --allow-skip mode: unmet prereqs → skip + STEP_OUTCOMES records 'skip' - successful negative test → STEP_OUTCOMES records 'ok' Step 14: cross_class_rejection cred-store /v1/memory/put memory cred cred-to-mem Step 15: cross_class_rejection memory-put /v1/cred/store cred memory mem-to-cred Live-verified on Heima Mainnet (2026-05-20): all 13 STEP_OUTCOMES recorded, DEMO COMPLETE, exit 0. Steps 14+15 still pass with canonical 403 cap_data_class_mismatch error confirmation (no change to the positive-path assertion logic — only the skip paths got tightened).
…-mode correct)
Codex round-4 finding (high):
Cross-class negative test omits required STS headers, so strict
workers reject before the data-class guard.
The axum extractor order is: OptionalStsCreds → Json<Req> → handler
body (verify_cap). With AGENTKEYS_WORKER_REQUIRE_STS=1 — the
production deployment setting documented in aws_creds.rs — the
extractor rejects header-less requests with HTTP 401 BEFORE verify_cap
runs. The cross-class data-class guard inside verify_cap never fires.
Today the live test passes because the broker host workers don't have
AGENTKEYS_WORKER_REQUIRE_STS=1 set. So we're proving the data-class
guard against dev-config workers but NOT against the prod target.
That's exactly the 'demo says complete, prod silently broken' failure
mode the codex review pipeline keeps catching.
Fix: cross_class_rejection() now:
1. Mints agent-side STS creds for the TARGET worker's role:
step 14 (memory worker target) → memory-role STS
step 15 (cred worker target) → vault-role STS
2. Passes all three X-Aws-* headers in the POST to the worker.
Worker request order now:
a. OptionalStsCreds extractor: valid headers present → Some(creds) → OK
(passes regardless of AGENTKEYS_WORKER_REQUIRE_STS=1 setting)
b. verify_cap:
check_op (Store) → OK
check_data_class (cap.data_class != worker's class) → REJECT
→ HTTP 403 cap_data_class_mismatch
c. S3 op never runs (verify_cap returned error)
The data-class guard provably fires now, in BOTH strict and non-strict
worker configurations. Codex's concern was correct.
Refactored mint_agent_sts_for_role() as a shared helper so cross_class
test reuses the same SIWE+OIDC+STS flow as cred_memory_roundtrip. Same
auth chain, same trust boundary, same code path — no inconsistency
between positive (cred_memory_roundtrip) and negative (cross_class)
tests.
Live-verified 2026-05-20 on Heima Mainnet: 13 STEP_OUTCOMES recorded,
all ok, DEMO COMPLETE. Steps 14+15 still return canonical
403 cap_data_class_mismatch with the STS headers correctly passed
through — confirming the data-class guard fires AFTER extractor
authentication passes.
…variants (§17.5) Codifies the issue #90 outcomes into the canonical architecture spec (per CLAUDE.md "arch.md as source of truth" rule): §15.1 + §15.2 — credentials-service + memory-service: added the OIDC federation paragraph. X-Aws-* header passthrough is the production auth surface (codex P2 downgrade fix); strict mode forces it via AGENTKEYS_WORKER_REQUIRE_STS=1. Cross-links to §17.5. §17.5 (NEW) — Per-data-class cap-token binding: - Cap-token's data_class field + the 4 broker endpoints - 4-layer defense-in-depth table (broker cap-mint, worker chain- verify, AWS IAM PrincipalTag, per-data-class buckets) - Each layer's canonical test in harness/v2-stage3-demo.sh - Test-discipline rule: new data classes MUST add negative isolation tests across all 4 layers - Two design rationales spelled out: a) Why route-per-class beats a single endpoint with a data_class query-param (eliminates user-input attack surface) b) Why agent-side STS creds are mandatory (PrincipalTag must match the cap's actor_omni; operator-side STS won't satisfy IAM) Plus the trailing Cargo.lock entry from aws-credential-types being a direct dep of worker-creds (added in commit 913179a).
This was referenced May 20, 2026
Closed
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Dev-only co-location of the 4 service workers (
audit/email/cred/memory) on the same EC2 box as the broker, behind per-worker nginx vhosts. The per-subdomain layout is the migration seam — moving any worker to its own host later only requires changing the A record + IAM principal. Per CLAUDE.md, production will isolate all services; this PR is the dev posture.Topology after merge:
broker.litentry.org:8091agentkeys-brokersigner.litentry.org:8092agentkeys-signeraudit.litentry.org:9092agentkeys-worker-audit(Merkle relay)email.litentry.org:9093agentkeys-worker-email(SES + S3 inbox)cred.litentry.org:9094agentkeys-worker-creds(credential CRUD)memory.litentry.org:9095agentkeys-worker-memory(memory CRUD)What changed
scripts/setup-broker-host.sh(+506 / -46)--without-workersopts out)./etc/agentkeys/worker-{creds,memory}.envwith stable KEK secrets (preserved across re-runs — regenerating would invalidate every existing encrypted blob).worker-{audit,email}.envare non-secret and deterministic.worker-creds+worker-memoryuse/bin/sh -c 'export BROKER_CAP_PUBKEY_PEM=\"\$(cat …)\" && exec …'to inject the broker's session-pubkey PEM (multi-line — can't go in an EnvironmentFile).write_worker_nginx_site()helper. A → B (HTTP-only → :443 ssl) flip on second pass after LE cert is issued, matching the existing broker + signer pattern.probe_or_dieagainst/healthzon all 4 worker ports after restart.--audit-host,--email-host,--cred-host,--memory-host,--chain-rpc,--vault-bucket,--memory-bucket,--scope-addr,--registry-addr,--k3-counter-addr,--without-workers. All default-derived from\$ISSUER_HOST+scripts/operator-workstation.env./etc/agentkeys/worker-{creds,memory}.env(mirrors the existing broker-unit detection block). So a first run with--chain-rpc https://devnet.examplestays sticky on subsequent flag-less re-runs.scripts/dns-upsert-workers.sh(NEW)Single atomic Route 53 change-batch UPSERT for all 4 A records. Validates the caller is on
agentkeys-admin(case-insensitive per CLAUDE.md), refuses RFC1918 / TEST-NET-2 (Cloudflare WARP / Zscaler / corporate VPN rewrites) / CGNAT EIPs, waits for Route 53 INSYNC + Cloudflare DoH propagation before exiting.scripts/verify-workers.sh(NEW)Laptop-side end-to-end check: DNS resolves via Cloudflare DoH → TLS cert is Let's Encrypt with valid date →
/healthzreturns HTTP 200 with the per-worker expected body marker. Exits non-zero with per-failure diagnostic.--no-tlsflag for the HTTP-only first-pass phase.docs/cloud-setup.md(+88)crates/agentkeys-worker-audit/src/main.rs+…-email/src/main.rsGET /healthz → \"ok\"soprobe_or_diecan verify boot.worker-creds+worker-memoryalready had it.scripts/operator-workstation.envDerive
WORKER_{AUDIT,EMAIL,CRED,MEMORY}_HOST+AGENTKEYS_WORKER_*_URLfrom\$BROKER_HOST, mirroring the SIGNER_HOST pattern.scripts/heima-scope-set.sh+heima-scope-revoke.sh(carry-over)Graceful skip with
{\"ok\":true,\"skipped\":\"no-webauthn-k11\"}when nomode:webauthnK11 is enrolled, soharness/v2-stage1-demo.sh(default stub mode) is fully CI-automatable without operator Touch ID. Fixes the Q2 issue raised in the prior session: "`bash harness/v2-stage1-demo.sh` is not fully automatic, it still ask me for touchid".Reviewer notes
/bin/sh -cwrapper in worker-creds / worker-memory units was tested by rendering the heredoc in isolation — it produces literal\$(cat /var/lib/…/session-keypair.pub.pem)in the unit file, expanded by/bin/shat service start, not at script-write time.worker-creds+worker-memoryRequires=agentkeys-broker.serviceso the session-pubkey PEM exists before they start.setup-broker-host.shrestarts workers AFTER signer (which depends on the PEM too) — order isbackend + broker → signer → workers.worker-audit's/var/lib/agentkeys/audit-leavesdir is created with mode0750owned byagentkeys;ProtectSystem=strict+ReadWritePaths=/var/lib/agentkeysallows the worker to write per-batch Merkle JSONL there.worker-creds.env+worker-memory.envare mode 0600, owneragentkeys. The other two env files are mode 0644 (no secrets).Test plan
bash -nclean on all 3 scripts.cargo check -p agentkeys-worker-{audit,email,creds,memory}exit 0.scripts/operator-workstation.envderiveshttps://audit.litentry.orgetc. correctly.ExecStart=/bin/sh -c 'export BROKER_CAP_PUBKEY_PEM=\"\$(cat /var/lib/…/session-keypair.pub.pem)\" && exec /usr/local/bin/agentkeys-worker-creds'(no script-time $-expansion).bash scripts/dns-upsert-workers.sh(UPSERTs 4 A records on Route 53).sudo bash scripts/setup-broker-host.sh --yes(writes HTTP-only nginx vhosts + systemd units, builds + installs all 4 worker binaries).sudo certbot certonly --webroot …loop for the 4 new hosts.sudo bash scripts/setup-broker-host.sh --yes(second pass flips nginx onto :443 ssl).bash scripts/verify-workers.sh— expect all 4 green.