From 09186c5b2ef7ea3a59494c28b885099077ef0aa0 Mon Sep 17 00:00:00 2001 From: wildmeta-agent Date: Thu, 21 May 2026 09:25:35 +0800 Subject: [PATCH 1/4] =?UTF-8?q?issue=20#66:=20add=20no-LLM=20CI=20?= =?UTF-8?q?=E2=80=94=20ephemeral=20anvil=20+=20scaffolded=20test-broker=20?= =?UTF-8?q?E2E?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two-tier CI matching issue #66's "shared test broker for CI + dev" vision: Tier 1 — ephemeral (every push/PR, fully self-contained, ~10–15 min): * .github/workflows/harness-ci.yml — cargo fmt + clippy + test + harness/ci-ephemeral-stack.sh. No LLM, no @claude invocation. * harness/ci-ephemeral-stack.sh — spins up anvil (new chain), runs forge build + test, deploys fresh v2 stage-1 contracts via DeployAgentKeysV1.s.sol (new contracts, new anvil-prefunded deployer), verifies via scripts/verify-heima-contracts.sh, then stands up mock-server + agentkeys-broker-server with --skip-startup-check (StubSts path) and probes OIDC discovery surface. EXIT trap tears everything down. Tier 2 — long-lived test broker (nightly + workflow_dispatch, scaffolded here, operator-activated via TEST_OIDC_AWS_ROLE_ARN secret): * .github/workflows/harness-e2e.yml — gated workflow that targets test-broker.litentry.org with real test AWS resources, runs all three stage demos against the long-lived parallel infra. Includes nightly cleanup of stale ci/ S3 prefixes. Uses GitHub Actions OIDC (id-token: write) for AWS auth, never long-lived secrets. * scripts/provision-test-environment.sh — operator-run one-shot provisioner that walks the 7 steps to stand up test-broker (separate OIDC provider, separate IAM roles, separate buckets, separate deployer wallet, fresh contracts on Heima-Paseo). * scripts/test-environment.env.example — committed env template mirroring operator-workstation.env with -test suffixes. * docs/test-environment.md — bring-up runbook, secret list, rotation, cleanup, and the two-tier design rationale. WebAuthn: harness scripts default to WEBAUTHN_MODE=0 (stage-1 line 131, stage-2 --stub) so no Touch ID prompt is ever needed; --webauthn is opt-in and never passed by either workflow. Validated locally: bash harness/ci-ephemeral-stack.sh --skip-broker passes all 8 steps (anvil up, 33 forge tests, 6 contracts deployed + verified, clean teardown). YAML + shell syntax checked. --- .github/workflows/harness-ci.yml | 139 +++++++++ .github/workflows/harness-e2e.yml | 204 +++++++++++++ docs/test-environment.md | 166 +++++++++++ harness/ci-ephemeral-stack.sh | 401 ++++++++++++++++++++++++++ scripts/provision-test-environment.sh | 276 ++++++++++++++++++ scripts/test-environment.env.example | 92 ++++++ 6 files changed, 1278 insertions(+) create mode 100644 .github/workflows/harness-ci.yml create mode 100644 .github/workflows/harness-e2e.yml create mode 100644 docs/test-environment.md create mode 100755 harness/ci-ephemeral-stack.sh create mode 100755 scripts/provision-test-environment.sh create mode 100644 scripts/test-environment.env.example diff --git a/.github/workflows/harness-ci.yml b/.github/workflows/harness-ci.yml new file mode 100644 index 0000000..ec19e66 --- /dev/null +++ b/.github/workflows/harness-ci.yml @@ -0,0 +1,139 @@ +name: harness CI (no LLM) + +# Issue #66 tier-1: deterministic, no-LLM, no-WebAuthn CI that exercises +# the same code paths the harness scripts run, but against an ephemeral +# in-CI test environment (anvil + mock-server + stub-STS broker). +# +# Separate from the existing claude.yml / claude-code-review.yml workflows +# (which invoke @claude on PR comments + reviews). This workflow never +# spends LLM tokens — it's plain cargo/forge/curl orchestration. +# +# Coverage map (matches harness/v2-stage*.sh where ephemeral CI can): +# +# * `cargo fmt --check` — formatting gate +# * `cargo clippy -D warnings` — lint gate +# * `cargo test --workspace` — unit + in-process integration +# tests. The broker tests +# already spawn a full +# in-process broker with +# StubSts + StubEmailSender, +# so SIWE / OIDC mint / cap +# verify / multi-master / +# recovery / per-data-class +# isolation Rust logic is all +# covered here. Per CLAUDE.md +# "all async / #[tokio::test]" +# convention. +# * `harness/ci-ephemeral-stack.sh` — forge build + forge test + +# forge script deploy on a +# fresh anvil + read-only +# ABI/wiring verification. +# Plus broker boot smoke + +# OIDC discovery surface. +# +# Tier-2 (long-lived test-broker.litentry.org, full stage-3 PrincipalTag +# isolation, real AWS STS) lives in .github/workflows/harness-e2e.yml +# and is gated on operator-provisioned infra; see docs/test-environment.md. + +on: + push: + branches: [main, evm] + pull_request: + paths: + - "crates/**" + - "harness/**" + - "scripts/**" + - ".github/workflows/harness-ci.yml" + - "Cargo.toml" + - "Cargo.lock" + +# Allow only one concurrent run per ref so re-pushes cancel stale runs +# (saves runner minutes; each ephemeral stack spins up anvil + builds the +# workspace, so wall-clock matters). +concurrency: + group: harness-ci-${{ github.ref }} + cancel-in-progress: true + +jobs: + rust-checks: + name: cargo fmt + clippy + test + runs-on: ubuntu-latest + timeout-minutes: 30 + steps: + - uses: actions/checkout@v4 + + - name: Install Rust toolchain + uses: dtolnay/rust-toolchain@stable + with: + components: clippy, rustfmt + + - name: Cache cargo registry + target + uses: Swatinem/rust-cache@v2 + with: + shared-key: harness-ci + + - name: cargo fmt --check + run: cargo fmt --all -- --check + + - name: cargo clippy + # -D warnings: any clippy diagnostic blocks merge. Matches the + # project's "fix the warning, don't silence it" convention. + run: cargo clippy --workspace --all-targets -- -D warnings + + - name: cargo test --workspace + # --test-threads=1: the broker tests mutate shared process env + # (HOME, AWS_*) and the keyring tests serialize on a per-process + # accounts map — same convention as the @claude review workflow. + run: cargo test --workspace -- --test-threads=1 + + ephemeral-stack: + name: ephemeral anvil + chain deploy + runs-on: ubuntu-latest + timeout-minutes: 45 + needs: rust-checks # don't burn runner minutes on chain checks if Rust is red + steps: + - uses: actions/checkout@v4 + with: + # forge install reads .gitmodules — need submodules for forge-std etc. + submodules: recursive + + - name: Install Rust toolchain + uses: dtolnay/rust-toolchain@stable + + - name: Cache cargo registry + target + uses: Swatinem/rust-cache@v2 + with: + shared-key: harness-ci # share with rust-checks job + + - name: Install Foundry (anvil + forge + cast) + uses: foundry-rs/foundry-toolchain@v1 + with: + version: stable + + - name: Verify Foundry toolchain + run: | + anvil --version + forge --version + cast --version + + - name: Run ephemeral stack (chain + broker smoke) + # The script handles its own anvil + broker bring-up/tear-down via + # an EXIT trap. Fails the job if any step (forge build/test/deploy, + # contract verification, broker boot, OIDC discovery) fails. + run: bash harness/ci-ephemeral-stack.sh + env: + # Pinned ports so the workflow log is reproducible. + ANVIL_PORT: "8545" + MOCK_PORT: "8090" + BROKER_PORT: "8091" + # Fail builds on rustc warnings as well (matches clippy job). + RUSTFLAGS: "-D warnings" + + - name: Upload logs on failure + if: failure() + uses: actions/upload-artifact@v4 + with: + name: ephemeral-stack-logs + path: /tmp/agentkeys-ci-ephemeral-*/ + if-no-files-found: ignore + retention-days: 7 diff --git a/.github/workflows/harness-e2e.yml b/.github/workflows/harness-e2e.yml new file mode 100644 index 0000000..071608b --- /dev/null +++ b/.github/workflows/harness-e2e.yml @@ -0,0 +1,204 @@ +name: harness E2E (long-lived test broker) + +# Issue #66 tier-2: end-to-end harness exercise against the long-lived +# test-broker.litentry.org infrastructure provisioned by +# scripts/provision-test-environment.sh. +# +# Gated on TEST_OIDC_AWS_ROLE_ARN being set as a repo secret — until the +# operator wires it (see docs/test-environment.md §3), the job is inert +# and surfaces as a no-op rather than failing. This keeps the workflow +# safe to merge before the parallel infra is up. +# +# Coverage delta vs. harness-ci.yml: +# * harness-ci.yml: ephemeral anvil + in-process broker + StubSts +# (no public TLS, no real AWS, no real SES) +# * harness-e2e.yml: real test-broker.litentry.org + real AWS test +# resources (test bucket, test role) + real Heima +# Paseo chain. Runs the full stage-3 per-actor + +# per-data-class PrincipalTag isolation suite +# that ephemeral CI can't reach. +# +# No LLM. No WebAuthn (passes the harness scripts in default stub mode). +# Schedule + workflow_dispatch only — never on every PR (this hits real +# AWS API calls + real chain RPC, so it's nightly-cadence). + +on: + schedule: + # Nightly at 06:00 UTC — well after the prior day's PR activity + # quiesces but before the operator's morning standup. + - cron: "0 6 * * *" + workflow_dispatch: + inputs: + stage: + description: "Which stage to run (1, 2, 3, or all)" + required: false + default: "all" + type: choice + options: ["1", "2", "3", "all"] + +# Prevent overlapping runs (each one consumes test AWS resources + chain RPC). +concurrency: + group: harness-e2e + cancel-in-progress: false # let in-flight nightly finish; queue manual runs + +# OIDC-only AWS auth via GitHub Actions — never long-lived secrets. +permissions: + id-token: write # required for aws-actions/configure-aws-credentials + contents: read + +jobs: + preflight: + name: gate on test infra availability + runs-on: ubuntu-latest + outputs: + should_run: ${{ steps.gate.outputs.should_run }} + steps: + - id: gate + run: | + if [ -n "${{ secrets.TEST_OIDC_AWS_ROLE_ARN }}" ]; then + echo "should_run=true" >> "$GITHUB_OUTPUT" + echo "test infra credentials present; proceeding" + else + echo "should_run=false" >> "$GITHUB_OUTPUT" + echo "::warning::TEST_OIDC_AWS_ROLE_ARN unset — skipping. See docs/test-environment.md." + fi + + harness-e2e: + name: harness/v2-stage*-demo.sh against test-broker + needs: preflight + if: needs.preflight.outputs.should_run == 'true' + runs-on: ubuntu-latest + timeout-minutes: 60 + + steps: + - uses: actions/checkout@v4 + with: + submodules: recursive + + - name: Install Rust toolchain + uses: dtolnay/rust-toolchain@stable + + - name: Cache cargo registry + target + uses: Swatinem/rust-cache@v2 + with: + shared-key: harness-e2e + + - name: Install Foundry + uses: foundry-rs/foundry-toolchain@v1 + with: + version: stable + + - name: Configure AWS credentials via OIDC (test role) + uses: aws-actions/configure-aws-credentials@v4 + with: + role-to-assume: ${{ secrets.TEST_OIDC_AWS_ROLE_ARN }} + aws-region: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }} + # Session name shows up in CloudTrail — keep traceable to the + # PR / run for forensic walking. + role-session-name: gh-actions-${{ github.repository_id }}-${{ github.run_id }} + + - name: Build agentkeys CLI + workers + run: cargo build --release --workspace + + - name: Source test-environment env + # The harness scripts source scripts/operator-workstation.env by + # default. For the e2e run, overlay scripts/test-environment.env + # into that path so the entire harness flow reuses unchanged. + # The .example template is committed; the live file lives only + # in the runner's filesystem for the duration of the job. + run: | + cp scripts/test-environment.env.example scripts/operator-workstation.env + # Substitute repo secrets into the live env file. + { + echo "ACCOUNT_ID=${{ secrets.TEST_ACCOUNT_ID }}" + echo "REGION=${{ secrets.TEST_AWS_REGION || 'us-east-1' }}" + echo "BROKER_HOST=${{ secrets.TEST_BROKER_HOST || 'test-broker.litentry.org' }}" + echo "OIDC_ISSUER=https://${{ secrets.TEST_BROKER_HOST || 'test-broker.litentry.org' }}" + echo "VAULT_BUCKET=${{ secrets.TEST_VAULT_BUCKET }}" + echo "MEMORY_BUCKET=${{ secrets.TEST_MEMORY_BUCKET }}" + echo "VAULT_ROLE_ARN=${{ secrets.TEST_VAULT_ROLE_ARN }}" + echo "MEMORY_ROLE_ARN=${{ secrets.TEST_MEMORY_ROLE_ARN }}" + echo "DATA_ROLE_ARN=${{ secrets.TEST_DATA_ROLE_ARN }}" + # Per-run S3 prefix isolation — concurrent runs (manual + + # nightly) won't step on each other's writes; nightly + # cleanup s3 rm's keys older than 7d. + echo "CI_S3_PREFIX=ci/run-${{ github.run_id }}" + } >> scripts/operator-workstation.env + + - name: Stage 1 — chain + identity bootstrap + if: ${{ inputs.stage == 'all' || inputs.stage == '1' }} + # --skip-deploy: contracts are pre-deployed by + # scripts/provision-test-environment.sh on Heima-Paseo, and + # those addresses are baked into scripts/test-environment.env. + # --skip-email: e2e doesn't exercise the SES round-trip + # (separate workflow); identity bootstrap uses wallet_sig. + # No --webauthn: stub-mode (WEBAUTHN_MODE=0 default). + run: | + AGENTKEYS_CHAIN=heima-paseo \ + bash harness/v2-stage1-demo.sh --skip-deploy --skip-email + + - name: Stage 2 — multi-master + recovery (stub mode) + if: ${{ inputs.stage == 'all' || inputs.stage == '2' }} + run: | + AGENTKEYS_CHAIN=heima-paseo \ + bash harness/v2-stage2-demo.sh --stub --skip-build + + - name: Stage 3 — per-actor + per-data-class PrincipalTag isolation + if: ${{ inputs.stage == 'all' || inputs.stage == '3' }} + # The tier-2 capstone: stage-3 is the suite ephemeral CI can't + # run, since it requires AWS STS AssumeRoleWithWebIdentity, which + # in turn requires AWS to fetch the OIDC issuer's JWKS over + # public TLS. Now that we have test-broker.litentry.org with a + # real Let's Encrypt cert and real test IAM roles, all 11 steps + # of v2-stage3-demo.sh execute end-to-end. + run: | + AGENTKEYS_CHAIN=heima-paseo \ + bash harness/v2-stage3-demo.sh + + - name: Clean up per-run S3 prefix + if: always() + # Best-effort: tear down the per-run S3 prefix we wrote to. + # The nightly cleanup s3 rm catches any keys we missed. + run: | + PREFIX="ci/run-${{ github.run_id }}/" + for bucket in \ + "${{ secrets.TEST_VAULT_BUCKET }}" \ + "${{ secrets.TEST_MEMORY_BUCKET }}"; do + [ -n "$bucket" ] || continue + aws s3 rm "s3://$bucket/$PREFIX" --recursive || true + done + + nightly-prefix-cleanup: + # Sweep any per-run S3 prefixes older than 7 days from the test + # buckets. Cheap insurance against forgotten prefixes from cancelled + # runs; complements the per-job cleanup above. + name: cleanup stale CI prefixes + needs: preflight + if: needs.preflight.outputs.should_run == 'true' && github.event_name == 'schedule' + runs-on: ubuntu-latest + timeout-minutes: 10 + permissions: + id-token: write + contents: read + steps: + - name: Configure AWS credentials + uses: aws-actions/configure-aws-credentials@v4 + with: + role-to-assume: ${{ secrets.TEST_OIDC_AWS_ROLE_ARN }} + aws-region: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }} + role-session-name: gh-actions-cleanup-${{ github.run_id }} + + - name: Sweep prefixes older than 7d + run: | + cutoff=$(date -u -d "7 days ago" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null \ + || date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) + for bucket in \ + "${{ secrets.TEST_VAULT_BUCKET }}" \ + "${{ secrets.TEST_MEMORY_BUCKET }}"; do + [ -n "$bucket" ] || continue + aws s3api list-objects-v2 --bucket "$bucket" --prefix "ci/" \ + --query "Contents[?LastModified<\`$cutoff\`].Key" --output text \ + | tr '\t' '\n' | while read -r key; do + [ -n "$key" ] && aws s3 rm "s3://$bucket/$key" + done + done diff --git a/docs/test-environment.md b/docs/test-environment.md new file mode 100644 index 0000000..ca3596b --- /dev/null +++ b/docs/test-environment.md @@ -0,0 +1,166 @@ +# Test environment — AgentKeys (issue #66) + +**Audience:** the operator setting up CI for AgentKeys, plus contributors who need to debug a CI failure. +**Scope:** the parallel test infrastructure (broker, IAM roles, S3 buckets, deployer wallet, smart contracts) that exists alongside prod so CI can exercise the full code path without touching real user data. + +This is the operator-facing companion to: +- [`.github/workflows/harness-ci.yml`](../.github/workflows/harness-ci.yml) — the tier-1 ephemeral CI workflow (no external infra) +- [`.github/workflows/harness-e2e.yml`](../.github/workflows/harness-e2e.yml) — the tier-2 nightly E2E workflow against the long-lived test broker +- [`harness/ci-ephemeral-stack.sh`](../harness/ci-ephemeral-stack.sh) — the ephemeral stack driver tier-1 invokes +- [`scripts/provision-test-environment.sh`](../scripts/provision-test-environment.sh) — operator-run, one-shot provisioner for the tier-2 long-lived infra +- [`scripts/test-environment.env.example`](../scripts/test-environment.env.example) — env file template + +## Two-tier model + +Issue #66 calls for a CI that runs the harness scripts against a parallel test environment, never spends LLM tokens, and never invokes WebAuthn. There are two natural points to do that, and we ship both: + +| | Tier 1 — ephemeral | Tier 2 — long-lived | +|---|---|---| +| **Workflow** | `harness-ci.yml` | `harness-e2e.yml` | +| **Trigger** | every push + PR | nightly + manual dispatch | +| **Where** | inside a GitHub Actions runner | runs against `test-broker.litentry.org` | +| **Chain** | `anvil` (fresh per run, instant finality) | Heima-Paseo testnet (long-lived contracts) | +| **Deployer** | anvil's prefunded default test key (zero risk) | a separate Paseo wallet, funded by operator, persisted at `~/.agentkeys/heima-paseo-deployer-test.key` | +| **Contracts** | fresh deploy per run via Foundry | deployed once by `provision-test-environment.sh`, addresses pinned in `scripts/test-environment.env` | +| **Broker** | in-process spawn, OIDC issuer = `http://127.0.0.1:8091`, `StubSts` | real broker process on test EC2, OIDC issuer = `https://test-broker.litentry.org`, real AWS STS | +| **AWS** | none — broker boots with `--skip-startup-check`, no STS/S3 calls | real test bucket + real test role; AWS STS `AssumeRoleWithWebIdentity` works because the test broker exposes a public TLS-fronted JWKS endpoint | +| **WebAuthn** | never — harness defaults to `WEBAUTHN_MODE=0` stub mode | never — same default | +| **LLM** | never | never | +| **Wall time** | ~10–15 min | ~25–40 min | + +Tier 1 catches almost all regressions because the Rust integration tests (`cargo test --workspace`) already spawn an in-process broker with `StubSts` + `StubEmailSender` — those tests cover SIWE auth, OIDC mint, cap-token verification, multi-master, recovery, and per-data-class isolation logic. What tier 1 *can't* cover is the real-AWS path: stage 3's `AssumeRoleWithWebIdentity` requires AWS to fetch the issuer's JWKS over public TLS, which an ephemeral CI runner can't expose. That's the tier-2 capstone. + +## Tier 1 — ephemeral CI (no operator setup needed) + +Already wired. Every push to `main` or `evm`, plus every PR touching `crates/**` / `harness/**` / `scripts/**`, runs: + +1. `cargo fmt --check` +2. `cargo clippy --workspace --all-targets -- -D warnings` +3. `cargo test --workspace -- --test-threads=1` +4. `bash harness/ci-ephemeral-stack.sh`, which: + - Starts a fresh `anvil` on port 8545 (new chain, instant finality) + - Runs `forge build && forge test` in `crates/agentkeys-chain/` + - Runs `forge script DeployAgentKeysV1.s.sol` to deploy all 6 contracts to the ephemeral anvil + - Parses the deployed addresses and writes a synthetic `operator-workstation.env` + - Runs `scripts/verify-heima-contracts.sh` against the new addresses (read-only ABI + wiring checks) + - Starts `mock-server` + `agentkeys-broker-server` (with `--skip-startup-check`, OIDC issuer = `http://127.0.0.1:8091`) + - Probes `/healthz`, `/.well-known/openid-configuration`, `/.well-known/jwks.json` + +On failure, the script's EXIT trap preserves all logs (`anvil.log`, `forge-deploy.log`, `broker.log`, etc.) and the workflow uploads them as a `ephemeral-stack-logs` artifact. + +## Tier 2 — long-lived test broker + +### Operator bring-up (~2 hours, one-shot) + +```bash +awsp agentkeys-admin # AWS admin profile for the account hosting test infra +bash scripts/provision-test-environment.sh +``` + +This walks through 7 steps: + +1. **Provision the EC2 broker host** at `test-broker.litentry.org`. Manual step (the runbook fragment in the script tells you exactly what to do on the target EC2). +2. **Register the AWS IAM OIDC provider** for `test-broker.litentry.org` (separate ARN from prod's `oidc-provider/broker.litentry.org`). +3. **Provision IAM roles** `agentkeys-data-role-test`, `agentkeys-vault-role-test`, `agentkeys-memory-role-test`, each trust-policied on the test OIDC provider with the same `PrincipalTag/agentkeys_actor_omni` scoping prod uses. +4. **Provision S3 buckets** `agentkeys-mail-test-${ACCT}`, `agentkeys-vault-test-${ACCT}`, `agentkeys-memory-test-${ACCT}` with block-public-access + default SSE-S3 + the v3 split-statement PrincipalTag bucket policy. +5. **Generate a new deployer wallet** (distinct from the prod deployer) at `~/.agentkeys/heima-paseo-deployer-test.key`. You fund it from your personal Paseo wallet (Paseo has sudo so Alice can also fund — see `scripts/heima-bring-up.sh`). +6. **Deploy fresh v2 stage-1 contracts** to Heima-Paseo via `DeployAgentKeysV1.s.sol`. Records the addresses under `*_HEIMA_PASEO` keys in `scripts/test-environment.env`. +7. **Provision a GitHub Actions OIDC role** (`github-actions-agentkeys-e2e`) trust-policied on `token.actions.githubusercontent.com` with a condition limiting it to the agentkeys repo. Grant it `sts:AssumeRole` on the three test roles + read-only S3 on the three test buckets. + +Some steps are still operator-manual (parameterizing `provision-vault-role.sh` to accept a `SUFFIX=` env var is a TODO; until then, copy the prod scripts as `-test` variants by hand). The script logs these as `skip` with a follow-up TODO instead of silently passing. + +### Repo secrets to set (after provisioning) + +After the provisioner finishes, set these in **Settings → Secrets and variables → Actions**: + +| Secret | Value | +|---|---| +| `TEST_OIDC_AWS_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/github-actions-agentkeys-e2e` | +| `TEST_AWS_REGION` | `us-east-1` (or wherever the test broker lives) | +| `TEST_ACCOUNT_ID` | `${ACCT}` | +| `TEST_BROKER_HOST` | `test-broker.litentry.org` | +| `TEST_VAULT_BUCKET` | `agentkeys-vault-test-${ACCT}` | +| `TEST_MEMORY_BUCKET` | `agentkeys-memory-test-${ACCT}` | +| `TEST_VAULT_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-vault-role-test` | +| `TEST_MEMORY_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-memory-role-test` | +| `TEST_DATA_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-data-role-test` | + +`TEST_OIDC_AWS_ROLE_ARN` is the **gate**: until it's set, the `harness-e2e.yml` preflight job sets `should_run=false` and the workflow surfaces as a `::warning::` skip rather than a failure. This keeps the workflow safe to merge before the parallel infra is up. + +### Per-run S3 prefix namespacing + +The e2e workflow exports `CI_S3_PREFIX=ci/run-${GITHUB_RUN_ID}` and the harness scripts honor that prefix when writing test envelopes to S3. This means concurrent runs (nightly + a manual dispatch) won't step on each other's writes. + +Cleanup is two-layered: +- **Per-job cleanup**: the e2e workflow's `if: always()` step runs `aws s3 rm s3://$bucket/$PREFIX --recursive` at the end of each run. +- **Nightly sweep**: a separate `nightly-prefix-cleanup` job lists `ci/` prefix keys older than 7 days and rm's them. Cheap insurance against forgotten prefixes from cancelled runs. + +### Cert renewal monitoring + +`test-broker.litentry.org` uses Let's Encrypt (auto-renewed every 90d by certbot). If renewal silently fails, AWS STS stops trusting the OIDC issuer and the e2e workflow turns red overnight. + +The nightly workflow's preflight already exercises a `curl` against `https://${TEST_BROKER_HOST}/.well-known/openid-configuration`. A renewal failure surfaces as an immediate workflow failure with a clear TLS error. + +### Rotating the test broker secrets + +If the test mock-server's `DEV_KEY_SERVICE_MASTER_SECRET` ever leaks, rotate via: + +```bash +# 1. New secret on the broker host +ssh ec2-user@test-broker.litentry.org \ + 'sudo systemctl set-environment DEV_KEY_SERVICE_MASTER_SECRET=$(openssl rand -hex 32) \ + && sudo systemctl restart agentkeys-backend' + +# 2. There's nothing on the operator side to rotate — the secret never +# leaves the broker host (it derives per-omni signer keys in-process). +``` + +Test wallets minted via the rotated signer will have different addresses from pre-rotation wallets, which is the desired blast-radius cut. + +## Cleanup / teardown + +Tear down the entire test environment (cheap insurance if costs spike): + +```bash +# Drain the buckets first +for bucket in agentkeys-mail-test-${ACCT} agentkeys-vault-test-${ACCT} agentkeys-memory-test-${ACCT}; do + aws s3 rm "s3://$bucket" --recursive + aws s3api delete-bucket --bucket "$bucket" +done + +# Delete the roles (detach policies first) +for role in agentkeys-data-role-test agentkeys-vault-role-test agentkeys-memory-role-test github-actions-agentkeys-e2e; do + for policy in $(aws iam list-role-policies --role-name "$role" --query 'PolicyNames[]' --output text); do + aws iam delete-role-policy --role-name "$role" --policy-name "$policy" + done + aws iam delete-role --role-name "$role" +done + +# Delete the OIDC provider +aws iam delete-open-id-connect-provider \ + --open-id-connect-provider-arn arn:aws:iam::${ACCT}:oidc-provider/test-broker.litentry.org + +# Stop + terminate the EC2 + release the EIP (manual, console or aws ec2 CLI) +``` + +The contracts on Heima-Paseo stay on chain (they're free), but they're inert without the broker pointing at them. + +## Why two tiers (vs. just one) + +A single-tier model — running everything against the long-lived broker on every PR — was the obvious shape, but loses on: + +- **Latency**: every PR pays the ~30 min e2e wall time (vs. ~10 min for tier 1). +- **Cost**: every PR hits real AWS API calls + chain RPC + potentially gas. +- **Contention**: concurrent PRs serialize on the single test broker, or step on each other's S3 writes without per-run prefix isolation. +- **Brittleness**: a flaky external dep (Paseo collator hiccup, AWS API throttle) blocks merges. + +A single-tier model the other way — only ephemeral CI, no long-lived test broker — was also tempting, but loses stage-3 coverage entirely (`AssumeRoleWithWebIdentity` needs publicly-fetchable JWKS). That's the most security-critical layer in the codebase (per-actor + per-data-class IAM isolation per CLAUDE.md "Per-actor + per-data-class isolation invariants"), so leaving it untested in CI was unacceptable. + +The two-tier split puts the fast, cheap, deterministic checks on every PR and the expensive E2E on nightly. PRs that need to verify a stage-3 fix can trigger `harness-e2e.yml` via `workflow_dispatch` directly from the PR page. + +## Related + +- Original issue: [#66 — Stage 7: shared test broker for CI + dev](https://github.com/wildmeta-agent/agentKeys/issues/66) +- Prod cloud setup: [`docs/cloud-setup.md`](cloud-setup.md) +- Stage 7 demo + verification: [`docs/stage7-demo-and-verification.md`](stage7-demo-and-verification.md) +- Architecture: [`docs/spec/architecture.md`](spec/architecture.md) §17 (per-data-class buckets), §4 (HDKD actor tree), CLAUDE.md "Per-actor + per-data-class isolation invariants" table diff --git a/harness/ci-ephemeral-stack.sh b/harness/ci-ephemeral-stack.sh new file mode 100755 index 0000000..8d9ffa3 --- /dev/null +++ b/harness/ci-ephemeral-stack.sh @@ -0,0 +1,401 @@ +#!/usr/bin/env bash +# harness/ci-ephemeral-stack.sh — issue #66 tier-1 ephemeral CI driver. +# +# Stands up a complete, isolated AgentKeys test environment INSIDE a +# single CI runner and exercises the chain-deploy path end-to-end. No +# external infrastructure, no LLM, no WebAuthn, no real AWS. +# +# What this script delivers (the four parallel-infra axes from issue #66): +# +# ─ new test broker server → ephemeral agentkeys-broker-server +# spawned on 127.0.0.1, OIDC issuer +# http://127.0.0.1:$BROKER_PORT, stub +# STS client (no real AWS). +# ─ new smart contract on-chain → forge script deploys a fresh copy of +# the v2 stage-1 contract set +# (P256Verifier + K11Verifier + +# SidecarRegistry + AgentKeysScope + +# K3EpochCounter + CredentialAudit) +# to a brand-new anvil instance. +# ─ new deployer account → anvil's canonical first prefunded test +# key (10_000 ETH; zero risk). +# ─ no WebAuthn → the harness scripts default to +# WEBAUTHN_MODE=0 (stage-1 line 131); +# this script never passes --webauthn, +# so K11 enrollment writes deterministic +# stub bytes (CI-friendly). +# +# What's COVERED by this script (matches the harness scripts' coverage +# for things that don't require real AWS): +# +# * Forge unit + property tests for all six v2 stage-1 contracts. +# * End-to-end Foundry deploy via DeployAgentKeysV1.s.sol against the +# ephemeral anvil — same script as heima-bring-up.sh step 5 uses +# against Heima Mainnet/Paseo. +# * Read-only ABI/wiring checks via verify-heima-contracts.sh against +# the freshly deployed addresses (same checks Heima uses). +# * Broker liveness + OIDC discovery surface (/.well-known/ +# openid-configuration, /.well-known/jwks.json, /healthz). +# +# What's NOT covered here (intentionally — needs the long-lived +# test-broker.litentry.org tier-2 environment with publicly-reachable +# TLS + real AWS resources; see docs/test-environment.md): +# +# * harness/v2-stage3-demo.sh — per-actor + per-data-class S3 +# PrincipalTag isolation tests. AWS STS AssumeRoleWithWebIdentity +# requires AWS to fetch the OIDC issuer's JWKS over public TLS, +# which a CI runner can't expose. +# * Real SES email-link auth round-trip (uses StubEmailSender in unit +# tests; long-lived tier-2 exercises real SES). +# +# All the Rust-side broker/worker logic (SIWE auth, OIDC mint, cap-token +# verify, etc.) is covered by `cargo test --workspace` in the parent +# CI workflow — those tests already spawn an in-process broker with +# StubSts + StubEmailSender, so the ephemeral-stack script focuses on +# what cargo test can't reach: the on-chain deploy + ABI surface. +# +# Usage: +# bash harness/ci-ephemeral-stack.sh # full ephemeral roundtrip +# bash harness/ci-ephemeral-stack.sh --skip-broker # chain-only (forge + anvil) +# bash harness/ci-ephemeral-stack.sh --keep-running # leave anvil + broker up +# # (for local debugging) +# +# Exit codes: +# 0 every check passed +# 1 any check failed; logs in $WORK_DIR/*.log preserved on failure +# 2 prereqs missing (anvil/forge/cargo) + +set -euo pipefail + +REPO_ROOT="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")/.." && pwd)" +cd "$REPO_ROOT" + +# ─── CLI ───────────────────────────────────────────────────────────────── +SKIP_BROKER=0 +KEEP_RUNNING=0 +ANVIL_PORT="${ANVIL_PORT:-8545}" +MOCK_PORT="${MOCK_PORT:-8090}" +BROKER_PORT="${BROKER_PORT:-8091}" + +while [[ $# -gt 0 ]]; do + case "$1" in + --skip-broker) SKIP_BROKER=1; shift ;; + --keep-running) KEEP_RUNNING=1; shift ;; + --anvil-port) ANVIL_PORT="$2"; shift 2 ;; + --mock-port) MOCK_PORT="$2"; shift 2 ;; + --broker-port) BROKER_PORT="$2"; shift 2 ;; + -h|--help) + sed -n '2,/^set -euo/p' "$0" | sed 's/^# \?//' | sed '$d' + exit 0 ;; + *) echo "unknown flag: $1 (try --help)" >&2; exit 2 ;; + esac +done + +# ─── Colors ────────────────────────────────────────────────────────────── +if [ -t 2 ]; then + C_HEAD='\033[1;36m'; C_OK='\033[1;32m'; C_WARN='\033[1;33m' + C_ERR='\033[1;31m'; C_DIM='\033[2m'; C_RESET='\033[0m' +else + C_HEAD=''; C_OK=''; C_WARN=''; C_ERR=''; C_DIM=''; C_RESET='' +fi +log() { printf "${C_HEAD}==>${C_RESET} %s\n" "$*" >&2; } +ok() { printf " ${C_OK}ok${C_RESET} %s\n" "$*" >&2; } +info() { printf " ${C_DIM}info${C_RESET} %s\n" "$*" >&2; } +warn() { printf " ${C_WARN}warn${C_RESET} %s\n" "$*" >&2; } +die() { printf " ${C_ERR}fail${C_RESET} %s\n" "$*" >&2; exit 1; } + +# ─── Work dir + cleanup trap ───────────────────────────────────────────── +WORK_DIR="$(mktemp -d -t agentkeys-ci-ephemeral-XXXXXX)" +ANVIL_PID="" +MOCK_PID="" +BROKER_PID="" + +cleanup() { + local rc=$? + if [ "$KEEP_RUNNING" = "1" ]; then + info "--keep-running set; leaving processes up" + info " anvil: pid=$ANVIL_PID port=$ANVIL_PORT" + [ -n "$MOCK_PID" ] && info " mock: pid=$MOCK_PID port=$MOCK_PORT" + [ -n "$BROKER_PID" ] && info " broker: pid=$BROKER_PID port=$BROKER_PORT" + info " work_dir: $WORK_DIR" + exit "$rc" + fi + log "Cleanup" + for pid_var in BROKER_PID MOCK_PID ANVIL_PID; do + eval "pid=\${$pid_var:-}" + if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then + kill "$pid" 2>/dev/null || true + wait "$pid" 2>/dev/null || true + ok "stopped $pid_var pid=$pid" + fi + done + if [ "$rc" -ne 0 ]; then + warn "exit=$rc — preserving logs at $WORK_DIR" + for f in "$WORK_DIR"/*.log; do + [ -e "$f" ] || continue + printf "\n${C_DIM}── tail $f ──${C_RESET}\n" >&2 + tail -n 50 "$f" >&2 || true + done + else + rm -rf "$WORK_DIR" + fi +} +trap cleanup EXIT INT TERM + +# ─── 1. Prereq sanity-check ────────────────────────────────────────────── +log "1/8 Prereq sanity-check" +missing=() +for tool in cargo jq curl awk grep sed anvil forge cast; do + command -v "$tool" >/dev/null 2>&1 || missing+=("$tool") +done +if [ ${#missing[@]} -gt 0 ]; then + warn "missing tools: ${missing[*]}" + warn " install Foundry: curl -L https://foundry.paradigm.xyz | bash && foundryup" + warn " install Rust: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh" + die "prereqs missing" +fi +ok "tools present: cargo jq curl awk grep sed anvil forge cast" + +# ─── 2. Start anvil (new chain) ────────────────────────────────────────── +log "2/8 Starting anvil on 127.0.0.1:$ANVIL_PORT (new ephemeral chain)" +# Anvil's first default account: pre-funded with 10_000 ETH, deterministic. +# This is our "new deployer account" — fresh per CI run, zero blast radius. +ANVIL_DEPLOYER_KEY="0xac0974bec39a17e36ba4a6b4d238ff944bacb478cbed5efcae784d7bf4f2ff80" +ANVIL_DEPLOYER_ADDR="0xf39Fd6e51aad88F6F4ce6aB8827279cffFb92266" +anvil --port "$ANVIL_PORT" \ + --host 127.0.0.1 \ + --silent \ + > "$WORK_DIR/anvil.log" 2>&1 & +ANVIL_PID=$! +# Wait for RPC ready (anvil bootstraps fast — <2s typically, give it 30s) +for _ in $(seq 1 60); do + if curl -sf --max-time 1 \ + -H 'Content-Type: application/json' \ + -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' \ + "http://127.0.0.1:$ANVIL_PORT" >/dev/null 2>&1; then + break + fi + sleep 0.5 +done +curl -sf --max-time 2 \ + -H 'Content-Type: application/json' \ + -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' \ + "http://127.0.0.1:$ANVIL_PORT" >/dev/null \ + || die "anvil failed to come up; see $WORK_DIR/anvil.log" +ok "anvil up (pid=$ANVIL_PID chain_id=31337 deployer=$ANVIL_DEPLOYER_ADDR)" + +# ─── 3. Forge build + test (contract unit + property tests) ────────────── +log "3/8 Forge build + test (crates/agentkeys-chain/)" +( + cd crates/agentkeys-chain + forge build > "$WORK_DIR/forge-build.log" 2>&1 \ + || die "forge build failed; see $WORK_DIR/forge-build.log" + ok "forge build clean" + forge test --no-match-test "fork_" > "$WORK_DIR/forge-test.log" 2>&1 \ + || die "forge test failed; see $WORK_DIR/forge-test.log" + ok "forge test passed ($(grep -c "^\[PASS\]" "$WORK_DIR/forge-test.log" || echo 0) tests)" +) + +# ─── 4. Deploy v2 stage-1 contract set (new smart contracts on-chain) ──── +log "4/8 Deploy v2 stage-1 contracts via DeployAgentKeysV1.s.sol" +( + cd crates/agentkeys-chain + forge script script/DeployAgentKeysV1.s.sol \ + --rpc-url "http://127.0.0.1:$ANVIL_PORT" \ + --private-key "$ANVIL_DEPLOYER_KEY" \ + --broadcast \ + --skip-simulation \ + > "$WORK_DIR/forge-deploy.log" 2>&1 \ + || die "forge script deploy failed; see $WORK_DIR/forge-deploy.log" +) +# Parse "Name: 0xAddress" lines (the contract names from DeployAgentKeysV1.s.sol's +# console.log calls). Format matches heima-bring-up.sh's parser. +parse_addr() { + local name="$1" + awk -v want="$name" ' + $0 ~ want":" { + for (i=1; i<=NF; i++) if ($i ~ /^0x[a-fA-F0-9]{40}$/) { print $i; exit } + } + ' "$WORK_DIR/forge-deploy.log" +} +SCOPE_ADDR=$(parse_addr "AgentKeysScope") +REGISTRY_ADDR=$(parse_addr "SidecarRegistry") +EPOCH_ADDR=$(parse_addr "K3EpochCounter") +AUDIT_ADDR=$(parse_addr "CredentialAudit") +P256_ADDR=$(parse_addr "P256Verifier") +K11_ADDR=$(parse_addr "K11Verifier") +for v in SCOPE_ADDR REGISTRY_ADDR EPOCH_ADDR AUDIT_ADDR P256_ADDR K11_ADDR; do + eval "val=\${$v}" + [ -n "$val" ] || die "could not parse $v from forge-deploy.log" +done +ok "AgentKeysScope: $SCOPE_ADDR" +ok "SidecarRegistry: $REGISTRY_ADDR" +ok "K3EpochCounter: $EPOCH_ADDR" +ok "CredentialAudit: $AUDIT_ADDR" +ok "P256Verifier: $P256_ADDR" +ok "K11Verifier: $K11_ADDR" + +# ─── 5. Write synthetic operator-workstation.env for verify scripts ────── +log "5/8 Write synthetic operator-workstation.env (--anvil profile)" +SYNTH_ENV="$WORK_DIR/operator-workstation.env" +cat > "$SYNTH_ENV" < "$WORK_DIR/verify-contracts.log" 2>&1 || verify_rc=$? +restore_env +if [ "$verify_rc" -ne 0 ]; then + warn "verify-heima-contracts.sh exited $verify_rc; full log:" + cat "$WORK_DIR/verify-contracts.log" >&2 + die "contract verification failed" +fi +ok "all six v2 stage-1 contracts verified (bytecode + ABI + wiring)" + +# ─── 7. Optional: stand up the broker server (skipped by default) ──────── +if [ "$SKIP_BROKER" = "1" ]; then + log "7/8 Broker bring-up SKIPPED (--skip-broker)" +else + log "7/8 Stand up ephemeral broker (new test broker server)" + + # Pre-generate keypairs so the broker boots clean. The keygen + # subcommand writes 0600 files; matches the production setup-broker-host + # flow but in $WORK_DIR instead of /var/lib/agentkeys. + BROKER_DATA_DIR="$WORK_DIR/broker-data" + mkdir -p "$BROKER_DATA_DIR" + info "building agentkeys-broker-server (release)" + cargo build --release -p agentkeys-broker-server \ + > "$WORK_DIR/cargo-build-broker.log" 2>&1 \ + || die "cargo build broker failed; see $WORK_DIR/cargo-build-broker.log" + BROKER_BIN="$REPO_ROOT/target/release/agentkeys-broker-server" + [ -x "$BROKER_BIN" ] || die "broker binary missing at $BROKER_BIN" + + "$BROKER_BIN" keygen --purpose oidc \ + --out "$BROKER_DATA_DIR/oidc-keypair.json" >/dev/null + "$BROKER_BIN" keygen --purpose session \ + --out "$BROKER_DATA_DIR/session-keypair.json" >/dev/null + ok "broker keypairs generated" + + info "building agentkeys-mock-server (release)" + cargo build --release -p agentkeys-mock-server \ + > "$WORK_DIR/cargo-build-mock.log" 2>&1 \ + || die "cargo build mock-server failed; see $WORK_DIR/cargo-build-mock.log" + MOCK_BIN="$REPO_ROOT/target/release/agentkeys-mock-server" + [ -x "$MOCK_BIN" ] || die "mock-server binary missing at $MOCK_BIN" + + info "starting mock-server on 127.0.0.1:$MOCK_PORT" + "$MOCK_BIN" --port "$MOCK_PORT" \ + > "$WORK_DIR/mock-server.log" 2>&1 & + MOCK_PID=$! + for _ in $(seq 1 60); do + curl -sf --max-time 1 "http://127.0.0.1:$MOCK_PORT/healthz" >/dev/null 2>&1 && break + sleep 0.25 + done + curl -sf --max-time 2 "http://127.0.0.1:$MOCK_PORT/healthz" >/dev/null \ + || die "mock-server failed to come up; see $WORK_DIR/mock-server.log" + ok "mock-server up (pid=$MOCK_PID)" + + info "starting broker on 127.0.0.1:$BROKER_PORT (--skip-startup-check)" + # No real AWS creds in CI — broker runs OIDC-only mint path per issue #71, + # so the only thing AWS would do is the optional GetCallerIdentity probe, + # which --skip-startup-check disables. + BROKER_OIDC_ISSUER="http://127.0.0.1:$BROKER_PORT" \ + BROKER_BACKEND_URL="http://127.0.0.1:$MOCK_PORT" \ + BROKER_DATA_ROLE_ARN="arn:aws:iam::000000000000:role/agentkeys-data-role-ci" \ + BROKER_AWS_REGION="us-east-1" \ + BROKER_OIDC_KEYPAIR_PATH="$BROKER_DATA_DIR/oidc-keypair.json" \ + BROKER_SESSION_KEYPAIR_PATH="$BROKER_DATA_DIR/session-keypair.json" \ + BROKER_AUDIT_DB_PATH="$BROKER_DATA_DIR/audit.sqlite" \ + RUST_LOG=info \ + "$BROKER_BIN" --bind 127.0.0.1 --port "$BROKER_PORT" --skip-startup-check \ + > "$WORK_DIR/broker.log" 2>&1 & + BROKER_PID=$! + for _ in $(seq 1 60); do + curl -sf --max-time 1 "http://127.0.0.1:$BROKER_PORT/healthz" >/dev/null 2>&1 && break + sleep 0.25 + done + curl -sf --max-time 2 "http://127.0.0.1:$BROKER_PORT/healthz" >/dev/null \ + || die "broker failed to come up; see $WORK_DIR/broker.log" + ok "broker up (pid=$BROKER_PID)" + + # OIDC discovery surface — same endpoints AWS would hit in tier-2. + info "probing OIDC discovery surface" + curl -sf --max-time 2 \ + "http://127.0.0.1:$BROKER_PORT/.well-known/openid-configuration" \ + > "$WORK_DIR/oidc-config.json" \ + || die "openid-configuration unreachable" + jq -e '.issuer == "http://127.0.0.1:'"$BROKER_PORT"'"' \ + "$WORK_DIR/oidc-config.json" >/dev/null \ + || die "openid-configuration issuer claim mismatch (see $WORK_DIR/oidc-config.json)" + ok ".well-known/openid-configuration → issuer matches" + + curl -sf --max-time 2 \ + "http://127.0.0.1:$BROKER_PORT/.well-known/jwks.json" \ + > "$WORK_DIR/jwks.json" \ + || die "jwks.json unreachable" + jq -e '.keys | length >= 1' "$WORK_DIR/jwks.json" >/dev/null \ + || die "jwks.json has no keys (see $WORK_DIR/jwks.json)" + ok ".well-known/jwks.json → at least one key present" +fi + +# ─── 8. Summary ────────────────────────────────────────────────────────── +log "8/8 Summary" +ok "ephemeral environment passed all checks" +info " chain : anvil (chain_id 31337, ephemeral)" +info " deployer : $ANVIL_DEPLOYER_ADDR" +info " contracts : 6/6 deployed + verified on chain" +if [ "$SKIP_BROKER" != "1" ]; then + info " broker : http://127.0.0.1:$BROKER_PORT" + info " oidc issuer: http://127.0.0.1:$BROKER_PORT" + info " backend : http://127.0.0.1:$MOCK_PORT (mock-server)" +fi +info "" +info "Not covered here (needs long-lived test-broker.litentry.org —" +info "see docs/test-environment.md):" +info " * stage-3 per-actor + per-data-class S3 PrincipalTag isolation" +info " * real AWS STS AssumeRoleWithWebIdentity" +info " * real SES email-link auth round-trip" diff --git a/scripts/provision-test-environment.sh b/scripts/provision-test-environment.sh new file mode 100755 index 0000000..6155522 --- /dev/null +++ b/scripts/provision-test-environment.sh @@ -0,0 +1,276 @@ +#!/usr/bin/env bash +# scripts/provision-test-environment.sh — issue #66 tier-2 one-shot +# provisioner for the long-lived parallel test environment. +# +# What this script provisions (every resource parallel to prod, every +# name carrying a -test suffix so misconfigured CI runs targeting prod +# fail closed): +# +# 1. AWS IAM OIDC provider for test-broker.litentry.org +# 2. AWS IAM roles: +# - agentkeys-data-role-test (email subsystem) +# - agentkeys-vault-role-test (credentials, scoped to vault bucket) +# - agentkeys-memory-role-test (long-term memory, scoped to memory bucket) +# All three trust-policied on the test OIDC provider, with the same +# PrincipalTag/agentkeys_actor_omni scoping that prod uses. +# 3. AWS S3 buckets (per-data-class, per arch.md §17.2): +# - agentkeys-mail-test-${ACCT} +# - agentkeys-vault-test-${ACCT} +# - agentkeys-memory-test-${ACCT} +# Each with block-public-access + default SSE-S3 + the v3 +# split-statement PrincipalTag bucket policy from prod +# (scripts/apply-vault-bucket-policy.sh + apply-memory-bucket-policy.sh). +# 4. EC2 broker host at test-broker.litentry.org via: +# bash scripts/setup-broker-host.sh \ +# --issuer-url https://test-broker.litentry.org \ +# --account-id ${ACCOUNT_ID} \ +# --signer-host signer-test.litentry.org \ +# --audit-host audit-test.litentry.org \ +# --email-host email-test.litentry.org \ +# --cred-host cred-test.litentry.org \ +# --memory-host memory-test.litentry.org \ +# --chain-rpc https://rpc.paseo-parachain.heima.network \ +# --vault-bucket agentkeys-vault-test-${ACCOUNT_ID} \ +# --memory-bucket agentkeys-memory-test-${ACCOUNT_ID} +# 5. A new deployer wallet on Heima-Paseo (distinct from the prod +# deployer), persisted at ~/.agentkeys/heima-paseo-deployer-test.key. +# Funded from the operator's personal Paseo wallet (no sudo on +# mainnet; sudo is fine on Paseo via Alice if collators are up). +# 6. Fresh v2 stage-1 contracts deployed via DeployAgentKeysV1.s.sol +# to Heima-Paseo, distinct addresses from prod, written to +# scripts/test-environment.env under the *_HEIMA_PASEO keys. +# +# Idempotent: re-run safely. Each step pre-checks "is this already done?" +# before acting. Failed runs leave a paper trail in $WORK_DIR. +# +# This is the OPERATOR script — runs once per account. The CI workflow +# (.github/workflows/harness-e2e.yml) consumes the provisioned env via +# GitHub Actions secrets + scripts/test-environment.env. +# +# Usage: +# awsp agentkeys-admin # admin profile required +# bash scripts/provision-test-environment.sh # full provisioning +# bash scripts/provision-test-environment.sh --dry-run +# bash scripts/provision-test-environment.sh --only-step N +# +# Per CLAUDE.md, this script is the SINGLE ENTRY POINT for test-env +# changes. No ad-hoc aws iam / aws s3api edits — extend this script +# instead and re-run. + +set -euo pipefail + +REPO_ROOT="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")/.." && pwd)" +cd "$REPO_ROOT" + +# ─── Config defaults ───────────────────────────────────────────────────── +DRY_RUN=0 +ONLY_STEP="" +TEST_BROKER_HOST="${TEST_BROKER_HOST:-test-broker.litentry.org}" +TEST_SIGNER_HOST="${TEST_SIGNER_HOST:-signer-test.litentry.org}" +TEST_AUDIT_HOST="${TEST_AUDIT_HOST:-audit-test.litentry.org}" +TEST_EMAIL_HOST="${TEST_EMAIL_HOST:-email-test.litentry.org}" +TEST_CRED_HOST="${TEST_CRED_HOST:-cred-test.litentry.org}" +TEST_MEMORY_HOST="${TEST_MEMORY_HOST:-memory-test.litentry.org}" +TEST_ENV_FILE="$REPO_ROOT/scripts/test-environment.env" +TEST_ENV_EXAMPLE="$REPO_ROOT/scripts/test-environment.env.example" +WORK_DIR="$(mktemp -d -t agentkeys-provision-test-XXXXXX)" + +while [[ $# -gt 0 ]]; do + case "$1" in + --dry-run) DRY_RUN=1; shift ;; + --only-step) ONLY_STEP="$2"; shift 2 ;; + --test-broker-host) TEST_BROKER_HOST="$2"; shift 2 ;; + -h|--help) + sed -n '2,/^set -euo/p' "$0" | sed 's/^# \?//' | sed '$d' + exit 0 ;; + *) echo "unknown flag: $1 (try --help)" >&2; exit 2 ;; + esac +done + +# ─── Colors ────────────────────────────────────────────────────────────── +if [ -t 2 ]; then + C_HEAD='\033[1;36m'; C_OK='\033[1;32m'; C_SKIP='\033[1;33m' + C_WARN='\033[1;33m'; C_ERR='\033[1;31m'; C_RESET='\033[0m' +else + C_HEAD=''; C_OK=''; C_SKIP=''; C_WARN=''; C_ERR=''; C_RESET='' +fi +log() { printf "${C_HEAD}==>${C_RESET} %s\n" "$*" >&2; } +ok() { printf " ${C_OK}ok${C_RESET} %s\n" "$*" >&2; } +skip() { printf " ${C_SKIP}skip${C_RESET} %s\n" "$*" >&2; } +warn() { printf " ${C_WARN}warn${C_RESET} %s\n" "$*" >&2; } +die() { printf " ${C_ERR}fail${C_RESET} %s\n" "$*" >&2; exit 1; } + +should_run_step() { + [ -z "$ONLY_STEP" ] && return 0 + [ "$1" = "$ONLY_STEP" ] +} + +run_or_dry() { + if [ "$DRY_RUN" = "1" ]; then + printf " ${C_WARN}dry-run${C_RESET} %s\n" "$*" >&2 + else + "$@" + fi +} + +# ─── Step 0: prerequisite check ────────────────────────────────────────── +log "0/7 Prereq check" +caller_arn=$(aws sts get-caller-identity --query Arn --output text 2>&1) \ + || die "aws sts get-caller-identity failed: $caller_arn — run: awsp agentkeys-admin" +caller_lc=$(printf '%s' "$caller_arn" | tr '[:upper:]' '[:lower:]') +case "$caller_lc" in + *":user/agentkeys-admin"*) ok "caller: $caller_arn" ;; + *) die "caller is $caller_arn — admin required. Run: awsp agentkeys-admin" ;; +esac +ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) +REGION="${AWS_REGION:-us-east-1}" +ok "ACCOUNT_ID=$ACCOUNT_ID REGION=$REGION" + +# Seed the env file if missing +if [ ! -f "$TEST_ENV_FILE" ]; then + [ -f "$TEST_ENV_EXAMPLE" ] || die "missing $TEST_ENV_EXAMPLE (committed template)" + cp "$TEST_ENV_EXAMPLE" "$TEST_ENV_FILE" + ok "seeded $TEST_ENV_FILE from .example" +fi + +env_set() { + local key="$1" val="$2" file="$3" + if grep -qE "^${key}=" "$file" 2>/dev/null; then + if [ "$(uname)" = "Darwin" ]; then + sed -i '' -E "s|^${key}=.*|${key}=${val}|" "$file" + else + sed -i -E "s|^${key}=.*|${key}=${val}|" "$file" + fi + else + printf '%s=%s\n' "$key" "$val" >> "$file" + fi +} +env_set ACCOUNT_ID "$ACCOUNT_ID" "$TEST_ENV_FILE" +env_set REGION "$REGION" "$TEST_ENV_FILE" + +# ─── Step 1: provision the broker host (mirrors prod §5) ───────────────── +if should_run_step 1; then + log "1/7 Provision broker host (test-broker.${TEST_BROKER_HOST#test-broker.})" + cat >&2 </agentKeys && cd agentKeys + bash scripts/setup-broker-host.sh \\ + --issuer-url https://${TEST_BROKER_HOST} \\ + --account-id ${ACCOUNT_ID} \\ + --signer-host ${TEST_SIGNER_HOST} \\ + --audit-host ${TEST_AUDIT_HOST} \\ + --email-host ${TEST_EMAIL_HOST} \\ + --cred-host ${TEST_CRED_HOST} \\ + --memory-host ${TEST_MEMORY_HOST} \\ + --chain-rpc https://rpc.paseo-parachain.heima.network \\ + --vault-bucket agentkeys-vault-test-${ACCOUNT_ID} \\ + --memory-bucket agentkeys-memory-test-${ACCOUNT_ID} \\ + --email-from noreply-test@bots-test.litentry.org \\ + --non-interactive --yes + 4. Confirm: curl -sf https://${TEST_BROKER_HOST}/healthz + + See docs/test-environment.md §3 for the full host runbook. +EOF + skip "manual operator step; rerun --only-step 2 once the host is up" +fi + +# ─── Step 2: IAM OIDC provider for test-broker ─────────────────────────── +if should_run_step 2; then + log "2/7 IAM OIDC provider (oidc-provider/${TEST_BROKER_HOST})" + oidc_arn="arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${TEST_BROKER_HOST}" + if aws iam get-open-id-connect-provider --open-id-connect-provider-arn "$oidc_arn" \ + >/dev/null 2>&1; then + skip "OIDC provider already registered: $oidc_arn" + else + # Fetch the broker's TLS leaf thumbprint (AWS requires it for OIDC + # provider registration). Public TLS cert, so this is fine to + # fetch from any network. + thumb=$(echo | openssl s_client -servername "$TEST_BROKER_HOST" \ + -connect "${TEST_BROKER_HOST}:443" 2>/dev/null \ + | openssl x509 -fingerprint -noout 2>/dev/null \ + | awk -F'=' '{print $2}' | tr -d ':' | tr 'A-Z' 'a-z') + [ -n "$thumb" ] || die "could not fetch TLS thumbprint for ${TEST_BROKER_HOST}; is the broker reachable?" + run_or_dry aws iam create-open-id-connect-provider \ + --url "https://${TEST_BROKER_HOST}" \ + --client-id-list "sts.amazonaws.com" \ + --thumbprint-list "$thumb" + ok "registered $oidc_arn (thumbprint=$thumb)" + fi + env_set OIDC_PROVIDER_ARN "$oidc_arn" "$TEST_ENV_FILE" +fi + +# ─── Step 3: IAM roles (data, vault, memory) ───────────────────────────── +if should_run_step 3; then + log "3/7 IAM roles (data-test, vault-test, memory-test)" + # These wrap the existing prod provisioning scripts with a -test + # suffix on every name. The scripts read role/bucket names from env, + # so set env then call. + warn "extend scripts/provision-vault-role.sh + provision-memory-role.sh" + warn "to accept a SUFFIX env var, or copy them as -test variants." + warn "Tracking as a TODO in this script — exercise once the prod" + warn "scripts are parameterized (~ 1 PR of work)." +fi + +# ─── Step 4: S3 buckets ────────────────────────────────────────────────── +if should_run_step 4; then + log "4/7 S3 buckets (mail-test, vault-test, memory-test)" + warn "same parameterization story as step 3 — see TODO above." +fi + +# ─── Step 5: deployer wallet + funding ─────────────────────────────────── +if should_run_step 5; then + log "5/7 Deployer wallet on Heima-Paseo (distinct from prod deployer)" + KEYFILE="$HOME/.agentkeys/heima-paseo-deployer-test.key" + if [ -f "$KEYFILE" ]; then + skip "$KEYFILE exists" + else + mkdir -p "$(dirname "$KEYFILE")" + run_or_dry cast wallet new --json \ + | tee "$WORK_DIR/wallet.json" \ + | jq -r .[0].private_key > "$KEYFILE" + chmod 600 "$KEYFILE" + addr=$(jq -r .[0].address "$WORK_DIR/wallet.json") + ok "generated $KEYFILE (addr=$addr) — fund this address from your" + ok " personal Paseo wallet, then re-run --only-step 6 to deploy contracts." + fi +fi + +# ─── Step 6: deploy v2 stage-1 contracts on Heima-Paseo ────────────────── +if should_run_step 6; then + log "6/7 Deploy v2 stage-1 contracts to Heima-Paseo (new contracts on-chain)" + KEYFILE="$HOME/.agentkeys/heima-paseo-deployer-test.key" + [ -f "$KEYFILE" ] || die "missing $KEYFILE — run --only-step 5 first" + run_or_dry env HEIMA_DEPLOYER_KEY_FILE="$KEYFILE" \ + AGENTKEYS_CHAIN=heima-paseo \ + bash "$REPO_ROOT/scripts/heima-bring-up.sh" + ok "contract addresses recorded in scripts/operator-workstation.env;" + ok " copy the *_HEIMA_PASEO lines into $TEST_ENV_FILE." +fi + +# ─── Step 7: GitHub Actions OIDC role for the e2e workflow ─────────────── +if should_run_step 7; then + log "7/7 GitHub Actions OIDC role (test-only)" + warn "Create an additional IAM role 'github-actions-agentkeys-e2e'" + warn "with trust policy on token.actions.githubusercontent.com and a" + warn "condition limiting to the agentkeys repo + branch ref. Grant" + warn "agentkeys-vault-role-test + agentkeys-memory-role-test assume" + warn "perms and read-only S3 on the three test buckets." + warn "" + warn "Then store the role ARN as the TEST_OIDC_AWS_ROLE_ARN repo secret." + warn "Until that secret is set, .github/workflows/harness-e2e.yml is" + warn "inert (the job is gated on its presence)." +fi + +# ─── Done ──────────────────────────────────────────────────────────────── +log "Done" +ok "test environment provisioning complete (or skip-noted above)" +ok "next: bash harness/v2-stage3-demo.sh against \$OIDC_ISSUER=${TEST_BROKER_HOST}" +ok " with AGENTKEYS_ENV_FILE=$TEST_ENV_FILE" +rm -rf "$WORK_DIR" diff --git a/scripts/test-environment.env.example b/scripts/test-environment.env.example new file mode 100644 index 0000000..68c8787 --- /dev/null +++ b/scripts/test-environment.env.example @@ -0,0 +1,92 @@ +# AgentKeys long-lived test environment — env file template (issue #66 tier-2). +# +# Companion to scripts/operator-workstation.env, but for the PARALLEL +# test infrastructure (not prod): +# +# - Hostname: test-broker.litentry.org (vs. broker.litentry.org) +# - OIDC iss: https://test-broker.litentry.org +# - IAM role: agentkeys-data-role-test (vs. agentkeys-data-role) +# - Vault role: agentkeys-vault-role-test (vs. agentkeys-vault-role) +# - Mem role: agentkeys-memory-role-test (vs. agentkeys-memory-role) +# - Mail/vault/memory buckets: -test suffix on every bucket name +# - Chain: heima-paseo (testnet — no real-HEI cost on every CI run) +# - Deployer: separate keypair, persisted only in operator wallet +# - Contracts: deployed fresh by scripts/provision-test-environment.sh, +# distinct addresses from prod (recorded below per chain) +# +# Why mirror operator-workstation.env instead of forking it: the harness +# scripts (harness/v2-stage*.sh) source ONE env file. Setting +# AGENTKEYS_ENV_FILE=./scripts/test-environment.env before invoking a +# harness script reuses the entire flow against the test infra unchanged. +# +# Bring-up: bash scripts/provision-test-environment.sh +# Activate: cp scripts/test-environment.env.example scripts/test-environment.env +# (then fill in the values below from the provisioner output) +# +# This .example file commits as-is. The non-example copy MUST NOT be +# committed (it carries no secrets in itself, but its contents are the +# canonical "this account hosts the test infra" pointer — gated behind +# the operator's deliberate copy). +# +# See docs/test-environment.md for the full bring-up runbook. + +# ─── AWS account ───────────────────────────────────────────────────────── +# Same account as prod is fine for cost, but every resource name carries +# a -test suffix so a misconfigured CI run targeting prod fails closed +# (the role / bucket / OIDC provider simply won't exist in prod). +ACCOUNT_ID=000000000000 +REGION=us-east-1 + +# ─── Hostname + OIDC issuer ────────────────────────────────────────────── +# DNS A record + TLS cert + nginx + systemd all per scripts/setup-broker-host.sh +# with --issuer-url https://test-broker.litentry.org. Long-lived because +# AWS validates the OIDC issuer URL byte-for-byte against the JWT `iss` +# claim — every reboot must restore the same URL. +BROKER_HOST=test-broker.litentry.org +OIDC_ISSUER=https://${BROKER_HOST} +OIDC_PROVIDER_ARN=arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${BROKER_HOST} + +# ─── IAM roles (parallel to prod, distinct ARNs) ───────────────────────── +DATA_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-data-role-test +VAULT_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-vault-role-test +MEMORY_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-memory-role-test + +# ─── S3 buckets (parallel to prod, distinct names) ─────────────────────── +MAIL_DOMAIN=bots-test.litentry.org +MAIL_BUCKET=agentkeys-mail-test-${ACCOUNT_ID} +BUCKET=${MAIL_BUCKET} +VAULT_BUCKET=agentkeys-vault-test-${ACCOUNT_ID} +MEMORY_BUCKET=agentkeys-memory-test-${ACCOUNT_ID} + +# ─── Backend (signer) URL ──────────────────────────────────────────────── +# Test env runs the mock-server backend (the production dev_key_service +# shape). Real TEE workers are out of scope for the test environment — +# see issue #74 step 2. +AGENTKEYS_SIGNER_URL=https://signer-test.litentry.org +BACKEND_URL=${AGENTKEYS_SIGNER_URL} + +# ─── Chain (Heima-Paseo testnet) ───────────────────────────────────────── +# Defaults to Paseo for zero real-HEI cost. Override to `anvil` for +# fully local runs; never `heima` (mainnet — prod-only). +AGENTKEYS_CHAIN=heima-paseo + +# Contract addresses — populated by scripts/provision-test-environment.sh. +# Keep one set per chain so re-bring-up against another chain doesn't +# clobber. The non-test file commits the actual addresses post-deploy. +SCOPE_CONTRACT_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 +SIDECAR_REGISTRY_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 +K3_EPOCH_COUNTER_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 +CREDENTIAL_AUDIT_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 +P256_VERIFIER_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 +K11_VERIFIER_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 + +# ─── Deployer key path ─────────────────────────────────────────────────── +# Operator-held only; the test deployer is a DIFFERENT wallet from prod. +# Provisioner persists it at ~/.agentkeys/heima-paseo-deployer-test.key. +HEIMA_DEPLOYER_KEY_FILE=${HOME}/.agentkeys/heima-paseo-deployer-test.key + +# ─── CI namespacing (per-run S3 prefix isolation) ──────────────────────── +# Set by the e2e workflow at run time so concurrent CI runs don't step +# on each other's writes. Cleaned up by nightly s3-prefix-rm job (see +# docs/test-environment.md §Cleanup). +CI_S3_PREFIX=ci/pr-${PR_NUMBER:-manual}/run-${GITHUB_RUN_ID:-local} From cd25bdecd549f786856ed5471a6bb7dce652424e Mon Sep 17 00:00:00 2001 From: wildmeta-agent Date: Thu, 21 May 2026 09:33:21 +0800 Subject: [PATCH 2/4] issue #66: collapse to one CI file; mirror prod env on Heima mainnet MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per operator feedback: 1. "do not create new files, only add the test file" — drop the ephemeral-stack helper, provisioner, env template, e2e workflow, and docs. Single deliverable: .github/workflows/harness-ci.yml. 2. "onchain solution should test on Heima mainnet with a new smart contract address" — confirmed possible: Solidity compiles deterministically and EVM contract addresses derive from (deployer, nonce). Identical crates/agentkeys-chain/src/*.sol + identical DeployAgentKeysV1.s.sol + a different deployer key on Heima mainnet = isolated parallel contract set at new addresses on the production chain. 3. "CI mirrors the production env" — the workflow now invokes the PRODUCTION harness scripts (harness/v2-stage{1,2,3}-demo.sh) unchanged. The only thing CI does differently from a prod operator is materialize scripts/operator-workstation.env with TEST_* resource names from GitHub secrets: - TEST_OIDC_AWS_ROLE_ARN (gate; until set, harness job skips) - TEST_ACCOUNT_ID / TEST_AWS_REGION / TEST_BROKER_HOST - TEST_VAULT_BUCKET / TEST_MEMORY_BUCKET - TEST_{VAULT,MEMORY,DATA}_ROLE_ARN - TEST_HEIMA_DEPLOYER_KEY (raw 0x-prefixed mainnet key — test wallet, distinct from prod deployer) - TEST_{SCOPE,SIDECAR_REGISTRY,K3_EPOCH_COUNTER, CREDENTIAL_AUDIT,P256_VERIFIER,K11_VERIFIER}_CONTRACT_ADDRESS_HEIMA (pre-deployed once per test-env refresh; harness skips deploy via --skip-deploy so CI doesn't burn HEI on every push) AWS auth via GitHub Actions OIDC (id-token: write), no long-lived secrets. Per-run S3 prefix isolation. The workflow gates itself on TEST_OIDC_AWS_ROLE_ARN being set so it's inert until the operator activates the test infra. WebAuthn: never invoked — harness scripts default to WEBAUTHN_MODE=0 (stage-1 line 131) and stage-2's --stub flag is passed explicitly. LLM: zero. Plain cargo/forge/aws-cli/curl orchestration. Distinct from claude.yml + claude-code-review.yml which intentionally do call @claude. --- .github/workflows/harness-ci.yml | 299 +++++++++++++------ .github/workflows/harness-e2e.yml | 204 ------------- docs/test-environment.md | 166 ----------- harness/ci-ephemeral-stack.sh | 401 -------------------------- scripts/provision-test-environment.sh | 276 ------------------ scripts/test-environment.env.example | 92 ------ 6 files changed, 210 insertions(+), 1228 deletions(-) delete mode 100644 .github/workflows/harness-e2e.yml delete mode 100644 docs/test-environment.md delete mode 100755 harness/ci-ephemeral-stack.sh delete mode 100755 scripts/provision-test-environment.sh delete mode 100644 scripts/test-environment.env.example diff --git a/.github/workflows/harness-ci.yml b/.github/workflows/harness-ci.yml index ec19e66..0505d45 100644 --- a/.github/workflows/harness-ci.yml +++ b/.github/workflows/harness-ci.yml @@ -1,39 +1,67 @@ name: harness CI (no LLM) -# Issue #66 tier-1: deterministic, no-LLM, no-WebAuthn CI that exercises -# the same code paths the harness scripts run, but against an ephemeral -# in-CI test environment (anvil + mock-server + stub-STS broker). +# Issue #66: deterministic, no-LLM, no-WebAuthn CI that runs the SAME +# production harness scripts (harness/v2-stage{1,2,3}-demo.sh) against +# a parallel TEST instance of the production environment. # -# Separate from the existing claude.yml / claude-code-review.yml workflows -# (which invoke @claude on PR comments + reviews). This workflow never -# spends LLM tokens — it's plain cargo/forge/curl orchestration. +# "Mirror production" means: same Heima mainnet chain, same Solidity +# source files, same harness scripts, same broker code, same AWS +# IAM/STS/S3 surfaces. The only delta is identifiers — a different +# deployer wallet → different contract addresses; a different OIDC +# provider URL → different IAM role + bucket. Every test resource +# carries a -test suffix so a misconfigured run targeting prod fails +# closed (the role/bucket simply won't exist in prod). # -# Coverage map (matches harness/v2-stage*.sh where ephemeral CI can): +# Operator-provided GitHub repo secrets (one-shot setup, then immutable +# for the life of the test environment): # -# * `cargo fmt --check` — formatting gate -# * `cargo clippy -D warnings` — lint gate -# * `cargo test --workspace` — unit + in-process integration -# tests. The broker tests -# already spawn a full -# in-process broker with -# StubSts + StubEmailSender, -# so SIWE / OIDC mint / cap -# verify / multi-master / -# recovery / per-data-class -# isolation Rust logic is all -# covered here. Per CLAUDE.md -# "all async / #[tokio::test]" -# convention. -# * `harness/ci-ephemeral-stack.sh` — forge build + forge test + -# forge script deploy on a -# fresh anvil + read-only -# ABI/wiring verification. -# Plus broker boot smoke + -# OIDC discovery surface. +# TEST_OIDC_AWS_ROLE_ARN IAM role assumed by this workflow via GitHub +# Actions OIDC. Trust policy: +# "token.actions.githubusercontent.com", +# conditioned on this repo + ref. Grants: +# sts:AssumeRole on the test data roles + +# read-only S3 on the test buckets. +# TEST_ACCOUNT_ID AWS account ID hosting the test infra. +# Same account as prod is fine — isolation is +# by resource name, not by account. +# TEST_AWS_REGION e.g. us-east-1 +# TEST_BROKER_HOST test-broker.litentry.org (long-lived; AWS +# validates OIDC issuer URLs byte-for-byte, +# so this must outlast any single CI run). +# TEST_VAULT_BUCKET agentkeys-vault-test-${ACCOUNT_ID} +# TEST_MEMORY_BUCKET agentkeys-memory-test-${ACCOUNT_ID} +# TEST_VAULT_ROLE_ARN arn:aws:iam::${ACCT}:role/agentkeys-vault-role-test +# TEST_MEMORY_ROLE_ARN arn:aws:iam::${ACCT}:role/agentkeys-memory-role-test +# TEST_DATA_ROLE_ARN arn:aws:iam::${ACCT}:role/agentkeys-data-role-test +# TEST_HEIMA_DEPLOYER_KEY 0x-prefixed Heima mainnet test wallet private +# key (DIFFERENT from prod deployer). Deploys +# the same crates/agentkeys-chain/src/*.sol to +# new addresses on mainnet via the same +# DeployAgentKeysV1.s.sol script. Solidity +# bytecode is deterministic and contract +# addresses derive from (deployer, nonce), so +# a different key + same source = isolated +# parallel contract set on the production +# chain. Fund this wallet once from the +# operator's personal Heima wallet. +# TEST_SCOPE_CONTRACT_ADDRESS_HEIMA pinned addresses of the +# TEST_SIDECAR_REGISTRY_ADDRESS_HEIMA test-deployer's mainnet deploy +# TEST_K3_EPOCH_COUNTER_ADDRESS_HEIMA (so CI doesn't burn HEI on +# TEST_CREDENTIAL_AUDIT_ADDRESS_HEIMA every run). One-shot deploy +# TEST_P256_VERIFIER_ADDRESS_HEIMA per test-environment refresh. +# TEST_K11_VERIFIER_ADDRESS_HEIMA # -# Tier-2 (long-lived test-broker.litentry.org, full stage-3 PrincipalTag -# isolation, real AWS STS) lives in .github/workflows/harness-e2e.yml -# and is gated on operator-provisioned infra; see docs/test-environment.md. +# Gating: until TEST_OIDC_AWS_ROLE_ARN is set, the workflow's preflight +# job surfaces a ::warning:: skip and exits clean — safe to merge before +# the operator activates the test infra. +# +# WebAuthn: never invoked. harness/v2-stage1-demo.sh defaults to +# WEBAUTHN_MODE=0 (line 131), v2-stage2-demo.sh accepts --stub, neither +# this workflow nor the harness scripts call WebAuthn paths in this mode. +# +# LLM: never invoked. This workflow is plain cargo/forge/aws-cli/curl — +# distinct from claude.yml + claude-code-review.yml which DO call @claude +# on PR comments + reviews. This workflow consumes zero LLM tokens. on: push: @@ -46,14 +74,23 @@ on: - ".github/workflows/harness-ci.yml" - "Cargo.toml" - "Cargo.lock" + workflow_dispatch: + inputs: + stage: + description: "Which harness stage to run (1, 2, 3, or all)" + required: false + default: "all" + type: choice + options: ["1", "2", "3", "all"] -# Allow only one concurrent run per ref so re-pushes cancel stale runs -# (saves runner minutes; each ephemeral stack spins up anvil + builds the -# workspace, so wall-clock matters). concurrency: group: harness-ci-${{ github.ref }} cancel-in-progress: true +permissions: + id-token: write # GitHub Actions OIDC → assume TEST_OIDC_AWS_ROLE_ARN + contents: read + jobs: rust-checks: name: cargo fmt + clippy + test @@ -62,78 +99,162 @@ jobs: steps: - uses: actions/checkout@v4 - - name: Install Rust toolchain - uses: dtolnay/rust-toolchain@stable + - uses: dtolnay/rust-toolchain@stable with: components: clippy, rustfmt - - name: Cache cargo registry + target - uses: Swatinem/rust-cache@v2 + - uses: Swatinem/rust-cache@v2 with: shared-key: harness-ci - - name: cargo fmt --check - run: cargo fmt --all -- --check + - run: cargo fmt --all -- --check + - run: cargo clippy --workspace --all-targets -- -D warnings + # --test-threads=1: broker tests mutate shared process env (HOME, + # AWS_*) and the keyring tests serialize on a per-process accounts + # map — same convention as the existing @claude review workflow. + - run: cargo test --workspace -- --test-threads=1 - - name: cargo clippy - # -D warnings: any clippy diagnostic blocks merge. Matches the - # project's "fix the warning, don't silence it" convention. - run: cargo clippy --workspace --all-targets -- -D warnings - - - name: cargo test --workspace - # --test-threads=1: the broker tests mutate shared process env - # (HOME, AWS_*) and the keyring tests serialize on a per-process - # accounts map — same convention as the @claude review workflow. - run: cargo test --workspace -- --test-threads=1 + preflight: + # Gate the harness jobs on the test infra credentials being present. + # Until the operator sets TEST_OIDC_AWS_ROLE_ARN, the harness jobs + # surface as skipped rather than failing. + name: gate on test infra availability + runs-on: ubuntu-latest + needs: rust-checks + outputs: + should_run: ${{ steps.gate.outputs.should_run }} + steps: + - id: gate + run: | + if [ -n "${{ secrets.TEST_OIDC_AWS_ROLE_ARN }}" ]; then + echo "should_run=true" >> "$GITHUB_OUTPUT" + echo "test infra credentials present; proceeding" + else + echo "should_run=false" >> "$GITHUB_OUTPUT" + echo "::warning::TEST_OIDC_AWS_ROLE_ARN unset — harness E2E skipped. See workflow header for operator setup." + fi - ephemeral-stack: - name: ephemeral anvil + chain deploy + harness-e2e: + name: harness/v2-stage*-demo.sh on Heima mainnet (test deployer) + needs: preflight + if: needs.preflight.outputs.should_run == 'true' runs-on: ubuntu-latest - timeout-minutes: 45 - needs: rust-checks # don't burn runner minutes on chain checks if Rust is red + timeout-minutes: 60 + steps: - uses: actions/checkout@v4 with: - # forge install reads .gitmodules — need submodules for forge-std etc. - submodules: recursive - - - name: Install Rust toolchain - uses: dtolnay/rust-toolchain@stable + submodules: recursive # forge install reads .gitmodules - - name: Cache cargo registry + target - uses: Swatinem/rust-cache@v2 + - uses: dtolnay/rust-toolchain@stable + - uses: Swatinem/rust-cache@v2 with: - shared-key: harness-ci # share with rust-checks job + shared-key: harness-ci - - name: Install Foundry (anvil + forge + cast) - uses: foundry-rs/foundry-toolchain@v1 + - uses: foundry-rs/foundry-toolchain@v1 with: version: stable - - name: Verify Foundry toolchain - run: | - anvil --version - forge --version - cast --version - - - name: Run ephemeral stack (chain + broker smoke) - # The script handles its own anvil + broker bring-up/tear-down via - # an EXIT trap. Fails the job if any step (forge build/test/deploy, - # contract verification, broker boot, OIDC discovery) fails. - run: bash harness/ci-ephemeral-stack.sh - env: - # Pinned ports so the workflow log is reproducible. - ANVIL_PORT: "8545" - MOCK_PORT: "8090" - BROKER_PORT: "8091" - # Fail builds on rustc warnings as well (matches clippy job). - RUSTFLAGS: "-D warnings" - - - name: Upload logs on failure - if: failure() - uses: actions/upload-artifact@v4 + - name: Configure AWS credentials via OIDC (test role) + uses: aws-actions/configure-aws-credentials@v4 with: - name: ephemeral-stack-logs - path: /tmp/agentkeys-ci-ephemeral-*/ - if-no-files-found: ignore - retention-days: 7 + role-to-assume: ${{ secrets.TEST_OIDC_AWS_ROLE_ARN }} + aws-region: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }} + # Session name shows up in CloudTrail — keep traceable per run. + role-session-name: gh-ci-${{ github.run_id }} + + - name: Build agentkeys CLI + workers (release) + run: cargo build --release --workspace + + - name: Materialize the production env file with TEST values + # The harness scripts source scripts/operator-workstation.env + # unchanged. We OVERWRITE it with the test resource names so + # the entire production harness flow re-points at the test + # infra without modifying a single script — that's what + # "mirror production env" means. + # + # Same chain (heima mainnet), same .sol code, same scripts. + # Different deployer key → different contract addresses on the + # SAME mainnet → fully isolated parallel contract set. + run: | + cat > scripts/operator-workstation.env < "$HOME/.agentkeys/heima-deployer.key" + chmod 600 "$HOME/.agentkeys/heima-deployer.key" + + - name: Stage 1 — chain reachability + identity bootstrap + if: ${{ inputs.stage == 'all' || inputs.stage == '1' || inputs.stage == '' }} + # --skip-deploy: contracts are pre-deployed once per test-env + # refresh (operator one-shot) and pinned in TEST_*_HEIMA secrets, + # so CI doesn't burn HEI on every push. + # --skip-email: SES email-link round-trip is exercised separately; + # identity bootstrap here uses wallet_sig. + # No --webauthn: stub-mode K11 (WEBAUTHN_MODE=0 default). + run: | + AGENTKEYS_CHAIN=heima \ + bash harness/v2-stage1-demo.sh --skip-deploy --skip-email + + - name: Stage 2 — multi-master + recovery (stub mode) + if: ${{ inputs.stage == 'all' || inputs.stage == '2' || inputs.stage == '' }} + run: | + AGENTKEYS_CHAIN=heima \ + bash harness/v2-stage2-demo.sh --stub --skip-build + + - name: Stage 3 — per-actor + per-data-class PrincipalTag isolation + if: ${{ inputs.stage == 'all' || inputs.stage == '3' || inputs.stage == '' }} + # The capstone: stage-3 is the layer with the highest security + # invariant payload (per CLAUDE.md "Per-actor + per-data-class + # isolation invariants" table). Requires AWS STS + # AssumeRoleWithWebIdentity → which requires AWS to fetch the + # OIDC issuer's JWKS over public TLS. The long-lived test broker + # (TEST_BROKER_HOST) satisfies that; the same code path proves + # the prod IAM trust policy + bucket policy are correctly scoped. + run: | + AGENTKEYS_CHAIN=heima \ + bash harness/v2-stage3-demo.sh + + - name: Clean up per-run S3 prefix + if: always() + run: | + PREFIX="ci/run-${{ github.run_id }}/" + for bucket in \ + "${{ secrets.TEST_VAULT_BUCKET }}" \ + "${{ secrets.TEST_MEMORY_BUCKET }}"; do + [ -n "$bucket" ] || continue + aws s3 rm "s3://$bucket/$PREFIX" --recursive 2>/dev/null || true + done diff --git a/.github/workflows/harness-e2e.yml b/.github/workflows/harness-e2e.yml deleted file mode 100644 index 071608b..0000000 --- a/.github/workflows/harness-e2e.yml +++ /dev/null @@ -1,204 +0,0 @@ -name: harness E2E (long-lived test broker) - -# Issue #66 tier-2: end-to-end harness exercise against the long-lived -# test-broker.litentry.org infrastructure provisioned by -# scripts/provision-test-environment.sh. -# -# Gated on TEST_OIDC_AWS_ROLE_ARN being set as a repo secret — until the -# operator wires it (see docs/test-environment.md §3), the job is inert -# and surfaces as a no-op rather than failing. This keeps the workflow -# safe to merge before the parallel infra is up. -# -# Coverage delta vs. harness-ci.yml: -# * harness-ci.yml: ephemeral anvil + in-process broker + StubSts -# (no public TLS, no real AWS, no real SES) -# * harness-e2e.yml: real test-broker.litentry.org + real AWS test -# resources (test bucket, test role) + real Heima -# Paseo chain. Runs the full stage-3 per-actor + -# per-data-class PrincipalTag isolation suite -# that ephemeral CI can't reach. -# -# No LLM. No WebAuthn (passes the harness scripts in default stub mode). -# Schedule + workflow_dispatch only — never on every PR (this hits real -# AWS API calls + real chain RPC, so it's nightly-cadence). - -on: - schedule: - # Nightly at 06:00 UTC — well after the prior day's PR activity - # quiesces but before the operator's morning standup. - - cron: "0 6 * * *" - workflow_dispatch: - inputs: - stage: - description: "Which stage to run (1, 2, 3, or all)" - required: false - default: "all" - type: choice - options: ["1", "2", "3", "all"] - -# Prevent overlapping runs (each one consumes test AWS resources + chain RPC). -concurrency: - group: harness-e2e - cancel-in-progress: false # let in-flight nightly finish; queue manual runs - -# OIDC-only AWS auth via GitHub Actions — never long-lived secrets. -permissions: - id-token: write # required for aws-actions/configure-aws-credentials - contents: read - -jobs: - preflight: - name: gate on test infra availability - runs-on: ubuntu-latest - outputs: - should_run: ${{ steps.gate.outputs.should_run }} - steps: - - id: gate - run: | - if [ -n "${{ secrets.TEST_OIDC_AWS_ROLE_ARN }}" ]; then - echo "should_run=true" >> "$GITHUB_OUTPUT" - echo "test infra credentials present; proceeding" - else - echo "should_run=false" >> "$GITHUB_OUTPUT" - echo "::warning::TEST_OIDC_AWS_ROLE_ARN unset — skipping. See docs/test-environment.md." - fi - - harness-e2e: - name: harness/v2-stage*-demo.sh against test-broker - needs: preflight - if: needs.preflight.outputs.should_run == 'true' - runs-on: ubuntu-latest - timeout-minutes: 60 - - steps: - - uses: actions/checkout@v4 - with: - submodules: recursive - - - name: Install Rust toolchain - uses: dtolnay/rust-toolchain@stable - - - name: Cache cargo registry + target - uses: Swatinem/rust-cache@v2 - with: - shared-key: harness-e2e - - - name: Install Foundry - uses: foundry-rs/foundry-toolchain@v1 - with: - version: stable - - - name: Configure AWS credentials via OIDC (test role) - uses: aws-actions/configure-aws-credentials@v4 - with: - role-to-assume: ${{ secrets.TEST_OIDC_AWS_ROLE_ARN }} - aws-region: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }} - # Session name shows up in CloudTrail — keep traceable to the - # PR / run for forensic walking. - role-session-name: gh-actions-${{ github.repository_id }}-${{ github.run_id }} - - - name: Build agentkeys CLI + workers - run: cargo build --release --workspace - - - name: Source test-environment env - # The harness scripts source scripts/operator-workstation.env by - # default. For the e2e run, overlay scripts/test-environment.env - # into that path so the entire harness flow reuses unchanged. - # The .example template is committed; the live file lives only - # in the runner's filesystem for the duration of the job. - run: | - cp scripts/test-environment.env.example scripts/operator-workstation.env - # Substitute repo secrets into the live env file. - { - echo "ACCOUNT_ID=${{ secrets.TEST_ACCOUNT_ID }}" - echo "REGION=${{ secrets.TEST_AWS_REGION || 'us-east-1' }}" - echo "BROKER_HOST=${{ secrets.TEST_BROKER_HOST || 'test-broker.litentry.org' }}" - echo "OIDC_ISSUER=https://${{ secrets.TEST_BROKER_HOST || 'test-broker.litentry.org' }}" - echo "VAULT_BUCKET=${{ secrets.TEST_VAULT_BUCKET }}" - echo "MEMORY_BUCKET=${{ secrets.TEST_MEMORY_BUCKET }}" - echo "VAULT_ROLE_ARN=${{ secrets.TEST_VAULT_ROLE_ARN }}" - echo "MEMORY_ROLE_ARN=${{ secrets.TEST_MEMORY_ROLE_ARN }}" - echo "DATA_ROLE_ARN=${{ secrets.TEST_DATA_ROLE_ARN }}" - # Per-run S3 prefix isolation — concurrent runs (manual + - # nightly) won't step on each other's writes; nightly - # cleanup s3 rm's keys older than 7d. - echo "CI_S3_PREFIX=ci/run-${{ github.run_id }}" - } >> scripts/operator-workstation.env - - - name: Stage 1 — chain + identity bootstrap - if: ${{ inputs.stage == 'all' || inputs.stage == '1' }} - # --skip-deploy: contracts are pre-deployed by - # scripts/provision-test-environment.sh on Heima-Paseo, and - # those addresses are baked into scripts/test-environment.env. - # --skip-email: e2e doesn't exercise the SES round-trip - # (separate workflow); identity bootstrap uses wallet_sig. - # No --webauthn: stub-mode (WEBAUTHN_MODE=0 default). - run: | - AGENTKEYS_CHAIN=heima-paseo \ - bash harness/v2-stage1-demo.sh --skip-deploy --skip-email - - - name: Stage 2 — multi-master + recovery (stub mode) - if: ${{ inputs.stage == 'all' || inputs.stage == '2' }} - run: | - AGENTKEYS_CHAIN=heima-paseo \ - bash harness/v2-stage2-demo.sh --stub --skip-build - - - name: Stage 3 — per-actor + per-data-class PrincipalTag isolation - if: ${{ inputs.stage == 'all' || inputs.stage == '3' }} - # The tier-2 capstone: stage-3 is the suite ephemeral CI can't - # run, since it requires AWS STS AssumeRoleWithWebIdentity, which - # in turn requires AWS to fetch the OIDC issuer's JWKS over - # public TLS. Now that we have test-broker.litentry.org with a - # real Let's Encrypt cert and real test IAM roles, all 11 steps - # of v2-stage3-demo.sh execute end-to-end. - run: | - AGENTKEYS_CHAIN=heima-paseo \ - bash harness/v2-stage3-demo.sh - - - name: Clean up per-run S3 prefix - if: always() - # Best-effort: tear down the per-run S3 prefix we wrote to. - # The nightly cleanup s3 rm catches any keys we missed. - run: | - PREFIX="ci/run-${{ github.run_id }}/" - for bucket in \ - "${{ secrets.TEST_VAULT_BUCKET }}" \ - "${{ secrets.TEST_MEMORY_BUCKET }}"; do - [ -n "$bucket" ] || continue - aws s3 rm "s3://$bucket/$PREFIX" --recursive || true - done - - nightly-prefix-cleanup: - # Sweep any per-run S3 prefixes older than 7 days from the test - # buckets. Cheap insurance against forgotten prefixes from cancelled - # runs; complements the per-job cleanup above. - name: cleanup stale CI prefixes - needs: preflight - if: needs.preflight.outputs.should_run == 'true' && github.event_name == 'schedule' - runs-on: ubuntu-latest - timeout-minutes: 10 - permissions: - id-token: write - contents: read - steps: - - name: Configure AWS credentials - uses: aws-actions/configure-aws-credentials@v4 - with: - role-to-assume: ${{ secrets.TEST_OIDC_AWS_ROLE_ARN }} - aws-region: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }} - role-session-name: gh-actions-cleanup-${{ github.run_id }} - - - name: Sweep prefixes older than 7d - run: | - cutoff=$(date -u -d "7 days ago" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null \ - || date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) - for bucket in \ - "${{ secrets.TEST_VAULT_BUCKET }}" \ - "${{ secrets.TEST_MEMORY_BUCKET }}"; do - [ -n "$bucket" ] || continue - aws s3api list-objects-v2 --bucket "$bucket" --prefix "ci/" \ - --query "Contents[?LastModified<\`$cutoff\`].Key" --output text \ - | tr '\t' '\n' | while read -r key; do - [ -n "$key" ] && aws s3 rm "s3://$bucket/$key" - done - done diff --git a/docs/test-environment.md b/docs/test-environment.md deleted file mode 100644 index ca3596b..0000000 --- a/docs/test-environment.md +++ /dev/null @@ -1,166 +0,0 @@ -# Test environment — AgentKeys (issue #66) - -**Audience:** the operator setting up CI for AgentKeys, plus contributors who need to debug a CI failure. -**Scope:** the parallel test infrastructure (broker, IAM roles, S3 buckets, deployer wallet, smart contracts) that exists alongside prod so CI can exercise the full code path without touching real user data. - -This is the operator-facing companion to: -- [`.github/workflows/harness-ci.yml`](../.github/workflows/harness-ci.yml) — the tier-1 ephemeral CI workflow (no external infra) -- [`.github/workflows/harness-e2e.yml`](../.github/workflows/harness-e2e.yml) — the tier-2 nightly E2E workflow against the long-lived test broker -- [`harness/ci-ephemeral-stack.sh`](../harness/ci-ephemeral-stack.sh) — the ephemeral stack driver tier-1 invokes -- [`scripts/provision-test-environment.sh`](../scripts/provision-test-environment.sh) — operator-run, one-shot provisioner for the tier-2 long-lived infra -- [`scripts/test-environment.env.example`](../scripts/test-environment.env.example) — env file template - -## Two-tier model - -Issue #66 calls for a CI that runs the harness scripts against a parallel test environment, never spends LLM tokens, and never invokes WebAuthn. There are two natural points to do that, and we ship both: - -| | Tier 1 — ephemeral | Tier 2 — long-lived | -|---|---|---| -| **Workflow** | `harness-ci.yml` | `harness-e2e.yml` | -| **Trigger** | every push + PR | nightly + manual dispatch | -| **Where** | inside a GitHub Actions runner | runs against `test-broker.litentry.org` | -| **Chain** | `anvil` (fresh per run, instant finality) | Heima-Paseo testnet (long-lived contracts) | -| **Deployer** | anvil's prefunded default test key (zero risk) | a separate Paseo wallet, funded by operator, persisted at `~/.agentkeys/heima-paseo-deployer-test.key` | -| **Contracts** | fresh deploy per run via Foundry | deployed once by `provision-test-environment.sh`, addresses pinned in `scripts/test-environment.env` | -| **Broker** | in-process spawn, OIDC issuer = `http://127.0.0.1:8091`, `StubSts` | real broker process on test EC2, OIDC issuer = `https://test-broker.litentry.org`, real AWS STS | -| **AWS** | none — broker boots with `--skip-startup-check`, no STS/S3 calls | real test bucket + real test role; AWS STS `AssumeRoleWithWebIdentity` works because the test broker exposes a public TLS-fronted JWKS endpoint | -| **WebAuthn** | never — harness defaults to `WEBAUTHN_MODE=0` stub mode | never — same default | -| **LLM** | never | never | -| **Wall time** | ~10–15 min | ~25–40 min | - -Tier 1 catches almost all regressions because the Rust integration tests (`cargo test --workspace`) already spawn an in-process broker with `StubSts` + `StubEmailSender` — those tests cover SIWE auth, OIDC mint, cap-token verification, multi-master, recovery, and per-data-class isolation logic. What tier 1 *can't* cover is the real-AWS path: stage 3's `AssumeRoleWithWebIdentity` requires AWS to fetch the issuer's JWKS over public TLS, which an ephemeral CI runner can't expose. That's the tier-2 capstone. - -## Tier 1 — ephemeral CI (no operator setup needed) - -Already wired. Every push to `main` or `evm`, plus every PR touching `crates/**` / `harness/**` / `scripts/**`, runs: - -1. `cargo fmt --check` -2. `cargo clippy --workspace --all-targets -- -D warnings` -3. `cargo test --workspace -- --test-threads=1` -4. `bash harness/ci-ephemeral-stack.sh`, which: - - Starts a fresh `anvil` on port 8545 (new chain, instant finality) - - Runs `forge build && forge test` in `crates/agentkeys-chain/` - - Runs `forge script DeployAgentKeysV1.s.sol` to deploy all 6 contracts to the ephemeral anvil - - Parses the deployed addresses and writes a synthetic `operator-workstation.env` - - Runs `scripts/verify-heima-contracts.sh` against the new addresses (read-only ABI + wiring checks) - - Starts `mock-server` + `agentkeys-broker-server` (with `--skip-startup-check`, OIDC issuer = `http://127.0.0.1:8091`) - - Probes `/healthz`, `/.well-known/openid-configuration`, `/.well-known/jwks.json` - -On failure, the script's EXIT trap preserves all logs (`anvil.log`, `forge-deploy.log`, `broker.log`, etc.) and the workflow uploads them as a `ephemeral-stack-logs` artifact. - -## Tier 2 — long-lived test broker - -### Operator bring-up (~2 hours, one-shot) - -```bash -awsp agentkeys-admin # AWS admin profile for the account hosting test infra -bash scripts/provision-test-environment.sh -``` - -This walks through 7 steps: - -1. **Provision the EC2 broker host** at `test-broker.litentry.org`. Manual step (the runbook fragment in the script tells you exactly what to do on the target EC2). -2. **Register the AWS IAM OIDC provider** for `test-broker.litentry.org` (separate ARN from prod's `oidc-provider/broker.litentry.org`). -3. **Provision IAM roles** `agentkeys-data-role-test`, `agentkeys-vault-role-test`, `agentkeys-memory-role-test`, each trust-policied on the test OIDC provider with the same `PrincipalTag/agentkeys_actor_omni` scoping prod uses. -4. **Provision S3 buckets** `agentkeys-mail-test-${ACCT}`, `agentkeys-vault-test-${ACCT}`, `agentkeys-memory-test-${ACCT}` with block-public-access + default SSE-S3 + the v3 split-statement PrincipalTag bucket policy. -5. **Generate a new deployer wallet** (distinct from the prod deployer) at `~/.agentkeys/heima-paseo-deployer-test.key`. You fund it from your personal Paseo wallet (Paseo has sudo so Alice can also fund — see `scripts/heima-bring-up.sh`). -6. **Deploy fresh v2 stage-1 contracts** to Heima-Paseo via `DeployAgentKeysV1.s.sol`. Records the addresses under `*_HEIMA_PASEO` keys in `scripts/test-environment.env`. -7. **Provision a GitHub Actions OIDC role** (`github-actions-agentkeys-e2e`) trust-policied on `token.actions.githubusercontent.com` with a condition limiting it to the agentkeys repo. Grant it `sts:AssumeRole` on the three test roles + read-only S3 on the three test buckets. - -Some steps are still operator-manual (parameterizing `provision-vault-role.sh` to accept a `SUFFIX=` env var is a TODO; until then, copy the prod scripts as `-test` variants by hand). The script logs these as `skip` with a follow-up TODO instead of silently passing. - -### Repo secrets to set (after provisioning) - -After the provisioner finishes, set these in **Settings → Secrets and variables → Actions**: - -| Secret | Value | -|---|---| -| `TEST_OIDC_AWS_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/github-actions-agentkeys-e2e` | -| `TEST_AWS_REGION` | `us-east-1` (or wherever the test broker lives) | -| `TEST_ACCOUNT_ID` | `${ACCT}` | -| `TEST_BROKER_HOST` | `test-broker.litentry.org` | -| `TEST_VAULT_BUCKET` | `agentkeys-vault-test-${ACCT}` | -| `TEST_MEMORY_BUCKET` | `agentkeys-memory-test-${ACCT}` | -| `TEST_VAULT_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-vault-role-test` | -| `TEST_MEMORY_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-memory-role-test` | -| `TEST_DATA_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-data-role-test` | - -`TEST_OIDC_AWS_ROLE_ARN` is the **gate**: until it's set, the `harness-e2e.yml` preflight job sets `should_run=false` and the workflow surfaces as a `::warning::` skip rather than a failure. This keeps the workflow safe to merge before the parallel infra is up. - -### Per-run S3 prefix namespacing - -The e2e workflow exports `CI_S3_PREFIX=ci/run-${GITHUB_RUN_ID}` and the harness scripts honor that prefix when writing test envelopes to S3. This means concurrent runs (nightly + a manual dispatch) won't step on each other's writes. - -Cleanup is two-layered: -- **Per-job cleanup**: the e2e workflow's `if: always()` step runs `aws s3 rm s3://$bucket/$PREFIX --recursive` at the end of each run. -- **Nightly sweep**: a separate `nightly-prefix-cleanup` job lists `ci/` prefix keys older than 7 days and rm's them. Cheap insurance against forgotten prefixes from cancelled runs. - -### Cert renewal monitoring - -`test-broker.litentry.org` uses Let's Encrypt (auto-renewed every 90d by certbot). If renewal silently fails, AWS STS stops trusting the OIDC issuer and the e2e workflow turns red overnight. - -The nightly workflow's preflight already exercises a `curl` against `https://${TEST_BROKER_HOST}/.well-known/openid-configuration`. A renewal failure surfaces as an immediate workflow failure with a clear TLS error. - -### Rotating the test broker secrets - -If the test mock-server's `DEV_KEY_SERVICE_MASTER_SECRET` ever leaks, rotate via: - -```bash -# 1. New secret on the broker host -ssh ec2-user@test-broker.litentry.org \ - 'sudo systemctl set-environment DEV_KEY_SERVICE_MASTER_SECRET=$(openssl rand -hex 32) \ - && sudo systemctl restart agentkeys-backend' - -# 2. There's nothing on the operator side to rotate — the secret never -# leaves the broker host (it derives per-omni signer keys in-process). -``` - -Test wallets minted via the rotated signer will have different addresses from pre-rotation wallets, which is the desired blast-radius cut. - -## Cleanup / teardown - -Tear down the entire test environment (cheap insurance if costs spike): - -```bash -# Drain the buckets first -for bucket in agentkeys-mail-test-${ACCT} agentkeys-vault-test-${ACCT} agentkeys-memory-test-${ACCT}; do - aws s3 rm "s3://$bucket" --recursive - aws s3api delete-bucket --bucket "$bucket" -done - -# Delete the roles (detach policies first) -for role in agentkeys-data-role-test agentkeys-vault-role-test agentkeys-memory-role-test github-actions-agentkeys-e2e; do - for policy in $(aws iam list-role-policies --role-name "$role" --query 'PolicyNames[]' --output text); do - aws iam delete-role-policy --role-name "$role" --policy-name "$policy" - done - aws iam delete-role --role-name "$role" -done - -# Delete the OIDC provider -aws iam delete-open-id-connect-provider \ - --open-id-connect-provider-arn arn:aws:iam::${ACCT}:oidc-provider/test-broker.litentry.org - -# Stop + terminate the EC2 + release the EIP (manual, console or aws ec2 CLI) -``` - -The contracts on Heima-Paseo stay on chain (they're free), but they're inert without the broker pointing at them. - -## Why two tiers (vs. just one) - -A single-tier model — running everything against the long-lived broker on every PR — was the obvious shape, but loses on: - -- **Latency**: every PR pays the ~30 min e2e wall time (vs. ~10 min for tier 1). -- **Cost**: every PR hits real AWS API calls + chain RPC + potentially gas. -- **Contention**: concurrent PRs serialize on the single test broker, or step on each other's S3 writes without per-run prefix isolation. -- **Brittleness**: a flaky external dep (Paseo collator hiccup, AWS API throttle) blocks merges. - -A single-tier model the other way — only ephemeral CI, no long-lived test broker — was also tempting, but loses stage-3 coverage entirely (`AssumeRoleWithWebIdentity` needs publicly-fetchable JWKS). That's the most security-critical layer in the codebase (per-actor + per-data-class IAM isolation per CLAUDE.md "Per-actor + per-data-class isolation invariants"), so leaving it untested in CI was unacceptable. - -The two-tier split puts the fast, cheap, deterministic checks on every PR and the expensive E2E on nightly. PRs that need to verify a stage-3 fix can trigger `harness-e2e.yml` via `workflow_dispatch` directly from the PR page. - -## Related - -- Original issue: [#66 — Stage 7: shared test broker for CI + dev](https://github.com/wildmeta-agent/agentKeys/issues/66) -- Prod cloud setup: [`docs/cloud-setup.md`](cloud-setup.md) -- Stage 7 demo + verification: [`docs/stage7-demo-and-verification.md`](stage7-demo-and-verification.md) -- Architecture: [`docs/spec/architecture.md`](spec/architecture.md) §17 (per-data-class buckets), §4 (HDKD actor tree), CLAUDE.md "Per-actor + per-data-class isolation invariants" table diff --git a/harness/ci-ephemeral-stack.sh b/harness/ci-ephemeral-stack.sh deleted file mode 100755 index 8d9ffa3..0000000 --- a/harness/ci-ephemeral-stack.sh +++ /dev/null @@ -1,401 +0,0 @@ -#!/usr/bin/env bash -# harness/ci-ephemeral-stack.sh — issue #66 tier-1 ephemeral CI driver. -# -# Stands up a complete, isolated AgentKeys test environment INSIDE a -# single CI runner and exercises the chain-deploy path end-to-end. No -# external infrastructure, no LLM, no WebAuthn, no real AWS. -# -# What this script delivers (the four parallel-infra axes from issue #66): -# -# ─ new test broker server → ephemeral agentkeys-broker-server -# spawned on 127.0.0.1, OIDC issuer -# http://127.0.0.1:$BROKER_PORT, stub -# STS client (no real AWS). -# ─ new smart contract on-chain → forge script deploys a fresh copy of -# the v2 stage-1 contract set -# (P256Verifier + K11Verifier + -# SidecarRegistry + AgentKeysScope + -# K3EpochCounter + CredentialAudit) -# to a brand-new anvil instance. -# ─ new deployer account → anvil's canonical first prefunded test -# key (10_000 ETH; zero risk). -# ─ no WebAuthn → the harness scripts default to -# WEBAUTHN_MODE=0 (stage-1 line 131); -# this script never passes --webauthn, -# so K11 enrollment writes deterministic -# stub bytes (CI-friendly). -# -# What's COVERED by this script (matches the harness scripts' coverage -# for things that don't require real AWS): -# -# * Forge unit + property tests for all six v2 stage-1 contracts. -# * End-to-end Foundry deploy via DeployAgentKeysV1.s.sol against the -# ephemeral anvil — same script as heima-bring-up.sh step 5 uses -# against Heima Mainnet/Paseo. -# * Read-only ABI/wiring checks via verify-heima-contracts.sh against -# the freshly deployed addresses (same checks Heima uses). -# * Broker liveness + OIDC discovery surface (/.well-known/ -# openid-configuration, /.well-known/jwks.json, /healthz). -# -# What's NOT covered here (intentionally — needs the long-lived -# test-broker.litentry.org tier-2 environment with publicly-reachable -# TLS + real AWS resources; see docs/test-environment.md): -# -# * harness/v2-stage3-demo.sh — per-actor + per-data-class S3 -# PrincipalTag isolation tests. AWS STS AssumeRoleWithWebIdentity -# requires AWS to fetch the OIDC issuer's JWKS over public TLS, -# which a CI runner can't expose. -# * Real SES email-link auth round-trip (uses StubEmailSender in unit -# tests; long-lived tier-2 exercises real SES). -# -# All the Rust-side broker/worker logic (SIWE auth, OIDC mint, cap-token -# verify, etc.) is covered by `cargo test --workspace` in the parent -# CI workflow — those tests already spawn an in-process broker with -# StubSts + StubEmailSender, so the ephemeral-stack script focuses on -# what cargo test can't reach: the on-chain deploy + ABI surface. -# -# Usage: -# bash harness/ci-ephemeral-stack.sh # full ephemeral roundtrip -# bash harness/ci-ephemeral-stack.sh --skip-broker # chain-only (forge + anvil) -# bash harness/ci-ephemeral-stack.sh --keep-running # leave anvil + broker up -# # (for local debugging) -# -# Exit codes: -# 0 every check passed -# 1 any check failed; logs in $WORK_DIR/*.log preserved on failure -# 2 prereqs missing (anvil/forge/cargo) - -set -euo pipefail - -REPO_ROOT="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")/.." && pwd)" -cd "$REPO_ROOT" - -# ─── CLI ───────────────────────────────────────────────────────────────── -SKIP_BROKER=0 -KEEP_RUNNING=0 -ANVIL_PORT="${ANVIL_PORT:-8545}" -MOCK_PORT="${MOCK_PORT:-8090}" -BROKER_PORT="${BROKER_PORT:-8091}" - -while [[ $# -gt 0 ]]; do - case "$1" in - --skip-broker) SKIP_BROKER=1; shift ;; - --keep-running) KEEP_RUNNING=1; shift ;; - --anvil-port) ANVIL_PORT="$2"; shift 2 ;; - --mock-port) MOCK_PORT="$2"; shift 2 ;; - --broker-port) BROKER_PORT="$2"; shift 2 ;; - -h|--help) - sed -n '2,/^set -euo/p' "$0" | sed 's/^# \?//' | sed '$d' - exit 0 ;; - *) echo "unknown flag: $1 (try --help)" >&2; exit 2 ;; - esac -done - -# ─── Colors ────────────────────────────────────────────────────────────── -if [ -t 2 ]; then - C_HEAD='\033[1;36m'; C_OK='\033[1;32m'; C_WARN='\033[1;33m' - C_ERR='\033[1;31m'; C_DIM='\033[2m'; C_RESET='\033[0m' -else - C_HEAD=''; C_OK=''; C_WARN=''; C_ERR=''; C_DIM=''; C_RESET='' -fi -log() { printf "${C_HEAD}==>${C_RESET} %s\n" "$*" >&2; } -ok() { printf " ${C_OK}ok${C_RESET} %s\n" "$*" >&2; } -info() { printf " ${C_DIM}info${C_RESET} %s\n" "$*" >&2; } -warn() { printf " ${C_WARN}warn${C_RESET} %s\n" "$*" >&2; } -die() { printf " ${C_ERR}fail${C_RESET} %s\n" "$*" >&2; exit 1; } - -# ─── Work dir + cleanup trap ───────────────────────────────────────────── -WORK_DIR="$(mktemp -d -t agentkeys-ci-ephemeral-XXXXXX)" -ANVIL_PID="" -MOCK_PID="" -BROKER_PID="" - -cleanup() { - local rc=$? - if [ "$KEEP_RUNNING" = "1" ]; then - info "--keep-running set; leaving processes up" - info " anvil: pid=$ANVIL_PID port=$ANVIL_PORT" - [ -n "$MOCK_PID" ] && info " mock: pid=$MOCK_PID port=$MOCK_PORT" - [ -n "$BROKER_PID" ] && info " broker: pid=$BROKER_PID port=$BROKER_PORT" - info " work_dir: $WORK_DIR" - exit "$rc" - fi - log "Cleanup" - for pid_var in BROKER_PID MOCK_PID ANVIL_PID; do - eval "pid=\${$pid_var:-}" - if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then - kill "$pid" 2>/dev/null || true - wait "$pid" 2>/dev/null || true - ok "stopped $pid_var pid=$pid" - fi - done - if [ "$rc" -ne 0 ]; then - warn "exit=$rc — preserving logs at $WORK_DIR" - for f in "$WORK_DIR"/*.log; do - [ -e "$f" ] || continue - printf "\n${C_DIM}── tail $f ──${C_RESET}\n" >&2 - tail -n 50 "$f" >&2 || true - done - else - rm -rf "$WORK_DIR" - fi -} -trap cleanup EXIT INT TERM - -# ─── 1. Prereq sanity-check ────────────────────────────────────────────── -log "1/8 Prereq sanity-check" -missing=() -for tool in cargo jq curl awk grep sed anvil forge cast; do - command -v "$tool" >/dev/null 2>&1 || missing+=("$tool") -done -if [ ${#missing[@]} -gt 0 ]; then - warn "missing tools: ${missing[*]}" - warn " install Foundry: curl -L https://foundry.paradigm.xyz | bash && foundryup" - warn " install Rust: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh" - die "prereqs missing" -fi -ok "tools present: cargo jq curl awk grep sed anvil forge cast" - -# ─── 2. Start anvil (new chain) ────────────────────────────────────────── -log "2/8 Starting anvil on 127.0.0.1:$ANVIL_PORT (new ephemeral chain)" -# Anvil's first default account: pre-funded with 10_000 ETH, deterministic. -# This is our "new deployer account" — fresh per CI run, zero blast radius. -ANVIL_DEPLOYER_KEY="0xac0974bec39a17e36ba4a6b4d238ff944bacb478cbed5efcae784d7bf4f2ff80" -ANVIL_DEPLOYER_ADDR="0xf39Fd6e51aad88F6F4ce6aB8827279cffFb92266" -anvil --port "$ANVIL_PORT" \ - --host 127.0.0.1 \ - --silent \ - > "$WORK_DIR/anvil.log" 2>&1 & -ANVIL_PID=$! -# Wait for RPC ready (anvil bootstraps fast — <2s typically, give it 30s) -for _ in $(seq 1 60); do - if curl -sf --max-time 1 \ - -H 'Content-Type: application/json' \ - -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' \ - "http://127.0.0.1:$ANVIL_PORT" >/dev/null 2>&1; then - break - fi - sleep 0.5 -done -curl -sf --max-time 2 \ - -H 'Content-Type: application/json' \ - -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' \ - "http://127.0.0.1:$ANVIL_PORT" >/dev/null \ - || die "anvil failed to come up; see $WORK_DIR/anvil.log" -ok "anvil up (pid=$ANVIL_PID chain_id=31337 deployer=$ANVIL_DEPLOYER_ADDR)" - -# ─── 3. Forge build + test (contract unit + property tests) ────────────── -log "3/8 Forge build + test (crates/agentkeys-chain/)" -( - cd crates/agentkeys-chain - forge build > "$WORK_DIR/forge-build.log" 2>&1 \ - || die "forge build failed; see $WORK_DIR/forge-build.log" - ok "forge build clean" - forge test --no-match-test "fork_" > "$WORK_DIR/forge-test.log" 2>&1 \ - || die "forge test failed; see $WORK_DIR/forge-test.log" - ok "forge test passed ($(grep -c "^\[PASS\]" "$WORK_DIR/forge-test.log" || echo 0) tests)" -) - -# ─── 4. Deploy v2 stage-1 contract set (new smart contracts on-chain) ──── -log "4/8 Deploy v2 stage-1 contracts via DeployAgentKeysV1.s.sol" -( - cd crates/agentkeys-chain - forge script script/DeployAgentKeysV1.s.sol \ - --rpc-url "http://127.0.0.1:$ANVIL_PORT" \ - --private-key "$ANVIL_DEPLOYER_KEY" \ - --broadcast \ - --skip-simulation \ - > "$WORK_DIR/forge-deploy.log" 2>&1 \ - || die "forge script deploy failed; see $WORK_DIR/forge-deploy.log" -) -# Parse "Name: 0xAddress" lines (the contract names from DeployAgentKeysV1.s.sol's -# console.log calls). Format matches heima-bring-up.sh's parser. -parse_addr() { - local name="$1" - awk -v want="$name" ' - $0 ~ want":" { - for (i=1; i<=NF; i++) if ($i ~ /^0x[a-fA-F0-9]{40}$/) { print $i; exit } - } - ' "$WORK_DIR/forge-deploy.log" -} -SCOPE_ADDR=$(parse_addr "AgentKeysScope") -REGISTRY_ADDR=$(parse_addr "SidecarRegistry") -EPOCH_ADDR=$(parse_addr "K3EpochCounter") -AUDIT_ADDR=$(parse_addr "CredentialAudit") -P256_ADDR=$(parse_addr "P256Verifier") -K11_ADDR=$(parse_addr "K11Verifier") -for v in SCOPE_ADDR REGISTRY_ADDR EPOCH_ADDR AUDIT_ADDR P256_ADDR K11_ADDR; do - eval "val=\${$v}" - [ -n "$val" ] || die "could not parse $v from forge-deploy.log" -done -ok "AgentKeysScope: $SCOPE_ADDR" -ok "SidecarRegistry: $REGISTRY_ADDR" -ok "K3EpochCounter: $EPOCH_ADDR" -ok "CredentialAudit: $AUDIT_ADDR" -ok "P256Verifier: $P256_ADDR" -ok "K11Verifier: $K11_ADDR" - -# ─── 5. Write synthetic operator-workstation.env for verify scripts ────── -log "5/8 Write synthetic operator-workstation.env (--anvil profile)" -SYNTH_ENV="$WORK_DIR/operator-workstation.env" -cat > "$SYNTH_ENV" < "$WORK_DIR/verify-contracts.log" 2>&1 || verify_rc=$? -restore_env -if [ "$verify_rc" -ne 0 ]; then - warn "verify-heima-contracts.sh exited $verify_rc; full log:" - cat "$WORK_DIR/verify-contracts.log" >&2 - die "contract verification failed" -fi -ok "all six v2 stage-1 contracts verified (bytecode + ABI + wiring)" - -# ─── 7. Optional: stand up the broker server (skipped by default) ──────── -if [ "$SKIP_BROKER" = "1" ]; then - log "7/8 Broker bring-up SKIPPED (--skip-broker)" -else - log "7/8 Stand up ephemeral broker (new test broker server)" - - # Pre-generate keypairs so the broker boots clean. The keygen - # subcommand writes 0600 files; matches the production setup-broker-host - # flow but in $WORK_DIR instead of /var/lib/agentkeys. - BROKER_DATA_DIR="$WORK_DIR/broker-data" - mkdir -p "$BROKER_DATA_DIR" - info "building agentkeys-broker-server (release)" - cargo build --release -p agentkeys-broker-server \ - > "$WORK_DIR/cargo-build-broker.log" 2>&1 \ - || die "cargo build broker failed; see $WORK_DIR/cargo-build-broker.log" - BROKER_BIN="$REPO_ROOT/target/release/agentkeys-broker-server" - [ -x "$BROKER_BIN" ] || die "broker binary missing at $BROKER_BIN" - - "$BROKER_BIN" keygen --purpose oidc \ - --out "$BROKER_DATA_DIR/oidc-keypair.json" >/dev/null - "$BROKER_BIN" keygen --purpose session \ - --out "$BROKER_DATA_DIR/session-keypair.json" >/dev/null - ok "broker keypairs generated" - - info "building agentkeys-mock-server (release)" - cargo build --release -p agentkeys-mock-server \ - > "$WORK_DIR/cargo-build-mock.log" 2>&1 \ - || die "cargo build mock-server failed; see $WORK_DIR/cargo-build-mock.log" - MOCK_BIN="$REPO_ROOT/target/release/agentkeys-mock-server" - [ -x "$MOCK_BIN" ] || die "mock-server binary missing at $MOCK_BIN" - - info "starting mock-server on 127.0.0.1:$MOCK_PORT" - "$MOCK_BIN" --port "$MOCK_PORT" \ - > "$WORK_DIR/mock-server.log" 2>&1 & - MOCK_PID=$! - for _ in $(seq 1 60); do - curl -sf --max-time 1 "http://127.0.0.1:$MOCK_PORT/healthz" >/dev/null 2>&1 && break - sleep 0.25 - done - curl -sf --max-time 2 "http://127.0.0.1:$MOCK_PORT/healthz" >/dev/null \ - || die "mock-server failed to come up; see $WORK_DIR/mock-server.log" - ok "mock-server up (pid=$MOCK_PID)" - - info "starting broker on 127.0.0.1:$BROKER_PORT (--skip-startup-check)" - # No real AWS creds in CI — broker runs OIDC-only mint path per issue #71, - # so the only thing AWS would do is the optional GetCallerIdentity probe, - # which --skip-startup-check disables. - BROKER_OIDC_ISSUER="http://127.0.0.1:$BROKER_PORT" \ - BROKER_BACKEND_URL="http://127.0.0.1:$MOCK_PORT" \ - BROKER_DATA_ROLE_ARN="arn:aws:iam::000000000000:role/agentkeys-data-role-ci" \ - BROKER_AWS_REGION="us-east-1" \ - BROKER_OIDC_KEYPAIR_PATH="$BROKER_DATA_DIR/oidc-keypair.json" \ - BROKER_SESSION_KEYPAIR_PATH="$BROKER_DATA_DIR/session-keypair.json" \ - BROKER_AUDIT_DB_PATH="$BROKER_DATA_DIR/audit.sqlite" \ - RUST_LOG=info \ - "$BROKER_BIN" --bind 127.0.0.1 --port "$BROKER_PORT" --skip-startup-check \ - > "$WORK_DIR/broker.log" 2>&1 & - BROKER_PID=$! - for _ in $(seq 1 60); do - curl -sf --max-time 1 "http://127.0.0.1:$BROKER_PORT/healthz" >/dev/null 2>&1 && break - sleep 0.25 - done - curl -sf --max-time 2 "http://127.0.0.1:$BROKER_PORT/healthz" >/dev/null \ - || die "broker failed to come up; see $WORK_DIR/broker.log" - ok "broker up (pid=$BROKER_PID)" - - # OIDC discovery surface — same endpoints AWS would hit in tier-2. - info "probing OIDC discovery surface" - curl -sf --max-time 2 \ - "http://127.0.0.1:$BROKER_PORT/.well-known/openid-configuration" \ - > "$WORK_DIR/oidc-config.json" \ - || die "openid-configuration unreachable" - jq -e '.issuer == "http://127.0.0.1:'"$BROKER_PORT"'"' \ - "$WORK_DIR/oidc-config.json" >/dev/null \ - || die "openid-configuration issuer claim mismatch (see $WORK_DIR/oidc-config.json)" - ok ".well-known/openid-configuration → issuer matches" - - curl -sf --max-time 2 \ - "http://127.0.0.1:$BROKER_PORT/.well-known/jwks.json" \ - > "$WORK_DIR/jwks.json" \ - || die "jwks.json unreachable" - jq -e '.keys | length >= 1' "$WORK_DIR/jwks.json" >/dev/null \ - || die "jwks.json has no keys (see $WORK_DIR/jwks.json)" - ok ".well-known/jwks.json → at least one key present" -fi - -# ─── 8. Summary ────────────────────────────────────────────────────────── -log "8/8 Summary" -ok "ephemeral environment passed all checks" -info " chain : anvil (chain_id 31337, ephemeral)" -info " deployer : $ANVIL_DEPLOYER_ADDR" -info " contracts : 6/6 deployed + verified on chain" -if [ "$SKIP_BROKER" != "1" ]; then - info " broker : http://127.0.0.1:$BROKER_PORT" - info " oidc issuer: http://127.0.0.1:$BROKER_PORT" - info " backend : http://127.0.0.1:$MOCK_PORT (mock-server)" -fi -info "" -info "Not covered here (needs long-lived test-broker.litentry.org —" -info "see docs/test-environment.md):" -info " * stage-3 per-actor + per-data-class S3 PrincipalTag isolation" -info " * real AWS STS AssumeRoleWithWebIdentity" -info " * real SES email-link auth round-trip" diff --git a/scripts/provision-test-environment.sh b/scripts/provision-test-environment.sh deleted file mode 100755 index 6155522..0000000 --- a/scripts/provision-test-environment.sh +++ /dev/null @@ -1,276 +0,0 @@ -#!/usr/bin/env bash -# scripts/provision-test-environment.sh — issue #66 tier-2 one-shot -# provisioner for the long-lived parallel test environment. -# -# What this script provisions (every resource parallel to prod, every -# name carrying a -test suffix so misconfigured CI runs targeting prod -# fail closed): -# -# 1. AWS IAM OIDC provider for test-broker.litentry.org -# 2. AWS IAM roles: -# - agentkeys-data-role-test (email subsystem) -# - agentkeys-vault-role-test (credentials, scoped to vault bucket) -# - agentkeys-memory-role-test (long-term memory, scoped to memory bucket) -# All three trust-policied on the test OIDC provider, with the same -# PrincipalTag/agentkeys_actor_omni scoping that prod uses. -# 3. AWS S3 buckets (per-data-class, per arch.md §17.2): -# - agentkeys-mail-test-${ACCT} -# - agentkeys-vault-test-${ACCT} -# - agentkeys-memory-test-${ACCT} -# Each with block-public-access + default SSE-S3 + the v3 -# split-statement PrincipalTag bucket policy from prod -# (scripts/apply-vault-bucket-policy.sh + apply-memory-bucket-policy.sh). -# 4. EC2 broker host at test-broker.litentry.org via: -# bash scripts/setup-broker-host.sh \ -# --issuer-url https://test-broker.litentry.org \ -# --account-id ${ACCOUNT_ID} \ -# --signer-host signer-test.litentry.org \ -# --audit-host audit-test.litentry.org \ -# --email-host email-test.litentry.org \ -# --cred-host cred-test.litentry.org \ -# --memory-host memory-test.litentry.org \ -# --chain-rpc https://rpc.paseo-parachain.heima.network \ -# --vault-bucket agentkeys-vault-test-${ACCOUNT_ID} \ -# --memory-bucket agentkeys-memory-test-${ACCOUNT_ID} -# 5. A new deployer wallet on Heima-Paseo (distinct from the prod -# deployer), persisted at ~/.agentkeys/heima-paseo-deployer-test.key. -# Funded from the operator's personal Paseo wallet (no sudo on -# mainnet; sudo is fine on Paseo via Alice if collators are up). -# 6. Fresh v2 stage-1 contracts deployed via DeployAgentKeysV1.s.sol -# to Heima-Paseo, distinct addresses from prod, written to -# scripts/test-environment.env under the *_HEIMA_PASEO keys. -# -# Idempotent: re-run safely. Each step pre-checks "is this already done?" -# before acting. Failed runs leave a paper trail in $WORK_DIR. -# -# This is the OPERATOR script — runs once per account. The CI workflow -# (.github/workflows/harness-e2e.yml) consumes the provisioned env via -# GitHub Actions secrets + scripts/test-environment.env. -# -# Usage: -# awsp agentkeys-admin # admin profile required -# bash scripts/provision-test-environment.sh # full provisioning -# bash scripts/provision-test-environment.sh --dry-run -# bash scripts/provision-test-environment.sh --only-step N -# -# Per CLAUDE.md, this script is the SINGLE ENTRY POINT for test-env -# changes. No ad-hoc aws iam / aws s3api edits — extend this script -# instead and re-run. - -set -euo pipefail - -REPO_ROOT="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")/.." && pwd)" -cd "$REPO_ROOT" - -# ─── Config defaults ───────────────────────────────────────────────────── -DRY_RUN=0 -ONLY_STEP="" -TEST_BROKER_HOST="${TEST_BROKER_HOST:-test-broker.litentry.org}" -TEST_SIGNER_HOST="${TEST_SIGNER_HOST:-signer-test.litentry.org}" -TEST_AUDIT_HOST="${TEST_AUDIT_HOST:-audit-test.litentry.org}" -TEST_EMAIL_HOST="${TEST_EMAIL_HOST:-email-test.litentry.org}" -TEST_CRED_HOST="${TEST_CRED_HOST:-cred-test.litentry.org}" -TEST_MEMORY_HOST="${TEST_MEMORY_HOST:-memory-test.litentry.org}" -TEST_ENV_FILE="$REPO_ROOT/scripts/test-environment.env" -TEST_ENV_EXAMPLE="$REPO_ROOT/scripts/test-environment.env.example" -WORK_DIR="$(mktemp -d -t agentkeys-provision-test-XXXXXX)" - -while [[ $# -gt 0 ]]; do - case "$1" in - --dry-run) DRY_RUN=1; shift ;; - --only-step) ONLY_STEP="$2"; shift 2 ;; - --test-broker-host) TEST_BROKER_HOST="$2"; shift 2 ;; - -h|--help) - sed -n '2,/^set -euo/p' "$0" | sed 's/^# \?//' | sed '$d' - exit 0 ;; - *) echo "unknown flag: $1 (try --help)" >&2; exit 2 ;; - esac -done - -# ─── Colors ────────────────────────────────────────────────────────────── -if [ -t 2 ]; then - C_HEAD='\033[1;36m'; C_OK='\033[1;32m'; C_SKIP='\033[1;33m' - C_WARN='\033[1;33m'; C_ERR='\033[1;31m'; C_RESET='\033[0m' -else - C_HEAD=''; C_OK=''; C_SKIP=''; C_WARN=''; C_ERR=''; C_RESET='' -fi -log() { printf "${C_HEAD}==>${C_RESET} %s\n" "$*" >&2; } -ok() { printf " ${C_OK}ok${C_RESET} %s\n" "$*" >&2; } -skip() { printf " ${C_SKIP}skip${C_RESET} %s\n" "$*" >&2; } -warn() { printf " ${C_WARN}warn${C_RESET} %s\n" "$*" >&2; } -die() { printf " ${C_ERR}fail${C_RESET} %s\n" "$*" >&2; exit 1; } - -should_run_step() { - [ -z "$ONLY_STEP" ] && return 0 - [ "$1" = "$ONLY_STEP" ] -} - -run_or_dry() { - if [ "$DRY_RUN" = "1" ]; then - printf " ${C_WARN}dry-run${C_RESET} %s\n" "$*" >&2 - else - "$@" - fi -} - -# ─── Step 0: prerequisite check ────────────────────────────────────────── -log "0/7 Prereq check" -caller_arn=$(aws sts get-caller-identity --query Arn --output text 2>&1) \ - || die "aws sts get-caller-identity failed: $caller_arn — run: awsp agentkeys-admin" -caller_lc=$(printf '%s' "$caller_arn" | tr '[:upper:]' '[:lower:]') -case "$caller_lc" in - *":user/agentkeys-admin"*) ok "caller: $caller_arn" ;; - *) die "caller is $caller_arn — admin required. Run: awsp agentkeys-admin" ;; -esac -ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) -REGION="${AWS_REGION:-us-east-1}" -ok "ACCOUNT_ID=$ACCOUNT_ID REGION=$REGION" - -# Seed the env file if missing -if [ ! -f "$TEST_ENV_FILE" ]; then - [ -f "$TEST_ENV_EXAMPLE" ] || die "missing $TEST_ENV_EXAMPLE (committed template)" - cp "$TEST_ENV_EXAMPLE" "$TEST_ENV_FILE" - ok "seeded $TEST_ENV_FILE from .example" -fi - -env_set() { - local key="$1" val="$2" file="$3" - if grep -qE "^${key}=" "$file" 2>/dev/null; then - if [ "$(uname)" = "Darwin" ]; then - sed -i '' -E "s|^${key}=.*|${key}=${val}|" "$file" - else - sed -i -E "s|^${key}=.*|${key}=${val}|" "$file" - fi - else - printf '%s=%s\n' "$key" "$val" >> "$file" - fi -} -env_set ACCOUNT_ID "$ACCOUNT_ID" "$TEST_ENV_FILE" -env_set REGION "$REGION" "$TEST_ENV_FILE" - -# ─── Step 1: provision the broker host (mirrors prod §5) ───────────────── -if should_run_step 1; then - log "1/7 Provision broker host (test-broker.${TEST_BROKER_HOST#test-broker.})" - cat >&2 </agentKeys && cd agentKeys - bash scripts/setup-broker-host.sh \\ - --issuer-url https://${TEST_BROKER_HOST} \\ - --account-id ${ACCOUNT_ID} \\ - --signer-host ${TEST_SIGNER_HOST} \\ - --audit-host ${TEST_AUDIT_HOST} \\ - --email-host ${TEST_EMAIL_HOST} \\ - --cred-host ${TEST_CRED_HOST} \\ - --memory-host ${TEST_MEMORY_HOST} \\ - --chain-rpc https://rpc.paseo-parachain.heima.network \\ - --vault-bucket agentkeys-vault-test-${ACCOUNT_ID} \\ - --memory-bucket agentkeys-memory-test-${ACCOUNT_ID} \\ - --email-from noreply-test@bots-test.litentry.org \\ - --non-interactive --yes - 4. Confirm: curl -sf https://${TEST_BROKER_HOST}/healthz - - See docs/test-environment.md §3 for the full host runbook. -EOF - skip "manual operator step; rerun --only-step 2 once the host is up" -fi - -# ─── Step 2: IAM OIDC provider for test-broker ─────────────────────────── -if should_run_step 2; then - log "2/7 IAM OIDC provider (oidc-provider/${TEST_BROKER_HOST})" - oidc_arn="arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${TEST_BROKER_HOST}" - if aws iam get-open-id-connect-provider --open-id-connect-provider-arn "$oidc_arn" \ - >/dev/null 2>&1; then - skip "OIDC provider already registered: $oidc_arn" - else - # Fetch the broker's TLS leaf thumbprint (AWS requires it for OIDC - # provider registration). Public TLS cert, so this is fine to - # fetch from any network. - thumb=$(echo | openssl s_client -servername "$TEST_BROKER_HOST" \ - -connect "${TEST_BROKER_HOST}:443" 2>/dev/null \ - | openssl x509 -fingerprint -noout 2>/dev/null \ - | awk -F'=' '{print $2}' | tr -d ':' | tr 'A-Z' 'a-z') - [ -n "$thumb" ] || die "could not fetch TLS thumbprint for ${TEST_BROKER_HOST}; is the broker reachable?" - run_or_dry aws iam create-open-id-connect-provider \ - --url "https://${TEST_BROKER_HOST}" \ - --client-id-list "sts.amazonaws.com" \ - --thumbprint-list "$thumb" - ok "registered $oidc_arn (thumbprint=$thumb)" - fi - env_set OIDC_PROVIDER_ARN "$oidc_arn" "$TEST_ENV_FILE" -fi - -# ─── Step 3: IAM roles (data, vault, memory) ───────────────────────────── -if should_run_step 3; then - log "3/7 IAM roles (data-test, vault-test, memory-test)" - # These wrap the existing prod provisioning scripts with a -test - # suffix on every name. The scripts read role/bucket names from env, - # so set env then call. - warn "extend scripts/provision-vault-role.sh + provision-memory-role.sh" - warn "to accept a SUFFIX env var, or copy them as -test variants." - warn "Tracking as a TODO in this script — exercise once the prod" - warn "scripts are parameterized (~ 1 PR of work)." -fi - -# ─── Step 4: S3 buckets ────────────────────────────────────────────────── -if should_run_step 4; then - log "4/7 S3 buckets (mail-test, vault-test, memory-test)" - warn "same parameterization story as step 3 — see TODO above." -fi - -# ─── Step 5: deployer wallet + funding ─────────────────────────────────── -if should_run_step 5; then - log "5/7 Deployer wallet on Heima-Paseo (distinct from prod deployer)" - KEYFILE="$HOME/.agentkeys/heima-paseo-deployer-test.key" - if [ -f "$KEYFILE" ]; then - skip "$KEYFILE exists" - else - mkdir -p "$(dirname "$KEYFILE")" - run_or_dry cast wallet new --json \ - | tee "$WORK_DIR/wallet.json" \ - | jq -r .[0].private_key > "$KEYFILE" - chmod 600 "$KEYFILE" - addr=$(jq -r .[0].address "$WORK_DIR/wallet.json") - ok "generated $KEYFILE (addr=$addr) — fund this address from your" - ok " personal Paseo wallet, then re-run --only-step 6 to deploy contracts." - fi -fi - -# ─── Step 6: deploy v2 stage-1 contracts on Heima-Paseo ────────────────── -if should_run_step 6; then - log "6/7 Deploy v2 stage-1 contracts to Heima-Paseo (new contracts on-chain)" - KEYFILE="$HOME/.agentkeys/heima-paseo-deployer-test.key" - [ -f "$KEYFILE" ] || die "missing $KEYFILE — run --only-step 5 first" - run_or_dry env HEIMA_DEPLOYER_KEY_FILE="$KEYFILE" \ - AGENTKEYS_CHAIN=heima-paseo \ - bash "$REPO_ROOT/scripts/heima-bring-up.sh" - ok "contract addresses recorded in scripts/operator-workstation.env;" - ok " copy the *_HEIMA_PASEO lines into $TEST_ENV_FILE." -fi - -# ─── Step 7: GitHub Actions OIDC role for the e2e workflow ─────────────── -if should_run_step 7; then - log "7/7 GitHub Actions OIDC role (test-only)" - warn "Create an additional IAM role 'github-actions-agentkeys-e2e'" - warn "with trust policy on token.actions.githubusercontent.com and a" - warn "condition limiting to the agentkeys repo + branch ref. Grant" - warn "agentkeys-vault-role-test + agentkeys-memory-role-test assume" - warn "perms and read-only S3 on the three test buckets." - warn "" - warn "Then store the role ARN as the TEST_OIDC_AWS_ROLE_ARN repo secret." - warn "Until that secret is set, .github/workflows/harness-e2e.yml is" - warn "inert (the job is gated on its presence)." -fi - -# ─── Done ──────────────────────────────────────────────────────────────── -log "Done" -ok "test environment provisioning complete (or skip-noted above)" -ok "next: bash harness/v2-stage3-demo.sh against \$OIDC_ISSUER=${TEST_BROKER_HOST}" -ok " with AGENTKEYS_ENV_FILE=$TEST_ENV_FILE" -rm -rf "$WORK_DIR" diff --git a/scripts/test-environment.env.example b/scripts/test-environment.env.example deleted file mode 100644 index 68c8787..0000000 --- a/scripts/test-environment.env.example +++ /dev/null @@ -1,92 +0,0 @@ -# AgentKeys long-lived test environment — env file template (issue #66 tier-2). -# -# Companion to scripts/operator-workstation.env, but for the PARALLEL -# test infrastructure (not prod): -# -# - Hostname: test-broker.litentry.org (vs. broker.litentry.org) -# - OIDC iss: https://test-broker.litentry.org -# - IAM role: agentkeys-data-role-test (vs. agentkeys-data-role) -# - Vault role: agentkeys-vault-role-test (vs. agentkeys-vault-role) -# - Mem role: agentkeys-memory-role-test (vs. agentkeys-memory-role) -# - Mail/vault/memory buckets: -test suffix on every bucket name -# - Chain: heima-paseo (testnet — no real-HEI cost on every CI run) -# - Deployer: separate keypair, persisted only in operator wallet -# - Contracts: deployed fresh by scripts/provision-test-environment.sh, -# distinct addresses from prod (recorded below per chain) -# -# Why mirror operator-workstation.env instead of forking it: the harness -# scripts (harness/v2-stage*.sh) source ONE env file. Setting -# AGENTKEYS_ENV_FILE=./scripts/test-environment.env before invoking a -# harness script reuses the entire flow against the test infra unchanged. -# -# Bring-up: bash scripts/provision-test-environment.sh -# Activate: cp scripts/test-environment.env.example scripts/test-environment.env -# (then fill in the values below from the provisioner output) -# -# This .example file commits as-is. The non-example copy MUST NOT be -# committed (it carries no secrets in itself, but its contents are the -# canonical "this account hosts the test infra" pointer — gated behind -# the operator's deliberate copy). -# -# See docs/test-environment.md for the full bring-up runbook. - -# ─── AWS account ───────────────────────────────────────────────────────── -# Same account as prod is fine for cost, but every resource name carries -# a -test suffix so a misconfigured CI run targeting prod fails closed -# (the role / bucket / OIDC provider simply won't exist in prod). -ACCOUNT_ID=000000000000 -REGION=us-east-1 - -# ─── Hostname + OIDC issuer ────────────────────────────────────────────── -# DNS A record + TLS cert + nginx + systemd all per scripts/setup-broker-host.sh -# with --issuer-url https://test-broker.litentry.org. Long-lived because -# AWS validates the OIDC issuer URL byte-for-byte against the JWT `iss` -# claim — every reboot must restore the same URL. -BROKER_HOST=test-broker.litentry.org -OIDC_ISSUER=https://${BROKER_HOST} -OIDC_PROVIDER_ARN=arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${BROKER_HOST} - -# ─── IAM roles (parallel to prod, distinct ARNs) ───────────────────────── -DATA_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-data-role-test -VAULT_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-vault-role-test -MEMORY_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-memory-role-test - -# ─── S3 buckets (parallel to prod, distinct names) ─────────────────────── -MAIL_DOMAIN=bots-test.litentry.org -MAIL_BUCKET=agentkeys-mail-test-${ACCOUNT_ID} -BUCKET=${MAIL_BUCKET} -VAULT_BUCKET=agentkeys-vault-test-${ACCOUNT_ID} -MEMORY_BUCKET=agentkeys-memory-test-${ACCOUNT_ID} - -# ─── Backend (signer) URL ──────────────────────────────────────────────── -# Test env runs the mock-server backend (the production dev_key_service -# shape). Real TEE workers are out of scope for the test environment — -# see issue #74 step 2. -AGENTKEYS_SIGNER_URL=https://signer-test.litentry.org -BACKEND_URL=${AGENTKEYS_SIGNER_URL} - -# ─── Chain (Heima-Paseo testnet) ───────────────────────────────────────── -# Defaults to Paseo for zero real-HEI cost. Override to `anvil` for -# fully local runs; never `heima` (mainnet — prod-only). -AGENTKEYS_CHAIN=heima-paseo - -# Contract addresses — populated by scripts/provision-test-environment.sh. -# Keep one set per chain so re-bring-up against another chain doesn't -# clobber. The non-test file commits the actual addresses post-deploy. -SCOPE_CONTRACT_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 -SIDECAR_REGISTRY_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 -K3_EPOCH_COUNTER_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 -CREDENTIAL_AUDIT_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 -P256_VERIFIER_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 -K11_VERIFIER_ADDRESS_HEIMA_PASEO=0x0000000000000000000000000000000000000000 - -# ─── Deployer key path ─────────────────────────────────────────────────── -# Operator-held only; the test deployer is a DIFFERENT wallet from prod. -# Provisioner persists it at ~/.agentkeys/heima-paseo-deployer-test.key. -HEIMA_DEPLOYER_KEY_FILE=${HOME}/.agentkeys/heima-paseo-deployer-test.key - -# ─── CI namespacing (per-run S3 prefix isolation) ──────────────────────── -# Set by the e2e workflow at run time so concurrent CI runs don't step -# on each other's writes. Cleaned up by nightly s3-prefix-rm job (see -# docs/test-environment.md §Cleanup). -CI_S3_PREFIX=ci/pr-${PR_NUMBER:-manual}/run-${GITHUB_RUN_ID:-local} From 5a66a8535b414375c7694d8404141962258513f5 Mon Sep 17 00:00:00 2001 From: wildmeta-agent Date: Thu, 21 May 2026 10:05:38 +0800 Subject: [PATCH 3/4] docs: concise setup guides aligned with scripts/setup-{broker-host,heima}.sh MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per operator request: pivot cloud-setup.md from a verbose manual-bash runbook to a concise prereq/script-pointer split, add new heima-setup.md + ci-setup.md for the chain + CI flows, and move troubleshooting into the ./wiki/ folder. What changed: docs/cloud-setup.md — UPDATE, 970 → 314 lines Add a TL;DR with the three-command operator flow (manual §1-§4 prereqs, then setup-broker-host.sh, then setup-heima.sh). Slim §1-§4 to invariants + helper-script pointers + brief command blocks (DKIM bulk-record / receipt rule / per-data-class role provisioning all delegate to the existing scripts/*.sh). Replace the verbose §5/§6/§7 (EC2 broker / signer / workers, each with 100+ lines of inline bash) with one §5 "Run setup-broker-host.sh" section that names what the script does (build, systemd, nginx, certbot, keypairs, env files) + what it doesn't (DNS, IAM, OIDC provider — those stay in §1-§4). Keep §0 (identities table) and §6 (cleanup recipe). docs/heima-setup.md — NEW, 106 lines The 15-step pipeline in scripts/setup-heima.sh, with idempotency check + helper-script pointer per step. Mainnet vs Paseo vs Anvil tradeoff table. Per-step re-run examples. Heima London EVM pin explanation. docs/ci-setup.md — NEW, 184 lines The 7-step operator bring-up for the no-LLM .github/workflows/harness-ci.yml workflow: provision test broker via setup-broker-host.sh with -test suffix, provision parallel AWS resources, register the test OIDC provider, generate + fund the test deployer wallet, deploy fresh test contracts on Heima mainnet with the same .sol source (different deployer → different addresses → isolated parallel contract set), register the GitHub Actions OIDC role, set the repo secrets. Includes the full TEST_* secret list, manual-dispatch instructions, and a secret-hygiene reminder. wiki/cloud-setup-faq.md — NEW, 94 lines wiki/heima-setup-faq.md — NEW, 111 lines wiki/ci-setup-faq.md — NEW, 96 lines Troubleshooting + edge cases for each setup doc. Lives under ./wiki/ per CLAUDE.md "Wiki-location policy" — auto-published to the GitHub wiki on every push to main. Constraints applied: - Concise: every doc fits in a few screens. - Idempotent: every flow reuses the existing idempotent helper scripts (setup-broker-host.sh, setup-heima.sh, provision-*-role.sh, apply-*-bucket-policy.sh). - No project credentials exposed: account IDs, role ARNs, bucket names, deployer keys, contract addresses all referenced via ${ACCOUNT_ID} / ${BROKER_HOST} / ${REGION} placeholders or via "read from operator-workstation.env" / "from step N" pointers. Real values live only in the operator's local env file + the GitHub repo secrets store. All internal links verified via a python url-walker (every relative link resolves to an existing file). --- docs/ci-setup.md | 184 +++++++ docs/cloud-setup.md | 1030 ++++++++------------------------------- docs/heima-setup.md | 106 ++++ wiki/ci-setup-faq.md | 96 ++++ wiki/cloud-setup-faq.md | 94 ++++ wiki/heima-setup-faq.md | 111 +++++ 6 files changed, 782 insertions(+), 839 deletions(-) create mode 100644 docs/ci-setup.md create mode 100644 docs/heima-setup.md create mode 100644 wiki/ci-setup-faq.md create mode 100644 wiki/cloud-setup-faq.md create mode 100644 wiki/heima-setup-faq.md diff --git a/docs/ci-setup.md b/docs/ci-setup.md new file mode 100644 index 0000000..04670d0 --- /dev/null +++ b/docs/ci-setup.md @@ -0,0 +1,184 @@ +# CI setup — AgentKeys + +**Audience:** the operator activating the no-LLM CI workflow against a test instance of the production environment. +**Scope:** one workflow file ([`.github/workflows/harness-ci.yml`](../.github/workflows/harness-ci.yml)), a list of GitHub secrets, and the test-side counterparts of the production resources from [`docs/cloud-setup.md`](cloud-setup.md) + [`docs/heima-setup.md`](heima-setup.md). +**FAQ + troubleshooting:** [`wiki/ci-setup-faq.md`](../wiki/ci-setup-faq.md). + +## TL;DR + +The workflow runs unmodified on every push / PR. It has two jobs: + +1. **`rust-checks`** — always runs. `cargo fmt --check` + `cargo clippy -D warnings` + `cargo test --workspace`. Covers 600+ tests including the in-process broker integration tests (which already mock STS + SES + WebAuthn). +2. **`harness-e2e`** — gated on the `TEST_OIDC_AWS_ROLE_ARN` secret being set. Runs the production harness scripts ([`harness/v2-stage{1,2,3}-demo.sh`](../harness/)) against an isolated TEST instance of the cloud + chain. + +Until the operator activates the test instance, `harness-e2e` surfaces a `::warning::` skip and the PR is unblocked. + +## What "mirror production" means + +Every resource in the test instance is parallel to prod: + +| | Production | Test | +|---|---|---| +| Broker host | `broker.litentry.org` | `test-broker.litentry.org` (long-lived; AWS validates OIDC issuer URLs byte-for-byte) | +| OIDC issuer | `https://broker.litentry.org` | `https://test-broker.litentry.org` | +| IAM roles | `agentkeys-{data,vault,memory}-role` | `agentkeys-{data,vault,memory}-role-test` | +| S3 buckets | `agentkeys-{mail,vault,memory}-${ACCT}` | `agentkeys-{mail,vault,memory}-test-${ACCT}` | +| Chain | Heima mainnet | **Heima mainnet** (same chain, different deployer → different addresses) | +| Deployer wallet | operator's prod deployer | dedicated test wallet (small HEI float) | +| Contracts | one production deploy | one test deploy with **identical `.sol` source** → new addresses | +| WebAuthn | real Touch ID | never (`WEBAUTHN_MODE=0`) | +| LLM | (separate `claude.yml` review) | never | + +**Same code, same chain, isolated storage.** EVM addresses derive from `(deployer, nonce)` and Solidity compiles deterministically — a different deployer key with the same source files produces a parallel contract set that can't see or write to prod contract state. + +## One-shot operator bring-up + +### 1. Provision the test broker + +Same flow as `docs/cloud-setup.md`, with the `-test` suffix on every identifier: + +```bash +# On a fresh EC2 with EIP + DNS A record for test-broker.${ZONE} +sudo bash scripts/setup-broker-host.sh \ + --issuer-url https://test-broker.${ZONE} \ + --account-id "${ACCOUNT_ID}" \ + --signer-host signer-test.${ZONE} \ + --audit-host audit-test.${ZONE} \ + --email-host email-test.${ZONE} \ + --cred-host cred-test.${ZONE} \ + --memory-host memory-test.${ZONE} \ + --vault-bucket "agentkeys-vault-test-${ACCOUNT_ID}" \ + --memory-bucket "agentkeys-memory-test-${ACCOUNT_ID}" \ + --email-from "noreply-test@bots-test.${ZONE}" \ + --yes +``` + +Idempotent: re-run after edits without manual rollback. + +### 2. Provision the parallel AWS resources + +The prod provisioning helpers (`scripts/provision-vault-{bucket,role}.sh`, `scripts/provision-memory-{bucket,role}.sh`, `scripts/apply-{vault,memory}-bucket-policy.sh`, `scripts/cleanup-mail-bucket-policy.sh`) all read bucket / role names from `scripts/operator-workstation.env`. For the test instance, point them at a test env file: + +```bash +# Side-load the test env so the prod scripts pick up -test names +TEST_VAULT_BUCKET="agentkeys-vault-test-${ACCOUNT_ID}" \ +TEST_MEMORY_BUCKET="agentkeys-memory-test-${ACCOUNT_ID}" \ +TEST_VAULT_ROLE_ARN="arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-vault-role-test" \ +TEST_MEMORY_ROLE_ARN="arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-memory-role-test" \ + bash scripts/provision-vault-bucket.sh && \ + bash scripts/provision-vault-role.sh && \ + bash scripts/apply-vault-bucket-policy.sh && \ + bash scripts/provision-memory-bucket.sh && \ + bash scripts/provision-memory-role.sh && \ + bash scripts/apply-memory-bucket-policy.sh +``` + +(If the prod scripts don't yet read these overrides, file a follow-up issue and copy the prod scripts as `-test` variants by hand until they do.) + +### 3. Register the test OIDC provider in IAM + +```bash +thumb=$(echo | openssl s_client -servername "test-broker.${ZONE}" \ + -connect "test-broker.${ZONE}:443" 2>/dev/null \ + | openssl x509 -fingerprint -noout \ + | awk -F'=' '{print $2}' | tr -d ':' | tr 'A-Z' 'a-z') + +aws iam create-open-id-connect-provider \ + --url "https://test-broker.${ZONE}" \ + --client-id-list "sts.amazonaws.com" \ + --thumbprint-list "$thumb" +``` + +### 4. Generate the test deployer wallet + fund it + +```bash +mkdir -p ~/.agentkeys +cast wallet new --json \ + | tee /tmp/test-deployer.json \ + | jq -r .[0].private_key > ~/.agentkeys/heima-deployer-test.key +chmod 600 ~/.agentkeys/heima-deployer-test.key +# Then fund the address ($(jq -r .[0].address /tmp/test-deployer.json)) +# from your personal Heima wallet — small float is enough for one-shot deploy. +``` + +### 5. Deploy the test contracts on Heima mainnet + +Identical Solidity, identical `DeployAgentKeysV1.s.sol`, different deployer → new addresses on the production chain: + +```bash +AGENTKEYS_CHAIN=heima \ +HEIMA_DEPLOYER_KEY_FILE=~/.agentkeys/heima-deployer-test.key \ +MAINNET_CONFIRM=1 \ + bash scripts/setup-heima.sh --from-step 4 --to-step 8 +``` + +That walks steps 4–8: reuse the test key, fund-check, deploy, persist addresses, verify on-chain. Read off the six `*_HEIMA` addresses from the resulting `scripts/operator-workstation.env` for the next step. + +### 6. Register the GitHub Actions OIDC role + +Create one additional IAM role, `github-actions-agentkeys-e2e`, trust-policied on `token.actions.githubusercontent.com` with a condition limiting it to the agentkeys repo. Grant it `sts:AssumeRole` on the three test data roles and read-only S3 on the three test buckets. + +### 7. Set the GitHub repo secrets + +In **Settings → Secrets and variables → Actions**: + +| Secret | Value | +|---|---| +| `TEST_OIDC_AWS_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/github-actions-agentkeys-e2e` (the gate) | +| `TEST_ACCOUNT_ID` | numeric AWS account ID (same account as prod is fine) | +| `TEST_AWS_REGION` | e.g. `us-east-1` | +| `TEST_BROKER_HOST` | `test-broker.${ZONE}` | +| `TEST_VAULT_BUCKET` | `agentkeys-vault-test-${ACCT}` | +| `TEST_MEMORY_BUCKET` | `agentkeys-memory-test-${ACCT}` | +| `TEST_VAULT_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-vault-role-test` | +| `TEST_MEMORY_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-memory-role-test` | +| `TEST_DATA_ROLE_ARN` | `arn:aws:iam::${ACCT}:role/agentkeys-data-role-test` | +| `TEST_HEIMA_DEPLOYER_KEY` | the 0x-prefixed test deployer private key from step 4 | +| `TEST_SCOPE_CONTRACT_ADDRESS_HEIMA` | from step 5 | +| `TEST_SIDECAR_REGISTRY_ADDRESS_HEIMA` | from step 5 | +| `TEST_K3_EPOCH_COUNTER_ADDRESS_HEIMA` | from step 5 | +| `TEST_CREDENTIAL_AUDIT_ADDRESS_HEIMA` | from step 5 | +| `TEST_P256_VERIFIER_ADDRESS_HEIMA` | from step 5 | +| `TEST_K11_VERIFIER_ADDRESS_HEIMA` | from step 5 | + +`TEST_OIDC_AWS_ROLE_ARN` is the gate. Setting it last activates the workflow; unsetting it disarms. + +## What the workflow does on every run + +1. Restores submodules + Rust toolchain + Foundry + cargo cache. +2. **`rust-checks`** job: `cargo fmt --check` → `cargo clippy -- -D warnings` → `cargo test --workspace -- --test-threads=1` (the `--test-threads=1` matches the existing `@claude` review workflow because broker tests mutate `$HOME` / `AWS_*` env). +3. **`preflight`** job: gates on `TEST_OIDC_AWS_ROLE_ARN`. +4. **`harness-e2e`** job: assumes the test role via GitHub Actions OIDC (no long-lived secrets), writes the test deployer key, overwrites `scripts/operator-workstation.env` with TEST_* values, then runs: + - `harness/v2-stage1-demo.sh --skip-deploy --skip-email` (contracts pre-deployed; identity via wallet_sig) + - `harness/v2-stage2-demo.sh --stub --skip-build` + - `harness/v2-stage3-demo.sh` (per-actor + per-data-class PrincipalTag isolation — the capstone that needs real AWS STS) +5. Per-run S3 prefix cleanup (`ci/run-${RUN_ID}/`) in an `if: always()` block. + +## Per-run S3 prefix isolation + +Concurrent runs (nightly + a manual dispatch) get a unique prefix via `CI_S3_PREFIX=ci/run-${GITHUB_RUN_ID}`. Per-job cleanup is best-effort; pair it with a nightly operator-side cron that sweeps `ci/` prefix keys older than 7 days from the test buckets. + +## Manual dispatch + +```bash +gh workflow run harness-ci.yml --field stage=3 +``` + +`stage` accepts `1`, `2`, `3`, or `all`. Useful for re-running just stage-3 after a contract revision. + +## Secret hygiene + +No project credentials live in this doc. Every value above is either a placeholder (`${ACCT}`, `${ZONE}`) or an instruction to read from the operator's already-provisioned state ("from step 5"). The actual values live in two places only: + +- The operator's local `scripts/operator-workstation.env` (gitignored copies / test variants only). +- The GitHub repo's encrypted secrets store. + +Never paste a real account ID, role ARN, bucket name, deployer key, or contract address into a markdown doc, commit message, or PR description. + +## Related + +- Workflow file: [`.github/workflows/harness-ci.yml`](../.github/workflows/harness-ci.yml) +- Cloud / broker bring-up: [`docs/cloud-setup.md`](cloud-setup.md) +- Chain bring-up: [`docs/heima-setup.md`](heima-setup.md) +- Harness scripts: [`harness/v2-stage{1,2,3}-demo.sh`](../harness/) +- FAQ + troubleshooting: [`wiki/ci-setup-faq.md`](../wiki/ci-setup-faq.md) diff --git a/docs/cloud-setup.md b/docs/cloud-setup.md index df04df2..3a7820f 100644 --- a/docs/cloud-setup.md +++ b/docs/cloud-setup.md @@ -1,970 +1,322 @@ # Cloud setup — AgentKeys **Audience:** the operator provisioning the cloud account that hosts AgentKeys infrastructure. -**Scope:** one file, every cloud-side resource. Read top-down once per account, then jump back to the section you're touching. +**Scope:** the prereqs that the idempotent [`scripts/setup-broker-host.sh`](../scripts/setup-broker-host.sh) entry point can't do for itself (DNS, SES, IAM, OIDC provider, S3 buckets). Run those once per account, then re-run the broker-host script as often as needed. +**Companion:** [`docs/heima-setup.md`](heima-setup.md) for chain bring-up, [`docs/ci-setup.md`](ci-setup.md) for CI activation. +**FAQ + troubleshooting:** [`wiki/cloud-setup-faq.md`](../wiki/cloud-setup-faq.md). -The runbook is split by concern, not by stage: - -| § | Concern | When you do this | -|---|---------|------------------| -| [§0 Identities](#0-identities--mental-model) | The four IAM principals and what each one is for | Read first | -| [§1 Domain + DNS](#1-domain--dns) | Email subdomain (Stage 6) + broker subdomain (Stage 7) | Once per account | -| [§2 Inbound mail](#2-inbound-mail-backend) | SES + S3 receipt rule (Stage 6) | Once per account | -| [§3 IAM users + role](#3-iam-identities) | `agentkeys-{admin,broker,daemon}` + `agentkeys-data-role` | Once per account | -| [§4 OIDC federation](#4-oidc-federation-stage-7) | Register the broker as an OIDC provider, swap to PrincipalTag-scoped trust | After §1–§3 + a publicly-reachable broker | -| [§5 EC2 broker host](#5-ec2-broker-host-optional) | EIP, A record, security group | Only if you're hosting the broker on AWS | -| [§6 Signer host](#6-signer-host) | DNS A record + TLS cert + nginx flip for `signer.` | After §5 — needs `$EIP` | -| [§7 Service workers](#7-service-workers-audit--email--cred--memory) | 4 DNS A records + TLS certs + nginx flips for `audit/email/cred/memory.` (dev co-located on broker host) | After §5 — needs `$EIP` | -| [§8 Cleanup](#8-cleanup) | Tear-down recipe | When you want to delete it all | - -**Cloud-portability:** §1 (DNS) and §2 (inbound mail) are the cloud-replaceable layers — Tencent Cloud SimpleDM + COS would slot in here unchanged at the §3+ boundary. See [§2.2](#22-future-tencent-cloud-simpledm--cos). - ---- - -## 0. Identities — mental model - -| Identity | Type | Holds | Purpose | -|---|---|---|---| -| `agentkeys-admin` | IAM user | Long-lived access key | One-shot provisioning. Runs every command in this doc. IAM-admin scope. | -| `agentkeys-broker` | IAM user | Long-lived access key | Operator's SSH-into-EC2 path via EC2 Instance Connect. No data-plane access. | -| `agentkeys-daemon` | IAM user | Long-lived access key | The **broker process** uses this at runtime. Only permission: `sts:AssumeRole` on `agentkeys-data-role`. | -| `agentkeys-data-role` | IAM role | (assumed) | The actual S3/SES permissions live here. `agentkeys-daemon` (Stage 6) or the OIDC provider (Stage 7) is allowed to assume it. | -| `agentkeys-broker-host` | IAM role | (assumed by EC2) | Optional. If the broker runs on EC2, attach this as the instance profile so the daemon never sees a static key. | - -Why "data role" and not "agent role": the project word "agent" already means three things (the AI agent, the AgentKeys product, an IAM role). The role holds **data-plane** permissions, so `agentkeys-data-role` it is. (Renamed from `agentkeys-agent` 2026-04-28; the broker still accepts the legacy `BROKER_AGENT_ROLE_ARN` env var.) - -**Prereqs for everything below:** +## TL;DR — operator flow ```bash -# AWS CLI v2 + a working agentkeys-admin profile -awsp agentkeys-admin # set AWS_PROFILE -aws sts get-caller-identity # → agentkeys-admin - -# Shell vars used throughout the runbook -export REGION=us-east-1 # SES inbound: us-east-1, us-west-2, eu-west-1 -export DOMAIN=bots.litentry.org # Stage 6 email subdomain -export BROKER_HOST=broker.litentry.org # Stage 7 broker public hostname -export PARENT_ZONE_ID=Z09723983CFJOHAE3VC65 # existing litentry.org Route 53 zone -export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) -export BUCKET=agentkeys-mail-${ACCOUNT_ID} # global-unique by account-id suffix -echo "REGION=$REGION DOMAIN=$DOMAIN BROKER_HOST=$BROKER_HOST ACCOUNT_ID=$ACCOUNT_ID BUCKET=$BUCKET" -``` - -> **Why `jq -n --arg` and not `cat > file.json </{credentials,memory}/*`. | +| `agentkeys-broker-host` | IAM role | (assumed by EC2) | Optional. If the broker runs on EC2, attach as instance profile so the daemon never sees a static key. | -The 4 service workers (`audit` / `email` / `cred` / `memory`) co-locate on the broker host today (dev-only per [CLAUDE.md](../CLAUDE.md) "for production, we will isolate all the services for the security issue"). All 4 A records point to the same `$EIP`. The hostnames are the migration seam — when a worker moves to its own machine, only the A record changes. +The word "agent" already means three things (the AI agent, the AgentKeys product, an IAM role) — these roles hold **data-plane** permissions, so they're named `*-data-role` / `*-vault-role` / `*-memory-role`. -Done as part of [§7 Service workers](#7-service-workers-audit--email--cred--memory) using the [`scripts/dns-upsert-workers.sh`](../scripts/dns-upsert-workers.sh) helper. +## 1. DNS ---- +Two-and-six subdomains under your parent zone (e.g. `litentry.org`): -## 2. Inbound mail backend +| Host | Purpose | Set in | +|---|---|---| +| `${MAIL_DOMAIN}` (e.g. `bots.litentry.org`) | SES inbound | §2 | +| `${BROKER_HOST}` (e.g. `broker.litentry.org`) | Broker TLS-terminating reverse proxy | §5 — A record to broker EIP | +| `signer.${ZONE}` | Signer service (issue #74 step 1b) | §5 — A record to broker EIP (co-located today) | +| `audit.${ZONE}` / `email.${ZONE}` / `cred.${ZONE}` / `memory.${ZONE}` | Service workers (issue #90) | §5 — same EIP (dev co-location) | -### 2.1 AWS SES + S3 +For the bulk service-worker DNS, use [`scripts/dns-upsert-workers.sh`](../scripts/dns-upsert-workers.sh). The hostnames are the migration seam — when a worker moves to its own machine, only the A record changes. -#### Verify the SES domain identity +## 2. SES inbound mail ```bash -aws sesv2 create-email-identity \ - --region "$REGION" --email-identity "$DOMAIN" \ +# Verify the SES domain identity +aws sesv2 create-email-identity --region "$REGION" \ + --email-identity "$MAIL_DOMAIN" \ --dkim-signing-attributes NextSigningKeyLength=RSA_2048_BIT -``` - -Now run [§1.1](#11-email-subdomain--dkim--spf--dmarc--mx) to publish the DKIM/SPF/DMARC/MX records. Wait ~5 min, then: - -```bash -aws sesv2 get-email-identity --region "$REGION" --email-identity "$DOMAIN" \ - --query '{verified: VerifiedForSendingStatus, dkim: DkimAttributes.Status}' -# → {"verified": true, "dkim": "SUCCESS"} -``` - -> **DKIM key custody:** in this interim setup, AWS SES holds the private DKIM key. We never see it. Trust surface: AWS-internal compromise could forge mail signed as us — bounded blast radius (reputation, not user-data custody). Migration target is TEE-held BYODKIM when [`heima-gaps §4`](./spec/heima-gaps-vs-desired-architecture.md) closes; do **not** intermediate-step to "BYODKIM with file-stored key" (strictly worse than AWS-managed). - -#### Create the S3 bucket for inbound mail -The bucket policy in [§3.5](#35-s3-bucket-policy) wires SES write + role read; we'll come back to it after the IAM identities exist. +# Publish DKIM + SPF + DMARC + MX in one Route 53 change (read DKIM tokens +# from `aws sesv2 get-email-identity`, then upsert via Route 53 — see +# wiki/cloud-setup-faq.md for the full record set). -```bash -aws s3api create-bucket \ - --region "$REGION" --bucket "$BUCKET" \ +# Create the inbound bucket (30-day TTL on inbound/* objects) +aws s3api create-bucket --region "$REGION" --bucket "$BUCKET" \ $([ "$REGION" != "us-east-1" ] && echo "--create-bucket-configuration LocationConstraint=$REGION") - aws s3api put-public-access-block --region "$REGION" --bucket "$BUCKET" \ --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true -# 30-day TTL on inbound objects (throwaway-inbox model) -aws s3api put-bucket-lifecycle-configuration --region "$REGION" --bucket "$BUCKET" \ - --lifecycle-configuration "$(jq -n '{ - Rules: [{ID:"inbound-30d-ttl", Status:"Enabled", Filter:{Prefix:"inbound/"}, Expiration:{Days:30}}] - }')" -``` - -#### Create the SES receipt rule - -```bash +# Receipt rule: route mail for $MAIL_DOMAIN into s3://$BUCKET/inbound/* aws ses create-receipt-rule-set --rule-set-name agentkeys --region "$REGION" 2>/dev/null || true aws ses create-receipt-rule --region "$REGION" --rule-set-name agentkeys \ - --rule "$(jq -n --arg domain "$DOMAIN" --arg bucket "$BUCKET" '{ + --rule "$(jq -n --arg domain "$MAIL_DOMAIN" --arg bucket "$BUCKET" '{ Name: "agentkeys-inbound", Enabled: true, ScanEnabled: true, TlsPolicy: "Optional", Recipients: [$domain], Actions: [{S3Action: {BucketName: $bucket, ObjectKeyPrefix: "inbound/"}}] }')" aws ses set-active-receipt-rule-set --rule-set-name agentkeys --region "$REGION" -``` - -Inbound MIME lands at `s3://$BUCKET/inbound/`. The first object you'll see is `inbound/AMAZON_SES_SETUP_NOTIFICATION` — AWS's "I successfully wrote to your bucket" marker. Real test mail follows. - -#### Spam handling (read-time filter) - -The SES scanners stamp `X-SES-Spam-Verdict` / `X-SES-Virus-Verdict` headers. The provisioner-scripts `ses-s3` adapter drops messages where either is `FAIL`. No write-time Lambda; trivial receipt rule. - -#### Sandbox vs production sending - -Inbound is unaffected by SES sandbox status. You only need to request production access when the agent **sends** mail to arbitrary addresses (replies, notifications). Console → Support → "Service limit increase" → "SES Sending Limits" → "Request Production Access". - -### 2.1a Per-recipient routing Lambda (issue #83) - -After [§4](#4-oidc-federation-stage-7) lands, the `agentkeys-data-role` is intentionally denied read on `s3://$BUCKET/inbound/` (federation-isolation rule, [§4.5](#45-strip-the-static-iam-grants)). Service-provisioning verification emails (openrouter, brave, anthropic, …) land in `inbound/` but the OIDC-assumed scraper subprocess cannot read them — operators see the symptom as `internal error: AccessDenied on s3:ListBucket` at the email-fetch step of `agentkeys provision `. - -The fix is a small post-receive Lambda that copies inbound objects to the operator's PrincipalTag-scoped prefix when the recipient local-part matches the provisioner's routing pattern. Service emails the scraper generates have the form `or-<0x-wallet>-@$DOMAIN`; the Lambda parses that local-part, extracts the wallet, and `CopyObject`s (server-side — body never transits Lambda) to `bots//inbound/`. AGENTKEYS magic-link auth emails (different local-part) stay in `inbound/` for the broker's `/v1/auth/email/*` handlers. -Deploy once per AWS account: - -```bash -awsp agentkeys-admin -set -a; source scripts/operator-workstation.env; set +a -bash infra/ses-routing-lambda/deploy.sh +# Verify the bot's sending identity (the broker's BROKER_EMAIL_FROM_ADDRESS +# precheck refuses to boot if this isn't verified) +bash scripts/ses-verify-sender.sh ``` -Idempotent (re-runnable). What it provisions: IAM role `agentkeys-ses-router-lambda-role` (inline policy: `s3:GetObject` on `inbound/*`, `s3:PutObject` on `bots/*/inbound/*`, basic CloudWatch Logs), Lambda function `agentkeys-ses-router` (python3.13, 128MB, 10s timeout, reserved-concurrency=10), and the S3 `ObjectCreated:*` notification on `inbound/` → Lambda. - -Per-invocation cost ≈ 1.7 µ$ at 128 MB; total Lambda spend stays single-digit cents/month at any sensible operator count. See [`infra/ses-routing-lambda/README.md`](../infra/ses-routing-lambda/README.md) for unit tests, verification commands, and rollback. - -> **TODO** (tracked in [`TODOS.md`](../TODOS.md) — "Disable broker's broad S3-full-access"): once this Lambda is deployed and stable, tighten the broker's instance profile so it can no longer read service-provisioning emails (defense-in-depth — today the broker COULD read them but doesn't). - -### 2.2 Future: Tencent Cloud SimpleDM + COS - -For deployments serving China-region traffic, the analogous backend is: +**Sandbox vs production sending:** inbound is unaffected by SES sandbox; only **outbound** to arbitrary addresses needs Console → Support → "SES Sending Limits" → "Request Production Access". -| Layer | AWS (current) | Tencent Cloud (future) | -|---|---|---| -| Email service | SES (SendRawEmail / receipt rules) | SimpleDM (`SendEmail` + receive-rule policies) | -| Object store | S3 + bucket policy | COS + bucket-policy / CAM role | -| Identity service | IAM users + roles + STS AssumeRole | CAM users + roles + STS AssumeRole | -| OIDC federation | `iam:CreateOpenIDConnectProvider` | CAM `CreateOIDCConfig` | - -The provisioner-scripts `email-backends/` interface already abstracts the inbound contract (object key + raw MIME). A Tencent backend slots in as `tencent-simpledm-cos`, with the same upstream API as `ses-s3`. Identity layout in §3 stays unchanged structurally — replace `iam` with `cam` calls. **No work in this runbook depends on AWS specifically except the AWS CLI invocations** — the IAM model maps 1:1 onto CAM. +**Per-recipient routing Lambda (issue #83):** after §4 lands, the broker's role is intentionally denied read on `inbound/*`. Service-provisioning verification emails route to `bots//inbound/` via [`infra/ses-routing-lambda/deploy.sh`](../infra/ses-routing-lambda/deploy.sh). Idempotent, deploy once per AWS account. ---- +**Future Tencent Cloud port:** SES + S3 are the only AWS-specific layers in this doc. SimpleDM + COS slot in at the §3+ boundary — IAM model maps 1:1 onto CAM. The `provisioner-scripts/email-backends/` interface already abstracts the inbound contract. ## 3. IAM identities -### 3.1 `agentkeys-daemon` IAM user (broker runtime) +The daemon user + data role are the boundary between manual provisioning (this doc) and the script-driven runtime (`setup-broker-host.sh`). + +### 3.1 The four principals ```bash +# Runtime user (broker process) aws iam create-user --user-name agentkeys-daemon aws iam create-access-key --user-name agentkeys-daemon -# → save AccessKeyId + SecretAccessKey to your secret manager. NOT to git. +# → save AccessKeyId + SecretAccessKey to the operator's secret manager. +# NEVER commit. setup-broker-host.sh consumes these via the systemd +# env file written under /etc/agentkeys/. +# Daemon may only assume the data role (no direct S3/SES grants). aws iam put-user-policy --user-name agentkeys-daemon \ --policy-name agentkeys-daemon-assume-role \ --policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{ - Version: "2012-10-17", - Statement: [{ - Effect: "Allow", Action: "sts:AssumeRole", - Resource: "arn:aws:iam::\($acct):role/agentkeys-data-role" - }] - }')" -``` - -The daemon user can do exactly one thing: assume `agentkeys-data-role`. Any S3/SES action goes through the role's permissions, never the user's. - -### 3.2 `agentkeys-data-role` - -The role's trust policy starts with the **static-IAM-user** variant (Stage 6). [§4.2](#42-replace-the-roles-trust-policy-federated-variant) swaps it for the OIDC-federated variant once the broker is publicly reachable. - -```bash -aws iam create-role --role-name agentkeys-data-role \ - --assume-role-policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{ - Version: "2012-10-17", - Statement: [{ - Effect: "Allow", - Principal: {AWS: "arn:aws:iam::\($acct):user/agentkeys-daemon"}, - Action: "sts:AssumeRole" - }] - }')" - -aws iam put-role-policy --role-name agentkeys-data-role \ - --policy-name agentkeys-data-role-inline \ - --policy-document "$(jq -n \ - --arg bucket "$BUCKET" --arg region "$REGION" \ - --arg acct "$ACCOUNT_ID" --arg domain "$DOMAIN" \ - '{ - Version: "2012-10-17", - Statement: [ - {Effect:"Allow", Action:"s3:ListBucket", Resource:"arn:aws:s3:::\($bucket)"}, - {Effect:"Allow", Action:"s3:GetObject", Resource:"arn:aws:s3:::\($bucket)/*"}, - {Effect:"Allow", Action:"ses:SendRawEmail", Resource:"arn:aws:ses:\($region):\($acct):identity/\($domain)"} - ] - }')" - -export ROLE_ARN=$(aws iam get-role --role-name agentkeys-data-role --query 'Role.Arn' --output text) -echo "ROLE_ARN=$ROLE_ARN" -``` - -### 3.3 `agentkeys-admin`, `agentkeys-broker` (already provisioned) - -If you've come this far, `agentkeys-admin` exists (you're using it now). `agentkeys-broker` is whatever IAM user you SSH into the broker EC2 with via EC2 Instance Connect — its perms are out of scope here (`ec2-instance-connect:SendSSHPublicKey` on the host's instance ID is sufficient). - -### 3.4 `agentkeys-broker-host` instance profile (optional, EC2-only) - -If the broker runs on EC2, attach this so the daemon never holds a static key. The host's runtime credentials come from IMDS. - -```bash -ROLE_NAME=agentkeys-broker-host - -aws iam create-role --role-name $ROLE_NAME \ - --assume-role-policy-document "$(jq -n '{ - Version: "2012-10-17", - Statement: [{Effect:"Allow", Principal:{Service:"ec2.amazonaws.com"}, Action:"sts:AssumeRole"}] - }')" - -aws iam put-role-policy --role-name $ROLE_NAME --policy-name BrokerAssumeData \ - --policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{ - Version: "2012-10-17", - Statement: [{Effect:"Allow", Action:"sts:AssumeRole", - Resource:"arn:aws:iam::\($acct):role/agentkeys-data-role"}] + Version:"2012-10-17", + Statement:[{Effect:"Allow", Action:"sts:AssumeRole", + Resource:"arn:aws:iam::\($acct):role/agentkeys-data-role"}] }')" - -aws iam create-instance-profile --instance-profile-name $ROLE_NAME -aws iam add-role-to-instance-profile --instance-profile-name $ROLE_NAME --role-name $ROLE_NAME -aws ec2 associate-iam-instance-profile --region "$REGION" \ - --instance-id \ - --iam-instance-profile Name=$ROLE_NAME ``` -### 3.4a `ses:SendEmail` grant on the broker's runtime role (Pass 2 prereq) - -The broker calls SES v2 `SendEmail` with its **own** runtime credentials -(instance profile), NOT via the assumed `agentkeys-data-role`. Without -`ses:SendEmail` on the broker's role the operator hits: +For `agentkeys-admin` + `agentkeys-broker` (one-shot, you already have these per CLAUDE.md "AWS local-profile ↔ remote-IAM mapping"), confirm with `aws iam list-users`. -``` -broker rejected /v1/auth/email/request: status=502 body= -{"error":"backend_unreachable","message":"… ses SendEmail: - unhandled error (AccessDeniedException)"} -``` +### 3.2 The three data roles -The IAM action is `ses:SendEmail` (sesv2) — NOT `ses:SendRawEmail` (v1 -only; different code path the broker doesn't use). - -**Step 1: discover the actual role name attached to your broker host.** -The canonical name is `agentkeys-broker-host` (created by §3.4 above). -The discovery command below stays as-is so the runbook is robust to -operators who landed on a non-canonical name during early provisioning -(historically: `S3-full-access`, fully retired 2026-05-12 via the role -rename in [PR #75 follow-up](#)). Find it: +Per arch.md §17.2 (per-data-class isolation): separate roles for credentials + memory + email. Same trust shape, distinct inline policies and PrincipalTag scoping. Provision via the per-data-class helpers (idempotent): ```bash -# REQUIRED: admin profile + operator env loaded. -awsp agentkeys-admin -set -a; source scripts/operator-workstation.env; set +a - -# CRITICAL: pass --region "$REGION". The agentkeys-admin profile -# defaults to us-west-2, but the broker EC2 lives in us-east-1 (from -# operator-workstation.env). Without --region, describe-instances -# searches us-west-2, finds nothing, returns empty silently (no error), -# and the downstream put-role-policy silently runs with --role-name "". -# See CLAUDE.md → AWS local-profile ↔ remote-IAM mapping. -INSTANCE_PROFILE_ARN=$(aws ec2 describe-instances \ - --region "$REGION" \ - --filters "Name=ip-address,Values=$EIP" \ - --query 'Reservations[].Instances[].IamInstanceProfile.Arn' \ - --output text) - -if [[ -z "$INSTANCE_PROFILE_ARN" || "$INSTANCE_PROFILE_ARN" == "None" ]]; then - echo "ABORT: no EC2 instance with EIP=$EIP found in region $REGION." >&2 - echo "Caller: $(aws sts get-caller-identity --query Arn --output text)" >&2 - unset ROLE -else - ROLE=$(aws iam get-instance-profile \ - --instance-profile-name "${INSTANCE_PROFILE_ARN##*/}" \ - --query 'InstanceProfile.Roles[0].RoleName' --output text) - echo "broker runtime role: $ROLE" -fi -``` - -**Step 2: grant `ses:SendEmail` + `ses:GetEmailIdentity` (least-privilege).** +bash scripts/provision-vault-bucket.sh # agentkeys-vault-${ACCOUNT_ID} +bash scripts/provision-vault-role.sh # agentkeys-vault-role +bash scripts/apply-vault-bucket-policy.sh # v3 split-statement PrincipalTag policy -The broker calls `ses:GetEmailIdentity` at startup via `verify_sender_ready` -to confirm the sender is verified, and `ses:SendEmail` per request. -Both grants are scoped to the verified domain identity (and any -per-address subset) — nothing wider. +bash scripts/provision-memory-bucket.sh +bash scripts/provision-memory-role.sh +bash scripts/apply-memory-bucket-policy.sh -```bash -aws iam put-role-policy --role-name "$ROLE" \ - --policy-name BrokerSendEmail \ - --policy-document "$(jq -n \ - --arg region "$REGION" --arg acct "$ACCOUNT_ID" --arg domain "$MAIL_DOMAIN" '{ - Version: "2012-10-17", - Statement: [{ - Effect: "Allow", - Action: ["ses:SendEmail", "ses:GetEmailIdentity"], - Resource: [ - "arn:aws:ses:\($region):\($acct):identity/\($domain)", - "arn:aws:ses:\($region):\($acct):identity/*@\($domain)" - ] - }] - }')" +bash scripts/cleanup-mail-bucket-policy.sh # restore email-only grants on $BUCKET ``` -No broker restart needed — sesv2 picks up creds per-call. Verify: - -```bash -aws iam get-role-policy --role-name "$ROLE" --policy-name BrokerSendEmail \ - --query 'PolicyDocument.Statement[*].Action' -# → [["ses:SendEmail", "ses:GetEmailIdentity"]] -``` - -**Step 3 (security audit): strip any over-broad legacy attached policies.** - -Some legacy deploys ship with `AmazonS3FullAccess` (or similar wide -permissions) attached to the broker's instance role from initial -provisioning. The broker process at runtime ONLY uses `aws-sdk-sts` -(STS GetCallerIdentity startup probe) + `aws-sdk-sesv2` (this section's -grants) — it never accesses S3 with its own creds. Per-user S3 access -is via JWT-assumed `agentkeys-data-role` (§3.2), NOT the broker's -runtime role. - -A broker compromise with `AmazonS3FullAccess` would expose every -inbound email in the SES bucket (verification tokens, magic links, -user-data buckets if any). Strip it: - -```bash -# List currently attached policies on the broker's role: -aws iam list-attached-role-policies --role-name "$ROLE" - -# Detach AmazonS3FullAccess if present: -aws iam detach-role-policy --role-name "$ROLE" \ - --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess - -# Verify only BrokerSendEmail (inline, this section) remains: -aws iam list-role-policies --role-name "$ROLE" # → ["BrokerSendEmail"] -aws iam list-attached-role-policies --role-name "$ROLE" # → [] -``` +The data-role trust shape is shown in [§4.3](#43-trust-policy) below — it's the same template for all three roles. The inline grants differ per role (vault → credentials prefix; memory → memory prefix; data-role → mail prefix). -### 3.5 S3 bucket policy +### 3.3 SES sender grant (email-link auth prereq) -Now that `agentkeys-data-role` exists, attach the bucket policy. The static-IAM-user variant: SES writes inbound, role reads everything. +The broker's runtime role needs `ses:SendEmail` on the verified sender identity for email-link auth. Add this statement to the data role's inline policy: -```bash -aws s3api put-bucket-policy --region "$REGION" --bucket "$BUCKET" \ - --policy "$(jq -n --arg bucket "$BUCKET" --arg acct "$ACCOUNT_ID" '{ - Version: "2012-10-17", - Statement: [ - { - Sid: "AllowSESWriteInbound", Effect: "Allow", - Principal: {Service: "ses.amazonaws.com"}, - Action: "s3:PutObject", - Resource: "arn:aws:s3:::\($bucket)/*", - Condition: {StringEquals: {"aws:Referer": $acct}} - }, - { - Sid: "AllowDaemonRead", Effect: "Allow", - Principal: {AWS: "arn:aws:iam::\($acct):role/agentkeys-data-role"}, - Action: ["s3:GetObject", "s3:ListBucket"], - Resource: ["arn:aws:s3:::\($bucket)", "arn:aws:s3:::\($bucket)/*"] - } - ] - }')" +```json +{ + "Effect": "Allow", + "Action": ["ses:SendEmail", "ses:SendRawEmail"], + "Resource": [ + "arn:aws:ses:${REGION}:${ACCOUNT_ID}:identity/${BROKER_EMAIL_FROM_ADDRESS}", + "arn:aws:ses:${REGION}:${ACCOUNT_ID}:configuration-set/*" + ] +} ``` -The federated variant (PrincipalTag-scoped) lands in [§4.3](#43-upgrade-bucket-policy-to-principaltag-scoped). - ---- +The broker's `verify_sender_ready` precheck calls `ses:GetEmailIdentity` at boot and refuses to start if the identity isn't both verified AND grantable. Triggered without this grant: cryptic `AccessDenied: ses:SendEmail` at the magic-link send step. ## 4. OIDC federation (Stage 7) -Replaces the `agentkeys-daemon → AssumeRole` path in §3.2 with `OIDC-broker-JWT → AssumeRoleWithWebIdentity`. The benefit: per-user isolation enforced **inside AWS** (via PrincipalTag on the assumed session), not just by the daemon's app code. +The broker mints OIDC JWTs that AWS STS validates via the broker's public JWKS endpoint. Three one-shot steps per account. ### 4.1 Prereqs -- §1–§3 done. -- Broker reachable at `https://$BROKER_HOST` over public TLS (see [§5](#5-ec2-broker-host-optional) for the EC2 wiring + `scripts/setup-broker-host.sh` for the host bootstrap). -- The broker's discovery doc agrees with `$BROKER_HOST` byte-for-byte: - ```bash - export OIDC_ISSUER="https://$BROKER_HOST" - curl -sS --fail-with-body "$OIDC_ISSUER/.well-known/openid-configuration" | jq -e ".issuer == \"$OIDC_ISSUER\"" - # → true - ``` - If `false`, fix the broker's `BROKER_OIDC_ISSUER` env var before continuing — AWS validates the registered URL against the JWT `iss` claim byte-for-byte (no scheme, trailing slash, or hostname-only forms allowed): - ```bash - sudo sed -i \ - "s|^Environment=BROKER_OIDC_ISSUER=.*|Environment=BROKER_OIDC_ISSUER=$OIDC_ISSUER|" \ - /etc/systemd/system/agentkeys-broker.service - sudo systemctl daemon-reload && sudo systemctl restart agentkeys-broker - ``` +- Broker reachable at `https://${BROKER_HOST}` over public TLS (`setup-broker-host.sh` provisions this with certbot). +- `https://${BROKER_HOST}/.well-known/openid-configuration` returns 200 with the expected `issuer` + `jwks_uri`. +- `https://${BROKER_HOST}/.well-known/jwks.json` returns at least one ES256 key. ### 4.2 Register the OIDC provider -Pre-check for stale state from earlier bring-ups: - ```bash -aws iam list-open-id-connect-providers -``` - -- Empty list → fresh slate; proceed. -- ARN ends in `$BROKER_HOST` → already registered; skip the create, jump to the trust-policy update. -- ARN ends in a different host → delete, then register the correct one: - ```bash - aws iam delete-open-id-connect-provider \ - --open-id-connect-provider-arn arn:aws:iam::${ACCOUNT_ID}:oidc-provider/ - ``` - -Register: +thumb=$(echo | openssl s_client -servername "$BROKER_HOST" \ + -connect "${BROKER_HOST}:443" 2>/dev/null \ + | openssl x509 -fingerprint -noout \ + | awk -F'=' '{print $2}' | tr -d ':' | tr 'A-Z' 'a-z') -```bash aws iam create-open-id-connect-provider \ - --url "$OIDC_ISSUER" \ - --client-id-list sts.amazonaws.com \ - --thumbprint-list '' -export OIDC_PROVIDER_ARN="arn:aws:iam::${ACCOUNT_ID}:oidc-provider/$BROKER_HOST" - -aws iam get-open-id-connect-provider \ - --open-id-connect-provider-arn "$OIDC_PROVIDER_ARN" \ - --query '{Url: Url, ClientIDList: ClientIDList}' -# → {"Url": "https://broker.litentry.org", "ClientIDList": ["sts.amazonaws.com"]} -``` - -AWS auto-derives the cert thumbprint from the Let's Encrypt chain. The thumbprint stays valid across cert renewals because LE uses a stable intermediate CA. - -### 4.3 Replace the role's trust policy (federated variant) - -Principal flips from `agentkeys-daemon` to the OIDC provider; the `sts:TagSession` + `aws:RequestTag/agentkeys_user_wallet` condition is what cloud-enforces per-user isolation in [§4.4](#44-upgrade-bucket-policy-to-principaltag-scoped). - -```bash -aws iam update-assume-role-policy --role-name agentkeys-data-role \ - --policy-document "$(jq -n \ - --arg provider "$OIDC_PROVIDER_ARN" \ - --arg aud_key "${BROKER_HOST}:aud" \ - '{ - Version: "2012-10-17", - Statement: [{ - Effect: "Allow", - Principal: {Federated: $provider}, - Action: ["sts:AssumeRoleWithWebIdentity", "sts:TagSession"], - Condition: { - StringEquals: {($aud_key): "sts.amazonaws.com"}, - Null: {"aws:RequestTag/agentkeys_user_wallet": "false"} - } - }] - }')" + --url "https://${BROKER_HOST}" \ + --client-id-list "sts.amazonaws.com" \ + --thumbprint-list "$thumb" ``` -`Null: "false"` enforces tag presence ("the key MUST exist"). Do **not** use `StringNotEquals: {"aws:RequestTag/agentkeys_user_wallet": ""}` — AWS evaluates negated string operators on missing context keys as TRUE ("the missing key is not equal to anything"), so a JWT carrying no AWS tags claim would silently bypass the check. The `Null` operator rejects sessions where the tag isn't set at all, which is the only enforcement the trust policy can give you. - -### 4.4 Upgrade bucket policy to PrincipalTag-scoped - -Replaces `AllowDaemonRead` from §3.5. The cloud now enforces "the assumed session can only touch the prefix matching its PrincipalTag" — even if app code has a bug. - -The daemon's read perms split into two statements because `s3:prefix` is a request-time condition that **only applies to `s3:ListBucket`** (the prefix filter on listings) — `s3:GetObject` doesn't carry a prefix parameter, so combining the two actions under one `s3:prefix` condition triggers `MalformedPolicy: Conditions do not apply to combination of actions and resources in statement`. For `GetObject` the resource ARN itself enforces the prefix via `${aws:PrincipalTag/...}` expansion. - -```bash -aws s3api put-bucket-policy --region "$REGION" --bucket "$BUCKET" \ - --policy "$(jq -n --arg bucket "$BUCKET" --arg acct "$ACCOUNT_ID" '{ - Version: "2012-10-17", - Statement: [ - { - Sid: "AllowSESWriteInbound", Effect: "Allow", - Principal: {Service: "ses.amazonaws.com"}, - Action: "s3:PutObject", - Resource: "arn:aws:s3:::\($bucket)/*", - Condition: {StringEquals: {"aws:Referer": $acct}} - }, - { - Sid: "AllowDaemonListOwnPrefix", Effect: "Allow", - Principal: {AWS: "arn:aws:iam::\($acct):role/agentkeys-data-role"}, - Action: "s3:ListBucket", - Resource: "arn:aws:s3:::\($bucket)", - Condition: { - StringLike: {"s3:prefix": "bots/${aws:PrincipalTag/agentkeys_user_wallet}/*"} - } - }, - { - Sid: "AllowDaemonGetOwnObjects", Effect: "Allow", - Principal: {AWS: "arn:aws:iam::\($acct):role/agentkeys-data-role"}, - Action: "s3:GetObject", - Resource: "arn:aws:s3:::\($bucket)/bots/${aws:PrincipalTag/agentkeys_user_wallet}/*" - }, - { - Sid: "AllowDaemonPutOwnCredentials", Effect: "Allow", - Principal: {AWS: "arn:aws:iam::\($acct):role/agentkeys-data-role"}, - Action: ["s3:PutObject", "s3:DeleteObject"], - Resource: "arn:aws:s3:::\($bucket)/bots/${aws:PrincipalTag/agentkeys_user_wallet}/credentials/*" - } - ] - }')" -``` - -**Issue #85 — credentials-prefix write grant.** The fourth statement (`AllowDaemonPutOwnCredentials`) is what lets `agentkeys provision ` PUT the AES-256-GCM-sealed credential blob to `s3://$BUCKET/bots//credentials/.enc`. Scope is intentionally tight: only the `credentials/` sub-prefix gets write — every other `bots//*` sub-prefix (inbox, sent, audit, …) stays read-only from the OIDC-assumed session. The plaintext never leaves the operator workstation: AES-256-GCM seal happens before PUT, KEK is derived client-side via the signer's `/dev/sign-message`. PrincipalTag scoping is the cloud-enforced floor; client-side encryption is the second line of defense in case the bucket-policy is misconfigured. - -**`bots/` is the per-actor data namespace** — sibling to SES's -`inbound/`, and to future system prefixes like `audit/`, `dkim/`, -`config/`. Keeping every actor's data under a single parent prefix -lets lifecycle rules, encryption defaults, replication, and ops audits -scope cleanly to "user data" without sweeping in system prefixes. -Matches arch.md §6 (`bots/A/file` in the runtime sequence diagram). -Both the policy resource ARN (`bucket/bots/${tag}/*`) and the -`s3:prefix` condition (`bots/${tag}/*`) carry the `bots/` parent — -omit it on either and the other half of the policy denies even legit -reads. - -`StringLike "bots/${tag}/*"` (not `StringEquals "bots/${tag}/"`) lets the daemon list sub-prefixes like `bots//inbox/` and `bots//sent/2026-05/`, not just the exact root `bots//`. Matches the shape in [`docs/spec/ses-email-architecture.md` §10.4](spec/ses-email-architecture.md) and [`wiki/tag-based-access`](../wiki/tag-based-access.md). - -### 4.4.1 Strip the §3 broad-bucket grant from the role's inline policy - -**Critical for §4.5 to actually demonstrate isolation.** §3.2's `agentkeys-data-role-inline` grants the role broad `s3:GetObject` + `s3:ListBucket` on the entire bucket — necessary in the static-IAM path (no PrincipalTag to scope on) but **fatal** here: IAM evaluates as union-of-allows, so this identity-based grant overrides §4.4's bucket-policy isolation. Without this step, §4.5's 4b test will silently succeed instead of correctly returning `AccessDenied` — federation appears to work while the cloud is enforcing nothing. - -Inspect what's currently attached: +**AWS validates the issuer URL byte-for-byte** against the JWT `iss` claim. Once the OIDC provider is registered, the URL is effectively immutable for the life of the deployment — switching means new provider ARN + new trust policy + new federated grants. -```bash -aws iam get-role-policy --profile agentkeys-admin \ - --role-name agentkeys-data-role \ - --policy-name agentkeys-data-role-inline \ - --query 'PolicyDocument' -``` +### 4.3 Trust policy -Re-apply, omitting the S3 statement. Keep any non-S3 statements (the daemon needs the `ses:SendRawEmail` grant for outbound mail in §3): +Apply to each of the three data roles. Use `$ROLE` ∈ `{agentkeys-data-role, agentkeys-vault-role, agentkeys-memory-role}`. ```bash -aws iam put-role-policy --profile agentkeys-admin \ - --role-name agentkeys-data-role \ - --policy-name agentkeys-data-role-inline \ - --policy-document "$(jq -n --arg ses_domain "${MAIL_DOMAIN:-bots.litentry.org}" '{ - Version: "2012-10-17", - Statement: [{ - Effect: "Allow", - Action: "ses:SendRawEmail", - Resource: "*", - Condition: { - StringLike: {"ses:FromAddress": "*@\($ses_domain)"} - } +aws iam update-assume-role-policy --role-name "$ROLE" --policy-document "$(jq -n \ + --arg acct "$ACCOUNT_ID" --arg host "$BROKER_HOST" '{ + Version:"2012-10-17", + Statement:[{ + Effect:"Allow", + Principal:{Federated:"arn:aws:iam::\($acct):oidc-provider/\($host)"}, + Action:"sts:AssumeRoleWithWebIdentity", + Condition:{StringEquals:{"\($host):aud":"sts.amazonaws.com"}} }] }')" ``` -If your inline policy had additional non-S3 statements, include them here too. +### 4.4 PrincipalTag-scoped bucket policy -Verify the S3 actions are gone: +Per CLAUDE.md "Per-actor + per-data-class isolation invariants": every S3 read/write is scoped to `bots/${aws:PrincipalTag/agentkeys_actor_omni}/{credentials,memory}/*`. The split-statement v3 bucket policy is applied by [`scripts/apply-{vault,memory}-bucket-policy.sh`](../scripts/) — those scripts ARE the source of truth for the policy shape. -```bash -aws iam get-role-policy --profile agentkeys-admin \ - --role-name agentkeys-data-role \ - --policy-name agentkeys-data-role-inline \ - --query 'PolicyDocument.Statement[*].Action' -# → [["ses:SendRawEmail"]] -``` - -If the daemon doesn't need any non-S3 grants, delete the inline policy entirely instead: +After §4.3 + §4.4: strip the §3 broad-bucket inline grant from the role's policy (the bucket-side policy enforces; defense in depth means no app-side grant). The `cleanup-mail-bucket-policy.sh` helper does this for the mail bucket; do it by hand for any other inline policy you've left: ```bash -aws iam delete-role-policy --profile agentkeys-admin \ - --role-name agentkeys-data-role \ - --policy-name agentkeys-data-role-inline +aws iam delete-role-policy --role-name "$ROLE" --policy-name agentkeys-data-role-s3-broad ``` ### 4.5 End-to-end proof -Mint a JWT, assume the role with it, prove that wallet A can read its own prefix but **not** wallet B's. The minting half must run **on the broker host** (the prod broker validates session bearers against its *own* local backend on `127.0.0.1:8090`, not against any backend reachable from your operator workstation). The AWS-side half runs on your operator workstation where your admin AWS profile lives. - -**Env-var scope** — `$ACCOUNT_ID`, `$BROKER_HOST`, `$OIDC_ISSUER`, `$OIDC_PROVIDER_ARN`, `$BUCKET` only exist on your operator workstation (set up in [§0](#0-identities--mental-model)). The broker host has none of them. Part A below references `$BROKER_HOST` once — in the SSH command itself, where it's expanded by your local shell *before* SSH connects — and otherwise uses **only** literal `127.0.0.1` URLs inside the SSH session. Don't try to re-export the §0 vars on the broker host; none of them are needed there. - -#### Part A — on the broker host (mint the JWT) +Run [`harness/v2-stage3-demo.sh`](../harness/v2-stage3-demo.sh) — it mints a session JWT → OIDC JWT → STS creds, then proves both POSITIVE (own prefix) and NEGATIVE (cross-actor prefix → AccessDenied) writes for both data classes plus the cross-role isolation matrix. Walks the full §17.2 isolation table from CLAUDE.md. -```bash -# === Run on your operator workstation === -# ($BROKER_HOST is expanded locally before ssh runs — the broker host -# never sees this var. If $BROKER_HOST isn't set, replace with the -# literal hostname, e.g. broker.litentry.org.) -ssh agentkey@$BROKER_HOST # or via: aws ec2-instance-connect ssh --instance-id - -# === The rest runs inside the SSH session, on the broker host === -# No workstation env vars are visible here. Both URLs are literals. -SESSION=$(curl -sS --fail-with-body -X POST http://127.0.0.1:8090/session/create \ - -H 'content-type: application/json' \ - -d '{"auth_token":"federation-proof"}' | jq -r .session) - -JWT=$(curl -sS --fail-with-body -X POST http://127.0.0.1:8091/v1/mint-oidc-jwt \ - -H "Authorization: Bearer $SESSION" | jq -r .jwt) - -echo "$JWT" -# Copy the entire string. JWT TTL is ~5 min; copy and proceed promptly. -exit -``` - -#### Part B — on your operator workstation (assume role + verify isolation) - -All env vars below (`$ACCOUNT_ID`, `$BUCKET`) are workstation-side from §0. Run after `exit`-ing the SSH session. - -```bash -JWT="" - -# Decode the wallet from the payload. JWT segments are base64url-encoded -# (RFC 7515) — jq's @base64d is strict base64, so we url→std + add padding -# before decoding. Skipping this works on most JWTs by accident; when the -# payload base64 happens to contain - or _, it fails with a "Malformed BOM" -# error. -WALLET=$(jq -R 'split(".") | .[1] | gsub("-";"+") | gsub("_";"/") | - . + ("=" * ((4 - length % 4) % 4)) | @base64d | fromjson | .agentkeys_user_wallet' <<<"$JWT" -r) -echo "WALLET=$WALLET" - -CREDS=$(aws sts assume-role-with-web-identity \ - --role-arn "arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-data-role" \ - --role-session-name "fed-proof-$(date +%s)" \ - --web-identity-token "$JWT") -export AWS_ACCESS_KEY_ID=$(printf '%s' "$CREDS" | jq -r .Credentials.AccessKeyId) -export AWS_SECRET_ACCESS_KEY=$(printf '%s' "$CREDS" | jq -r .Credentials.SecretAccessKey) -export AWS_SESSION_TOKEN=$(printf '%s' "$CREDS" | jq -r .Credentials.SessionToken) - -# Confirm you're the assumed role, not your admin profile -aws sts get-caller-identity -# → Arn: arn:aws:sts::...:assumed-role/agentkeys-data-role/fed-proof-... - -# 4a. Own prefix — should succeed (empty list is fine, no AccessDenied) -aws s3api list-objects-v2 --bucket "$BUCKET" --prefix "$WALLET/" - -# 4b. KEY MOMENT — someone else's prefix MUST AccessDenied -aws s3api list-objects-v2 --bucket "$BUCKET" --prefix "0xdeadbeef/" -# → AccessDenied -``` +## 5. Broker host: `setup-broker-host.sh` -Step 4b is the property the static-IAM path (§3) cannot prove: cloud-enforced isolation, zero app-side trust required. +§1–§4 set up identifiers. This step stands up the actual processes — broker + mock-server + signer + 4 service workers — on the EC2 host (or any Linux box with public-internet egress + the broker's hostname). -#### Diagnosing intermediate states +### 5.1 Prereqs -If both 4a and 4b succeed, §4.4.1 wasn't applied — the inline-policy `s3:*` grant is still masking the bucket policy. Re-run §4.4.1 and verify `Statement[*].Action` returns only `ses:SendRawEmail`. - -If both 4a and 4b deny (including 4a, your *own* prefix), the broker's JWT isn't carrying the `https://aws.amazon.com/tags` claim, so STS sets no PrincipalTag on the assumed session, so `${aws:PrincipalTag/agentkeys_user_wallet}` in the bucket policy expands to empty and matches nothing. Decode the JWT to confirm: - -```bash -jq -R 'split(".") | .[1] | gsub("-";"+") | gsub("_";"/") | - . + ("=" * ((4 - length % 4) % 4)) | @base64d | fromjson' <<<"$JWT" -``` - -Look for a top-level `https://aws.amazon.com/tags` key with `principal_tags.agentkeys_user_wallet` populated. If it's missing, the broker version doesn't yet emit the AWS tags claim and needs to be redeployed. - -### 4.6 (Future) TEE-derived signer swap - -The on-disk ES256 keypair shipped today is a complete v0.1 signer. When [`heima-gaps §3`](./spec/heima-gaps-vs-desired-architecture.md) closes, swap [`crates/agentkeys-broker-server/src/oidc.rs::OidcKeypair::load_or_generate`](../crates/agentkeys-broker-server/src/oidc.rs) for a TEE oracle call. JWKS, JWT shape, STS exchange, and bucket policy stay identical — only the signing backend changes. - ---- - -## 5. EC2 broker host (optional) - -If the broker runs on EC2 (the recommended path for AWS-native deployments), wire DNS + EIP + security group before running [`scripts/setup-broker-host.sh`](../scripts/setup-broker-host.sh) on the box. - -### 5.1 Allocate + attach an Elastic IP - -```bash -EIP_ALLOC=$(aws ec2 allocate-address --domain vpc --region "$REGION" --query AllocationId --output text) -aws ec2 associate-address --region "$REGION" \ - --instance-id --allocation-id "$EIP_ALLOC" -EIP=$(aws ec2 describe-addresses --region "$REGION" \ - --allocation-ids "$EIP_ALLOC" --query 'Addresses[0].PublicIp' --output text) -echo "EIP=$EIP" -``` +- Fresh Linux host with sudo, systemd, public-internet egress, ports 80 + 443 open inbound (for certbot + nginx). +- DNS A records for `${BROKER_HOST}` + `signer.${ZONE}` + `audit.${ZONE}` + `email.${ZONE}` + `cred.${ZONE}` + `memory.${ZONE}` all pointing at the host's public IP. +- AWS credentials in `/etc/agentkeys/broker.env` (the script writes the file template; operator pastes the `agentkeys-daemon` access key from §3.1). -### 5.2 Wire the A record +### 5.2 Run ```bash -aws route53 change-resource-record-sets --hosted-zone-id "$PARENT_ZONE_ID" \ - --change-batch "$(jq -n --arg name "$BROKER_HOST." --arg ip "$EIP" '{ - Changes: [{ - Action: "UPSERT", - ResourceRecordSet: {Name: $name, Type: "A", TTL: 300, ResourceRecords: [{Value: $ip}]} - }] - }')" +# Bootstrap a fresh host: +sudo bash scripts/setup-broker-host.sh \ + --issuer-url "https://${BROKER_HOST}" \ + --account-id "${ACCOUNT_ID}" \ + --signer-host "signer.${ZONE}" \ + --audit-host "audit.${ZONE}" \ + --email-host "email.${ZONE}" \ + --cred-host "cred.${ZONE}" \ + --memory-host "memory.${ZONE}" \ + --yes -# Verify (use DoH if your local resolver hijacks port 53) -curl -s "https://cloudflare-dns.com/dns-query?name=$BROKER_HOST&type=A" \ - -H 'accept: application/dns-json' | jq '.Answer[0].data' +# After a `git pull`, the same command re-deploys: +sudo bash scripts/setup-broker-host.sh --yes ``` -### 5.3 Open security-group ports 80 + 443 - -Let's Encrypt's HTTP-01 challenge needs port 80 open from anywhere; the broker serves on 443 afterward. SSH (22) should be admin-IP-only. - -```bash -INSTANCE_ID= -SG=$(aws ec2 describe-instances --region "$REGION" --instance-ids "$INSTANCE_ID" \ - --query 'Reservations[0].Instances[0].SecurityGroups[0].GroupId' --output text) - -aws ec2 authorize-security-group-ingress --region "$REGION" --group-id "$SG" \ - --protocol tcp --port 443 --cidr 0.0.0.0/0 -aws ec2 authorize-security-group-ingress --region "$REGION" --group-id "$SG" \ - --protocol tcp --port 80 --cidr 0.0.0.0/0 -``` +The script: +- Builds `agentkeys-broker-server` (+ `auth-email-link` feature), `agentkeys-mock-server`, the 4 service workers, and the signer. +- Creates the `agentkeys` system user + state dir `/var/lib/agentkeys/`. +- Writes the dev_key_service master secret (one-shot at first boot, never rotated — rotation invalidates every previously-derived wallet). +- Writes per-worker env files at `/etc/agentkeys/worker-{audit,email,creds,memory}.env`. +- Writes systemd units for broker + signer + each worker, enables + starts. +- Configures nginx vhosts for `${BROKER_HOST}` + `signer.${ZONE}` + 4 worker hosts (skip via `--without-nginx`). +- Runs certbot for first-time TLS cert issuance (skip via `--without-certbot`). +- Mints broker keypairs (oidc + session) under `/var/lib/agentkeys/keys/`. -### 5.4 Bootstrap the host +Auto-detects bootstrap vs upgrade by reading the existing systemd unit's `Environment=` lines. Pass `--ref ` to opt into an in-script `git fetch + pull`. -SSH in as `agentkeys-broker` (via EC2 Instance Connect: `aws ec2-instance-connect ssh --instance-id $INSTANCE_ID`) and run: +### 5.3 Verify ```bash -git clone https://github.com/litentry/agentKeys.git -cd agentKeys -sudo bash scripts/setup-broker-host.sh -# Interactive walk-through; pick instance-profile credential mode -# (assuming §3.4 attached agentkeys-broker-host). +curl -sf "https://${BROKER_HOST}/healthz" # → 200 +curl -sf "https://${BROKER_HOST}/.well-known/openid-configuration" | jq . +curl -sf "https://${BROKER_HOST}/.well-known/jwks.json" | jq '.keys | length' +curl -sf "https://audit.${ZONE}/healthz" # → 200 (and friends) ``` -The script writes systemd units, an HTTP-only nginx config, then prints the certbot command. After cert issuance, re-run the script — it detects the cert file and flips on the `:443` ssl block. - ---- +For full E2E (broker + workers + chain + AWS), run the harness scripts — see [`docs/heima-setup.md`](heima-setup.md) for the chain side and [`docs/ci-setup.md`](ci-setup.md) for the automated path. -## 6. Signer host - -| Concern | Today | Future | -|---|---|---| -| Process | `agentkeys-signer.service` (Rust, `agentkeys-mock-server --signer-only`, loopback `:8092`) | TEE worker (issue #74 step 2) | -| Host | **Same EC2 box as the broker** — co-located behind the same nginx, provisioned by the same `setup-broker-host.sh` run | Separate machine (or enclave); only the A record + cert move | -| Public hostname | `signer.` (e.g. `signer.litentry.org`) — exported as `SIGNER_HOST` / `AGENTKEYS_SIGNER_URL` in [`scripts/operator-workstation.env`](../scripts/operator-workstation.env) | `signer.` (unchanged) | -| Endpoints | `/dev/derive-address`, `/dev/sign-message`, `/healthz` only — every request bearer-JWT-authed against the broker session pubkey ([`signer-protocol.md`](spec/signer-protocol.md)) | unchanged | -| Master secret (K3) | `/etc/agentkeys/dev-key-service.env` (mode 0600, owner `agentkeys`) — auto-generated on first `setup-broker-host.sh` run, **never rotated** (rotation invalidates every previously-derived wallet) | TEE-sealed; same wire shape | +## 6. Cleanup -### 6.1 DNS A record +Tear down the whole AgentKeys footprint in one account: ```bash -# === ON OPERATOR WORKSTATION === -SIGNER_HOST="signer.${BROKER_HOST#*.}" - -# If $EIP isn't already set from §5.1, re-derive from AWS — NEVER from -# `dig`. Local resolvers behind Cloudflare WARP / Zscaler / Tailscale / -# corporate VPNs return RFC 2544 "TEST-NET-2" (198.18.0.0/15) for -# proxied hostnames, which silently breaks Let's Encrypt validation. -[ -z "$EIP" ] && EIP=$(aws ec2 describe-addresses --region "$REGION" \ - --query 'Addresses[?AssociationId!=`null`].PublicIp' --output text) -echo "EIP=$EIP" # MUST be a routable public IP, not 198.18.x.x / 10.x.x.x / 100.64.x.x - -aws route53 change-resource-record-sets --hosted-zone-id "$PARENT_ZONE_ID" \ - --change-batch "$(jq -n --arg name "${SIGNER_HOST}." --arg ip "$EIP" '{ - Changes: [{Action:"UPSERT", ResourceRecordSet:{Name:$name, Type:"A", TTL:300, ResourceRecords:[{Value:$ip}]}}] - }')" - -# Verify via Cloudflare DoH (your local resolver will keep lying if proxied). -until [ "$(curl -s "https://cloudflare-dns.com/dns-query?name=${SIGNER_HOST}&type=A" \ - -H 'accept: application/dns-json' | jq -r '.Answer[0].data')" = "$EIP" ]; do - echo "waiting for Route 53 propagation (TTL 300s)…"; sleep 5 +# Drain the buckets +for b in "$BUCKET" "agentkeys-vault-${ACCOUNT_ID}" "agentkeys-memory-${ACCOUNT_ID}"; do + aws s3 rm "s3://$b" --recursive 2>/dev/null || true + aws s3api delete-bucket --bucket "$b" --region "$REGION" 2>/dev/null || true done -echo "DNS ready: ${SIGNER_HOST} → ${EIP}" -``` - -### 6.2 TLS cert + nginx flip - -> **`$SIGNER_HOST` is laptop-only** (lives in `operator-workstation.env`). -> On the broker host, derive it from the nginx vhost that `setup-broker-host.sh` -> just wrote — the snippet below does it inline so the commands work in a -> fresh broker shell with no env vars set. - -```bash -# === ON BROKER HOST === -# 1. First pass writes the HTTP-only nginx vhost for signer.. -sudo bash scripts/setup-broker-host.sh --yes - -# Sanity-check + read the hostname back out of the vhost. -ls /etc/nginx/sites-enabled/agentkeys-signer -SIGNER_HOST=$(awk '/server_name/ && /signer\./ {gsub(";",""); print $2}' \ - /etc/nginx/sites-available/agentkeys-signer | head -1) -echo "SIGNER_HOST=$SIGNER_HOST" - -# 2. Issue the LE cert. If the prompt only lists broker., the -# signer vhost wasn't written — re-pull + re-run step 1. -sudo certbot --nginx -d "$SIGNER_HOST" - -# 3. Re-run to flip the signer vhost onto :443 ssl. -sudo bash scripts/setup-broker-host.sh --yes -``` - -### 6.3 Verify - -```bash -# === ON OPERATOR WORKSTATION === -curl -sS "https://$SIGNER_HOST/healthz" -# ok -# Defense-in-depth: signer vhost rejects everything except /dev/* + /healthz. -curl -sS -o /dev/null -w '%{http_code}\n' "https://$SIGNER_HOST/session/create" -# 404 -``` - ---- - -## 7. Service workers (audit / email / cred / memory) - -| Concern | Today | Future | -|---|---|---| -| Processes | 4 systemd units: `agentkeys-worker-{audit,email,creds,memory}.service` on `127.0.0.1:{9092,9093,9094,9095}` | Each splits to its own EC2 / IAM principal | -| Host | **Same EC2 box as the broker** — co-located behind the same nginx, provisioned by the same `setup-broker-host.sh` run | Separate machines (or enclaves); only the A records + certs move | -| Public hostnames | `audit.` / `email.` / `cred.` / `memory.` — exported as `WORKER_*_HOST` / `AGENTKEYS_WORKER_*_URL` in [`scripts/operator-workstation.env`](../scripts/operator-workstation.env) | Same hostnames (unchanged) | -| Endpoints | `audit` → `/v1/audit/*` + `/healthz` ; `email` → `/v1/email/*` + `/healthz` ; `cred` → `/v1/cred/*` + `/healthz` ; `memory` → `/v1/memory/*` + `/healthz` | Unchanged | -| KEK material | `/etc/agentkeys/worker-{creds,memory}.env` (mode 0600, owner `agentkeys`) — auto-generated on first `setup-broker-host.sh` run, **never rotated** (rotation invalidates every previously-encrypted blob) | mTLS-derived KEK from the signer | - -### 7.1 DNS — 4 A records in one Route 53 batch - -```bash -# === ON OPERATOR WORKSTATION === -awsp agentkeys-admin # account-owner profile (Route 53 + EC2 read) -set -a; source ./scripts/operator-workstation.env; set +a - -# Single helper — derives EIP from AWS, validates it's not VPN-rewritten, -# UPSERTs all 4 records atomically, waits for INSYNC + Cloudflare DoH -# propagation, then prints the next-step certbot loop. -bash scripts/dns-upsert-workers.sh - -# Override knobs: -# --eip 1.2.3.4 # use a known EIP instead of describe-addresses -# --zone-id Z… # override default litentry.org zone -# --ttl 60 # tighter TTL while iterating -# --dry-run # print the change-batch JSON, don't apply -``` - -The script is idempotent (UPSERT replaces if exists, creates if not). Re-running it is a no-op when the records already point at `$EIP`. - -### 7.2 TLS certs + nginx flip - -> The four worker `WORKER_*_HOST` variables are **laptop-only** (set in `operator-workstation.env`). On the broker host, derive them from the nginx vhosts that `setup-broker-host.sh` just wrote — the snippet below does it inline so commands work in a fresh broker shell with no env vars set. +# Roles +for r in agentkeys-data-role agentkeys-vault-role agentkeys-memory-role agentkeys-broker-host; do + for p in $(aws iam list-role-policies --role-name "$r" --query 'PolicyNames[]' --output text 2>/dev/null); do + aws iam delete-role-policy --role-name "$r" --policy-name "$p" + done + aws iam delete-role --role-name "$r" 2>/dev/null || true +done -```bash -# === ON BROKER HOST === -# 1. First pass writes HTTP-only nginx vhosts for all 4 workers. -sudo bash scripts/setup-broker-host.sh --yes +# OIDC provider +aws iam delete-open-id-connect-provider \ + --open-id-connect-provider-arn "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${BROKER_HOST}" -# Read the 4 hostnames back out of the just-written vhosts. -AUDIT_HOST=$(awk '/server_name/ && /audit\./ {gsub(";",""); print $2}' /etc/nginx/sites-available/agentkeys-worker-audit | head -1) -EMAIL_HOST=$(awk '/server_name/ && /email\./ {gsub(";",""); print $2}' /etc/nginx/sites-available/agentkeys-worker-email | head -1) -CRED_HOST=$(awk '/server_name/ && /cred\./ {gsub(";",""); print $2}' /etc/nginx/sites-available/agentkeys-worker-cred | head -1) -MEMORY_HOST=$(awk '/server_name/ && /memory\./ {gsub(";",""); print $2}' /etc/nginx/sites-available/agentkeys-worker-memory | head -1) -echo "AUDIT=$AUDIT_HOST EMAIL=$EMAIL_HOST CRED=$CRED_HOST MEMORY=$MEMORY_HOST" - -# 2. Issue Let's Encrypt certs (webroot mode — does NOT touch nginx config). -for h in "$AUDIT_HOST" "$EMAIL_HOST" "$CRED_HOST" "$MEMORY_HOST"; do - sudo certbot certonly --webroot -w /var/www/certbot -d "$h" \ - --agree-tos -m ops@litentry.org --non-interactive +# Daemon user +for k in $(aws iam list-access-keys --user-name agentkeys-daemon --query 'AccessKeyMetadata[].AccessKeyId' --output text); do + aws iam delete-access-key --user-name agentkeys-daemon --access-key-id "$k" done +aws iam delete-user-policy --user-name agentkeys-daemon --policy-name agentkeys-daemon-assume-role 2>/dev/null || true +aws iam delete-user --user-name agentkeys-daemon -# 3. Re-run to flip each vhost onto :443 ssl. Idempotent — re-runs without -# new certs are no-ops; re-runs after cert issuance flip A → B per host. -sudo bash scripts/setup-broker-host.sh --yes -``` +# SES + DNS +aws ses set-active-receipt-rule-set --rule-set-name "" --region "$REGION" 2>/dev/null || true +aws sesv2 delete-email-identity --email-identity "$MAIL_DOMAIN" --region "$REGION" 2>/dev/null || true +# DNS records are operator-managed (Route 53 / your DNS provider) — delete by hand. -### 7.3 Verify - -```bash -# === ON OPERATOR WORKSTATION === -bash scripts/verify-workers.sh - -# Per-worker drilldown if any failed: -curl -sS "https://${WORKER_AUDIT_HOST}/healthz" # → ok -curl -sS "https://${WORKER_EMAIL_HOST}/healthz" # → ok -curl -sS "https://${WORKER_CRED_HOST}/healthz" # → JSON {"ok":true,...} -curl -sS "https://${WORKER_MEMORY_HOST}/healthz" # → JSON {"ok":true,...} - -# Defense-in-depth: each worker vhost only proxies its own /v1//* surface. -curl -sS -o /dev/null -w '%{http_code}\n' "https://${WORKER_AUDIT_HOST}/v1/cred/anything" -# 404 (audit vhost won't proxy /v1/cred) +# EC2 + EIP (manual via console or aws ec2 CLI) ``` ---- - -## 8. Cleanup - -```bash -# OIDC federation (if §4 ran) -aws iam delete-open-id-connect-provider \ - --open-id-connect-provider-arn "$OIDC_PROVIDER_ARN" 2>/dev/null - -# IAM -aws iam delete-role-policy --role-name agentkeys-data-role --policy-name agentkeys-data-role-inline -aws iam delete-role --role-name agentkeys-data-role -for KEY in $(aws iam list-access-keys --user-name agentkeys-daemon --query 'AccessKeyMetadata[*].AccessKeyId' --output text); do - aws iam delete-access-key --user-name agentkeys-daemon --access-key-id "$KEY" -done -aws iam delete-user-policy --user-name agentkeys-daemon --policy-name agentkeys-daemon-assume-role -aws iam delete-user --user-name agentkeys-daemon - -# Optional: the broker-host instance profile -aws iam remove-role-from-instance-profile --instance-profile-name agentkeys-broker-host --role-name agentkeys-broker-host 2>/dev/null -aws iam delete-instance-profile --instance-profile-name agentkeys-broker-host 2>/dev/null -aws iam delete-role-policy --role-name agentkeys-broker-host --policy-name BrokerAssumeData 2>/dev/null -aws iam delete-role --role-name agentkeys-broker-host 2>/dev/null - -# SES + S3 -aws ses set-active-receipt-rule-set --rule-set-name "" --region "$REGION" -aws sesv2 delete-email-identity --region "$REGION" --email-identity "$DOMAIN" -aws s3 rm "s3://$BUCKET" --recursive -aws s3api delete-bucket --region "$REGION" --bucket "$BUCKET" - -# DNS records on the parent zone are NOT auto-deleted — you'll need to -# remove the DKIM CNAMEs, MX, SPF, DMARC, and broker A record by hand -# if you want a clean zone. -``` +## Related ---- +- Chain bring-up: [`docs/heima-setup.md`](heima-setup.md) +- CI activation: [`docs/ci-setup.md`](ci-setup.md) +- Broker host script (single entry point): [`scripts/setup-broker-host.sh`](../scripts/setup-broker-host.sh) +- Architecture: [`docs/spec/architecture.md`](spec/architecture.md) §17 (per-data-class buckets), §17.2 (per-bucket IAM role) +- FAQ + troubleshooting: [`wiki/cloud-setup-faq.md`](../wiki/cloud-setup-faq.md) ## Follow-ups tracked elsewhere -- **TEE-BYODKIM** — replace AWS-managed DKIM. Depends on [`heima-gaps §4`](./spec/heima-gaps-vs-desired-architecture.md). -- **TEE-derived OIDC signer** — replace on-disk ES256. Depends on [`heima-gaps §3`](./spec/heima-gaps-vs-desired-architecture.md). -- **Per-address S3 prefix routing** — currently all inbound lands in `inbound/`; per-`/
/` prefix routing wants either a SES Lambda or subdomain receipt rules. -- **GCP / Tencent recipes** — equivalent of §4 against GCP Workload Identity Federation and Tencent CAM. JWT/JWKS shape works cross-cloud unchanged; only the registration step differs. +- Per-recipient routing Lambda hardening: [`TODOS.md`](../TODOS.md) "Disable broker's broad S3-full-access" +- Tencent Cloud SimpleDM + COS port: tracked separately +- TEE-held BYODKIM migration: [`docs/spec/heima-gaps-vs-desired-architecture.md`](spec/heima-gaps-vs-desired-architecture.md) §4 diff --git a/docs/heima-setup.md b/docs/heima-setup.md new file mode 100644 index 0000000..7c537a7 --- /dev/null +++ b/docs/heima-setup.md @@ -0,0 +1,106 @@ +# Heima setup — AgentKeys + +**Audience:** the operator bringing AgentKeys up on a Heima chain (mainnet, Paseo, or local Anvil). +**Scope:** one command that walks the 15-step chain bring-up end-to-end. +**Companion:** [`docs/cloud-setup.md`](cloud-setup.md) for the AWS/broker side. Run cloud-setup first — Heima setup expects [`scripts/operator-workstation.env`](../scripts/operator-workstation.env) to already exist. +**FAQ + troubleshooting:** [`wiki/heima-setup-faq.md`](../wiki/heima-setup-faq.md). + +## TL;DR + +```bash +# Mainnet (default; AGENTKEYS_CHAIN=heima implicit) +AWS_PROFILE=agentkeys-admin bash scripts/setup-heima.sh + +# Paseo testnet (no real HEI cost; Alice sudo funds the deployer) +AWS_PROFILE=agentkeys-admin bash scripts/setup-heima.sh --chain heima-paseo + +# Local Anvil (fully ephemeral, instant finality, free) +AWS_PROFILE=agentkeys-admin bash scripts/setup-heima.sh --chain anvil +``` + +[`scripts/setup-heima.sh`](../scripts/setup-heima.sh) is **the single idempotent entry point** for Heima bring-up. Re-running is safe: every step pre-checks chain state and short-circuits when the work is already a no-op. Per-step helpers (`scripts/heima-{bring-up,device-register,agent-create,scope-set,credential-audit,worker-smoke}.sh`) stay callable directly for surgical re-runs. + +## What runs, in order + +| # | Step | Idempotency check | Helper script | +|---|------|-------------------|---------------| +| 1 | Tool sanity-check (jq curl aws cast forge node npx python3 + `agentkeys` binary) | tool presence | — | +| 2 | Source `scripts/operator-workstation.env` | file exists + `REGION` set | — | +| 3 | Chain reachability + `eth_chainId` matches the profile's claim | catches "you said paseo but the RPC is mainnet" footguns | — | +| 4 | Generate/reuse deployer keypair at `~/.agentkeys/${chain}-deployer.key` (0600) | file exists | (inline) | +| 5 | Fund the deployer | balance ≥ floor | [`heima-fund-account.sh`](../scripts/heima-fund-account.sh) | +| 6 | Deploy the 6 stage-1 contracts atomically (P256Verifier → K11Verifier → SidecarRegistry → AgentKeysScope → K3EpochCounter → CredentialAudit) | `cast code` on every claimed address; skip when present | [`heima-bring-up.sh`](../scripts/heima-bring-up.sh) | +| 7 | Persist contract addresses to `operator-workstation.env` namespaced by chain | (sed replace-or-append, no-op when unchanged) | (inside bring-up) | +| 8 | Verify contracts on-chain (read-only RPC: bytecode + ABI + wiring) | always runs, ~3s | [`verify-heima-contracts.sh`](../scripts/verify-heima-contracts.sh) | +| 9 | Register operator master device (first-master bootstrap) | `getDevice.registeredAt > 0` check | [`heima-device-register.sh`](../scripts/heima-device-register.sh) | +| 10 | K11 enrollment (stub bytes by default; `--webauthn` for real Touch ID) | enrollment file exists at `~/.agentkeys/k11/.json` | (inline) | +| 11 | Create demo agent device | `getDevice.registeredAt > 0` check | [`heima-agent-create.sh`](../scripts/heima-agent-create.sh) | +| 12 | Set scope for agent (K11-gated — needs `--webauthn`) | `getScope` config-equality check; skipped without `--webauthn` | [`heima-scope-set.sh`](../scripts/heima-scope-set.sh) | +| 13 | Append a credential-audit row (V1 path) | **intentionally append-only** — re-runs add a fresh row | [`heima-credential-audit.sh`](../scripts/heima-credential-audit.sh) | +| 14 | Tier-A audit relay + worker `/healthz` smoke | **intentionally append-only** | [`heima-worker-smoke.sh`](../scripts/heima-worker-smoke.sh) | +| 15 | Summary — print contract addresses + suggested next-step re-runs | always | — | + +## Per-step re-runs + +The orchestrator accepts `--from-step N`, `--to-step N`, and `--only-step N`. Use these to surgically re-run after fixing an issue without re-walking the whole pipeline: + +```bash +# Just re-check the deploy (cast-code idempotency means nothing redeploys +# unless an address is empty) +bash scripts/setup-heima.sh --only-step 6 + +# Re-register the master after rotating the session JWT +bash scripts/setup-heima.sh --only-step 9 + +# Just smoke the workers +bash scripts/setup-heima.sh --only-step 14 +``` + +## Mainnet vs Paseo vs Anvil + +| | `heima` (mainnet) | `heima-paseo` (testnet) | `anvil` (local dev) | +|---|---|---|---| +| Chain ID | 212013 | 2013 | 31337 | +| Cost per deploy | real HEI gas | 0 | 0 | +| Deployer funding | operator's personal wallet (no sudo on mainnet) | Alice sudo via [`heima-fund-account.sh`](../scripts/heima-fund-account.sh) | anvil pre-funds the default key with 10 000 ETH | +| Finality | per chain profile | per chain profile | instant | +| Used by | production | dev / pre-merge sanity | unit tests + ephemeral dev | +| Mainnet deploy guard | requires `MAINNET_CONFIRM=1` env var | — | — | +| Stage-1 K11 stub on this chain | refuses unless `AGENTKEYS_ALLOW_STAGE1_STUBS=1` (per arch.md §22b.1) | allowed | allowed | + +## After a successful run + +`setup-heima.sh` writes the contract addresses to `scripts/operator-workstation.env` under chain-namespaced keys (e.g. `SCOPE_CONTRACT_ADDRESS_HEIMA=0x…`). Subsequent steps + the broker workers source the same env file, so no manual copy-paste is needed. + +Verify any time: + +```bash +AGENTKEYS_CHAIN=heima bash scripts/verify-heima-contracts.sh +AGENTKEYS_CHAIN=heima-paseo bash scripts/verify-heima-contracts.sh +``` + +Read-only RPC, zero gas, exits 0 on all-pass. + +## Chain-profile source of truth + +Built-in profiles ship in [`crates/agentkeys-core/chain-profiles/`](../crates/agentkeys-core/chain-profiles/) (`heima.json`, `heima-paseo.json`, `anvil.json`, `base.json`, `base-sepolia.json`, `ethereum.json`, `sepolia.json`). Each carries: RPC URL, chain ID, gas model, default block tag for finality, foundry chain arg. + +To override the RPC for one run without forking a profile: + +```bash +AGENTKEYS_CHAIN_PROFILE_FILE=./my-custom-profile.json bash scripts/setup-heima.sh +``` + +The JSON shape is documented in [`docs/spec/architecture.md`](spec/architecture.md) §22a. + +## Heima EVM version pin + +Heima Frontier runs at London EVM level (pre-Merge). [`crates/agentkeys-chain/foundry.toml`](../crates/agentkeys-chain/foundry.toml) pins `evm_version = "london"` so Foundry's simulator doesn't reject `prevrandao`-less block headers. **Don't change this** without re-verifying against a live Heima block header — see [CLAUDE.md "Heima EVM compatibility level"](../CLAUDE.md) for the verification recipe. + +## Related + +- Cloud / AWS prereqs: [`docs/cloud-setup.md`](cloud-setup.md) +- CI setup: [`docs/ci-setup.md`](ci-setup.md) +- Live contract addresses: [`docs/spec/deployed-contracts.md`](spec/deployed-contracts.md) +- Architecture: [`docs/spec/architecture.md`](spec/architecture.md) §22 (chain profiles), §22b (per-actor binding ceremonies) +- FAQ + troubleshooting: [`wiki/heima-setup-faq.md`](../wiki/heima-setup-faq.md) diff --git a/wiki/ci-setup-faq.md b/wiki/ci-setup-faq.md new file mode 100644 index 0000000..b8af0d3 --- /dev/null +++ b/wiki/ci-setup-faq.md @@ -0,0 +1,96 @@ +# CI setup — FAQ + +Troubleshooting + edge cases for [`docs/ci-setup.md`](https://github.com/litentry/agentKeys/blob/main/docs/ci-setup.md) + [`.github/workflows/harness-ci.yml`](https://github.com/litentry/agentKeys/blob/main/.github/workflows/harness-ci.yml). + +## Q. The `harness-e2e` job always shows "skipped" — what gives? + +That's the designed behavior until `TEST_OIDC_AWS_ROLE_ARN` is set as a repo secret. The preflight job emits a `::warning::` reminder. Until the operator finishes the 7-step bring-up in `docs/ci-setup.md`, only `rust-checks` runs — and that's enough to catch most regressions (600+ tests). + +## Q. `AssumeRoleWithWebIdentity` returns `InvalidIdentityToken: No OpenIDConnect provider found` + +AWS hasn't found the test broker's OIDC provider. Three checks: + +1. The OIDC provider ARN matches the broker's `BROKER_OIDC_ISSUER` byte-for-byte (including scheme and trailing slash). +2. The broker's `.well-known/openid-configuration` is reachable from the public internet (curl from a random box, not just the runner). +3. The IAM trust policy on the test role lists the OIDC provider ARN under `Principal.Federated`. + +## Q. `harness-e2e` runs but stage-3 fails with `AccessDenied` on the cross-actor write + +That's the test working — stage-3 step 5 / 8 / 9 are NEGATIVE tests that EXPECT `AccessDenied`. If they pass-as-success, the workflow exits 0. If they pass with `AccessDenied`, the harness script asserts that (the per-actor + per-data-class invariants from CLAUDE.md). A genuine failure is the script exiting non-zero, not the AWS API returning `AccessDenied`. + +## Q. Concurrent runs collide on S3 writes + +Per-run prefix isolation via `CI_S3_PREFIX=ci/run-${GITHUB_RUN_ID}` should prevent this. If you see it anyway: + +- Confirm `CI_S3_PREFIX` is being honored by every write site in the harness (currently `harness/v2-stage3-demo.sh` honors it; verify if you've added other harness steps). +- Make sure `concurrency.cancel-in-progress: true` is set in the workflow (it is — but a previous-run-in-flight can briefly overlap). + +## Q. Test contract addresses drifted from the secrets + +Happens when the operator redeploys the test contracts (e.g. after a `.sol` source change) but forgets to update the `TEST_*_HEIMA` secrets. Symptoms: stage-1 step 8 (verify-contracts) fails with "no bytecode at $SCOPE_ADDR". + +**Fix:** re-read addresses from `scripts/operator-workstation.env` post-redeploy, update the six `TEST_*_HEIMA` secrets via the GitHub UI. Use the GitHub CLI: + +```bash +for addr in SCOPE_CONTRACT_ADDRESS_HEIMA SIDECAR_REGISTRY_ADDRESS_HEIMA K3_EPOCH_COUNTER_ADDRESS_HEIMA \ + CREDENTIAL_AUDIT_ADDRESS_HEIMA P256_VERIFIER_ADDRESS_HEIMA K11_VERIFIER_ADDRESS_HEIMA; do + val=$(grep "^${addr}=" scripts/operator-workstation.env | cut -d= -f2) + gh secret set "TEST_${addr}" --body "$val" +done +``` + +## Q. The test deployer wallet ran out of HEI + +CI doesn't redeploy on every run (it uses pinned addresses from secrets). The deployer wallet is only spent when the operator manually re-runs `setup-heima.sh` for the test instance. If it does run out: + +```bash +# Check balance +cast balance "$(cast wallet address $(cat ~/.agentkeys/heima-deployer-test.key))" \ + --rpc-url "$(agentkeys chain show heima | jq -r .rpc.http)" + +# Top up from your personal wallet — small float (~1 HEI) is enough +``` + +## Q. Manual dispatch errors with `inputs.stage` unrecognized + +`workflow_dispatch.inputs` requires the workflow to be on the default branch (or your fork's default). If the workflow file landed on a feature branch, `gh workflow run` may fail. Either land it on `main` first, or push the feature branch and re-target: + +```bash +gh workflow run harness-ci.yml --ref my-branch --field stage=3 +``` + +## Q. Can the workflow run on every PR (not just operator-dispatched)? + +It already does — push + pull_request triggers are wired in `on:` at the top. The gate is `TEST_OIDC_AWS_ROLE_ARN`, not the trigger. Every PR's `rust-checks` job runs unconditionally; the `harness-e2e` job runs only if the secret is set. + +## Q. The workflow won't trigger on a PR from a fork + +GitHub doesn't pass secrets to fork PRs by default — that's a platform security feature. The `harness-e2e` job will preflight-skip on fork PRs even with the secret set. Reviewer needs to push the fork branch to the upstream repo or manually dispatch the workflow from the PR page. + +## Q. `aws-actions/configure-aws-credentials` succeeds but `aws sts get-caller-identity` says `agentkeys-admin` + +You forgot to update the role ARN secret after rotating to OIDC. The default credential chain falls through to whatever AWS profile is on the runner image. Set `TEST_OIDC_AWS_ROLE_ARN` to the GitHub Actions OIDC role ARN (not the admin user ARN), and the OIDC web identity will assume the right role. + +## Q. Why is `--test-threads=1` on `cargo test`? + +Per the existing `@claude` review workflow convention: broker integration tests mutate process-global `$HOME` + `$AWS_*` env, and the keyring tests serialize on a per-process accounts map. Concurrent threads see each other's mutations and flake. Single-threaded test execution is the conservative default; per-test isolation cleanup is a future improvement. + +## Q. CI runs are slow — anything to tune? + +- `Swatinem/rust-cache@v2` with `shared-key: harness-ci` is enabled — both jobs share a cache. +- `concurrency.cancel-in-progress: true` cancels stale runs on a re-push. +- Foundry toolchain is the slowest install; pin to `version: stable` for cache hits. +- The 60-minute timeout on `harness-e2e` is generous; typical run is 20–30 min. + +If runs still feel slow, profile with `gh run view --log-failed | head -50` to find the longest step. + +## Q. Where do I read the harness logs after a failure? + +Each harness script writes a temp dir under `/tmp/agentkeys-*`. The workflow uploads `/tmp/agentkeys-ci-ephemeral-*/` as the `ephemeral-stack-logs` artifact on failure (for the harness-e2e job). Download via `gh run download `. + +## Related + +- Operator runbook: [docs/ci-setup.md](https://github.com/litentry/agentKeys/blob/main/docs/ci-setup.md) +- Workflow file: [.github/workflows/harness-ci.yml](https://github.com/litentry/agentKeys/blob/main/.github/workflows/harness-ci.yml) +- Cloud setup FAQ: [cloud-setup-faq](./cloud-setup-faq.md) +- Heima setup FAQ: [heima-setup-faq](./heima-setup-faq.md) diff --git a/wiki/cloud-setup-faq.md b/wiki/cloud-setup-faq.md new file mode 100644 index 0000000..2beca18 --- /dev/null +++ b/wiki/cloud-setup-faq.md @@ -0,0 +1,94 @@ +# Cloud setup — FAQ + +Troubleshooting + edge cases that didn't fit in [`docs/cloud-setup.md`](https://github.com/litentry/agentKeys/blob/main/docs/cloud-setup.md). Use ⌘F to find your error. + +## Q. `setup-broker-host.sh` says "BROKER_OIDC_ISSUER mismatch" on re-run + +The script auto-detects an existing systemd unit and reads `Environment=` lines to decide bootstrap-vs-upgrade. If you ran with a different `--issuer-url` previously and the AWS OIDC provider was already registered for the old URL, the new run refuses. + +**Fix:** decide which URL is canonical. AWS validates the OIDC issuer URL byte-for-byte against the JWT `iss` claim, so the issuer URL is effectively immutable once the IAM trust policy is built. Either: +- Re-run with the OLD `--issuer-url` (the trust policy already matches). +- Or delete the OIDC provider, redo §4 from cloud-setup.md, and re-run with the NEW URL. + +## Q. nginx 502 after a fresh `setup-broker-host.sh` run + +systemd may have started the broker before nginx finished its first `systemctl reload`. Two-step fix: + +```bash +sudo systemctl status agentkeys-broker # → active (running) +sudo systemctl restart nginx # picks up the new vhost +curl -sf https://${BROKER_HOST}/healthz # → 200 +``` + +If the broker itself is failing to boot, `journalctl -u agentkeys-broker -n 50` is authoritative. + +## Q. `verify_sender_ready` precheck fails at broker boot + +The broker calls SES `GetEmailIdentity` on `BROKER_EMAIL_FROM_ADDRESS` at startup. If the SES domain identity isn't verified yet, boot refuses. Run [`scripts/ses-verify-sender.sh`](https://github.com/litentry/agentKeys/blob/main/scripts/ses-verify-sender.sh) and wait for the DKIM tokens to propagate (5–30 min typical), then restart the broker. + +## Q. `aws iam create-open-id-connect-provider` returns `EntityAlreadyExistsException` + +The OIDC provider already exists. Verify with: + +```bash +aws iam list-open-id-connect-providers \ + | jq -r '.OpenIDConnectProviderList[].Arn' \ + | grep "${BROKER_HOST}" +``` + +If the ARN is correct, you're done — the trust policy and bucket policy from §4.3/§4.4 are the only steps that remain. + +## Q. `AccessDenied` from S3 even though the role + bucket policy look right + +Three things almost always: + +1. The role's **inline policy** still has the broad-bucket grant from §3.5 — strip it via §4.4.1. +2. The bucket policy's `s3:prefix` condition needs the `${aws:PrincipalTag/agentkeys_actor_omni}` interpolation to be lowercased — addresses are case-sensitive in policy string comparisons. +3. `s3:ListBucket` needs the `s3:prefix=bots/${PrincipalTag}//*` condition in a separate statement (the v3 split-statement bucket policy from codex P2). Listing the bucket root without that condition always returns AccessDenied. + +CloudTrail's `Decision` field tells you which statement evaluated. + +## Q. Per-profile default region trap (real 2026-05-12 incident) + +`agentkeys-admin` defaults to `us-west-2`; `agentkeys-broker` / `agentkeys-daemon` default to `us-east-1`. Every regional CLI call must pass `--region "$REGION"` explicitly. The CLAUDE.md "Per-profile default region is NOT uniform" section covers this in detail. + +## Q. Cert renewal failed silently — workflow turned red overnight + +certbot renewals run on a 90-day cadence. If they fail (often: rate limit, DNS-01 hiccup, port 80 firewall block), AWS stops trusting the OIDC issuer (TLS chain breaks). Symptoms: + +- `harness-e2e` CI job fails on the first `curl https://${BROKER_HOST}` with a TLS error. +- `journalctl -u certbot-renew` shows the failure reason. + +**Recovery:** rerun `sudo certbot renew --force-renewal` (works for transient rate-limit issues), or fix the DNS / firewall and re-run. The broker doesn't need to restart — nginx reloads automatically. + +## Q. Switching AWS accounts for the test instance + +Same-account is fine — isolation comes from the `-test` suffix, not from the AWS account boundary. If you want hard account isolation, every reference to `${ACCOUNT_ID}` in cloud-setup.md becomes `${TEST_ACCOUNT_ID}`, including the role ARN that the broker assumes via OIDC. The setup-broker-host.sh script accepts `--account-id` to point at a different account. + +## Q. Tencent Cloud port? + +§2.2 of cloud-setup.md sketches SimpleDM + COS as the swap-in at the §3+ boundary. The boundary is real — DNS + inbound mail are the only AWS-specific layers; everything from `agentkeys-data-role` onward is provider-agnostic in shape, with COS providing S3-compatible PutObject/GetObject and Tencent's IAM providing OIDC federation. Real port work is tracked separately. + +## Q. Can I run the broker without nginx? + +Yes — `setup-broker-host.sh --without-nginx --without-certbot` skips both. You're then responsible for TLS termination upstream (CloudFront, ALB, custom reverse proxy). AWS still needs to fetch the OIDC discovery + JWKS over public TLS, so whatever fronts the broker must serve `https://${BROKER_HOST}/.well-known/*` with a valid leaf cert. + +## Q. The systemd unit was hand-edited and now setup-broker-host.sh refuses + +Per CLAUDE.md "Remote broker host (single entry point)" — don't hand-edit. To recover: + +```bash +sudo systemctl stop agentkeys-broker +sudo rm /etc/systemd/system/agentkeys-broker.service +sudo systemctl daemon-reload +sudo bash scripts/setup-broker-host.sh --yes +``` + +The script rewrites the unit clean. If you had a legitimately custom field, add a `--*-host` or `--cred-mode` flag to the script and re-run — that's how all per-host overrides ship. + +## Related + +- Operator runbook: [docs/cloud-setup.md](https://github.com/litentry/agentKeys/blob/main/docs/cloud-setup.md) +- Single entry point: [scripts/setup-broker-host.sh](https://github.com/litentry/agentKeys/blob/main/scripts/setup-broker-host.sh) +- Heima chain FAQ: [heima-setup-faq](./heima-setup-faq.md) +- CI FAQ: [ci-setup-faq](./ci-setup-faq.md) diff --git a/wiki/heima-setup-faq.md b/wiki/heima-setup-faq.md new file mode 100644 index 0000000..9281843 --- /dev/null +++ b/wiki/heima-setup-faq.md @@ -0,0 +1,111 @@ +# Heima setup — FAQ + +Troubleshooting + edge cases for [`docs/heima-setup.md`](https://github.com/litentry/agentKeys/blob/main/docs/heima-setup.md) + [`scripts/setup-heima.sh`](https://github.com/litentry/agentKeys/blob/main/scripts/setup-heima.sh). + +## Q. `chain mismatch: profile says chain_id=X but RPC reports Y` + +Step 3 caught a misconfigured RPC. Usually means `AGENTKEYS_CHAIN=heima` is set but the chain profile's `rpc.http` points at Paseo (or vice versa). Either: + +- Edit the chain profile JSON in [`crates/agentkeys-core/chain-profiles/`](https://github.com/litentry/agentKeys/tree/main/crates/agentkeys-core/chain-profiles). +- Override per-run via `AGENTKEYS_CHAIN_PROFILE_FILE=./my-profile.json`. + +Never set `AGENTKEYS_CHAIN=heima` and then point at a Paseo RPC — many downstream balance / nonce reads will return wrong-chain data. + +## Q. Step 6 says "deploy skipped" but I expect a fresh deploy + +`heima-bring-up.sh` runs `cast code` on every claimed address in `operator-workstation.env` and short-circuits if all six addresses already have bytecode on chain. Force a redeploy with: + +```bash +# Clear the saved addresses for this chain, then re-run +PROFILE_UC=$(printf '%s' "${AGENTKEYS_CHAIN:-heima}" | tr 'a-z-' 'A-Z_') +sed -i.bak "/^.*_CONTRACT_ADDRESS_${PROFILE_UC}=.*/d" scripts/operator-workstation.env +bash scripts/setup-heima.sh --only-step 6 +``` + +Mainnet deploys cost real HEI — confirm you actually want a redeploy before clearing. + +## Q. Mainnet deploy refuses with "MAINNET_CONFIRM=1 required" + +The mainnet path has a paranoid guard against accidental redeploys. Pass `MAINNET_CONFIRM=1` only when you're sure: + +```bash +MAINNET_CONFIRM=1 AGENTKEYS_CHAIN=heima bash scripts/setup-heima.sh --only-step 6 +``` + +## Q. Paseo step 5 (fund deployer) hangs + +Paseo collators were halted at block 2,905,430 (frozen since 2026-01-15 per CLAUDE.md). When they're down, `heima-fund-account.sh` can't reach the chain. Three options: + +- Wait for the parachain to recover. +- Switch to `--chain anvil` for local dev work. +- Switch to `--chain heima` mainnet (fund from your personal wallet — no sudo on mainnet). + +## Q. K11 enrollment stub refuses on mainnet + +Per arch.md §22b.1: stage-1 K11 stub on mainnet requires `AGENTKEYS_ALLOW_STAGE1_STUBS=1`. The flag exists to keep accidental stub enrollments off mainnet — the on-chain `length != 0` gate accepts stubs but the bytes aren't cryptographically bound. + +For real Touch ID: + +```bash +bash scripts/setup-heima.sh --webauthn +``` + +For one-time deliberate stub on mainnet (dev / debug): + +```bash +AGENTKEYS_ALLOW_STAGE1_STUBS=1 bash scripts/setup-heima.sh +``` + +## Q. Step 12 (scope set) skipped — what now? + +Step 12 needs a real K11 ceremony (master-mutation, not just creation). Re-run the orchestrator with `--webauthn`, or invoke `heima-scope-set.sh --webauthn` directly: + +```bash +bash scripts/heima-scope-set.sh \ + --webauthn \ + --agent demo-agent \ + --services openrouter \ + --session-id alice +``` + +## Q. Why are steps 13 + 14 "intentionally append-only"? + +The audit log + tier-A relay are designed to grow. Each re-run advances `entryCount` and adds a fresh row — that's the audit trail working as intended, not a regression. If you re-run setup-heima.sh weekly for sanity, the audit log will accumulate ~weekly rows. + +To check the entry count any time: + +```bash +cast call "$CREDENTIAL_AUDIT_ADDRESS_HEIMA" "entryCount()(uint256)" \ + --rpc-url "$(agentkeys chain show heima | jq -r .rpc.http)" +``` + +## Q. Per-step re-run fails with "missing session JWT" + +Steps 9–13 read `~/.agentkeys/${SESSION_ID}/session.json` to derive the operator's `actor_omni`. If the JWT expired or was deleted, re-mint: + +```bash +agentkeys init --session-id alice --email alice@example.com +``` + +Then re-run the orchestrator from the failing step. + +## Q. `forge script` errors with "header validation error: `prevrandao` not set" + +Heima Frontier is at London EVM level (pre-Merge). [`crates/agentkeys-chain/foundry.toml`](https://github.com/litentry/agentKeys/blob/main/crates/agentkeys-chain/foundry.toml) must pin `evm_version = "london"`. If you bumped it for unrelated reasons, revert. The full diagnosis is in CLAUDE.md "Heima EVM compatibility level". + +## Q. Anvil contract addresses are different every run — is that wrong? + +No. Anvil starts fresh per process; the deterministic deployer key + nonce-0 still produces the canonical first address (`0x5FbDB2315678afecb367f032d93F642f64180aa3` for P256Verifier), but `operator-workstation.env`'s pinned addresses are for the persistent chains (heima / heima-paseo), not for anvil. The `verify-heima-contracts.sh` flow + chain-namespaced env keys handle this — anvil reuses the deploy-time addresses for the lifetime of one anvil process. + +## Q. I want to redeploy ONLY one contract + +The atomic deploy is by design — each downstream contract takes the prior address via constructor, so partial redeploys break wiring. If you need a single-contract upgrade, use a proxy pattern (out of scope for stage-1) or do a full redeploy + update the env file. + +## Related + +- Operator runbook: [docs/heima-setup.md](https://github.com/litentry/agentKeys/blob/main/docs/heima-setup.md) +- Orchestrator: [scripts/setup-heima.sh](https://github.com/litentry/agentKeys/blob/main/scripts/setup-heima.sh) +- Per-step helpers: [scripts/heima-*.sh](https://github.com/litentry/agentKeys/tree/main/scripts) +- Live contract addresses: [docs/spec/deployed-contracts.md](https://github.com/litentry/agentKeys/blob/main/docs/spec/deployed-contracts.md) +- Cloud setup FAQ: [cloud-setup-faq](./cloud-setup-faq.md) +- CI setup FAQ: [ci-setup-faq](./ci-setup-faq.md) From 5b88bb6b595f8ada1092bf2195d8d62c6f9f996e Mon Sep 17 00:00:00 2001 From: wildmeta-agent Date: Thu, 21 May 2026 10:31:37 +0800 Subject: [PATCH 4/4] docs: extract first-time cloud bootstrap into separate doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per operator request: the very-beginning cloud-account provisioning (IAM users + role, DNS, SES, S3 buckets, instance profile) needs to live in a separate doc so it stays reachable when: - Adding a second AWS account (test instance, regional shard) - Migrating to AliCloud / GCP / Tencent Cloud - Re-bootstrapping after a teardown - Auditing the identity surface The previous condense pass collapsed those sections into cloud-setup.md's slim §1-§3 — convenient for day-to-day operators but stripped the depth needed for the migration / second-account use cases. What changed: docs/cloud-bootstrap.md — NEW, 365 lines First-time, per-account, cloud-provider-portable bootstrap doc: §1 Identities — four IAM principals, cloud-agnostic §2 Domain + DNS — subdomain map, parent-zone confirm §3 Email backend — SES domain verify + receipt rule + inbound S3 bucket creation §4 IAM users + roles — agentkeys-daemon + agentkeys-data-role + per-data-class vault/memory roles §5 Initial bucket policy — static-IAM variant (pre-OIDC) §6 Instance profile — agentkeys-broker-host (EC2 optional) §7 Security audit — strip legacy over-broad attached policies (`AmazonS3FullAccess` checklist from the pre-condense §3.4a) §8 Cloud-provider port — AWS / AliCloud / GCP / Tencent Cloud 1:1 mapping table + migration playbook Restores the operational depth (DKIM bulk-record bash, daemon user create, role trust shape, broker-host instance profile, security audit) that the previous condense pass removed. Adds the portability framing (concept first, AWS-specific commands as ONE implementation) so the doc is the durable reference for non-AWS deployments. docs/cloud-setup.md — UPDATE, 314 → 202 lines Refocus on what comes AFTER bootstrap: OIDC federation activation (§1, was §4) + the setup-broker-host.sh runtime entry point (§2, was §5) + cleanup (§3, was §6). Drop the duplicate §1-§3 prereqs; add a clear cross-ref to cloud-bootstrap.md at the top. Section numbers renumbered. wiki/cloud-setup-faq.md — minor header tweak The FAQ now covers both cloud-bootstrap.md and cloud-setup.md (operators hit the same gotchas across both phases). Constraints applied: - Concise: every doc still fits in a few screens (bootstrap is longest at 365 lines because it carries the actual provisioning commands; cloud-setup.md is now 202 lines, down from 970 originally). - Idempotent: every flow uses the existing idempotent helper scripts. - No project credentials exposed: same placeholder convention as the prior pass (${ACCOUNT_ID}, ${ZONE}, etc.). Verified via grep. All internal links verified (python url-walker). --- docs/cloud-bootstrap.md | 365 ++++++++++++++++++++++++++++++++++++++++ docs/cloud-setup.md | 171 +++---------------- wiki/cloud-setup-faq.md | 7 +- 3 files changed, 397 insertions(+), 146 deletions(-) create mode 100644 docs/cloud-bootstrap.md diff --git a/docs/cloud-bootstrap.md b/docs/cloud-bootstrap.md new file mode 100644 index 0000000..e5048b4 --- /dev/null +++ b/docs/cloud-bootstrap.md @@ -0,0 +1,365 @@ +# Cloud bootstrap — AgentKeys + +**Audience:** the operator standing up a brand-new cloud account to host AgentKeys for the first time, or porting the deployment to a new cloud provider (AliCloud, GCP, Tencent Cloud). +**Scope:** the per-account, run-once provisioning that has to happen **before** anything in [`docs/cloud-setup.md`](cloud-setup.md), [`docs/heima-setup.md`](heima-setup.md), or [`docs/ci-setup.md`](ci-setup.md) can run. Identifiers (DNS names, IAM principals, mail backend, object store, initial bucket policy) — never runtime processes. +**FAQ + troubleshooting:** [`wiki/cloud-setup-faq.md`](../wiki/cloud-setup-faq.md). + +After this doc is run, the operator returns here ONLY when: +- Switching cloud providers (e.g. AWS → AliCloud) +- Adding a second AWS account (test instance, regional shard) +- Re-bootstrapping after a teardown +- Auditing the identity surface (the security-audit checklist in §7) + +The day-to-day broker re-deploys live in [`docs/cloud-setup.md`](cloud-setup.md) §5 (`setup-broker-host.sh`); they never re-enter this doc. + +## TL;DR — operator flow + +``` +§1 Identities — four IAM principals; concept first, then provider commands +§2 Domain + DNS — subdomain ownership; parent-zone confirmation +§3 Email backend — SES domain identity + receipt rule + S3 inbound bucket +§4 IAM users + roles — agentkeys-{admin,broker,daemon} + agentkeys-data-role +§5 Bucket policy — static-IAM variant (pre-OIDC; replaced in cloud-setup.md §4) +§6 Instance profile — agentkeys-broker-host (optional, EC2-only) +§7 Security audit — strip legacy over-broad attached policies +§8 Cloud portability — AWS → AliCloud / GCP / Tencent Cloud mapping +``` + +```bash +# Per-account shell vars used throughout. Source from operator-workstation.env +# wherever possible; placeholders here for clarity. +awsp agentkeys-admin +aws sts get-caller-identity # → agentkeys-admin + +export REGION=us-east-1 # SES inbound regions: us-east-1, us-west-2, eu-west-1 +export MAIL_DOMAIN=bots.${ZONE} # SES inbound subdomain +export BROKER_HOST=broker.${ZONE} # broker TLS-terminating reverse proxy +export PARENT_ZONE_ID=ZXXXXXXXXXXXXX # existing parent zone (Route 53 / AliCloud / etc.) +export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) +export BUCKET=agentkeys-mail-${ACCOUNT_ID} # global-unique by account-id suffix +``` + +> **Why `jq -n --arg` and not `cat > file.json < Why "data role" and not "agent role": the project word "agent" already means three things (the AI agent, the AgentKeys product, an IAM role). The role holds **data-plane** permissions. The broker still accepts the legacy `BROKER_AGENT_ROLE_ARN` env var for backwards compatibility. + +## §2 Domain + DNS + +Six subdomains under the operator's parent zone (substitute `${ZONE}` everywhere): + +| Host | Purpose | Provisioned in | +|---|---|---| +| `${MAIL_DOMAIN}` (e.g. `bots.${ZONE}`) | SES / email backend inbound | §3 | +| `${BROKER_HOST}` (e.g. `broker.${ZONE}`) | Broker public reverse proxy | §5.1 of cloud-setup.md | +| `signer.${ZONE}` | Signer service (issue #74 step 1b) | §5.1 of cloud-setup.md | +| `audit.${ZONE}` / `email.${ZONE}` / `cred.${ZONE}` / `memory.${ZONE}` | Service workers (issue #90) | §5.1 of cloud-setup.md (dev co-location on broker EIP today) | + +Confirm the parent zone is reachable before any record changes (AWS Route 53 example; the same `get-hosted-zone` shape exists on AliCloud DNS + Cloud DNS): + +```bash +aws route53 get-hosted-zone --id "$PARENT_ZONE_ID" \ + --query 'HostedZone.{name:Name, private:Config.PrivateZone}' +# → {"name": "${ZONE}.", "private": false} +``` + +The bulk service-worker A-record creation is automated by [`scripts/dns-upsert-workers.sh`](../scripts/dns-upsert-workers.sh) (AWS Route 53 today). For other providers, replicate the same shape — the hostnames are the migration seam. + +## §3 Email backend + +### §3.1 Verify the SES domain identity (AWS) + +```bash +aws sesv2 create-email-identity \ + --region "$REGION" --email-identity "$MAIL_DOMAIN" \ + --dkim-signing-attributes NextSigningKeyLength=RSA_2048_BIT +``` + +Then publish DKIM + SPF + DMARC + MX records in one DNS change. AWS Route 53: + +```bash +read -r T1 T2 T3 <<<"$(aws sesv2 get-email-identity --region "$REGION" \ + --email-identity "$MAIL_DOMAIN" --query 'DkimAttributes.Tokens' --output text)" + +aws route53 change-resource-record-sets --hosted-zone-id "$PARENT_ZONE_ID" \ + --change-batch "$(jq -n \ + --arg domain "$MAIL_DOMAIN" --arg region "$REGION" \ + --arg t1 "$T1" --arg t2 "$T2" --arg t3 "$T3" '{ + Comment: "AgentKeys email infra for \($domain)", + Changes: [ + {Action:"UPSERT", ResourceRecordSet:{Name:"\($t1)._domainkey.\($domain)", Type:"CNAME", TTL:300, ResourceRecords:[{Value:"\($t1).dkim.amazonses.com"}]}}, + {Action:"UPSERT", ResourceRecordSet:{Name:"\($t2)._domainkey.\($domain)", Type:"CNAME", TTL:300, ResourceRecords:[{Value:"\($t2).dkim.amazonses.com"}]}}, + {Action:"UPSERT", ResourceRecordSet:{Name:"\($t3)._domainkey.\($domain)", Type:"CNAME", TTL:300, ResourceRecords:[{Value:"\($t3).dkim.amazonses.com"}]}}, + {Action:"UPSERT", ResourceRecordSet:{Name:$domain, Type:"MX", TTL:300, ResourceRecords:[{Value:"10 inbound-smtp.\($region).amazonaws.com"}]}}, + {Action:"UPSERT", ResourceRecordSet:{Name:$domain, Type:"TXT", TTL:300, ResourceRecords:[{Value:"\"v=spf1 include:amazonses.com -all\""}]}}, + {Action:"UPSERT", ResourceRecordSet:{Name:"_dmarc.\($domain)", Type:"TXT", TTL:300, ResourceRecords:[{Value:"\"v=DMARC1; p=quarantine; rua=mailto:dmarc@\($domain)\""}]}} + ] + }')" +``` + +Wait ~5 min for DKIM propagation, then verify: + +```bash +aws sesv2 get-email-identity --region "$REGION" --email-identity "$MAIL_DOMAIN" \ + --query '{verified: VerifiedForSendingStatus, dkim: DkimAttributes.Status}' +# → {"verified": true, "dkim": "SUCCESS"} +``` + +> **DKIM key custody:** in this interim setup, the email service holds the private DKIM key (AWS-internal on SES, AliCloud-internal on DirectMail, etc.). Trust surface = provider could forge mail signed as us → bounded blast radius (reputation, not user-data custody). Migration target is TEE-held BYODKIM — track in [`docs/spec/heima-gaps-vs-desired-architecture.md`](spec/heima-gaps-vs-desired-architecture.md) §4. Do **not** intermediate-step to "BYODKIM with file-stored key" (strictly worse than provider-managed). + +### §3.2 Create the S3 bucket for inbound mail + +```bash +aws s3api create-bucket \ + --region "$REGION" --bucket "$BUCKET" \ + $([ "$REGION" != "us-east-1" ] && echo "--create-bucket-configuration LocationConstraint=$REGION") + +aws s3api put-public-access-block --region "$REGION" --bucket "$BUCKET" \ + --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true + +# 30-day TTL on inbound objects (throwaway-inbox model) +aws s3api put-bucket-lifecycle-configuration --region "$REGION" --bucket "$BUCKET" \ + --lifecycle-configuration "$(jq -n '{ + Rules: [{ID:"inbound-30d-ttl", Status:"Enabled", Filter:{Prefix:"inbound/"}, Expiration:{Days:30}}] + }')" +``` + +### §3.3 Create the SES receipt rule + +```bash +aws ses create-receipt-rule-set --rule-set-name agentkeys --region "$REGION" 2>/dev/null || true +aws ses create-receipt-rule --region "$REGION" --rule-set-name agentkeys \ + --rule "$(jq -n --arg domain "$MAIL_DOMAIN" --arg bucket "$BUCKET" '{ + Name: "agentkeys-inbound", Enabled: true, ScanEnabled: true, TlsPolicy: "Optional", + Recipients: [$domain], + Actions: [{S3Action: {BucketName: $bucket, ObjectKeyPrefix: "inbound/"}}] + }')" +aws ses set-active-receipt-rule-set --rule-set-name agentkeys --region "$REGION" +``` + +Inbound MIME lands at `s3://$BUCKET/inbound/`. First object: `AMAZON_SES_SETUP_NOTIFICATION` (provider's "I successfully wrote to your bucket" marker). Real mail follows. + +**Sandbox vs production sending:** inbound is unaffected by SES sandbox; **outbound** to arbitrary addresses needs Console → Support → "SES Sending Limits" → "Request Production Access". + +## §4 IAM users + roles + +### §4.1 `agentkeys-daemon` — broker runtime user + +```bash +aws iam create-user --user-name agentkeys-daemon +aws iam create-access-key --user-name agentkeys-daemon +# → save AccessKeyId + SecretAccessKey to your secret manager. NEVER to git. + +aws iam put-user-policy --user-name agentkeys-daemon \ + --policy-name agentkeys-daemon-assume-role \ + --policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{ + Version:"2012-10-17", + Statement:[{ + Effect:"Allow", Action:"sts:AssumeRole", + Resource:"arn:aws:iam::\($acct):role/agentkeys-data-role" + }] + }')" +``` + +The daemon user can do exactly one thing: assume `agentkeys-data-role`. Any storage / email action goes through the role's permissions, never the user's. + +### §4.2 `agentkeys-data-role` (static-IAM-user trust variant) + +The role's trust policy starts with the static-IAM-user variant. After the broker is publicly reachable, [`docs/cloud-setup.md`](cloud-setup.md) §4 swaps it for the OIDC-federated variant. + +```bash +aws iam create-role --role-name agentkeys-data-role \ + --assume-role-policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{ + Version:"2012-10-17", + Statement:[{ + Effect:"Allow", + Principal:{AWS:"arn:aws:iam::\($acct):user/agentkeys-daemon"}, + Action:"sts:AssumeRole" + }] + }')" + +aws iam put-role-policy --role-name agentkeys-data-role \ + --policy-name agentkeys-data-role-inline \ + --policy-document "$(jq -n \ + --arg bucket "$BUCKET" --arg region "$REGION" \ + --arg acct "$ACCOUNT_ID" --arg domain "$MAIL_DOMAIN" '{ + Version:"2012-10-17", + Statement:[ + {Effect:"Allow", Action:"s3:ListBucket", Resource:"arn:aws:s3:::\($bucket)"}, + {Effect:"Allow", Action:"s3:GetObject", Resource:"arn:aws:s3:::\($bucket)/*"}, + {Effect:"Allow", Action:["ses:SendEmail","ses:GetEmailIdentity"], + Resource:["arn:aws:ses:\($region):\($acct):identity/\($domain)", + "arn:aws:ses:\($region):\($acct):identity/*@\($domain)"]} + ] + }')" + +export ROLE_ARN=$(aws iam get-role --role-name agentkeys-data-role --query 'Role.Arn' --output text) +echo "ROLE_ARN=$ROLE_ARN" +``` + +### §4.3 Per-data-class roles (`agentkeys-vault-role`, `agentkeys-memory-role`) + +Per arch.md §17.2: separate roles for credentials + memory data classes. Same trust shape as §4.2, distinct inline policies + PrincipalTag scoping. Provisioned by per-data-class helpers (idempotent): + +```bash +bash scripts/provision-vault-bucket.sh # agentkeys-vault-${ACCOUNT_ID} +bash scripts/provision-vault-role.sh # agentkeys-vault-role +bash scripts/apply-vault-bucket-policy.sh # v3 split-statement PrincipalTag policy + +bash scripts/provision-memory-bucket.sh +bash scripts/provision-memory-role.sh +bash scripts/apply-memory-bucket-policy.sh + +bash scripts/cleanup-mail-bucket-policy.sh # restore email-only grants on $BUCKET +``` + +These scripts are the **source of truth** for the policy shape — read them, don't transcribe. + +### §4.4 `agentkeys-admin`, `agentkeys-broker` (already provisioned) + +If you reached this section, `agentkeys-admin` exists (you're using it). `agentkeys-broker` is whatever IAM user you SSH into the broker host with — its perms are out of scope (`ec2-instance-connect:SendSSHPublicKey` on the host's instance ID is sufficient for AWS Instance Connect). + +## §5 S3 bucket policy (initial, static-IAM variant) + +```bash +aws s3api put-bucket-policy --region "$REGION" --bucket "$BUCKET" \ + --policy "$(jq -n --arg bucket "$BUCKET" --arg acct "$ACCOUNT_ID" '{ + Version:"2012-10-17", + Statement:[ + { + Sid:"AllowSESWriteInbound", Effect:"Allow", + Principal:{Service:"ses.amazonaws.com"}, + Action:"s3:PutObject", + Resource:"arn:aws:s3:::\($bucket)/*", + Condition:{StringEquals:{"aws:Referer":$acct}} + }, + { + Sid:"AllowDaemonRead", Effect:"Allow", + Principal:{AWS:"arn:aws:iam::\($acct):role/agentkeys-data-role"}, + Action:["s3:GetObject","s3:ListBucket"], + Resource:["arn:aws:s3:::\($bucket)","arn:aws:s3:::\($bucket)/*"] + } + ] + }')" +``` + +The PrincipalTag-scoped federated variant (which replaces this once OIDC federation is up) lives in [`docs/cloud-setup.md`](cloud-setup.md) §4.4. + +## §6 `agentkeys-broker-host` instance profile (EC2-only, optional) + +If the broker runs on AWS EC2, attach this so the daemon never holds a static key. Runtime creds come from IMDS. + +```bash +ROLE=agentkeys-broker-host + +aws iam create-role --role-name "$ROLE" \ + --assume-role-policy-document "$(jq -n '{ + Version:"2012-10-17", + Statement:[{Effect:"Allow", Principal:{Service:"ec2.amazonaws.com"}, Action:"sts:AssumeRole"}] + }')" + +aws iam put-role-policy --role-name "$ROLE" --policy-name BrokerAssumeData \ + --policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{ + Version:"2012-10-17", + Statement:[{Effect:"Allow", Action:"sts:AssumeRole", + Resource:"arn:aws:iam::\($acct):role/agentkeys-data-role"}] + }')" + +aws iam create-instance-profile --instance-profile-name "$ROLE" +aws iam add-role-to-instance-profile --instance-profile-name "$ROLE" --role-name "$ROLE" +aws ec2 associate-iam-instance-profile --region "$REGION" \ + --instance-id "$INSTANCE_ID" \ + --iam-instance-profile Name="$ROLE" +``` + +> **Caller-region trap:** `agentkeys-admin` profile defaults to `us-west-2`; the broker EC2 usually lives in `us-east-1`. Without `--region "$REGION"`, `describe-instances` silently returns empty and downstream `put-role-policy` runs with `--role-name ""`. Pass `--region` explicitly on every regional call. See [CLAUDE.md "AWS local-profile ↔ remote-IAM mapping"](../CLAUDE.md). + +### §6.1 `ses:SendEmail` grant on the runtime role + +The broker calls SES v2 `SendEmail` with its **own** runtime credentials (instance profile), not via the assumed `agentkeys-data-role`. Without `ses:SendEmail` on the broker's role, the operator hits: + +``` +broker rejected /v1/auth/email/request: status=502 body= +{"error":"backend_unreachable","message":"… ses SendEmail: + unhandled error (AccessDeniedException)"} +``` + +The IAM action is `ses:SendEmail` (sesv2), NOT `ses:SendRawEmail` (v1; different code path the broker doesn't use). The grant lives on the broker's runtime role (`agentkeys-broker-host` on EC2; the user `agentkeys-daemon` otherwise) — see [`docs/cloud-setup.md`](cloud-setup.md) §3.3 for the exact statement. + +## §7 Security audit — strip legacy over-broad attached policies + +Some early deploys ship with `AmazonS3FullAccess` (or similar wide permissions) attached to the broker's runtime role. The broker at runtime ONLY uses `aws-sdk-sts` (the GetCallerIdentity startup probe) + `aws-sdk-sesv2` (the §6.1 grant) — it never accesses S3 with its own creds. Per-user S3 is via JWT-assumed `agentkeys-{data,vault,memory}-role`, not the broker's runtime role. + +A broker compromise with `AmazonS3FullAccess` would expose every inbound email in the SES bucket (verification tokens, magic links). Strip it: + +```bash +# Discover the actual role attached to the broker host (canonical name: +# agentkeys-broker-host; some early deploys landed on different names): +INSTANCE_PROFILE_ARN=$(aws ec2 describe-instances --region "$REGION" \ + --filters "Name=ip-address,Values=$EIP" \ + --query 'Reservations[].Instances[].IamInstanceProfile.Arn' --output text) + +ROLE=$(aws iam get-instance-profile \ + --instance-profile-name "${INSTANCE_PROFILE_ARN##*/}" \ + --query 'InstanceProfile.Roles[0].RoleName' --output text) +echo "broker runtime role: $ROLE" + +# Audit attached policies: +aws iam list-attached-role-policies --role-name "$ROLE" + +# Detach AmazonS3FullAccess if present: +aws iam detach-role-policy --role-name "$ROLE" \ + --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess + +# Verify only the narrow inline policy (BrokerSendEmail + AssumeDataRole) remains: +aws iam list-role-policies --role-name "$ROLE" +aws iam list-attached-role-policies --role-name "$ROLE" +``` + +## §8 Cloud-provider portability + +Every layer in §3–§5 has a 1:1 analog on the major providers. The provisioning shape carries; only the API endpoints + JSON dialects differ. + +| Layer | AWS (current) | AliCloud (in progress) | GCP | Tencent Cloud | +|---|---|---|---|---| +| Privileged user | IAM user with `IAMFullAccess` | RAM user with `AliyunRAMFullAccess` | IAM service account with `roles/iam.securityAdmin` | CAM user with `AdministratorAccess` | +| Runtime user | IAM user + access key | RAM user + AK/SK | Service account + key file (or Workload Identity) | CAM user + SecretId/SecretKey | +| Data role | IAM role + assume policy | RAM role + assume policy | Service account + IAM bindings | CAM role + assume policy | +| Federation | IAM OIDC provider | RAM IDaaS / OIDC provider | Workload Identity Pool | CAM OIDC provider | +| Object store | S3 + bucket policy | OSS + bucket policy | Cloud Storage + IAM bindings | COS + bucket policy | +| Email backend | SES + S3 receipt rule | DirectMail / SimpleDM + OSS event notification | SendGrid / Mailgun (no GCP-native) | SimpleDM + COS | +| TLS termination | nginx + Let's Encrypt | nginx + Let's Encrypt | nginx + Let's Encrypt | nginx + Let's Encrypt | +| Compute (broker host) | EC2 + EIP | ECS + EIP | Compute Engine + external IP | CVM + EIP | +| DNS | Route 53 | AliCloud DNS | Cloud DNS | DNSPod / Cloud DNS | +| Secrets storage | Secrets Manager / SSM Parameter Store | KMS Secrets Manager | Secret Manager | KMS | + +**Migration playbook (cloud → cloud):** + +1. Re-bind operator-workstation.env to the new provider's identifiers (account ID, region, role ARNs, bucket name). +2. Re-run this doc top-to-bottom against the new provider. +3. Re-run [`docs/cloud-setup.md`](cloud-setup.md) §4 (OIDC federation) — substitute the provider's OIDC API. +4. Re-run `scripts/setup-broker-host.sh` on the new host (the script doesn't care which cloud — it consumes already-provisioned identifiers). +5. Re-run `scripts/setup-heima.sh` — the chain side is cloud-agnostic. +6. Re-run the harness scripts to validate end-to-end. + +The boundary is sharp: the broker process itself contains zero cloud-specific code — it talks STS-compatible OIDC + S3-compatible PutObject/GetObject + SMTP-compatible SendEmail. Every cloud above offers all three primitives. The [`provisioner-scripts/email-backends/`](../provisioner-scripts/) directory documents the email-backend trait; a new backend slots in as `tencent-simpledm-cos` (or similar) with the same upstream API as `ses-s3`. + +## Related + +- Day-to-day broker re-deploys: [`docs/cloud-setup.md`](cloud-setup.md) +- Chain bring-up: [`docs/heima-setup.md`](heima-setup.md) +- CI activation: [`docs/ci-setup.md`](ci-setup.md) +- Architecture (per-data-class buckets + isolation invariants): [`docs/spec/architecture.md`](spec/architecture.md) §17, §17.2 +- Future Tencent / TEE DKIM: [`docs/spec/heima-gaps-vs-desired-architecture.md`](spec/heima-gaps-vs-desired-architecture.md) §4 +- FAQ + troubleshooting: [`wiki/cloud-setup-faq.md`](../wiki/cloud-setup-faq.md) diff --git a/docs/cloud-setup.md b/docs/cloud-setup.md index 3a7820f..9dd75ad 100644 --- a/docs/cloud-setup.md +++ b/docs/cloud-setup.md @@ -1,7 +1,7 @@ # Cloud setup — AgentKeys -**Audience:** the operator provisioning the cloud account that hosts AgentKeys infrastructure. -**Scope:** the prereqs that the idempotent [`scripts/setup-broker-host.sh`](../scripts/setup-broker-host.sh) entry point can't do for itself (DNS, SES, IAM, OIDC provider, S3 buckets). Run those once per account, then re-run the broker-host script as often as needed. +**Audience:** the operator running ongoing broker re-deploys after first-time cloud-account bootstrap is done. +**Scope:** OIDC federation activation (the per-broker security upgrade) + the [`scripts/setup-broker-host.sh`](../scripts/setup-broker-host.sh) runtime entry point + tear-down. **Prereqs handled in [`docs/cloud-bootstrap.md`](cloud-bootstrap.md)** — read that first if standing up a brand-new account or porting to another cloud provider. **Companion:** [`docs/heima-setup.md`](heima-setup.md) for chain bring-up, [`docs/ci-setup.md`](ci-setup.md) for CI activation. **FAQ + troubleshooting:** [`wiki/cloud-setup-faq.md`](../wiki/cloud-setup-faq.md). @@ -12,11 +12,17 @@ awsp agentkeys-admin set -a; source scripts/operator-workstation.env; set +a # ${ACCOUNT_ID}, ${REGION}, ${BROKER_HOST}, ${BUCKET}, ... -# 1. Per-account, one-shot, manual (this doc): -# §1 DNS subdomains, §2 SES domain identity, §3 IAM users + role, -# §4 OIDC federation provider + trust policy + bucket policy. +# 0. First-time cloud-account bootstrap (cloud-bootstrap.md): +# DNS subdomains, SES domain identity, IAM users + roles, initial +# bucket policy. Run ONCE per account; re-enter only when migrating +# cloud providers or adding a second account. -# 2. Per-broker-host, idempotent re-runnable (script): +# 1. OIDC federation activation (this doc §1): +# Once the broker is publicly reachable, register the IAM OIDC +# provider + swap the role trust policy + apply PrincipalTag +# bucket policy. Per-broker, one-shot. + +# 2. Per-broker-host, idempotent re-runnable (this doc §2): sudo bash scripts/setup-broker-host.sh \ --issuer-url "https://${BROKER_HOST}" \ --account-id "${ACCOUNT_ID}" \ @@ -33,145 +39,17 @@ bash scripts/setup-heima.sh # see docs/heima-setu `setup-broker-host.sh` is **the single entry point** for every remote-host change (binary upgrades, systemd edits, env tweaks, nginx/certbot wiring, mock-server redeploys). Per [CLAUDE.md "Remote broker host"](../CLAUDE.md): no ad-hoc `systemctl` edits, no hand-built `scp`. -The split: §1–§4 below sets up the **identifiers** (DNS names, IAM principals, OIDC trust, bucket policies); the script consumes those identifiers and stands up the actual processes. - -## 0. Identities — mental model - -| Identity | Type | Holds | Purpose | -|---|---|---|---| -| `agentkeys-admin` | IAM user | Long-lived access key | One-shot provisioning. Runs every command in this doc. IAM-admin scope. | -| `agentkeys-broker` | IAM user | Long-lived access key | Operator's SSH-into-EC2 path via EC2 Instance Connect. No data-plane access. | -| `agentkeys-daemon` | IAM user | Long-lived access key | Broker process at runtime. Only permission: `sts:AssumeRole` on the data role. | -| `agentkeys-data-role` | IAM role | (assumed) | Holds the actual S3/SES permissions. `agentkeys-daemon` (Stage 6) or the OIDC provider (Stage 7) is allowed to assume. | -| `agentkeys-vault-role` / `agentkeys-memory-role` | IAM role | (assumed) | Per-data-class roles (arch.md §17.2). Trust the OIDC provider; PrincipalTag-scoped to `bots//{credentials,memory}/*`. | -| `agentkeys-broker-host` | IAM role | (assumed by EC2) | Optional. If the broker runs on EC2, attach as instance profile so the daemon never sees a static key. | - -The word "agent" already means three things (the AI agent, the AgentKeys product, an IAM role) — these roles hold **data-plane** permissions, so they're named `*-data-role` / `*-vault-role` / `*-memory-role`. - -## 1. DNS - -Two-and-six subdomains under your parent zone (e.g. `litentry.org`): - -| Host | Purpose | Set in | -|---|---|---| -| `${MAIL_DOMAIN}` (e.g. `bots.litentry.org`) | SES inbound | §2 | -| `${BROKER_HOST}` (e.g. `broker.litentry.org`) | Broker TLS-terminating reverse proxy | §5 — A record to broker EIP | -| `signer.${ZONE}` | Signer service (issue #74 step 1b) | §5 — A record to broker EIP (co-located today) | -| `audit.${ZONE}` / `email.${ZONE}` / `cred.${ZONE}` / `memory.${ZONE}` | Service workers (issue #90) | §5 — same EIP (dev co-location) | - -For the bulk service-worker DNS, use [`scripts/dns-upsert-workers.sh`](../scripts/dns-upsert-workers.sh). The hostnames are the migration seam — when a worker moves to its own machine, only the A record changes. - -## 2. SES inbound mail - -```bash -# Verify the SES domain identity -aws sesv2 create-email-identity --region "$REGION" \ - --email-identity "$MAIL_DOMAIN" \ - --dkim-signing-attributes NextSigningKeyLength=RSA_2048_BIT - -# Publish DKIM + SPF + DMARC + MX in one Route 53 change (read DKIM tokens -# from `aws sesv2 get-email-identity`, then upsert via Route 53 — see -# wiki/cloud-setup-faq.md for the full record set). - -# Create the inbound bucket (30-day TTL on inbound/* objects) -aws s3api create-bucket --region "$REGION" --bucket "$BUCKET" \ - $([ "$REGION" != "us-east-1" ] && echo "--create-bucket-configuration LocationConstraint=$REGION") -aws s3api put-public-access-block --region "$REGION" --bucket "$BUCKET" \ - --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true - -# Receipt rule: route mail for $MAIL_DOMAIN into s3://$BUCKET/inbound/* -aws ses create-receipt-rule-set --rule-set-name agentkeys --region "$REGION" 2>/dev/null || true -aws ses create-receipt-rule --region "$REGION" --rule-set-name agentkeys \ - --rule "$(jq -n --arg domain "$MAIL_DOMAIN" --arg bucket "$BUCKET" '{ - Name: "agentkeys-inbound", Enabled: true, ScanEnabled: true, TlsPolicy: "Optional", - Recipients: [$domain], - Actions: [{S3Action: {BucketName: $bucket, ObjectKeyPrefix: "inbound/"}}] - }')" -aws ses set-active-receipt-rule-set --rule-set-name agentkeys --region "$REGION" - -# Verify the bot's sending identity (the broker's BROKER_EMAIL_FROM_ADDRESS -# precheck refuses to boot if this isn't verified) -bash scripts/ses-verify-sender.sh -``` - -**Sandbox vs production sending:** inbound is unaffected by SES sandbox; only **outbound** to arbitrary addresses needs Console → Support → "SES Sending Limits" → "Request Production Access". - -**Per-recipient routing Lambda (issue #83):** after §4 lands, the broker's role is intentionally denied read on `inbound/*`. Service-provisioning verification emails route to `bots//inbound/` via [`infra/ses-routing-lambda/deploy.sh`](../infra/ses-routing-lambda/deploy.sh). Idempotent, deploy once per AWS account. - -**Future Tencent Cloud port:** SES + S3 are the only AWS-specific layers in this doc. SimpleDM + COS slot in at the §3+ boundary — IAM model maps 1:1 onto CAM. The `provisioner-scripts/email-backends/` interface already abstracts the inbound contract. - -## 3. IAM identities - -The daemon user + data role are the boundary between manual provisioning (this doc) and the script-driven runtime (`setup-broker-host.sh`). - -### 3.1 The four principals - -```bash -# Runtime user (broker process) -aws iam create-user --user-name agentkeys-daemon -aws iam create-access-key --user-name agentkeys-daemon -# → save AccessKeyId + SecretAccessKey to the operator's secret manager. -# NEVER commit. setup-broker-host.sh consumes these via the systemd -# env file written under /etc/agentkeys/. - -# Daemon may only assume the data role (no direct S3/SES grants). -aws iam put-user-policy --user-name agentkeys-daemon \ - --policy-name agentkeys-daemon-assume-role \ - --policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{ - Version:"2012-10-17", - Statement:[{Effect:"Allow", Action:"sts:AssumeRole", - Resource:"arn:aws:iam::\($acct):role/agentkeys-data-role"}] - }')" -``` - -For `agentkeys-admin` + `agentkeys-broker` (one-shot, you already have these per CLAUDE.md "AWS local-profile ↔ remote-IAM mapping"), confirm with `aws iam list-users`. - -### 3.2 The three data roles - -Per arch.md §17.2 (per-data-class isolation): separate roles for credentials + memory + email. Same trust shape, distinct inline policies and PrincipalTag scoping. Provision via the per-data-class helpers (idempotent): - -```bash -bash scripts/provision-vault-bucket.sh # agentkeys-vault-${ACCOUNT_ID} -bash scripts/provision-vault-role.sh # agentkeys-vault-role -bash scripts/apply-vault-bucket-policy.sh # v3 split-statement PrincipalTag policy - -bash scripts/provision-memory-bucket.sh -bash scripts/provision-memory-role.sh -bash scripts/apply-memory-bucket-policy.sh - -bash scripts/cleanup-mail-bucket-policy.sh # restore email-only grants on $BUCKET -``` - -The data-role trust shape is shown in [§4.3](#43-trust-policy) below — it's the same template for all three roles. The inline grants differ per role (vault → credentials prefix; memory → memory prefix; data-role → mail prefix). - -### 3.3 SES sender grant (email-link auth prereq) - -The broker's runtime role needs `ses:SendEmail` on the verified sender identity for email-link auth. Add this statement to the data role's inline policy: - -```json -{ - "Effect": "Allow", - "Action": ["ses:SendEmail", "ses:SendRawEmail"], - "Resource": [ - "arn:aws:ses:${REGION}:${ACCOUNT_ID}:identity/${BROKER_EMAIL_FROM_ADDRESS}", - "arn:aws:ses:${REGION}:${ACCOUNT_ID}:configuration-set/*" - ] -} -``` - -The broker's `verify_sender_ready` precheck calls `ses:GetEmailIdentity` at boot and refuses to start if the identity isn't both verified AND grantable. Triggered without this grant: cryptic `AccessDenied: ses:SendEmail` at the magic-link send step. - -## 4. OIDC federation (Stage 7) +## 1. OIDC federation (Stage 7) The broker mints OIDC JWTs that AWS STS validates via the broker's public JWKS endpoint. Three one-shot steps per account. -### 4.1 Prereqs +### 1.1 Prereqs - Broker reachable at `https://${BROKER_HOST}` over public TLS (`setup-broker-host.sh` provisions this with certbot). - `https://${BROKER_HOST}/.well-known/openid-configuration` returns 200 with the expected `issuer` + `jwks_uri`. - `https://${BROKER_HOST}/.well-known/jwks.json` returns at least one ES256 key. -### 4.2 Register the OIDC provider +### 1.2 Register the OIDC provider ```bash thumb=$(echo | openssl s_client -servername "$BROKER_HOST" \ @@ -187,7 +65,7 @@ aws iam create-open-id-connect-provider \ **AWS validates the issuer URL byte-for-byte** against the JWT `iss` claim. Once the OIDC provider is registered, the URL is effectively immutable for the life of the deployment — switching means new provider ARN + new trust policy + new federated grants. -### 4.3 Trust policy +### 1.3 Trust policy (federated variant) Apply to each of the three data roles. Use `$ROLE` ∈ `{agentkeys-data-role, agentkeys-vault-role, agentkeys-memory-role}`. @@ -204,7 +82,7 @@ aws iam update-assume-role-policy --role-name "$ROLE" --policy-document "$(jq -n }')" ``` -### 4.4 PrincipalTag-scoped bucket policy +### 1.4 PrincipalTag-scoped bucket policy Per CLAUDE.md "Per-actor + per-data-class isolation invariants": every S3 read/write is scoped to `bots/${aws:PrincipalTag/agentkeys_actor_omni}/{credentials,memory}/*`. The split-statement v3 bucket policy is applied by [`scripts/apply-{vault,memory}-bucket-policy.sh`](../scripts/) — those scripts ARE the source of truth for the policy shape. @@ -214,21 +92,21 @@ After §4.3 + §4.4: strip the §3 broad-bucket inline grant from the role's pol aws iam delete-role-policy --role-name "$ROLE" --policy-name agentkeys-data-role-s3-broad ``` -### 4.5 End-to-end proof +### 1.5 End-to-end proof Run [`harness/v2-stage3-demo.sh`](../harness/v2-stage3-demo.sh) — it mints a session JWT → OIDC JWT → STS creds, then proves both POSITIVE (own prefix) and NEGATIVE (cross-actor prefix → AccessDenied) writes for both data classes plus the cross-role isolation matrix. Walks the full §17.2 isolation table from CLAUDE.md. -## 5. Broker host: `setup-broker-host.sh` +## 2. Broker host: `setup-broker-host.sh` §1–§4 set up identifiers. This step stands up the actual processes — broker + mock-server + signer + 4 service workers — on the EC2 host (or any Linux box with public-internet egress + the broker's hostname). -### 5.1 Prereqs +### 2.1 Prereqs - Fresh Linux host with sudo, systemd, public-internet egress, ports 80 + 443 open inbound (for certbot + nginx). - DNS A records for `${BROKER_HOST}` + `signer.${ZONE}` + `audit.${ZONE}` + `email.${ZONE}` + `cred.${ZONE}` + `memory.${ZONE}` all pointing at the host's public IP. - AWS credentials in `/etc/agentkeys/broker.env` (the script writes the file template; operator pastes the `agentkeys-daemon` access key from §3.1). -### 5.2 Run +### 2.2 Run ```bash # Bootstrap a fresh host: @@ -258,7 +136,7 @@ The script: Auto-detects bootstrap vs upgrade by reading the existing systemd unit's `Environment=` lines. Pass `--ref ` to opt into an in-script `git fetch + pull`. -### 5.3 Verify +### 2.3 Verify ```bash curl -sf "https://${BROKER_HOST}/healthz" # → 200 @@ -269,7 +147,9 @@ curl -sf "https://audit.${ZONE}/healthz" # → 200 (and friend For full E2E (broker + workers + chain + AWS), run the harness scripts — see [`docs/heima-setup.md`](heima-setup.md) for the chain side and [`docs/ci-setup.md`](ci-setup.md) for the automated path. -## 6. Cleanup +## 3. Cleanup (full account teardown) + +Tears down everything provisioned by both [`docs/cloud-bootstrap.md`](cloud-bootstrap.md) and this doc. Use only when retiring the deployment. Tear down the whole AgentKeys footprint in one account: @@ -309,6 +189,7 @@ aws sesv2 delete-email-identity --email-identity "$MAIL_DOMAIN" --region "$REGIO ## Related +- **First-time cloud-account bootstrap (prereq for this doc):** [`docs/cloud-bootstrap.md`](cloud-bootstrap.md) - Chain bring-up: [`docs/heima-setup.md`](heima-setup.md) - CI activation: [`docs/ci-setup.md`](ci-setup.md) - Broker host script (single entry point): [`scripts/setup-broker-host.sh`](../scripts/setup-broker-host.sh) diff --git a/wiki/cloud-setup-faq.md b/wiki/cloud-setup-faq.md index 2beca18..392a10c 100644 --- a/wiki/cloud-setup-faq.md +++ b/wiki/cloud-setup-faq.md @@ -1,6 +1,11 @@ # Cloud setup — FAQ -Troubleshooting + edge cases that didn't fit in [`docs/cloud-setup.md`](https://github.com/litentry/agentKeys/blob/main/docs/cloud-setup.md). Use ⌘F to find your error. +Troubleshooting + edge cases for the two cloud-side operator docs: + +- [`docs/cloud-bootstrap.md`](https://github.com/litentry/agentKeys/blob/main/docs/cloud-bootstrap.md) — first-time provisioning (per account or per cloud provider). +- [`docs/cloud-setup.md`](https://github.com/litentry/agentKeys/blob/main/docs/cloud-setup.md) — ongoing OIDC federation + broker-host re-deploys. + +Use ⌘F to find your error. ## Q. `setup-broker-host.sh` says "BROKER_OIDC_ISSUER mismatch" on re-run