GilmanLab · jmgilman · Apr 27, 2026 · Apr 27, 2026
diff --git a/docs/docs/architecture/bootstrap-and-lifecycle.md b/docs/docs/architecture/bootstrap-and-lifecycle.md
@@ -16,7 +16,7 @@ would be thrown away after day one.
 
 Before touching the real `UM760`, prove the risky assumptions locally:
 
-- generate or download a seeded IncusOS USB/IMG image
+- generate or download a seeded IncusOS USB/IMG `Operation` image
 - write it to a VM's only disk
 - boot that disk as the steady-state IncusOS host
 - confirm Incus initialization, trusted client certificate access, network
@@ -48,9 +48,9 @@ The intended host bootstrap sequence is:
 
 1. Start the temporary VyOS-hosted `k0s` cluster.
 2. Install Tinkerbell and the CAPI providers into that cluster.
-3. Generate a seeded IncusOS USB/IMG image for the `UM760`.
-4. Use Tinkerbell and HookOS to write that image directly to the internal
-   `UM760` disk through `image2disk` or `oci2disk`.
+3. Generate a seeded IncusOS USB/IMG `Operation` image for the `UM760`.
+4. Use Tinkerbell and HookOS to write that final-disk image directly to the
+   internal `UM760` disk through `image2disk` or `oci2disk`.
 5. Boot the `UM760` into IncusOS as the steady-state host OS.
 6. Initialize Incus on the `UM760` with the first-node defaults needed for the
    final cluster.
@@ -63,6 +63,10 @@ The same Tinkerbell path provisions the `MS-02` hosts later. Joining nodes must
 use IncusOS seed settings appropriate for joining the existing Incus cluster,
 not for creating independent local Incus defaults.
 
+A normal IncusOS `Installation` image must not be treated as equivalent to the
+`Operation` image for the single-disk `UM760` path. The bootstrap assumption is
+that the selected image is already a bootable final-disk artifact.
+
 ## Platform Cluster Bring-Up
 
 The platform cluster starts as one Talos VM on the `UM760`.
@@ -72,11 +76,19 @@ cluster reachable and let GitOps take over:
 
 - bootstrap-safe Cilium
 - minimal Argo CD on the platform cluster
+- the `platform-bootstrap` AppProject
 - an admin-owned root Application pointing at the platform cluster selection in
   `gitops`
 
+The `platform-bootstrap` AppProject is part of the day-0 handoff contract. It
+must exist before the root Application, because relying on Argo CD's startup
+creation of the `default` project caused a first-bootstrap race.
+
 After Argo CD is running, the GitOps bootstrap selection installs the full
 cluster-core components: full Cilium, full/self-managed Argo CD, and `kro`.
+The platform cluster bootstrap selection uses three separate Applications for
+Cilium, Argo CD, and `kro`, all on the `platform-bootstrap` project, so
+ownership and failure domains remain visible.
 
 The platform repo owns canonical bootstrap/core artifacts and release history.
 The gitops repo owns per-cluster version selection and cluster-local desired
@@ -180,7 +192,8 @@ syncs cluster-core and platform API state to them after they exist.
 
 The bootstrap path is not complete until these are proven:
 
-- IncusOS image generation and seeding for first-node and joining-node modes
+- IncusOS `Operation` image generation and seeding for first-node and
+  joining-node modes
 - Tinkerbell image writing from HookOS to the selected disks
 - VyOS-hosted `k0s` stability for the temporary bootstrap stack
 - CAPN plus Talos providers creating Talos VMs with the desired boot mode,

diff --git a/docs/docs/architecture/keycloak-runtime.md b/docs/docs/architecture/keycloak-runtime.md
@@ -20,9 +20,9 @@ The runtime contract is:
 - services: upstream Keycloak plus upstream Postgres, both pinned in `infra`
 - database: colocated Postgres with data on the instance EBS root volume
 - access name: `id.glab.lol`
-- TLS: ACME DNS-01 through Route 53 using the instance IAM role
-- reverse proxy: Caddy or equivalent on the host, terminating TLS and proxying
-  to Keycloak on loopback
+- TLS: Traefik ACME DNS-01 through Route 53 using the instance IAM role
+- reverse proxy: Traefik on the host, terminating TLS and proxying to Keycloak
+  on loopback
 
 State recovery uses application backups. Do not depend on EBS snapshots or AMI
 backups as the primary recovery path.

diff --git a/docs/docs/architecture/secrets-identity-pki.md b/docs/docs/architecture/secrets-identity-pki.md
@@ -280,40 +280,44 @@ Vault PKI roles issue short-lived certificates for internal use cases such as
 service-to-service mTLS, database client authentication, internal controllers,
 and future SPIRE upstream authority material.
 
-The current implementation history matters for rebuilds: the existing
-`infra/security/pki/root-ca` stack was applied against an earlier AWS account.
-The lab root must be recreated in the current `lab` account with the
-`pathlen:2` hierarchy, then the earlier account root key and state can be
-cleaned up after consumers trust the new chain.
-
-Router-hosted `step-ca` is limited to existing consumers during migration.
-Runtime issuance uses cert-manager plus Route 53 for public TLS and Vault for
-internal PKI.
-
-The small implementation slices are:
-
-1. Rewrap existing SOPS files with `alias/glab-sops`, add encryption context,
-   and remove PGP/age recipients.
-2. Rotate bootstrap secrets that previously depended on PGP/age-only history.
-3. Create the GitHub App plus SSM bootstrap path for `secrets` repo access.
-4. Recreate the internal root CA in the current `lab` account with the new path
-   length.
-5. Add the shared Vault unseal KMS key and `bank-vaults` storage layout.
-6. Stand up Vault in one cluster and prove the SOPS-to-Vault bootstrap handoff.
-7. Add cert-manager DNS-01 with Route 53 ACME delegation for one cluster.
-8. Migrate internal PKI consumers from `step-ca` to the per-cluster issuers.
-
-Open implementation threads:
-
-- exact KMS encryption-context key names and allowed scope values
-- exact GitHub App name, installation ID storage, and SSM parameter paths
+The current lab-account root is the committed root in
+`infra/security/pki/root-ca`. It is backed by the `lab` account KMS key
+`alias/glab-pki-root-ca` and uses the `pathlen:2` hierarchy above. Do not rely
+on the older pre-`lab` account root as the target trust anchor for new internal
+PKI work.
+
+RouterOS device certificates remain an existing-consumer migration thread.
+Runtime issuance for the target architecture uses cert-manager plus Route 53
+for public TLS and Vault for internal PKI, but RouterOS CA migration is
+deprioritized until the core lab architecture is farther along.
+
+Foundations already completed:
+
+- current SOPS files are rewrapped with `alias/glab-sops`, KMS encryption
+  context, and no routine PGP/age recipients
+- the GitHub App bootstrap material is stored in SSM, and
+  `github-token-broker` mints short-lived `contents:read` tokens for
+  `GilmanLab/secrets`
+- the internal root CA exists in the current `lab` account with `pathlen:2`
+
+Remaining implementation threads:
+
+- rotate bootstrap secrets where old PGP/age-encrypted git history should no
+  longer be trusted
+- attach `arn:aws:iam::186067932323:policy/glab-github-token-broker-invoke` to
+  the real bootstrap principals that should mint GitHub installation tokens
+- decide whether `/glab/bootstrap/github-app/private-key-pem` should move from
+  the default SSM KMS key to a customer-managed KMS key
 - whether Vault unseal material uses one shared S3 bucket with prefixes or
   separate per-cluster buckets
+- add the shared Vault unseal KMS key and `bank-vaults` storage layout
+- stand up Vault in one cluster and prove the SOPS-to-Vault bootstrap handoff
+- add cert-manager DNS-01 with Route 53 ACME delegation for one cluster
 - how trust bundles are distributed to workloads that need to trust internal
   Vault or SPIRE issuers
 - whether any future public TLS use case justifies wildcard certificates
-- when remaining `step-ca` consumers should be allowed to expire naturally
-  versus being actively replaced
+- migrate existing internal PKI consumers to the per-cluster issuers when they
+  become important enough to justify the work
 
 ## Identity