diff --git a/docs/docs/architecture/bootstrap-and-lifecycle.md b/docs/docs/architecture/bootstrap-and-lifecycle.md index ff34177..311a09b 100644 --- a/docs/docs/architecture/bootstrap-and-lifecycle.md +++ b/docs/docs/architecture/bootstrap-and-lifecycle.md @@ -16,7 +16,7 @@ would be thrown away after day one. Before touching the real `UM760`, prove the risky assumptions locally: -- generate or download a seeded IncusOS USB/IMG image +- generate or download a seeded IncusOS USB/IMG `Operation` image - write it to a VM's only disk - boot that disk as the steady-state IncusOS host - confirm Incus initialization, trusted client certificate access, network @@ -48,9 +48,9 @@ The intended host bootstrap sequence is: 1. Start the temporary VyOS-hosted `k0s` cluster. 2. Install Tinkerbell and the CAPI providers into that cluster. -3. Generate a seeded IncusOS USB/IMG image for the `UM760`. -4. Use Tinkerbell and HookOS to write that image directly to the internal - `UM760` disk through `image2disk` or `oci2disk`. +3. Generate a seeded IncusOS USB/IMG `Operation` image for the `UM760`. +4. Use Tinkerbell and HookOS to write that final-disk image directly to the + internal `UM760` disk through `image2disk` or `oci2disk`. 5. Boot the `UM760` into IncusOS as the steady-state host OS. 6. Initialize Incus on the `UM760` with the first-node defaults needed for the final cluster. @@ -63,6 +63,10 @@ The same Tinkerbell path provisions the `MS-02` hosts later. Joining nodes must use IncusOS seed settings appropriate for joining the existing Incus cluster, not for creating independent local Incus defaults. +A normal IncusOS `Installation` image must not be treated as equivalent to the +`Operation` image for the single-disk `UM760` path. The bootstrap assumption is +that the selected image is already a bootable final-disk artifact. + ## Platform Cluster Bring-Up The platform cluster starts as one Talos VM on the `UM760`. @@ -72,11 +76,19 @@ cluster reachable and let GitOps take over: - bootstrap-safe Cilium - minimal Argo CD on the platform cluster +- the `platform-bootstrap` AppProject - an admin-owned root Application pointing at the platform cluster selection in `gitops` +The `platform-bootstrap` AppProject is part of the day-0 handoff contract. It +must exist before the root Application, because relying on Argo CD's startup +creation of the `default` project caused a first-bootstrap race. + After Argo CD is running, the GitOps bootstrap selection installs the full cluster-core components: full Cilium, full/self-managed Argo CD, and `kro`. +The platform cluster bootstrap selection uses three separate Applications for +Cilium, Argo CD, and `kro`, all on the `platform-bootstrap` project, so +ownership and failure domains remain visible. The platform repo owns canonical bootstrap/core artifacts and release history. The gitops repo owns per-cluster version selection and cluster-local desired @@ -180,7 +192,8 @@ syncs cluster-core and platform API state to them after they exist. The bootstrap path is not complete until these are proven: -- IncusOS image generation and seeding for first-node and joining-node modes +- IncusOS `Operation` image generation and seeding for first-node and + joining-node modes - Tinkerbell image writing from HookOS to the selected disks - VyOS-hosted `k0s` stability for the temporary bootstrap stack - CAPN plus Talos providers creating Talos VMs with the desired boot mode, diff --git a/docs/docs/architecture/keycloak-runtime.md b/docs/docs/architecture/keycloak-runtime.md index 28d2fa1..52955d3 100644 --- a/docs/docs/architecture/keycloak-runtime.md +++ b/docs/docs/architecture/keycloak-runtime.md @@ -20,9 +20,9 @@ The runtime contract is: - services: upstream Keycloak plus upstream Postgres, both pinned in `infra` - database: colocated Postgres with data on the instance EBS root volume - access name: `id.glab.lol` -- TLS: ACME DNS-01 through Route 53 using the instance IAM role -- reverse proxy: Caddy or equivalent on the host, terminating TLS and proxying - to Keycloak on loopback +- TLS: Traefik ACME DNS-01 through Route 53 using the instance IAM role +- reverse proxy: Traefik on the host, terminating TLS and proxying to Keycloak + on loopback State recovery uses application backups. Do not depend on EBS snapshots or AMI backups as the primary recovery path. diff --git a/docs/docs/architecture/secrets-identity-pki.md b/docs/docs/architecture/secrets-identity-pki.md index 0282bcf..c2c1213 100644 --- a/docs/docs/architecture/secrets-identity-pki.md +++ b/docs/docs/architecture/secrets-identity-pki.md @@ -280,40 +280,44 @@ Vault PKI roles issue short-lived certificates for internal use cases such as service-to-service mTLS, database client authentication, internal controllers, and future SPIRE upstream authority material. -The current implementation history matters for rebuilds: the existing -`infra/security/pki/root-ca` stack was applied against an earlier AWS account. -The lab root must be recreated in the current `lab` account with the -`pathlen:2` hierarchy, then the earlier account root key and state can be -cleaned up after consumers trust the new chain. - -Router-hosted `step-ca` is limited to existing consumers during migration. -Runtime issuance uses cert-manager plus Route 53 for public TLS and Vault for -internal PKI. - -The small implementation slices are: - -1. Rewrap existing SOPS files with `alias/glab-sops`, add encryption context, - and remove PGP/age recipients. -2. Rotate bootstrap secrets that previously depended on PGP/age-only history. -3. Create the GitHub App plus SSM bootstrap path for `secrets` repo access. -4. Recreate the internal root CA in the current `lab` account with the new path - length. -5. Add the shared Vault unseal KMS key and `bank-vaults` storage layout. -6. Stand up Vault in one cluster and prove the SOPS-to-Vault bootstrap handoff. -7. Add cert-manager DNS-01 with Route 53 ACME delegation for one cluster. -8. Migrate internal PKI consumers from `step-ca` to the per-cluster issuers. - -Open implementation threads: - -- exact KMS encryption-context key names and allowed scope values -- exact GitHub App name, installation ID storage, and SSM parameter paths +The current lab-account root is the committed root in +`infra/security/pki/root-ca`. It is backed by the `lab` account KMS key +`alias/glab-pki-root-ca` and uses the `pathlen:2` hierarchy above. Do not rely +on the older pre-`lab` account root as the target trust anchor for new internal +PKI work. + +RouterOS device certificates remain an existing-consumer migration thread. +Runtime issuance for the target architecture uses cert-manager plus Route 53 +for public TLS and Vault for internal PKI, but RouterOS CA migration is +deprioritized until the core lab architecture is farther along. + +Foundations already completed: + +- current SOPS files are rewrapped with `alias/glab-sops`, KMS encryption + context, and no routine PGP/age recipients +- the GitHub App bootstrap material is stored in SSM, and + `github-token-broker` mints short-lived `contents:read` tokens for + `GilmanLab/secrets` +- the internal root CA exists in the current `lab` account with `pathlen:2` + +Remaining implementation threads: + +- rotate bootstrap secrets where old PGP/age-encrypted git history should no + longer be trusted +- attach `arn:aws:iam::186067932323:policy/glab-github-token-broker-invoke` to + the real bootstrap principals that should mint GitHub installation tokens +- decide whether `/glab/bootstrap/github-app/private-key-pem` should move from + the default SSM KMS key to a customer-managed KMS key - whether Vault unseal material uses one shared S3 bucket with prefixes or separate per-cluster buckets +- add the shared Vault unseal KMS key and `bank-vaults` storage layout +- stand up Vault in one cluster and prove the SOPS-to-Vault bootstrap handoff +- add cert-manager DNS-01 with Route 53 ACME delegation for one cluster - how trust bundles are distributed to workloads that need to trust internal Vault or SPIRE issuers - whether any future public TLS use case justifies wildcard certificates -- when remaining `step-ca` consumers should be allowed to expire naturally - versus being actively replaced +- migrate existing internal PKI consumers to the per-cluster issuers when they + become important enough to justify the work ## Identity