Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 18 additions & 5 deletions docs/docs/architecture/bootstrap-and-lifecycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ would be thrown away after day one.

Before touching the real `UM760`, prove the risky assumptions locally:

- generate or download a seeded IncusOS USB/IMG image
- generate or download a seeded IncusOS USB/IMG `Operation` image
- write it to a VM's only disk
- boot that disk as the steady-state IncusOS host
- confirm Incus initialization, trusted client certificate access, network
Expand Down Expand Up @@ -48,9 +48,9 @@ The intended host bootstrap sequence is:

1. Start the temporary VyOS-hosted `k0s` cluster.
2. Install Tinkerbell and the CAPI providers into that cluster.
3. Generate a seeded IncusOS USB/IMG image for the `UM760`.
4. Use Tinkerbell and HookOS to write that image directly to the internal
`UM760` disk through `image2disk` or `oci2disk`.
3. Generate a seeded IncusOS USB/IMG `Operation` image for the `UM760`.
4. Use Tinkerbell and HookOS to write that final-disk image directly to the
internal `UM760` disk through `image2disk` or `oci2disk`.
5. Boot the `UM760` into IncusOS as the steady-state host OS.
6. Initialize Incus on the `UM760` with the first-node defaults needed for the
final cluster.
Expand All @@ -63,6 +63,10 @@ The same Tinkerbell path provisions the `MS-02` hosts later. Joining nodes must
use IncusOS seed settings appropriate for joining the existing Incus cluster,
not for creating independent local Incus defaults.

A normal IncusOS `Installation` image must not be treated as equivalent to the
`Operation` image for the single-disk `UM760` path. The bootstrap assumption is
that the selected image is already a bootable final-disk artifact.

## Platform Cluster Bring-Up

The platform cluster starts as one Talos VM on the `UM760`.
Expand All @@ -72,11 +76,19 @@ cluster reachable and let GitOps take over:

- bootstrap-safe Cilium
- minimal Argo CD on the platform cluster
- the `platform-bootstrap` AppProject
- an admin-owned root Application pointing at the platform cluster selection in
`gitops`

The `platform-bootstrap` AppProject is part of the day-0 handoff contract. It
must exist before the root Application, because relying on Argo CD's startup
creation of the `default` project caused a first-bootstrap race.

After Argo CD is running, the GitOps bootstrap selection installs the full
cluster-core components: full Cilium, full/self-managed Argo CD, and `kro`.
The platform cluster bootstrap selection uses three separate Applications for
Cilium, Argo CD, and `kro`, all on the `platform-bootstrap` project, so
ownership and failure domains remain visible.

The platform repo owns canonical bootstrap/core artifacts and release history.
The gitops repo owns per-cluster version selection and cluster-local desired
Expand Down Expand Up @@ -180,7 +192,8 @@ syncs cluster-core and platform API state to them after they exist.

The bootstrap path is not complete until these are proven:

- IncusOS image generation and seeding for first-node and joining-node modes
- IncusOS `Operation` image generation and seeding for first-node and
joining-node modes
- Tinkerbell image writing from HookOS to the selected disks
- VyOS-hosted `k0s` stability for the temporary bootstrap stack
- CAPN plus Talos providers creating Talos VMs with the desired boot mode,
Expand Down
6 changes: 3 additions & 3 deletions docs/docs/architecture/keycloak-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ The runtime contract is:
- services: upstream Keycloak plus upstream Postgres, both pinned in `infra`
- database: colocated Postgres with data on the instance EBS root volume
- access name: `id.glab.lol`
- TLS: ACME DNS-01 through Route 53 using the instance IAM role
- reverse proxy: Caddy or equivalent on the host, terminating TLS and proxying
to Keycloak on loopback
- TLS: Traefik ACME DNS-01 through Route 53 using the instance IAM role
- reverse proxy: Traefik on the host, terminating TLS and proxying to Keycloak
on loopback

State recovery uses application backups. Do not depend on EBS snapshots or AMI
backups as the primary recovery path.
Expand Down
62 changes: 33 additions & 29 deletions docs/docs/architecture/secrets-identity-pki.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,40 +280,44 @@ Vault PKI roles issue short-lived certificates for internal use cases such as
service-to-service mTLS, database client authentication, internal controllers,
and future SPIRE upstream authority material.

The current implementation history matters for rebuilds: the existing
`infra/security/pki/root-ca` stack was applied against an earlier AWS account.
The lab root must be recreated in the current `lab` account with the
`pathlen:2` hierarchy, then the earlier account root key and state can be
cleaned up after consumers trust the new chain.

Router-hosted `step-ca` is limited to existing consumers during migration.
Runtime issuance uses cert-manager plus Route 53 for public TLS and Vault for
internal PKI.

The small implementation slices are:

1. Rewrap existing SOPS files with `alias/glab-sops`, add encryption context,
and remove PGP/age recipients.
2. Rotate bootstrap secrets that previously depended on PGP/age-only history.
3. Create the GitHub App plus SSM bootstrap path for `secrets` repo access.
4. Recreate the internal root CA in the current `lab` account with the new path
length.
5. Add the shared Vault unseal KMS key and `bank-vaults` storage layout.
6. Stand up Vault in one cluster and prove the SOPS-to-Vault bootstrap handoff.
7. Add cert-manager DNS-01 with Route 53 ACME delegation for one cluster.
8. Migrate internal PKI consumers from `step-ca` to the per-cluster issuers.

Open implementation threads:

- exact KMS encryption-context key names and allowed scope values
- exact GitHub App name, installation ID storage, and SSM parameter paths
The current lab-account root is the committed root in
`infra/security/pki/root-ca`. It is backed by the `lab` account KMS key
`alias/glab-pki-root-ca` and uses the `pathlen:2` hierarchy above. Do not rely
on the older pre-`lab` account root as the target trust anchor for new internal
PKI work.

RouterOS device certificates remain an existing-consumer migration thread.
Runtime issuance for the target architecture uses cert-manager plus Route 53
for public TLS and Vault for internal PKI, but RouterOS CA migration is
deprioritized until the core lab architecture is farther along.

Foundations already completed:

- current SOPS files are rewrapped with `alias/glab-sops`, KMS encryption
context, and no routine PGP/age recipients
- the GitHub App bootstrap material is stored in SSM, and
`github-token-broker` mints short-lived `contents:read` tokens for
`GilmanLab/secrets`
- the internal root CA exists in the current `lab` account with `pathlen:2`

Remaining implementation threads:

- rotate bootstrap secrets where old PGP/age-encrypted git history should no
longer be trusted
- attach `arn:aws:iam::186067932323:policy/glab-github-token-broker-invoke` to
the real bootstrap principals that should mint GitHub installation tokens
- decide whether `/glab/bootstrap/github-app/private-key-pem` should move from
the default SSM KMS key to a customer-managed KMS key
- whether Vault unseal material uses one shared S3 bucket with prefixes or
separate per-cluster buckets
- add the shared Vault unseal KMS key and `bank-vaults` storage layout
- stand up Vault in one cluster and prove the SOPS-to-Vault bootstrap handoff
- add cert-manager DNS-01 with Route 53 ACME delegation for one cluster
- how trust bundles are distributed to workloads that need to trust internal
Vault or SPIRE issuers
- whether any future public TLS use case justifies wildcard certificates
- when remaining `step-ca` consumers should be allowed to expire naturally
versus being actively replaced
- migrate existing internal PKI consumers to the per-cluster issuers when they
become important enough to justify the work

## Identity

Expand Down