From 66cec7a55e48f8c2719d66ce88a91ad14292fa72 Mon Sep 17 00:00:00 2001 From: Joshua Gilman Date: Mon, 20 Apr 2026 13:38:57 -0700 Subject: [PATCH 1/2] docs(designs): add AWS lab account design Proposes a dedicated AWS Organization (management + lab member accounts) in us-west-2 as the lab's durable off-lab trust anchor. Covers account layout, IAM Identity Center with a local identity store and WebAuthn MFA, a single public-subnet VPC at 172.16.0.0/16 with a Tailscale subnet router using workload identity federation, a Route 53 private zone for glab.lol mirrored to an in-lab CoreDNS zonefile for cold-start and outage resilience, and a secrets bootstrap chain (KMS as a SOPS recipient plus a GitHub App key in SSM) that terminates every instance-side identity at a single IAM role. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/docs/designs/aws-lab-account.md | 455 +++++++++++++++++++++++++++ 1 file changed, 455 insertions(+) create mode 100644 docs/docs/designs/aws-lab-account.md diff --git a/docs/docs/designs/aws-lab-account.md b/docs/docs/designs/aws-lab-account.md new file mode 100644 index 0000000..228e757 --- /dev/null +++ b/docs/docs/designs/aws-lab-account.md @@ -0,0 +1,455 @@ +--- +title: AWS Lab Account +description: Proposed design for a dedicated AWS Organization and member account that anchors lab DNS, bootstrap identity, and offsite compute for identity systems. +--- + +# AWS Lab Account + +## Status + +Proposed. + +This document defines how the lab uses a dedicated AWS Organization and member +account as its durable, out-of-lab trust anchor. It covers account and identity +structure, the VPC and its Tailscale site-to-site link to the lab, the Route 53 +private zone and the in-lab mirror that consumes it, and the secrets bootstrap +path that lets AWS be the single identity required to retrieve and decrypt all +other bootstrap material. + +Detailed Keycloak design is out of scope for this document and is covered +separately. + +## Purpose + +The primary purpose of this design is to keep lab identity, lab DNS, and +lab secrets from having a circular dependency on the lab itself. + +The intended split is: + +- AWS owns the durable trust anchor — the account, the identity system, the + KMS key, the private zone of record +- the lab consumes what AWS provides without requiring continuous connectivity + to operate steady-state +- a small number of AWS-resident components (a Tailscale subnet router, a + Keycloak host) provide the minimum bridge between the two sides + +This keeps the lab's hardest-to-bootstrap layers (identity, DNS, and secret +decryption keys) outside the lab, while keeping day-to-day serving and latency +local. + +## Goals + +- Provide a durable, off-lab trust anchor that survives total lab failure. +- Break the chicken-and-egg between lab DNS, lab identity, and lab secrets. +- Keep AWS itself identity-independent from anything the lab hosts, so AWS + remains reachable even when Keycloak, the platform cluster, or the lab + network is unavailable. +- Eliminate long-lived static credentials on AWS-hosted bootstrap nodes. +- Allow the lab to continue serving DNS and operating existing workloads + during a total internet outage. +- Mirror the durable-trust-anchor shape of a future day-job architecture so the + lab exercises the same patterns at small scale. + +## Non-Goals + +- This document does not define the Keycloak deployment, its database, or its + disaster-recovery plan. Those belong to a separate Keycloak design doc. +- This document does not define monitoring, logging, or alerting for the AWS + account. +- This document does not define exact IAM policy JSON, SCPs, Tailscale ACL + rules, CoreDNS configuration, or OpenTofu module layout. The doc names + contracts, not implementations. +- This document does not define OIDC trust between cluster workloads and AWS. + That is a separate future design connected to ExternalDNS and similar cluster + components. +- This document does not federate IAM Identity Center to Keycloak. Doing so + would reintroduce the circular dependency this design is built to avoid. + +## Design Summary + +The lab uses a dedicated AWS Organization with two accounts: + +- a **management account** that holds the Organization, billing, and IAM + Identity Center, and runs no workloads +- a **member account** that holds all lab-owned AWS resources: the VPC, the + Tailscale subnet router, the Keycloak host, the Route 53 private zone, the + KMS key used for SOPS, and the SSM parameters used for bootstrap + +Identity into both accounts comes from **IAM Identity Center with its built-in +identity store**. There is no external identity provider. A single human user +signs in with a hardware security key and assumes time-limited permission sets +into the member account. Root credentials on both accounts are break-glass only +and stored offline. + +The member account peers with the lab via a pair of **Tailscale subnet +routers**, one in AWS and one on VyOS. Devices on either side can reach devices +on the other side by their real IPs without being Tailscale nodes themselves. +The AWS-side subnet router authenticates to the tailnet via **Tailscale +workload identity federation**, using its attached IAM role — no pre-shared +auth key. + +The lab's authoritative DNS lives in a **Route 53 private zone** for `glab.lol` +bound to the lab's VPC. A sync job on the subnet router renders that zone to a +local zonefile, serves it over the tailnet, and an in-lab fetcher pulls the +file to disk. CoreDNS in the lab serves from the on-disk zonefile. The read +path never reaches AWS at query time, so DNS serving survives internet outages +and cold starts. + +Bootstrap secrets are gated end-to-end by the AWS-resident IAM role: + +- encrypted secrets live in the existing `secrets/` repo on GitHub, encrypted + with a **KMS customer-managed key** used as a SOPS recipient +- that repo is cloned via a **GitHub App** whose private key is stored in SSM + Parameter Store (SecureString) +- both KMS decrypt and SSM read are granted to the bootstrap instance via its + IAM role + +The result is that an AWS-resident bootstrap node holds zero persistent +secrets on disk: every identity it uses — Tailscale, GitHub, the SOPS +decryption key, AWS itself — traces back to its IAM role. + +## Account Structure + +The lab uses a two-account AWS Organization: + +| Account | Purpose | +|-------------|-----------------------------------------------------------------------------| +| `lab-mgmt` | Organization management account. Holds billing, IAM Identity Center, org-level config. No workloads. | +| `lab` | Lab workload member account. Holds VPC, EC2, Route 53, KMS, SSM, and all other lab resources. | + +The split follows AWS's own recommendation that the management account should +not run workloads. It also leaves room to add additional member accounts later +(for example, a prod-mirror account that stages day-job patterns) without +restructuring. + +Region: **`us-west-2`**. + +All resources in this design live in `us-west-2` in the `lab` account unless +explicitly stated otherwise. + +## Identity + +### Primary path + +IAM Identity Center is enabled in the `lab-mgmt` account and uses its +**built-in identity store**. There is no external IdP wired in. The identity +store holds one human user with **WebAuthn MFA enforced** via a hardware +security key. + +Access to the `lab` account is granted via permission sets assigned from +Identity Center. Daily operator access — console and CLI — is short-lived: + +- console access through the Identity Center access portal +- CLI access via `aws sso login`, which produces short-lived role credentials + +No long-lived IAM user access keys exist in either account for human use. + +### Break-glass + +Root user credentials exist on both accounts and are used for emergency +recovery only (loss of Identity Center access, billing-only actions not +permitted to Identity Center). Both root accounts: + +- use strong unique passwords +- have hardware-key MFA enabled +- are stored offline (outside any system whose recovery depends on AWS or + Keycloak being reachable) + +### Why the identity store is local + +Federating IAM Identity Center to Keycloak would make AWS access depend on +Keycloak. Keycloak depends on AWS for its compute, its DNS, and its secrets +bootstrap. Coupling the two defeats the entire reason for placing identity on +a durable off-lab trust anchor. + +A future addition of Keycloak SAML federation as a **secondary, convenience** +path for Identity Center is possible and explicitly deferred. The local +identity store always remains the primary admin path. + +## Network + +### VPC + +- **CIDR:** `172.16.0.0/16` +- **Subnets:** one public subnet, single AZ +- **Internet gateway:** attached; the subnet router carries outbound traffic + via an Elastic IP attached to its ENI +- **NAT gateway:** none. With a single public-subnet instance there is no + workload needing egress through a private subnet; skipping NAT removes the + largest ongoing fixed cost that would otherwise apply (~$32/mo) + +`172.16.0.0/16` is deliberately far from both the lab's `10.10.0.0/16` and +Tailscale's `100.64.0.0/10` CGNAT range, so no address-space collisions can +occur when routes are advertised across the tailnet. + +### Site-to-site with the lab + +The lab and the VPC connect via **Tailscale subnet routers on both sides**: + +- **AWS side:** the subnet router EC2 instance advertises `172.16.0.0/16` and + accepts `10.10.0.0/16`. +- **Lab side:** VyOS runs Tailscale and advertises `10.10.0.0/16` while + accepting `172.16.0.0/16`. + +Both sides run with `--snat-subnet-routes=false` so traffic preserves real +source IPs. The VPC route table directs `10.10.0.0/16` to the subnet router's +ENI, and the ENI has source/destination check disabled so it can forward. +Security groups allow `10.10.0.0/16` as a source on the ENI. + +From either side, a host can address the other side by its real IP without +being a Tailscale node itself. Lab DNS clients reach `172.16.0.0/16` +transparently; VPC workloads (Keycloak) can reach lab workloads when needed. + +MSS clamping is configured on VyOS to avoid black-holed large packets through +the WireGuard-based tunnel's smaller MTU. Tailscale ACLs permit traffic +between the two advertised CIDRs. + +### Tailscale node identity + +The AWS-side subnet router authenticates to the tailnet via **Tailscale +workload identity federation**, using its attached IAM role. No pre-shared +auth key is stored on the instance. Tailscale ACL tags are derived from IAM +claims (role ARN, account ID), so policy can be written against the +IAM identity rather than per-device labels. + +The VyOS node uses a traditional Tailscale auth key, because workload identity +federation only supports cloud-hosted clients. That key is managed out of band +and lives on a single on-prem device; it is not checked into any repo. + +## DNS + +### Authoritative zone + +The canonical lab domain is **`glab.lol`**. A Route 53 **private hosted zone** +for `glab.lol` lives in the `lab` account, bound to the VPC. All lab DNS +records are managed there. + +Private was chosen over public intentionally: the day-job architecture this +lab mirrors requires record names themselves to be non-public. A public zone +would be operationally simpler but would not exercise the same pattern. + +### Lab read path + +CoreDNS in the lab serves `glab.lol` from a **local zonefile on disk**. The +file is kept up to date by a sync pipeline that runs entirely outside the lab: + +1. A job on the AWS-side subnet router reads the Route 53 zone using its IAM + role and renders it to a standard zonefile. Refresh cadence is ≤1 minute. +2. The subnet router serves the rendered file over the tailnet. +3. An in-lab fetcher periodically pulls the file and writes it to the + filesystem CoreDNS reads from. + +CoreDNS never queries Route 53 at request time. The fetch path is +asynchronous and decoupled from serving. + +### Failure characteristics + +- **Steady-state AWS or internet outage:** fetches fail; CoreDNS continues to + serve from the last-fetched zonefile. The zone data becomes progressively + stale in proportion to the outage length, but queries continue to resolve. +- **Cold start during an outage:** CoreDNS loads the last zonefile from local + disk and resumes serving. The sync job is not on the critical path. +- **Full lab internet loss:** the tailnet path to the subnet router is itself + unreachable, which stops syncs but not serving. The zonefile on disk is the + resilience layer. +- **Stale-vs-unavailable tradeoff:** this design accepts staleness as the + price of availability during outages. Zone changes during an outage simply + do not propagate until connectivity returns. + +### Why the mirror exists + +Steady-state DNS resilience can be provided by CoreDNS itself — the `route53` +plugin reads zones into memory, and the `cache` plugin with `serve_stale` +enabled keeps answering through upstream outages. The mirror layer's specific +job is **cold-start and bootstrap resilience**: if CoreDNS restarts (node +reboot, container replaced) while Route 53 is unreachable, it has no in-memory +zone to fall back on. A zonefile on disk removes that failure mode. + +## Secrets Bootstrap + +### Contract + +An AWS-resident bootstrap instance must be able to, starting from only its +IAM role: + +1. Reach the tailnet. +2. Clone the private `secrets/` repo from GitHub. +3. Decrypt SOPS-encrypted files in that repo. + +At no point may the instance hold a durable, plaintext credential for any of +the three systems (Tailscale, GitHub, SOPS). All identity traces back to the +instance's attached IAM role. + +### KMS as a SOPS recipient + +The existing `secrets/` repo continues to hold SOPS-encrypted files. A single +**customer-managed KMS key** in the `lab` account is added as an additional +SOPS recipient. Any principal granted `kms:Decrypt` on that key — human +(via Identity Center permission set) or machine (via instance profile) — can +decrypt. + +This is non-breaking for existing workflows: SOPS supports multiple +recipients, so the KMS key can be added alongside the existing age key. +Retiring the age key is possible later but not required. + +The details of the SOPS-over-KMS workflow, key rotation, and human vs. +automation paths live in a separate secrets design doc. This document only +establishes that the KMS key lives in the `lab` account and is the anchor +for machine decryption. + +### GitHub App for repo access + +Cloning private repos from an AWS-resident bootstrap instance uses a +**GitHub App** owned by the `GilmanLab` organization and installed on the +`secrets` repo (plus any other private repos bootstrap needs to reach). + +- The App's **private signing key** is stored in an SSM Parameter Store + `SecureString` in the `lab` account. +- The instance's IAM role grants `ssm:GetParameter` + `kms:Decrypt` on that + specific parameter path only. +- On bootstrap, the instance fetches the key, generates a JWT, exchanges it + for a short-lived installation token (1-hour TTL), and clones. + +The only durable non-AWS secret anywhere in the chain is the App's private +signing key itself, and that key is at rest in AWS, gated by IAM. Installation +tokens are never stored on disk. + +### The single-anchor property + +Taken together, the chain on a single EC2 bootstrap instance is: + +| Step | Identity used | +|------|----------------------------------------------------------------| +| Join tailnet | IAM role (via workload identity federation) | +| Read GitHub App key | IAM role (via instance profile → SSM + KMS) | +| Mint installation token | App private key (short-lived, in memory) | +| Clone `secrets/` repo | Installation token (short-lived, in memory) | +| Decrypt SOPS files | IAM role (via instance profile → KMS) | + +Every persistent identity is the IAM role. Lose AWS, lose bootstrap. Gain AWS, +everything else unlocks in order. This is the design outcome the account +structure is in service of. + +## Compute and Cost Model + +### Instances + +| Name | Type | Purpose | +|----------------|------------|-------------------------------------------------------------| +| subnet router | `t4g.nano` | Tailscale site-to-site, Route 53 zonefile rendering | +| Keycloak host | `t4g.small`| Keycloak + colocated Postgres. Detailed design in Keycloak doc. | + +Both run Amazon Linux 2023 on ARM (`t4g` / Graviton). Tailscale and Keycloak +both ship native ARM builds. + +The two are kept **as separate instances** rather than colocated. A colocated +box would save ~$1.75/mo but would collapse the subnet router and identity +failure domains into one. The premium for separation is cheap insurance, and +separation is also more faithful to the day-job architecture this design +mirrors. + +EC2 instances do **not** run EBS snapshot or AMI backup jobs. The lab's +philosophy is rebuild-over-restore: every instance's durable state is either +in an external store (Route 53, KMS, SSM, S3, GitHub) or is designed to be +reconstructed from those sources. + +### Savings Plan commitment + +Both instances are long-lived infrastructure and not expected to change +instance family over their lifetime. The commitment shape is: + +- **3-year EC2 Instance Savings Plans, all-upfront**, covering the `t4g` + family in `us-west-2` +- expected effective discount ~72% vs. on-demand +- one purchase sized to cover both instances; additional commitment can be + layered later + +### Cost envelope + +Approximate, all-upfront amortized: + +| Item | Monthly | 3-year | +|-------------------------------|---------|--------| +| Subnet router (t4g.nano) | ~$1.75 | ~$63 | +| Keycloak host (t4g.small) | ~$3.89 | ~$140 | +| KMS customer-managed key | ~$1.00 | ~$36 | +| SSM Parameter Store (standard)| ~$0 | ~$0 | +| Route 53 private zone + queries | ~$0.50 | ~$18 | +| Data transfer | negligible | negligible | +| **Total (approximate)** | **~$7.15** | **~$260** | + +EBS, Elastic IPs attached to running instances, and Route 53 API calls for the +1-minute zonefile sync all fall into noise-level cost at lab scale. + +## Infrastructure as Code + +- All AWS resources are managed with **OpenTofu** from the `infra/` repo + under `infra/aws/`. +- OpenTofu state is stored in an **S3 bucket in the `lab` account**, using + S3's native locking. +- The OpenTofu entrypoint assumes a permission set role via Identity Center + for human-operator runs. Future CI-triggered runs will use a separate + identity (out of scope for this document). + +### Manual bootstrap surface + +A small amount of setup exists outside of OpenTofu, because it must exist +before OpenTofu can run: + +1. Creation of the AWS Organization and the two accounts. +2. Enablement of IAM Identity Center and the single operator user. +3. Creation of the S3 state bucket and the minimum IAM role OpenTofu will + assume. + +Everything downstream of that — the VPC, the subnet router, the private zone, +the KMS key, the SSM parameters, the instance profiles — is declared in +OpenTofu. + +## Failure Domains + +What fails together and what does not: + +| Failure | Lab DNS | Lab serving | AWS console access | Bootstrap of new lab instances | +|--------------------------------|---------|-------------|--------------------|------------------------------| +| Lab internet outage | ✓ (cached zonefile) | ✓ | ✗ (can't reach AWS) | ✗ | +| Subnet router EC2 down | ✓ (cached zonefile) | ✓ | ✓ | ✗ (tailnet → AWS bridge down) | +| `lab` account compromise | ✓ (cached zonefile, short-term) | ✓ | partial | ✗ | +| `lab-mgmt` account lost | ✓ (cached zonefile, short-term) | ✓ | ✗ | ✗ | +| Keycloak host down | ✓ | ✓ (except OIDC-gated services) | ✓ | ✓ | +| AWS region outage | ✓ (cached zonefile) | ✓ | ✗ | ✗ | +| Full lab power/hardware loss | ✗ | ✗ | ✓ | depends on external rebuild | + +The dominant pattern: **lab-side serving is robust to any offsite failure** +thanks to the zonefile-on-disk DNS path and locally-resident workloads. +Offsite failure primarily costs the ability to make changes, not the ability +to keep running. + +## Future Work + +The following are known next steps that are intentionally out of scope here: + +- **Keycloak design doc.** Deployment shape, Postgres colocation, backup to + object storage with Synology sync, rebuild-over-restore DR procedure, and + GitOps via `keycloak-config-cli`. +- **Cluster workload OIDC to AWS.** ExternalDNS-style workloads on the Talos + cluster will need AWS credentials; the cluster's own OIDC issuer (IRSA-style + federation) is the expected mechanism, not Tailscale-based federation. +- **GitHub Actions OIDC.** Trusting GitHub Actions as an OIDC identity + provider in the `lab` account so CI can apply OpenTofu without long-lived + keys. +- **Keycloak SAML federation to IAM Identity Center** as a secondary, + convenience access path alongside the local identity store. +- **Additional member accounts** under the same Organization as the lab's + day-job mirroring grows (prod-mirror, dev, etc.). +- **Secrets design doc** covering the SOPS-over-KMS workflow, rotation, + Vault relationship, and promotion model across bootstrap vs. per-cluster + secrets. + +## References + +- [Keycloak](./keycloak.md) +- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) +- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) +- [Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md) +- [Tailscale Workload Identity Federation](https://tailscale.com/kb/1581/workload-identity-federation) +- [AWS IAM Identity Center external IdP options](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-identity-source-idp.html) From ba23fbc1b08269314e45d07db5fdacf1cb5a0e40 Mon Sep 17 00:00:00 2001 From: Joshua Gilman Date: Mon, 20 Apr 2026 13:39:12 -0700 Subject: [PATCH 2/2] docs(designs): add Keycloak design Proposes Keycloak as the lab's central identity provider, deployed via Docker Compose on a dedicated t4g.small in the lab AWS account, colocated with Postgres, served at id.glab.lol. A single lab realm federates GitHub as its only OIDC upstream. Declarative configuration is reconciled from git via keycloak-config-cli; runtime state is backed up nightly to S3 with a rolling retention window and pulled locally to the Synology NAS. Disaster recovery is rebuild-first, with a documented same-hostname restore fallback. Includes a per-service break-glass matrix (Talos API, Argo CD admin, Vault recovery keys, local Identity Center user, etc.) so identity outages do not cascade into service outages. Also updates the designs index to list both the AWS Lab Account and Keycloak documents. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/docs/designs/index.md | 2 + docs/docs/designs/keycloak.md | 441 ++++++++++++++++++++++++++++++++++ 2 files changed, 443 insertions(+) create mode 100644 docs/docs/designs/keycloak.md diff --git a/docs/docs/designs/index.md b/docs/docs/designs/index.md index e22fc70..f022b90 100644 --- a/docs/docs/designs/index.md +++ b/docs/docs/designs/index.md @@ -23,6 +23,8 @@ Current designs: - [kro Consumption Model](./kro-consumption-model.md) - [Platform RGD Delivery Model](./platform-rgd-delivery.md) - [App RGD Design](./app-rgd.md) +- [AWS Lab Account](./aws-lab-account.md) +- [Keycloak](./keycloak.md) Once a design is implemented and considered durable, its steady-state shape should be folded back into the architecture overview and any relevant runbooks. diff --git a/docs/docs/designs/keycloak.md b/docs/docs/designs/keycloak.md new file mode 100644 index 0000000..7ecf9fd --- /dev/null +++ b/docs/docs/designs/keycloak.md @@ -0,0 +1,441 @@ +--- +title: Keycloak +description: Proposed design for the lab's central identity system — deployment shape, federation, configuration, backups, and the rebuild-over-restore disaster-recovery model. +--- + +# Keycloak + +## Status + +Proposed. + +This document defines how the lab runs Keycloak as its central identity +provider. It covers deployment shape, identity federation, declarative +configuration, TLS, backups, the rebuild-over-restore disaster-recovery +model, and the per-service break-glass matrix that lets the lab continue to +operate while Keycloak is down. + +This document assumes the [AWS Lab Account](./aws-lab-account.md) design. It +does not re-establish shared context about the AWS Organization, networking, +IAM, or secrets bootstrap. + +## Purpose + +The primary purpose of this design is to keep the lab's identity system on a +durable off-lab trust anchor, while keeping the lab itself operable when that +trust anchor is unreachable. + +The intended split is: + +- Keycloak holds the authoritative, human-facing identity of record +- cluster-level and service-level **break-glass paths** exist for every OIDC + consumer, so identity outages do not cascade into cluster-access outages +- configuration is declaratively sourced from git, so **rebuild is the default + recovery mode** and restore is a fallback used only for runtime state a + single-user lab can recreate in seconds + +This mirrors a day-job architecture at small scale without overpaying for +HA features a single-user lab cannot justify. + +## Goals + +- Provide one place to manage human identity across the lab. +- Keep Keycloak outside the lab's physical failure domain while keeping its + blast radius understood. +- Make Keycloak's configuration surface fully declarative via git, so rebuild + is a first-class recovery path. +- Ensure every Keycloak-dependent service has a documented break-glass path + that does not require Keycloak. +- Make the disaster-recovery procedure short enough to execute without a + runbook open on a phone. + +## Non-Goals + +- This document does not run Keycloak in a highly-available configuration. + Single-node is a deliberate choice for a single-user lab and is not a gap. +- This document does not define the per-service Keycloak client configuration + (redirect URIs, scopes, role mappings, token TTLs). Those live in the + realm repository. +- This document does not define monitoring, logging, or alerting. +- This document does not define the Keycloak → Identity Center SAML + federation path. That remains future work. +- This document does not define cluster-level OIDC federation to AWS (used + by ExternalDNS and similar controllers). That is a separate concern handled + by the cluster's own OIDC issuer, not by Keycloak. + +## Design Summary + +Keycloak runs on a single dedicated EC2 instance in the `lab` member account, +colocated with its Postgres database. Access is at **`id.glab.lol`** via a +Route 53 private-zone record and a TLS certificate issued automatically via +**ACME DNS-01**. The instance and database are deployed via **Docker +Compose**. + +The only upstream identity source is **GitHub, federated via OIDC**. A single +realm named `lab` holds all users and all OIDC/SAML clients. Sign-in to any +Keycloak-fronted service is: user → service → Keycloak → GitHub. + +Keycloak's declarative surface — realms, clients, roles, identity provider +settings, scopes, authentication flows — is reconciled from a git repository +by **`keycloak-config-cli`** running as a scheduled job on the Keycloak host. +Runtime state (user credentials, sessions, TOTP enrollment) is not in git and +is the only part of the system that needs backup-based recovery. + +Database dumps and the current TLS cert bundle are backed up nightly to an +S3 bucket in the `lab` account. The lab's Synology NAS pulls those backups +locally on a schedule, so a recent copy of Keycloak's runtime state exists on +a second continent and a second provider. + +**Disaster recovery is rebuild-first.** For a single-user lab, a rebuild from +the git-tracked realm + a fresh Postgres is faster than a restore, enforces +discipline that every config-surface change actually lives in git, and +produces a clean outcome. A restore path exists as fallback but is not the +primary recovery mode. + +When Keycloak is entirely unavailable, every Keycloak-fronted service has a +local break-glass path documented in this doc. The lab continues to operate; +only the "unified identity" experience degrades. + +## Deployment Shape + +### Host and Runtime + +- **Instance:** `t4g.small` (2 vCPU, 2 GB RAM), Amazon Linux 2023 on ARM, in + the `lab` account and `172.16.0.0/16` VPC from the AWS design. +- **Runtime:** Docker Compose manages two services: + - `keycloak` — official upstream Keycloak image, tagged to a specific + version pinned in `infra/`. + - `postgres` — official Postgres image, tagged to a specific version pinned + in `infra/`. Data volume on the instance's EBS root volume. +- **Reverse proxy:** Caddy (or equivalent) runs alongside and terminates TLS, + proxying to Keycloak on loopback. Caddy performs ACME DNS-01 renewals using + the instance's IAM role. This is the recommended shape; the exact proxy is + an implementation detail that does not need to appear in this doc. +- **No EBS snapshots or AMI backups.** State recovery is via the application + backup path below, not via block-level snapshots. + +### Identity Profile for the Host + +The EC2 instance's IAM role grants, at minimum: + +- Route 53 write access scoped to the `_acme-challenge.id.glab.lol` record + for DNS-01 validation. +- S3 write access to the Keycloak backup bucket's prefix. +- SSM Parameter Store read for any bootstrap-time secrets held there + (following the pattern established in the AWS design). + +The role carries no other permissions. All lab cluster-access, +secret-decryption, and tailnet-identity paths that this instance depends on +are established in the AWS Lab Account design and are not restated here. + +### Sizing and Tuning + +2 GB of RAM is tight but workable for a single-user lab because Keycloak 26.x +(Quarkus-based) has a much smaller footprint than earlier WildFly-based +versions. The required tuning is: + +- explicit Keycloak JVM max heap (e.g. ~768 MB) +- conservative Postgres `shared_buffers` (~128 MB) +- a swap file on the EBS volume as a safety margin + +CPU is burstable but effectively idle for single-user workloads; unlimited +mode is enabled to tolerate rare login bursts at negligible cost. + +## Identity Federation + +### Realm Structure + +A single realm named `lab` holds all lab users and all OIDC/SAML clients. No +separate realms for services vs. humans; a single-user lab does not benefit +from the separation, and multi-realm setups make GitOps reconciliation more +fragile. + +### Upstream IdP + +The realm has **exactly one identity provider configured: GitHub, via +OIDC**. There is no local username-password fallback. A lab user's identity +is their GitHub identity, federated through Keycloak, presented downstream to +each OIDC client. + +The Keycloak admin bootstrap user exists only briefly during initial realm +creation and is disabled once `keycloak-config-cli` has reconciled the +realm from git. + +### Intended Early OIDC Clients + +Keycloak is expected to front these services first: + +- **kubectl → Talos clusters** via each cluster's Kubernetes API-server OIDC + configuration. +- **Argo CD** (web UI and CLI). +- **Grafana** (when deployed). + +Additional clients will be added over time. The authoritative list lives in +the realm repository, not in this document. + +## TLS + +### Issuance + +The cert for `id.glab.lol` is issued via **ACME DNS-01 against Route 53** +using the instance's IAM role. There is no internet-exposed HTTP-01 path, +because the host sits in a private VPC with the subnet router as its only +tailnet attachment. + +No wildcard cert is used. Each service in the lab gets its own host-scoped +cert: + +- this host: `id.glab.lol` +- future clusters: `nonprod.k8s.glab.lol`, `prod.k8s.glab.lol` +- future services: their own host names + +Cluster-fronted services will issue their own certs via cert-manager and +each cluster's OIDC trust relationship with AWS (a separate future design). + +### Renewal + +Renewal is automatic and handled by whichever TLS-terminating proxy is used +on the host. No human intervention is expected between renewals. + +### Restore-time behavior + +For same-hostname restore to work without waiting on ACME, the current TLS +cert bundle (cert + private key) is included in the nightly backup payload +alongside the Postgres dump. A restored host can serve HTTPS immediately +with the backed-up cert and let the proxy handle its own renewal on the +normal schedule. + +## Configuration as Code + +### Source of Truth + +Keycloak's declarative surface is reconciled from a git repository. The +**repository is the source of truth**; the running Keycloak's admin surface +is a read-through cache of what's in git. + +In scope for git reconciliation: + +- realms +- clients +- client scopes +- roles and role mappings +- identity-provider configuration +- authentication flows and required actions +- realm-level settings + +Out of scope for git (intentionally runtime state): + +- user credentials (password hashes, WebAuthn registrations, TOTP secrets) +- sessions and refresh tokens +- audit and event logs +- ephemeral tokens and one-time codes + +### Reconciliation Tool + +Reconciliation uses **`keycloak-config-cli`** (adorsys). The tool is mature, +works against the admin API, handles partial updates, and does not require +Kubernetes CRDs or a separate operator. It is the best available GitOps +option as of this writing given the upstream Keycloak Operator still does +not provide first-class CRDs for clients, users, roles, or identity +providers. + +### Reconciliation Location + +`keycloak-config-cli` runs **on the Keycloak host itself** as a scheduled +job. Pull cadence is a small number of minutes; the exact cadence is an +implementation detail. The job authenticates to Keycloak using a +reconciliation service account stored in SSM Parameter Store. + +This location is intentionally simple for now. Pushing reconciliation into +GitHub Actions (so every git push triggers a reconcile) is named as future +work — it would enforce "git push is the only way config changes" more +strictly — but it requires a reachable admin endpoint and an appropriate +trust path, which are better designed once the realm repo has concrete +shape. + +### Schema Versioning + +Keycloak migrations run forward only. The realm repository pins the +Keycloak version it expects. Upgrades are driven by bumping the pin in +`infra/` and allowing the next reconcile cycle to re-apply cleanly against +the upgraded Keycloak. + +## Backups + +### What is backed up + +- Postgres dump (the full database, including all runtime state). +- Keycloak configuration files that live outside the database: `keycloak.conf`, + environment overrides, any custom themes or providers. +- The current TLS cert bundle (cert + private key). + +Configuration in git is **not** part of the backup — git is already the +durable store for it. + +### Where they go + +- **Primary destination:** an S3 bucket in the `lab` account. The bucket + uses server-side encryption with a KMS key; object-lock/versioning is on + so corruptions cannot silently overwrite known-good backups. +- **Secondary destination:** the lab's Synology NAS, which pulls from the S3 + bucket on its own schedule. + +The host writes backups to S3 using the instance's IAM role (no long-lived +credentials). The Synology pulls from S3 using a scoped, read-only access +mechanism chosen when Synology-side automation is implemented — out of scope +for this document. + +### Retention + +Retention is a rolling window, implemented via S3 lifecycle policies. The +contract: + +- **daily** backups retained for **30 days** +- **weekly** backups retained for **12 weeks** +- **monthly** backups retained for **12 months** + +"Last backup only" is explicitly rejected. A corruption-style incident +(realm data mangled by a bad change, not a hardware failure) requires +point-in-time restore from days ago. + +### Encryption + +Backups contain password hashes, signing keys, TOTP secrets, and session +state. They are encrypted twice: + +- client-side: the Postgres dump and cert bundle are encrypted before upload + using a recipient key managed alongside the rest of the lab's bootstrap + secrets +- server-side: the S3 bucket uses SSE-KMS with a customer-managed key + +This ensures that neither a leaked S3 object ACL nor a Synology +compromise yields usable plaintext. + +## Disaster Recovery + +The lab uses a **rebuild-first** recovery model. Restore exists as a +fallback, not as the primary path. + +### Rebuild path (primary) + +When Keycloak is unrecoverable or is being moved: + +1. Provision a fresh EC2 instance from the `infra/` OpenTofu modules. +2. Run `docker compose up` to start Keycloak and a **fresh** Postgres. +3. Run `keycloak-config-cli` against the new Keycloak, pointing at the realm + repository. All realms, clients, roles, and GitHub federation come back. +4. Sign in via GitHub. A new user entry is created on first login per the + identity-provider mapper configuration. +5. Re-enroll WebAuthn / TOTP (30 seconds). +6. Keycloak is operational. + +**Target RTO: 15 minutes.** This path requires only the git repository and +AWS access; it does not require any backup store. + +### Restore path (fallback) + +When a rebuild is unacceptable — for example, if you need to preserve the +exact user-state including federated-identity linkages and audit history — +the restore path is: + +1. Provision a fresh EC2 instance. +2. Pull the most recent, or a chosen point-in-time, backup from S3 or + Synology. +3. Restore the Postgres dump into a fresh Postgres. +4. Place the TLS cert bundle and any config files. +5. Run `docker compose up`. Keycloak boots against the restored database. +6. `keycloak-config-cli` runs its normal reconcile cycle; any configuration + drift between the backup time and `git HEAD` is corrected forward. + +For the single-user lab, the restore path is rarely worth it. For the +day-job mirror architecture, it is the default. + +### Hostname preservation + +Both paths require that the restored instance serve at **`id.glab.lol`**. +The `issuer` claim on every JWT Keycloak has ever signed is tied to that +URL. Changing the hostname invalidates every existing token and every +client's cached OIDC discovery. + +This is normally handled by DNS: the new instance comes up in the same VPC +with the same Route 53 A record pointing to it. During a full internet-loss +DR scenario where Route 53 is unreachable, a local override path exists: +the lab's CoreDNS zonefile can be manually edited to point `id.glab.lol` at +the restored instance's Tailscale address. + +### What rebuild does not recover + +- Stored user credentials (password hashes, WebAuthn, TOTP). These must be + re-enrolled by the user on first login. For a single-user lab, 30 seconds. +- Active sessions. All users are forced to re-authenticate. +- Federated-identity linkages established at previous logins. GitHub OIDC + users are re-linked on their next successful sign-in. +- Audit / event history. + +For the lab, none of these matter. For a production deployment of this +design, they would matter, and the restore path becomes primary. + +## Break-Glass Matrix + +When Keycloak is down, the lab continues to operate via +per-service break-glass paths. These are not emergency workarounds; they are +durable, documented alternate authentication paths that are kept active for +exactly this reason. + +| Service | Break-glass path | Notes | +|--------------------|------------------------------------------------------|-------| +| Talos API | mTLS via `talosconfig` and machine secrets | Talos's PKI is independent of Keycloak; `talosconfig` is the ultimate root anchor for the lab. | +| `kubectl` to clusters | Talos-generated admin kubeconfig via `talosctl kubeconfig` | Produced on demand against the cluster's own signing CA; not federated. | +| Argo CD | Built-in `admin` account and initial admin secret in the cluster | Retained and rotated but never disabled. | +| Vault (per cluster)| Unseal keys and root/recovery keys | Kept outside the lab per the Vault design (separate doc). | +| AWS | IAM Identity Center local user + WebAuthn hardware key | Does not federate to Keycloak by design — see the AWS doc. | +| Grafana | Local admin account | Kept active alongside OIDC client configuration. | +| GitHub (upstream IdP) | Personal GitHub account, hardware-key MFA | Keycloak is downstream of GitHub; if GitHub is down, federated login fails, and the above service-local paths are how the lab keeps moving. | + +All of these anchors' credentials live outside any Keycloak-dependent store. +`talosconfig`, Argo CD admin secrets, Vault recovery keys, and AWS root +credentials are kept in offline storage (password manager + hardware backup) +whose access does not depend on Keycloak, AWS, or internet connectivity. + +## Cost + +Monthly, all-upfront amortized over the 3-year EC2 Instance Savings Plan +commitment established in the AWS design: + +| Item | Monthly | +|--------------------------------|---------| +| t4g.small compute (3-yr SP) | ~$3.89 | +| S3 backup storage + lifecycle | ~$0.20 | +| Route 53 record queries | negligible | +| Data transfer | negligible | +| **Approximate total** | **~$4.10** | + +This is additive on top of the baseline AWS footprint laid out in the +AWS Lab Account design. + +## Future Work + +- **CI-driven reconciliation.** Move `keycloak-config-cli` from an + on-host cron into a GitHub Actions workflow so `git push` is the only + trigger that mutates Keycloak's declarative state. +- **Keycloak → IAM Identity Center SAML federation** as a secondary, + convenience path for AWS console access. The local IAM Identity Center + user remains primary. +- **Promotion to HA** if / when this design is reused at day-job scale. Two + Keycloak replicas behind a load balancer with clustered cache and a shared + database is the standard upgrade path; none of the decisions in this doc + block it. +- **Automated DR drill.** A periodic exercise — quarterly is appropriate — in + which the rebuild path is executed against a scratch instance to prove + the RTO target and keep the muscle alive. +- **Richer per-service break-glass.** Codify the break-glass secrets + themselves (their storage location, rotation cadence, recovery order) in + a separate operational runbook. + +## References + +- [AWS Lab Account](./aws-lab-account.md) +- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) +- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) +- [keycloak-config-cli (adorsys)](https://github.com/adorsys/keycloak-config-cli) +- [Keycloak Operator: status of first-class CRDs](https://www.keycloak.org/2022/09/operator-crs)