diff --git a/docs/docs/designs/aws-lab-account.md b/docs/docs/designs/aws-lab-account.md new file mode 100644 index 0000000..228e757 --- /dev/null +++ b/docs/docs/designs/aws-lab-account.md @@ -0,0 +1,455 @@ +--- +title: AWS Lab Account +description: Proposed design for a dedicated AWS Organization and member account that anchors lab DNS, bootstrap identity, and offsite compute for identity systems. +--- + +# AWS Lab Account + +## Status + +Proposed. + +This document defines how the lab uses a dedicated AWS Organization and member +account as its durable, out-of-lab trust anchor. It covers account and identity +structure, the VPC and its Tailscale site-to-site link to the lab, the Route 53 +private zone and the in-lab mirror that consumes it, and the secrets bootstrap +path that lets AWS be the single identity required to retrieve and decrypt all +other bootstrap material. + +Detailed Keycloak design is out of scope for this document and is covered +separately. + +## Purpose + +The primary purpose of this design is to keep lab identity, lab DNS, and +lab secrets from having a circular dependency on the lab itself. + +The intended split is: + +- AWS owns the durable trust anchor — the account, the identity system, the + KMS key, the private zone of record +- the lab consumes what AWS provides without requiring continuous connectivity + to operate steady-state +- a small number of AWS-resident components (a Tailscale subnet router, a + Keycloak host) provide the minimum bridge between the two sides + +This keeps the lab's hardest-to-bootstrap layers (identity, DNS, and secret +decryption keys) outside the lab, while keeping day-to-day serving and latency +local. + +## Goals + +- Provide a durable, off-lab trust anchor that survives total lab failure. +- Break the chicken-and-egg between lab DNS, lab identity, and lab secrets. +- Keep AWS itself identity-independent from anything the lab hosts, so AWS + remains reachable even when Keycloak, the platform cluster, or the lab + network is unavailable. +- Eliminate long-lived static credentials on AWS-hosted bootstrap nodes. +- Allow the lab to continue serving DNS and operating existing workloads + during a total internet outage. +- Mirror the durable-trust-anchor shape of a future day-job architecture so the + lab exercises the same patterns at small scale. + +## Non-Goals + +- This document does not define the Keycloak deployment, its database, or its + disaster-recovery plan. Those belong to a separate Keycloak design doc. +- This document does not define monitoring, logging, or alerting for the AWS + account. +- This document does not define exact IAM policy JSON, SCPs, Tailscale ACL + rules, CoreDNS configuration, or OpenTofu module layout. The doc names + contracts, not implementations. +- This document does not define OIDC trust between cluster workloads and AWS. + That is a separate future design connected to ExternalDNS and similar cluster + components. +- This document does not federate IAM Identity Center to Keycloak. Doing so + would reintroduce the circular dependency this design is built to avoid. + +## Design Summary + +The lab uses a dedicated AWS Organization with two accounts: + +- a **management account** that holds the Organization, billing, and IAM + Identity Center, and runs no workloads +- a **member account** that holds all lab-owned AWS resources: the VPC, the + Tailscale subnet router, the Keycloak host, the Route 53 private zone, the + KMS key used for SOPS, and the SSM parameters used for bootstrap + +Identity into both accounts comes from **IAM Identity Center with its built-in +identity store**. There is no external identity provider. A single human user +signs in with a hardware security key and assumes time-limited permission sets +into the member account. Root credentials on both accounts are break-glass only +and stored offline. + +The member account peers with the lab via a pair of **Tailscale subnet +routers**, one in AWS and one on VyOS. Devices on either side can reach devices +on the other side by their real IPs without being Tailscale nodes themselves. +The AWS-side subnet router authenticates to the tailnet via **Tailscale +workload identity federation**, using its attached IAM role — no pre-shared +auth key. + +The lab's authoritative DNS lives in a **Route 53 private zone** for `glab.lol` +bound to the lab's VPC. A sync job on the subnet router renders that zone to a +local zonefile, serves it over the tailnet, and an in-lab fetcher pulls the +file to disk. CoreDNS in the lab serves from the on-disk zonefile. The read +path never reaches AWS at query time, so DNS serving survives internet outages +and cold starts. + +Bootstrap secrets are gated end-to-end by the AWS-resident IAM role: + +- encrypted secrets live in the existing `secrets/` repo on GitHub, encrypted + with a **KMS customer-managed key** used as a SOPS recipient +- that repo is cloned via a **GitHub App** whose private key is stored in SSM + Parameter Store (SecureString) +- both KMS decrypt and SSM read are granted to the bootstrap instance via its + IAM role + +The result is that an AWS-resident bootstrap node holds zero persistent +secrets on disk: every identity it uses — Tailscale, GitHub, the SOPS +decryption key, AWS itself — traces back to its IAM role. + +## Account Structure + +The lab uses a two-account AWS Organization: + +| Account | Purpose | +|-------------|-----------------------------------------------------------------------------| +| `lab-mgmt` | Organization management account. Holds billing, IAM Identity Center, org-level config. No workloads. | +| `lab` | Lab workload member account. Holds VPC, EC2, Route 53, KMS, SSM, and all other lab resources. | + +The split follows AWS's own recommendation that the management account should +not run workloads. It also leaves room to add additional member accounts later +(for example, a prod-mirror account that stages day-job patterns) without +restructuring. + +Region: **`us-west-2`**. + +All resources in this design live in `us-west-2` in the `lab` account unless +explicitly stated otherwise. + +## Identity + +### Primary path + +IAM Identity Center is enabled in the `lab-mgmt` account and uses its +**built-in identity store**. There is no external IdP wired in. The identity +store holds one human user with **WebAuthn MFA enforced** via a hardware +security key. + +Access to the `lab` account is granted via permission sets assigned from +Identity Center. Daily operator access — console and CLI — is short-lived: + +- console access through the Identity Center access portal +- CLI access via `aws sso login`, which produces short-lived role credentials + +No long-lived IAM user access keys exist in either account for human use. + +### Break-glass + +Root user credentials exist on both accounts and are used for emergency +recovery only (loss of Identity Center access, billing-only actions not +permitted to Identity Center). Both root accounts: + +- use strong unique passwords +- have hardware-key MFA enabled +- are stored offline (outside any system whose recovery depends on AWS or + Keycloak being reachable) + +### Why the identity store is local + +Federating IAM Identity Center to Keycloak would make AWS access depend on +Keycloak. Keycloak depends on AWS for its compute, its DNS, and its secrets +bootstrap. Coupling the two defeats the entire reason for placing identity on +a durable off-lab trust anchor. + +A future addition of Keycloak SAML federation as a **secondary, convenience** +path for Identity Center is possible and explicitly deferred. The local +identity store always remains the primary admin path. + +## Network + +### VPC + +- **CIDR:** `172.16.0.0/16` +- **Subnets:** one public subnet, single AZ +- **Internet gateway:** attached; the subnet router carries outbound traffic + via an Elastic IP attached to its ENI +- **NAT gateway:** none. With a single public-subnet instance there is no + workload needing egress through a private subnet; skipping NAT removes the + largest ongoing fixed cost that would otherwise apply (~$32/mo) + +`172.16.0.0/16` is deliberately far from both the lab's `10.10.0.0/16` and +Tailscale's `100.64.0.0/10` CGNAT range, so no address-space collisions can +occur when routes are advertised across the tailnet. + +### Site-to-site with the lab + +The lab and the VPC connect via **Tailscale subnet routers on both sides**: + +- **AWS side:** the subnet router EC2 instance advertises `172.16.0.0/16` and + accepts `10.10.0.0/16`. +- **Lab side:** VyOS runs Tailscale and advertises `10.10.0.0/16` while + accepting `172.16.0.0/16`. + +Both sides run with `--snat-subnet-routes=false` so traffic preserves real +source IPs. The VPC route table directs `10.10.0.0/16` to the subnet router's +ENI, and the ENI has source/destination check disabled so it can forward. +Security groups allow `10.10.0.0/16` as a source on the ENI. + +From either side, a host can address the other side by its real IP without +being a Tailscale node itself. Lab DNS clients reach `172.16.0.0/16` +transparently; VPC workloads (Keycloak) can reach lab workloads when needed. + +MSS clamping is configured on VyOS to avoid black-holed large packets through +the WireGuard-based tunnel's smaller MTU. Tailscale ACLs permit traffic +between the two advertised CIDRs. + +### Tailscale node identity + +The AWS-side subnet router authenticates to the tailnet via **Tailscale +workload identity federation**, using its attached IAM role. No pre-shared +auth key is stored on the instance. Tailscale ACL tags are derived from IAM +claims (role ARN, account ID), so policy can be written against the +IAM identity rather than per-device labels. + +The VyOS node uses a traditional Tailscale auth key, because workload identity +federation only supports cloud-hosted clients. That key is managed out of band +and lives on a single on-prem device; it is not checked into any repo. + +## DNS + +### Authoritative zone + +The canonical lab domain is **`glab.lol`**. A Route 53 **private hosted zone** +for `glab.lol` lives in the `lab` account, bound to the VPC. All lab DNS +records are managed there. + +Private was chosen over public intentionally: the day-job architecture this +lab mirrors requires record names themselves to be non-public. A public zone +would be operationally simpler but would not exercise the same pattern. + +### Lab read path + +CoreDNS in the lab serves `glab.lol` from a **local zonefile on disk**. The +file is kept up to date by a sync pipeline that runs entirely outside the lab: + +1. A job on the AWS-side subnet router reads the Route 53 zone using its IAM + role and renders it to a standard zonefile. Refresh cadence is ≤1 minute. +2. The subnet router serves the rendered file over the tailnet. +3. An in-lab fetcher periodically pulls the file and writes it to the + filesystem CoreDNS reads from. + +CoreDNS never queries Route 53 at request time. The fetch path is +asynchronous and decoupled from serving. + +### Failure characteristics + +- **Steady-state AWS or internet outage:** fetches fail; CoreDNS continues to + serve from the last-fetched zonefile. The zone data becomes progressively + stale in proportion to the outage length, but queries continue to resolve. +- **Cold start during an outage:** CoreDNS loads the last zonefile from local + disk and resumes serving. The sync job is not on the critical path. +- **Full lab internet loss:** the tailnet path to the subnet router is itself + unreachable, which stops syncs but not serving. The zonefile on disk is the + resilience layer. +- **Stale-vs-unavailable tradeoff:** this design accepts staleness as the + price of availability during outages. Zone changes during an outage simply + do not propagate until connectivity returns. + +### Why the mirror exists + +Steady-state DNS resilience can be provided by CoreDNS itself — the `route53` +plugin reads zones into memory, and the `cache` plugin with `serve_stale` +enabled keeps answering through upstream outages. The mirror layer's specific +job is **cold-start and bootstrap resilience**: if CoreDNS restarts (node +reboot, container replaced) while Route 53 is unreachable, it has no in-memory +zone to fall back on. A zonefile on disk removes that failure mode. + +## Secrets Bootstrap + +### Contract + +An AWS-resident bootstrap instance must be able to, starting from only its +IAM role: + +1. Reach the tailnet. +2. Clone the private `secrets/` repo from GitHub. +3. Decrypt SOPS-encrypted files in that repo. + +At no point may the instance hold a durable, plaintext credential for any of +the three systems (Tailscale, GitHub, SOPS). All identity traces back to the +instance's attached IAM role. + +### KMS as a SOPS recipient + +The existing `secrets/` repo continues to hold SOPS-encrypted files. A single +**customer-managed KMS key** in the `lab` account is added as an additional +SOPS recipient. Any principal granted `kms:Decrypt` on that key — human +(via Identity Center permission set) or machine (via instance profile) — can +decrypt. + +This is non-breaking for existing workflows: SOPS supports multiple +recipients, so the KMS key can be added alongside the existing age key. +Retiring the age key is possible later but not required. + +The details of the SOPS-over-KMS workflow, key rotation, and human vs. +automation paths live in a separate secrets design doc. This document only +establishes that the KMS key lives in the `lab` account and is the anchor +for machine decryption. + +### GitHub App for repo access + +Cloning private repos from an AWS-resident bootstrap instance uses a +**GitHub App** owned by the `GilmanLab` organization and installed on the +`secrets` repo (plus any other private repos bootstrap needs to reach). + +- The App's **private signing key** is stored in an SSM Parameter Store + `SecureString` in the `lab` account. +- The instance's IAM role grants `ssm:GetParameter` + `kms:Decrypt` on that + specific parameter path only. +- On bootstrap, the instance fetches the key, generates a JWT, exchanges it + for a short-lived installation token (1-hour TTL), and clones. + +The only durable non-AWS secret anywhere in the chain is the App's private +signing key itself, and that key is at rest in AWS, gated by IAM. Installation +tokens are never stored on disk. + +### The single-anchor property + +Taken together, the chain on a single EC2 bootstrap instance is: + +| Step | Identity used | +|------|----------------------------------------------------------------| +| Join tailnet | IAM role (via workload identity federation) | +| Read GitHub App key | IAM role (via instance profile → SSM + KMS) | +| Mint installation token | App private key (short-lived, in memory) | +| Clone `secrets/` repo | Installation token (short-lived, in memory) | +| Decrypt SOPS files | IAM role (via instance profile → KMS) | + +Every persistent identity is the IAM role. Lose AWS, lose bootstrap. Gain AWS, +everything else unlocks in order. This is the design outcome the account +structure is in service of. + +## Compute and Cost Model + +### Instances + +| Name | Type | Purpose | +|----------------|------------|-------------------------------------------------------------| +| subnet router | `t4g.nano` | Tailscale site-to-site, Route 53 zonefile rendering | +| Keycloak host | `t4g.small`| Keycloak + colocated Postgres. Detailed design in Keycloak doc. | + +Both run Amazon Linux 2023 on ARM (`t4g` / Graviton). Tailscale and Keycloak +both ship native ARM builds. + +The two are kept **as separate instances** rather than colocated. A colocated +box would save ~$1.75/mo but would collapse the subnet router and identity +failure domains into one. The premium for separation is cheap insurance, and +separation is also more faithful to the day-job architecture this design +mirrors. + +EC2 instances do **not** run EBS snapshot or AMI backup jobs. The lab's +philosophy is rebuild-over-restore: every instance's durable state is either +in an external store (Route 53, KMS, SSM, S3, GitHub) or is designed to be +reconstructed from those sources. + +### Savings Plan commitment + +Both instances are long-lived infrastructure and not expected to change +instance family over their lifetime. The commitment shape is: + +- **3-year EC2 Instance Savings Plans, all-upfront**, covering the `t4g` + family in `us-west-2` +- expected effective discount ~72% vs. on-demand +- one purchase sized to cover both instances; additional commitment can be + layered later + +### Cost envelope + +Approximate, all-upfront amortized: + +| Item | Monthly | 3-year | +|-------------------------------|---------|--------| +| Subnet router (t4g.nano) | ~$1.75 | ~$63 | +| Keycloak host (t4g.small) | ~$3.89 | ~$140 | +| KMS customer-managed key | ~$1.00 | ~$36 | +| SSM Parameter Store (standard)| ~$0 | ~$0 | +| Route 53 private zone + queries | ~$0.50 | ~$18 | +| Data transfer | negligible | negligible | +| **Total (approximate)** | **~$7.15** | **~$260** | + +EBS, Elastic IPs attached to running instances, and Route 53 API calls for the +1-minute zonefile sync all fall into noise-level cost at lab scale. + +## Infrastructure as Code + +- All AWS resources are managed with **OpenTofu** from the `infra/` repo + under `infra/aws/`. +- OpenTofu state is stored in an **S3 bucket in the `lab` account**, using + S3's native locking. +- The OpenTofu entrypoint assumes a permission set role via Identity Center + for human-operator runs. Future CI-triggered runs will use a separate + identity (out of scope for this document). + +### Manual bootstrap surface + +A small amount of setup exists outside of OpenTofu, because it must exist +before OpenTofu can run: + +1. Creation of the AWS Organization and the two accounts. +2. Enablement of IAM Identity Center and the single operator user. +3. Creation of the S3 state bucket and the minimum IAM role OpenTofu will + assume. + +Everything downstream of that — the VPC, the subnet router, the private zone, +the KMS key, the SSM parameters, the instance profiles — is declared in +OpenTofu. + +## Failure Domains + +What fails together and what does not: + +| Failure | Lab DNS | Lab serving | AWS console access | Bootstrap of new lab instances | +|--------------------------------|---------|-------------|--------------------|------------------------------| +| Lab internet outage | ✓ (cached zonefile) | ✓ | ✗ (can't reach AWS) | ✗ | +| Subnet router EC2 down | ✓ (cached zonefile) | ✓ | ✓ | ✗ (tailnet → AWS bridge down) | +| `lab` account compromise | ✓ (cached zonefile, short-term) | ✓ | partial | ✗ | +| `lab-mgmt` account lost | ✓ (cached zonefile, short-term) | ✓ | ✗ | ✗ | +| Keycloak host down | ✓ | ✓ (except OIDC-gated services) | ✓ | ✓ | +| AWS region outage | ✓ (cached zonefile) | ✓ | ✗ | ✗ | +| Full lab power/hardware loss | ✗ | ✗ | ✓ | depends on external rebuild | + +The dominant pattern: **lab-side serving is robust to any offsite failure** +thanks to the zonefile-on-disk DNS path and locally-resident workloads. +Offsite failure primarily costs the ability to make changes, not the ability +to keep running. + +## Future Work + +The following are known next steps that are intentionally out of scope here: + +- **Keycloak design doc.** Deployment shape, Postgres colocation, backup to + object storage with Synology sync, rebuild-over-restore DR procedure, and + GitOps via `keycloak-config-cli`. +- **Cluster workload OIDC to AWS.** ExternalDNS-style workloads on the Talos + cluster will need AWS credentials; the cluster's own OIDC issuer (IRSA-style + federation) is the expected mechanism, not Tailscale-based federation. +- **GitHub Actions OIDC.** Trusting GitHub Actions as an OIDC identity + provider in the `lab` account so CI can apply OpenTofu without long-lived + keys. +- **Keycloak SAML federation to IAM Identity Center** as a secondary, + convenience access path alongside the local identity store. +- **Additional member accounts** under the same Organization as the lab's + day-job mirroring grows (prod-mirror, dev, etc.). +- **Secrets design doc** covering the SOPS-over-KMS workflow, rotation, + Vault relationship, and promotion model across bootstrap vs. per-cluster + secrets. + +## References + +- [Keycloak](./keycloak.md) +- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) +- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) +- [Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md) +- [Tailscale Workload Identity Federation](https://tailscale.com/kb/1581/workload-identity-federation) +- [AWS IAM Identity Center external IdP options](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-identity-source-idp.html) diff --git a/docs/docs/designs/index.md b/docs/docs/designs/index.md index e22fc70..f022b90 100644 --- a/docs/docs/designs/index.md +++ b/docs/docs/designs/index.md @@ -23,6 +23,8 @@ Current designs: - [kro Consumption Model](./kro-consumption-model.md) - [Platform RGD Delivery Model](./platform-rgd-delivery.md) - [App RGD Design](./app-rgd.md) +- [AWS Lab Account](./aws-lab-account.md) +- [Keycloak](./keycloak.md) Once a design is implemented and considered durable, its steady-state shape should be folded back into the architecture overview and any relevant runbooks. diff --git a/docs/docs/designs/keycloak.md b/docs/docs/designs/keycloak.md new file mode 100644 index 0000000..7ecf9fd --- /dev/null +++ b/docs/docs/designs/keycloak.md @@ -0,0 +1,441 @@ +--- +title: Keycloak +description: Proposed design for the lab's central identity system — deployment shape, federation, configuration, backups, and the rebuild-over-restore disaster-recovery model. +--- + +# Keycloak + +## Status + +Proposed. + +This document defines how the lab runs Keycloak as its central identity +provider. It covers deployment shape, identity federation, declarative +configuration, TLS, backups, the rebuild-over-restore disaster-recovery +model, and the per-service break-glass matrix that lets the lab continue to +operate while Keycloak is down. + +This document assumes the [AWS Lab Account](./aws-lab-account.md) design. It +does not re-establish shared context about the AWS Organization, networking, +IAM, or secrets bootstrap. + +## Purpose + +The primary purpose of this design is to keep the lab's identity system on a +durable off-lab trust anchor, while keeping the lab itself operable when that +trust anchor is unreachable. + +The intended split is: + +- Keycloak holds the authoritative, human-facing identity of record +- cluster-level and service-level **break-glass paths** exist for every OIDC + consumer, so identity outages do not cascade into cluster-access outages +- configuration is declaratively sourced from git, so **rebuild is the default + recovery mode** and restore is a fallback used only for runtime state a + single-user lab can recreate in seconds + +This mirrors a day-job architecture at small scale without overpaying for +HA features a single-user lab cannot justify. + +## Goals + +- Provide one place to manage human identity across the lab. +- Keep Keycloak outside the lab's physical failure domain while keeping its + blast radius understood. +- Make Keycloak's configuration surface fully declarative via git, so rebuild + is a first-class recovery path. +- Ensure every Keycloak-dependent service has a documented break-glass path + that does not require Keycloak. +- Make the disaster-recovery procedure short enough to execute without a + runbook open on a phone. + +## Non-Goals + +- This document does not run Keycloak in a highly-available configuration. + Single-node is a deliberate choice for a single-user lab and is not a gap. +- This document does not define the per-service Keycloak client configuration + (redirect URIs, scopes, role mappings, token TTLs). Those live in the + realm repository. +- This document does not define monitoring, logging, or alerting. +- This document does not define the Keycloak → Identity Center SAML + federation path. That remains future work. +- This document does not define cluster-level OIDC federation to AWS (used + by ExternalDNS and similar controllers). That is a separate concern handled + by the cluster's own OIDC issuer, not by Keycloak. + +## Design Summary + +Keycloak runs on a single dedicated EC2 instance in the `lab` member account, +colocated with its Postgres database. Access is at **`id.glab.lol`** via a +Route 53 private-zone record and a TLS certificate issued automatically via +**ACME DNS-01**. The instance and database are deployed via **Docker +Compose**. + +The only upstream identity source is **GitHub, federated via OIDC**. A single +realm named `lab` holds all users and all OIDC/SAML clients. Sign-in to any +Keycloak-fronted service is: user → service → Keycloak → GitHub. + +Keycloak's declarative surface — realms, clients, roles, identity provider +settings, scopes, authentication flows — is reconciled from a git repository +by **`keycloak-config-cli`** running as a scheduled job on the Keycloak host. +Runtime state (user credentials, sessions, TOTP enrollment) is not in git and +is the only part of the system that needs backup-based recovery. + +Database dumps and the current TLS cert bundle are backed up nightly to an +S3 bucket in the `lab` account. The lab's Synology NAS pulls those backups +locally on a schedule, so a recent copy of Keycloak's runtime state exists on +a second continent and a second provider. + +**Disaster recovery is rebuild-first.** For a single-user lab, a rebuild from +the git-tracked realm + a fresh Postgres is faster than a restore, enforces +discipline that every config-surface change actually lives in git, and +produces a clean outcome. A restore path exists as fallback but is not the +primary recovery mode. + +When Keycloak is entirely unavailable, every Keycloak-fronted service has a +local break-glass path documented in this doc. The lab continues to operate; +only the "unified identity" experience degrades. + +## Deployment Shape + +### Host and Runtime + +- **Instance:** `t4g.small` (2 vCPU, 2 GB RAM), Amazon Linux 2023 on ARM, in + the `lab` account and `172.16.0.0/16` VPC from the AWS design. +- **Runtime:** Docker Compose manages two services: + - `keycloak` — official upstream Keycloak image, tagged to a specific + version pinned in `infra/`. + - `postgres` — official Postgres image, tagged to a specific version pinned + in `infra/`. Data volume on the instance's EBS root volume. +- **Reverse proxy:** Caddy (or equivalent) runs alongside and terminates TLS, + proxying to Keycloak on loopback. Caddy performs ACME DNS-01 renewals using + the instance's IAM role. This is the recommended shape; the exact proxy is + an implementation detail that does not need to appear in this doc. +- **No EBS snapshots or AMI backups.** State recovery is via the application + backup path below, not via block-level snapshots. + +### Identity Profile for the Host + +The EC2 instance's IAM role grants, at minimum: + +- Route 53 write access scoped to the `_acme-challenge.id.glab.lol` record + for DNS-01 validation. +- S3 write access to the Keycloak backup bucket's prefix. +- SSM Parameter Store read for any bootstrap-time secrets held there + (following the pattern established in the AWS design). + +The role carries no other permissions. All lab cluster-access, +secret-decryption, and tailnet-identity paths that this instance depends on +are established in the AWS Lab Account design and are not restated here. + +### Sizing and Tuning + +2 GB of RAM is tight but workable for a single-user lab because Keycloak 26.x +(Quarkus-based) has a much smaller footprint than earlier WildFly-based +versions. The required tuning is: + +- explicit Keycloak JVM max heap (e.g. ~768 MB) +- conservative Postgres `shared_buffers` (~128 MB) +- a swap file on the EBS volume as a safety margin + +CPU is burstable but effectively idle for single-user workloads; unlimited +mode is enabled to tolerate rare login bursts at negligible cost. + +## Identity Federation + +### Realm Structure + +A single realm named `lab` holds all lab users and all OIDC/SAML clients. No +separate realms for services vs. humans; a single-user lab does not benefit +from the separation, and multi-realm setups make GitOps reconciliation more +fragile. + +### Upstream IdP + +The realm has **exactly one identity provider configured: GitHub, via +OIDC**. There is no local username-password fallback. A lab user's identity +is their GitHub identity, federated through Keycloak, presented downstream to +each OIDC client. + +The Keycloak admin bootstrap user exists only briefly during initial realm +creation and is disabled once `keycloak-config-cli` has reconciled the +realm from git. + +### Intended Early OIDC Clients + +Keycloak is expected to front these services first: + +- **kubectl → Talos clusters** via each cluster's Kubernetes API-server OIDC + configuration. +- **Argo CD** (web UI and CLI). +- **Grafana** (when deployed). + +Additional clients will be added over time. The authoritative list lives in +the realm repository, not in this document. + +## TLS + +### Issuance + +The cert for `id.glab.lol` is issued via **ACME DNS-01 against Route 53** +using the instance's IAM role. There is no internet-exposed HTTP-01 path, +because the host sits in a private VPC with the subnet router as its only +tailnet attachment. + +No wildcard cert is used. Each service in the lab gets its own host-scoped +cert: + +- this host: `id.glab.lol` +- future clusters: `nonprod.k8s.glab.lol`, `prod.k8s.glab.lol` +- future services: their own host names + +Cluster-fronted services will issue their own certs via cert-manager and +each cluster's OIDC trust relationship with AWS (a separate future design). + +### Renewal + +Renewal is automatic and handled by whichever TLS-terminating proxy is used +on the host. No human intervention is expected between renewals. + +### Restore-time behavior + +For same-hostname restore to work without waiting on ACME, the current TLS +cert bundle (cert + private key) is included in the nightly backup payload +alongside the Postgres dump. A restored host can serve HTTPS immediately +with the backed-up cert and let the proxy handle its own renewal on the +normal schedule. + +## Configuration as Code + +### Source of Truth + +Keycloak's declarative surface is reconciled from a git repository. The +**repository is the source of truth**; the running Keycloak's admin surface +is a read-through cache of what's in git. + +In scope for git reconciliation: + +- realms +- clients +- client scopes +- roles and role mappings +- identity-provider configuration +- authentication flows and required actions +- realm-level settings + +Out of scope for git (intentionally runtime state): + +- user credentials (password hashes, WebAuthn registrations, TOTP secrets) +- sessions and refresh tokens +- audit and event logs +- ephemeral tokens and one-time codes + +### Reconciliation Tool + +Reconciliation uses **`keycloak-config-cli`** (adorsys). The tool is mature, +works against the admin API, handles partial updates, and does not require +Kubernetes CRDs or a separate operator. It is the best available GitOps +option as of this writing given the upstream Keycloak Operator still does +not provide first-class CRDs for clients, users, roles, or identity +providers. + +### Reconciliation Location + +`keycloak-config-cli` runs **on the Keycloak host itself** as a scheduled +job. Pull cadence is a small number of minutes; the exact cadence is an +implementation detail. The job authenticates to Keycloak using a +reconciliation service account stored in SSM Parameter Store. + +This location is intentionally simple for now. Pushing reconciliation into +GitHub Actions (so every git push triggers a reconcile) is named as future +work — it would enforce "git push is the only way config changes" more +strictly — but it requires a reachable admin endpoint and an appropriate +trust path, which are better designed once the realm repo has concrete +shape. + +### Schema Versioning + +Keycloak migrations run forward only. The realm repository pins the +Keycloak version it expects. Upgrades are driven by bumping the pin in +`infra/` and allowing the next reconcile cycle to re-apply cleanly against +the upgraded Keycloak. + +## Backups + +### What is backed up + +- Postgres dump (the full database, including all runtime state). +- Keycloak configuration files that live outside the database: `keycloak.conf`, + environment overrides, any custom themes or providers. +- The current TLS cert bundle (cert + private key). + +Configuration in git is **not** part of the backup — git is already the +durable store for it. + +### Where they go + +- **Primary destination:** an S3 bucket in the `lab` account. The bucket + uses server-side encryption with a KMS key; object-lock/versioning is on + so corruptions cannot silently overwrite known-good backups. +- **Secondary destination:** the lab's Synology NAS, which pulls from the S3 + bucket on its own schedule. + +The host writes backups to S3 using the instance's IAM role (no long-lived +credentials). The Synology pulls from S3 using a scoped, read-only access +mechanism chosen when Synology-side automation is implemented — out of scope +for this document. + +### Retention + +Retention is a rolling window, implemented via S3 lifecycle policies. The +contract: + +- **daily** backups retained for **30 days** +- **weekly** backups retained for **12 weeks** +- **monthly** backups retained for **12 months** + +"Last backup only" is explicitly rejected. A corruption-style incident +(realm data mangled by a bad change, not a hardware failure) requires +point-in-time restore from days ago. + +### Encryption + +Backups contain password hashes, signing keys, TOTP secrets, and session +state. They are encrypted twice: + +- client-side: the Postgres dump and cert bundle are encrypted before upload + using a recipient key managed alongside the rest of the lab's bootstrap + secrets +- server-side: the S3 bucket uses SSE-KMS with a customer-managed key + +This ensures that neither a leaked S3 object ACL nor a Synology +compromise yields usable plaintext. + +## Disaster Recovery + +The lab uses a **rebuild-first** recovery model. Restore exists as a +fallback, not as the primary path. + +### Rebuild path (primary) + +When Keycloak is unrecoverable or is being moved: + +1. Provision a fresh EC2 instance from the `infra/` OpenTofu modules. +2. Run `docker compose up` to start Keycloak and a **fresh** Postgres. +3. Run `keycloak-config-cli` against the new Keycloak, pointing at the realm + repository. All realms, clients, roles, and GitHub federation come back. +4. Sign in via GitHub. A new user entry is created on first login per the + identity-provider mapper configuration. +5. Re-enroll WebAuthn / TOTP (30 seconds). +6. Keycloak is operational. + +**Target RTO: 15 minutes.** This path requires only the git repository and +AWS access; it does not require any backup store. + +### Restore path (fallback) + +When a rebuild is unacceptable — for example, if you need to preserve the +exact user-state including federated-identity linkages and audit history — +the restore path is: + +1. Provision a fresh EC2 instance. +2. Pull the most recent, or a chosen point-in-time, backup from S3 or + Synology. +3. Restore the Postgres dump into a fresh Postgres. +4. Place the TLS cert bundle and any config files. +5. Run `docker compose up`. Keycloak boots against the restored database. +6. `keycloak-config-cli` runs its normal reconcile cycle; any configuration + drift between the backup time and `git HEAD` is corrected forward. + +For the single-user lab, the restore path is rarely worth it. For the +day-job mirror architecture, it is the default. + +### Hostname preservation + +Both paths require that the restored instance serve at **`id.glab.lol`**. +The `issuer` claim on every JWT Keycloak has ever signed is tied to that +URL. Changing the hostname invalidates every existing token and every +client's cached OIDC discovery. + +This is normally handled by DNS: the new instance comes up in the same VPC +with the same Route 53 A record pointing to it. During a full internet-loss +DR scenario where Route 53 is unreachable, a local override path exists: +the lab's CoreDNS zonefile can be manually edited to point `id.glab.lol` at +the restored instance's Tailscale address. + +### What rebuild does not recover + +- Stored user credentials (password hashes, WebAuthn, TOTP). These must be + re-enrolled by the user on first login. For a single-user lab, 30 seconds. +- Active sessions. All users are forced to re-authenticate. +- Federated-identity linkages established at previous logins. GitHub OIDC + users are re-linked on their next successful sign-in. +- Audit / event history. + +For the lab, none of these matter. For a production deployment of this +design, they would matter, and the restore path becomes primary. + +## Break-Glass Matrix + +When Keycloak is down, the lab continues to operate via +per-service break-glass paths. These are not emergency workarounds; they are +durable, documented alternate authentication paths that are kept active for +exactly this reason. + +| Service | Break-glass path | Notes | +|--------------------|------------------------------------------------------|-------| +| Talos API | mTLS via `talosconfig` and machine secrets | Talos's PKI is independent of Keycloak; `talosconfig` is the ultimate root anchor for the lab. | +| `kubectl` to clusters | Talos-generated admin kubeconfig via `talosctl kubeconfig` | Produced on demand against the cluster's own signing CA; not federated. | +| Argo CD | Built-in `admin` account and initial admin secret in the cluster | Retained and rotated but never disabled. | +| Vault (per cluster)| Unseal keys and root/recovery keys | Kept outside the lab per the Vault design (separate doc). | +| AWS | IAM Identity Center local user + WebAuthn hardware key | Does not federate to Keycloak by design — see the AWS doc. | +| Grafana | Local admin account | Kept active alongside OIDC client configuration. | +| GitHub (upstream IdP) | Personal GitHub account, hardware-key MFA | Keycloak is downstream of GitHub; if GitHub is down, federated login fails, and the above service-local paths are how the lab keeps moving. | + +All of these anchors' credentials live outside any Keycloak-dependent store. +`talosconfig`, Argo CD admin secrets, Vault recovery keys, and AWS root +credentials are kept in offline storage (password manager + hardware backup) +whose access does not depend on Keycloak, AWS, or internet connectivity. + +## Cost + +Monthly, all-upfront amortized over the 3-year EC2 Instance Savings Plan +commitment established in the AWS design: + +| Item | Monthly | +|--------------------------------|---------| +| t4g.small compute (3-yr SP) | ~$3.89 | +| S3 backup storage + lifecycle | ~$0.20 | +| Route 53 record queries | negligible | +| Data transfer | negligible | +| **Approximate total** | **~$4.10** | + +This is additive on top of the baseline AWS footprint laid out in the +AWS Lab Account design. + +## Future Work + +- **CI-driven reconciliation.** Move `keycloak-config-cli` from an + on-host cron into a GitHub Actions workflow so `git push` is the only + trigger that mutates Keycloak's declarative state. +- **Keycloak → IAM Identity Center SAML federation** as a secondary, + convenience path for AWS console access. The local IAM Identity Center + user remains primary. +- **Promotion to HA** if / when this design is reused at day-job scale. Two + Keycloak replicas behind a load balancer with clustered cache and a shared + database is the standard upgrade path; none of the decisions in this doc + block it. +- **Automated DR drill.** A periodic exercise — quarterly is appropriate — in + which the rebuild path is executed against a scratch instance to prove + the RTO target and keep the muscle alive. +- **Richer per-service break-glass.** Codify the break-glass secrets + themselves (their storage location, rotation cadence, recovery order) in + a separate operational runbook. + +## References + +- [AWS Lab Account](./aws-lab-account.md) +- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) +- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) +- [keycloak-config-cli (adorsys)](https://github.com/adorsys/keycloak-config-cli) +- [Keycloak Operator: status of first-class CRDs](https://www.keycloak.org/2022/09/operator-crs)