diff --git a/docs/docs/designs/aws-lab-account.md b/docs/docs/designs/aws-lab-account.md
new file mode 100644
index 0000000..228e757
--- /dev/null
+++ b/docs/docs/designs/aws-lab-account.md
@@ -0,0 +1,455 @@
+---
+title: AWS Lab Account
+description: Proposed design for a dedicated AWS Organization and member account that anchors lab DNS, bootstrap identity, and offsite compute for identity systems.
+---
+
+# AWS Lab Account
+
+## Status
+
+Proposed.
+
+This document defines how the lab uses a dedicated AWS Organization and member
+account as its durable, out-of-lab trust anchor. It covers account and identity
+structure, the VPC and its Tailscale site-to-site link to the lab, the Route 53
+private zone and the in-lab mirror that consumes it, and the secrets bootstrap
+path that lets AWS be the single identity required to retrieve and decrypt all
+other bootstrap material.
+
+Detailed Keycloak design is out of scope for this document and is covered
+separately.
+
+## Purpose
+
+The primary purpose of this design is to keep lab identity, lab DNS, and
+lab secrets from having a circular dependency on the lab itself.
+
+The intended split is:
+
+- AWS owns the durable trust anchor — the account, the identity system, the
+  KMS key, the private zone of record
+- the lab consumes what AWS provides without requiring continuous connectivity
+  to operate steady-state
+- a small number of AWS-resident components (a Tailscale subnet router, a
+  Keycloak host) provide the minimum bridge between the two sides
+
+This keeps the lab's hardest-to-bootstrap layers (identity, DNS, and secret
+decryption keys) outside the lab, while keeping day-to-day serving and latency
+local.
+
+## Goals
+
+- Provide a durable, off-lab trust anchor that survives total lab failure.
+- Break the chicken-and-egg between lab DNS, lab identity, and lab secrets.
+- Keep AWS itself identity-independent from anything the lab hosts, so AWS
+  remains reachable even when Keycloak, the platform cluster, or the lab
+  network is unavailable.
+- Eliminate long-lived static credentials on AWS-hosted bootstrap nodes.
+- Allow the lab to continue serving DNS and operating existing workloads
+  during a total internet outage.
+- Mirror the durable-trust-anchor shape of a future day-job architecture so the
+  lab exercises the same patterns at small scale.
+
+## Non-Goals
+
+- This document does not define the Keycloak deployment, its database, or its
+  disaster-recovery plan. Those belong to a separate Keycloak design doc.
+- This document does not define monitoring, logging, or alerting for the AWS
+  account.
+- This document does not define exact IAM policy JSON, SCPs, Tailscale ACL
+  rules, CoreDNS configuration, or OpenTofu module layout. The doc names
+  contracts, not implementations.
+- This document does not define OIDC trust between cluster workloads and AWS.
+  That is a separate future design connected to ExternalDNS and similar cluster
+  components.
+- This document does not federate IAM Identity Center to Keycloak. Doing so
+  would reintroduce the circular dependency this design is built to avoid.
+
+## Design Summary
+
+The lab uses a dedicated AWS Organization with two accounts:
+
+- a **management account** that holds the Organization, billing, and IAM
+  Identity Center, and runs no workloads
+- a **member account** that holds all lab-owned AWS resources: the VPC, the
+  Tailscale subnet router, the Keycloak host, the Route 53 private zone, the
+  KMS key used for SOPS, and the SSM parameters used for bootstrap
+
+Identity into both accounts comes from **IAM Identity Center with its built-in
+identity store**. There is no external identity provider. A single human user
+signs in with a hardware security key and assumes time-limited permission sets
+into the member account. Root credentials on both accounts are break-glass only
+and stored offline.
+
+The member account peers with the lab via a pair of **Tailscale subnet
+routers**, one in AWS and one on VyOS. Devices on either side can reach devices
+on the other side by their real IPs without being Tailscale nodes themselves.
+The AWS-side subnet router authenticates to the tailnet via **Tailscale
+workload identity federation**, using its attached IAM role — no pre-shared
+auth key.
+
+The lab's authoritative DNS lives in a **Route 53 private zone** for `glab.lol`
+bound to the lab's VPC. A sync job on the subnet router renders that zone to a
+local zonefile, serves it over the tailnet, and an in-lab fetcher pulls the
+file to disk. CoreDNS in the lab serves from the on-disk zonefile. The read
+path never reaches AWS at query time, so DNS serving survives internet outages
+and cold starts.
+
+Bootstrap secrets are gated end-to-end by the AWS-resident IAM role:
+
+- encrypted secrets live in the existing `secrets/` repo on GitHub, encrypted
+  with a **KMS customer-managed key** used as a SOPS recipient
+- that repo is cloned via a **GitHub App** whose private key is stored in SSM
+  Parameter Store (SecureString)
+- both KMS decrypt and SSM read are granted to the bootstrap instance via its
+  IAM role
+
+The result is that an AWS-resident bootstrap node holds zero persistent
+secrets on disk: every identity it uses — Tailscale, GitHub, the SOPS
+decryption key, AWS itself — traces back to its IAM role.
+
+## Account Structure
+
+The lab uses a two-account AWS Organization:
+
+| Account     | Purpose                                                                     |
+|-------------|-----------------------------------------------------------------------------|
+| `lab-mgmt`  | Organization management account. Holds billing, IAM Identity Center, org-level config. No workloads. |
+| `lab`       | Lab workload member account. Holds VPC, EC2, Route 53, KMS, SSM, and all other lab resources. |
+
+The split follows AWS's own recommendation that the management account should
+not run workloads. It also leaves room to add additional member accounts later
+(for example, a prod-mirror account that stages day-job patterns) without
+restructuring.
+
+Region: **`us-west-2`**.
+
+All resources in this design live in `us-west-2` in the `lab` account unless
+explicitly stated otherwise.
+
+## Identity
+
+### Primary path
+
+IAM Identity Center is enabled in the `lab-mgmt` account and uses its
+**built-in identity store**. There is no external IdP wired in. The identity
+store holds one human user with **WebAuthn MFA enforced** via a hardware
+security key.
+
+Access to the `lab` account is granted via permission sets assigned from
+Identity Center. Daily operator access — console and CLI — is short-lived:
+
+- console access through the Identity Center access portal
+- CLI access via `aws sso login`, which produces short-lived role credentials
+
+No long-lived IAM user access keys exist in either account for human use.
+
+### Break-glass
+
+Root user credentials exist on both accounts and are used for emergency
+recovery only (loss of Identity Center access, billing-only actions not
+permitted to Identity Center). Both root accounts:
+
+- use strong unique passwords
+- have hardware-key MFA enabled
+- are stored offline (outside any system whose recovery depends on AWS or
+  Keycloak being reachable)
+
+### Why the identity store is local
+
+Federating IAM Identity Center to Keycloak would make AWS access depend on
+Keycloak. Keycloak depends on AWS for its compute, its DNS, and its secrets
+bootstrap. Coupling the two defeats the entire reason for placing identity on
+a durable off-lab trust anchor.
+
+A future addition of Keycloak SAML federation as a **secondary, convenience**
+path for Identity Center is possible and explicitly deferred. The local
+identity store always remains the primary admin path.
+
+## Network
+
+### VPC
+
+- **CIDR:** `172.16.0.0/16`
+- **Subnets:** one public subnet, single AZ
+- **Internet gateway:** attached; the subnet router carries outbound traffic
+  via an Elastic IP attached to its ENI
+- **NAT gateway:** none. With a single public-subnet instance there is no
+  workload needing egress through a private subnet; skipping NAT removes the
+  largest ongoing fixed cost that would otherwise apply (~$32/mo)
+
+`172.16.0.0/16` is deliberately far from both the lab's `10.10.0.0/16` and
+Tailscale's `100.64.0.0/10` CGNAT range, so no address-space collisions can
+occur when routes are advertised across the tailnet.
+
+### Site-to-site with the lab
+
+The lab and the VPC connect via **Tailscale subnet routers on both sides**:
+
+- **AWS side:** the subnet router EC2 instance advertises `172.16.0.0/16` and
+  accepts `10.10.0.0/16`.
+- **Lab side:** VyOS runs Tailscale and advertises `10.10.0.0/16` while
+  accepting `172.16.0.0/16`.
+
+Both sides run with `--snat-subnet-routes=false` so traffic preserves real
+source IPs. The VPC route table directs `10.10.0.0/16` to the subnet router's
+ENI, and the ENI has source/destination check disabled so it can forward.
+Security groups allow `10.10.0.0/16` as a source on the ENI.
+
+From either side, a host can address the other side by its real IP without
+being a Tailscale node itself. Lab DNS clients reach `172.16.0.0/16`
+transparently; VPC workloads (Keycloak) can reach lab workloads when needed.
+
+MSS clamping is configured on VyOS to avoid black-holed large packets through
+the WireGuard-based tunnel's smaller MTU. Tailscale ACLs permit traffic
+between the two advertised CIDRs.
+
+### Tailscale node identity
+
+The AWS-side subnet router authenticates to the tailnet via **Tailscale
+workload identity federation**, using its attached IAM role. No pre-shared
+auth key is stored on the instance. Tailscale ACL tags are derived from IAM
+claims (role ARN, account ID), so policy can be written against the
+IAM identity rather than per-device labels.
+
+The VyOS node uses a traditional Tailscale auth key, because workload identity
+federation only supports cloud-hosted clients. That key is managed out of band
+and lives on a single on-prem device; it is not checked into any repo.
+
+## DNS
+
+### Authoritative zone
+
+The canonical lab domain is **`glab.lol`**. A Route 53 **private hosted zone**
+for `glab.lol` lives in the `lab` account, bound to the VPC. All lab DNS
+records are managed there.
+
+Private was chosen over public intentionally: the day-job architecture this
+lab mirrors requires record names themselves to be non-public. A public zone
+would be operationally simpler but would not exercise the same pattern.
+
+### Lab read path
+
+CoreDNS in the lab serves `glab.lol` from a **local zonefile on disk**. The
+file is kept up to date by a sync pipeline that runs entirely outside the lab:
+
+1. A job on the AWS-side subnet router reads the Route 53 zone using its IAM
+   role and renders it to a standard zonefile. Refresh cadence is ≤1 minute.
+2. The subnet router serves the rendered file over the tailnet.
+3. An in-lab fetcher periodically pulls the file and writes it to the
+   filesystem CoreDNS reads from.
+
+CoreDNS never queries Route 53 at request time. The fetch path is
+asynchronous and decoupled from serving.
+
+### Failure characteristics
+
+- **Steady-state AWS or internet outage:** fetches fail; CoreDNS continues to
+  serve from the last-fetched zonefile. The zone data becomes progressively
+  stale in proportion to the outage length, but queries continue to resolve.
+- **Cold start during an outage:** CoreDNS loads the last zonefile from local
+  disk and resumes serving. The sync job is not on the critical path.
+- **Full lab internet loss:** the tailnet path to the subnet router is itself
+  unreachable, which stops syncs but not serving. The zonefile on disk is the
+  resilience layer.
+- **Stale-vs-unavailable tradeoff:** this design accepts staleness as the
+  price of availability during outages. Zone changes during an outage simply
+  do not propagate until connectivity returns.
+
+### Why the mirror exists
+
+Steady-state DNS resilience can be provided by CoreDNS itself — the `route53`
+plugin reads zones into memory, and the `cache` plugin with `serve_stale`
+enabled keeps answering through upstream outages. The mirror layer's specific
+job is **cold-start and bootstrap resilience**: if CoreDNS restarts (node
+reboot, container replaced) while Route 53 is unreachable, it has no in-memory
+zone to fall back on. A zonefile on disk removes that failure mode.
+
+## Secrets Bootstrap
+
+### Contract
+
+An AWS-resident bootstrap instance must be able to, starting from only its
+IAM role:
+
+1. Reach the tailnet.
+2. Clone the private `secrets/` repo from GitHub.
+3. Decrypt SOPS-encrypted files in that repo.
+
+At no point may the instance hold a durable, plaintext credential for any of
+the three systems (Tailscale, GitHub, SOPS). All identity traces back to the
+instance's attached IAM role.
+
+### KMS as a SOPS recipient
+
+The existing `secrets/` repo continues to hold SOPS-encrypted files. A single
+**customer-managed KMS key** in the `lab` account is added as an additional
+SOPS recipient. Any principal granted `kms:Decrypt` on that key — human
+(via Identity Center permission set) or machine (via instance profile) — can
+decrypt.
+
+This is non-breaking for existing workflows: SOPS supports multiple
+recipients, so the KMS key can be added alongside the existing age key.
+Retiring the age key is possible later but not required.
+
+The details of the SOPS-over-KMS workflow, key rotation, and human vs.
+automation paths live in a separate secrets design doc. This document only
+establishes that the KMS key lives in the `lab` account and is the anchor
+for machine decryption.
+
+### GitHub App for repo access
+
+Cloning private repos from an AWS-resident bootstrap instance uses a
+**GitHub App** owned by the `GilmanLab` organization and installed on the
+`secrets` repo (plus any other private repos bootstrap needs to reach).
+
+- The App's **private signing key** is stored in an SSM Parameter Store
+  `SecureString` in the `lab` account.
+- The instance's IAM role grants `ssm:GetParameter` + `kms:Decrypt` on that
+  specific parameter path only.
+- On bootstrap, the instance fetches the key, generates a JWT, exchanges it
+  for a short-lived installation token (1-hour TTL), and clones.
+
+The only durable non-AWS secret anywhere in the chain is the App's private
+signing key itself, and that key is at rest in AWS, gated by IAM. Installation
+tokens are never stored on disk.
+
+### The single-anchor property
+
+Taken together, the chain on a single EC2 bootstrap instance is:
+
+| Step | Identity used                                                  |
+|------|----------------------------------------------------------------|
+| Join tailnet | IAM role (via workload identity federation) |
+| Read GitHub App key | IAM role (via instance profile → SSM + KMS) |
+| Mint installation token | App private key (short-lived, in memory) |
+| Clone `secrets/` repo | Installation token (short-lived, in memory) |
+| Decrypt SOPS files | IAM role (via instance profile → KMS) |
+
+Every persistent identity is the IAM role. Lose AWS, lose bootstrap. Gain AWS,
+everything else unlocks in order. This is the design outcome the account
+structure is in service of.
+
+## Compute and Cost Model
+
+### Instances
+
+| Name           | Type       | Purpose                                                     |
+|----------------|------------|-------------------------------------------------------------|
+| subnet router  | `t4g.nano` | Tailscale site-to-site, Route 53 zonefile rendering         |
+| Keycloak host  | `t4g.small`| Keycloak + colocated Postgres. Detailed design in Keycloak doc. |
+
+Both run Amazon Linux 2023 on ARM (`t4g` / Graviton). Tailscale and Keycloak
+both ship native ARM builds.
+
+The two are kept **as separate instances** rather than colocated. A colocated
+box would save ~$1.75/mo but would collapse the subnet router and identity
+failure domains into one. The premium for separation is cheap insurance, and
+separation is also more faithful to the day-job architecture this design
+mirrors.
+
+EC2 instances do **not** run EBS snapshot or AMI backup jobs. The lab's
+philosophy is rebuild-over-restore: every instance's durable state is either
+in an external store (Route 53, KMS, SSM, S3, GitHub) or is designed to be
+reconstructed from those sources.
+
+### Savings Plan commitment
+
+Both instances are long-lived infrastructure and not expected to change
+instance family over their lifetime. The commitment shape is:
+
+- **3-year EC2 Instance Savings Plans, all-upfront**, covering the `t4g`
+  family in `us-west-2`
+- expected effective discount ~72% vs. on-demand
+- one purchase sized to cover both instances; additional commitment can be
+  layered later
+
+### Cost envelope
+
+Approximate, all-upfront amortized:
+
+| Item                          | Monthly | 3-year |
+|-------------------------------|---------|--------|
+| Subnet router (t4g.nano)      | ~$1.75  | ~$63   |
+| Keycloak host (t4g.small)     | ~$3.89  | ~$140  |
+| KMS customer-managed key      | ~$1.00  | ~$36   |
+| SSM Parameter Store (standard)| ~$0     | ~$0    |
+| Route 53 private zone + queries | ~$0.50  | ~$18   |
+| Data transfer                 | negligible | negligible |
+| **Total (approximate)**       | **~$7.15** | **~$260** |
+
+EBS, Elastic IPs attached to running instances, and Route 53 API calls for the
+1-minute zonefile sync all fall into noise-level cost at lab scale.
+
+## Infrastructure as Code
+
+- All AWS resources are managed with **OpenTofu** from the `infra/` repo
+  under `infra/aws/`.
+- OpenTofu state is stored in an **S3 bucket in the `lab` account**, using
+  S3's native locking.
+- The OpenTofu entrypoint assumes a permission set role via Identity Center
+  for human-operator runs. Future CI-triggered runs will use a separate
+  identity (out of scope for this document).
+
+### Manual bootstrap surface
+
+A small amount of setup exists outside of OpenTofu, because it must exist
+before OpenTofu can run:
+
+1. Creation of the AWS Organization and the two accounts.
+2. Enablement of IAM Identity Center and the single operator user.
+3. Creation of the S3 state bucket and the minimum IAM role OpenTofu will
+   assume.
+
+Everything downstream of that — the VPC, the subnet router, the private zone,
+the KMS key, the SSM parameters, the instance profiles — is declared in
+OpenTofu.
+
+## Failure Domains
+
+What fails together and what does not:
+
+| Failure                        | Lab DNS | Lab serving | AWS console access | Bootstrap of new lab instances |
+|--------------------------------|---------|-------------|--------------------|------------------------------|
+| Lab internet outage            | ✓ (cached zonefile) | ✓ | ✗ (can't reach AWS) | ✗ |
+| Subnet router EC2 down         | ✓ (cached zonefile) | ✓ | ✓ | ✗ (tailnet → AWS bridge down) |
+| `lab` account compromise       | ✓ (cached zonefile, short-term) | ✓ | partial | ✗ |
+| `lab-mgmt` account lost        | ✓ (cached zonefile, short-term) | ✓ | ✗ | ✗ |
+| Keycloak host down             | ✓ | ✓ (except OIDC-gated services) | ✓ | ✓ |
+| AWS region outage              | ✓ (cached zonefile) | ✓ | ✗ | ✗ |
+| Full lab power/hardware loss   | ✗ | ✗ | ✓ | depends on external rebuild |
+
+The dominant pattern: **lab-side serving is robust to any offsite failure**
+thanks to the zonefile-on-disk DNS path and locally-resident workloads.
+Offsite failure primarily costs the ability to make changes, not the ability
+to keep running.
+
+## Future Work
+
+The following are known next steps that are intentionally out of scope here:
+
+- **Keycloak design doc.** Deployment shape, Postgres colocation, backup to
+  object storage with Synology sync, rebuild-over-restore DR procedure, and
+  GitOps via `keycloak-config-cli`.
+- **Cluster workload OIDC to AWS.** ExternalDNS-style workloads on the Talos
+  cluster will need AWS credentials; the cluster's own OIDC issuer (IRSA-style
+  federation) is the expected mechanism, not Tailscale-based federation.
+- **GitHub Actions OIDC.** Trusting GitHub Actions as an OIDC identity
+  provider in the `lab` account so CI can apply OpenTofu without long-lived
+  keys.
+- **Keycloak SAML federation to IAM Identity Center** as a secondary,
+  convenience access path alongside the local identity store.
+- **Additional member accounts** under the same Organization as the lab's
+  day-job mirroring grows (prod-mirror, dev, etc.).
+- **Secrets design doc** covering the SOPS-over-KMS workflow, rotation,
+  Vault relationship, and promotion model across bootstrap vs. per-cluster
+  secrets.
+
+## References
+
+- [Keycloak](./keycloak.md)
+- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md)
+- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md)
+- [Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md)
+- [Tailscale Workload Identity Federation](https://tailscale.com/kb/1581/workload-identity-federation)
+- [AWS IAM Identity Center external IdP options](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-identity-source-idp.html)
diff --git a/docs/docs/designs/index.md b/docs/docs/designs/index.md
index e22fc70..f022b90 100644
--- a/docs/docs/designs/index.md
+++ b/docs/docs/designs/index.md
@@ -23,6 +23,8 @@ Current designs:
 - [kro Consumption Model](./kro-consumption-model.md)
 - [Platform RGD Delivery Model](./platform-rgd-delivery.md)
 - [App RGD Design](./app-rgd.md)
+- [AWS Lab Account](./aws-lab-account.md)
+- [Keycloak](./keycloak.md)
 
 Once a design is implemented and considered durable, its steady-state shape
 should be folded back into the architecture overview and any relevant runbooks.
diff --git a/docs/docs/designs/keycloak.md b/docs/docs/designs/keycloak.md
new file mode 100644
index 0000000..7ecf9fd
--- /dev/null
+++ b/docs/docs/designs/keycloak.md
@@ -0,0 +1,441 @@
+---
+title: Keycloak
+description: Proposed design for the lab's central identity system — deployment shape, federation, configuration, backups, and the rebuild-over-restore disaster-recovery model.
+---
+
+# Keycloak
+
+## Status
+
+Proposed.
+
+This document defines how the lab runs Keycloak as its central identity
+provider. It covers deployment shape, identity federation, declarative
+configuration, TLS, backups, the rebuild-over-restore disaster-recovery
+model, and the per-service break-glass matrix that lets the lab continue to
+operate while Keycloak is down.
+
+This document assumes the [AWS Lab Account](./aws-lab-account.md) design. It
+does not re-establish shared context about the AWS Organization, networking,
+IAM, or secrets bootstrap.
+
+## Purpose
+
+The primary purpose of this design is to keep the lab's identity system on a
+durable off-lab trust anchor, while keeping the lab itself operable when that
+trust anchor is unreachable.
+
+The intended split is:
+
+- Keycloak holds the authoritative, human-facing identity of record
+- cluster-level and service-level **break-glass paths** exist for every OIDC
+  consumer, so identity outages do not cascade into cluster-access outages
+- configuration is declaratively sourced from git, so **rebuild is the default
+  recovery mode** and restore is a fallback used only for runtime state a
+  single-user lab can recreate in seconds
+
+This mirrors a day-job architecture at small scale without overpaying for
+HA features a single-user lab cannot justify.
+
+## Goals
+
+- Provide one place to manage human identity across the lab.
+- Keep Keycloak outside the lab's physical failure domain while keeping its
+  blast radius understood.
+- Make Keycloak's configuration surface fully declarative via git, so rebuild
+  is a first-class recovery path.
+- Ensure every Keycloak-dependent service has a documented break-glass path
+  that does not require Keycloak.
+- Make the disaster-recovery procedure short enough to execute without a
+  runbook open on a phone.
+
+## Non-Goals
+
+- This document does not run Keycloak in a highly-available configuration.
+  Single-node is a deliberate choice for a single-user lab and is not a gap.
+- This document does not define the per-service Keycloak client configuration
+  (redirect URIs, scopes, role mappings, token TTLs). Those live in the
+  realm repository.
+- This document does not define monitoring, logging, or alerting.
+- This document does not define the Keycloak → Identity Center SAML
+  federation path. That remains future work.
+- This document does not define cluster-level OIDC federation to AWS (used
+  by ExternalDNS and similar controllers). That is a separate concern handled
+  by the cluster's own OIDC issuer, not by Keycloak.
+
+## Design Summary
+
+Keycloak runs on a single dedicated EC2 instance in the `lab` member account,
+colocated with its Postgres database. Access is at **`id.glab.lol`** via a
+Route 53 private-zone record and a TLS certificate issued automatically via
+**ACME DNS-01**. The instance and database are deployed via **Docker
+Compose**.
+
+The only upstream identity source is **GitHub, federated via OIDC**. A single
+realm named `lab` holds all users and all OIDC/SAML clients. Sign-in to any
+Keycloak-fronted service is: user → service → Keycloak → GitHub.
+
+Keycloak's declarative surface — realms, clients, roles, identity provider
+settings, scopes, authentication flows — is reconciled from a git repository
+by **`keycloak-config-cli`** running as a scheduled job on the Keycloak host.
+Runtime state (user credentials, sessions, TOTP enrollment) is not in git and
+is the only part of the system that needs backup-based recovery.
+
+Database dumps and the current TLS cert bundle are backed up nightly to an
+S3 bucket in the `lab` account. The lab's Synology NAS pulls those backups
+locally on a schedule, so a recent copy of Keycloak's runtime state exists on
+a second continent and a second provider.
+
+**Disaster recovery is rebuild-first.** For a single-user lab, a rebuild from
+the git-tracked realm + a fresh Postgres is faster than a restore, enforces
+discipline that every config-surface change actually lives in git, and
+produces a clean outcome. A restore path exists as fallback but is not the
+primary recovery mode.
+
+When Keycloak is entirely unavailable, every Keycloak-fronted service has a
+local break-glass path documented in this doc. The lab continues to operate;
+only the "unified identity" experience degrades.
+
+## Deployment Shape
+
+### Host and Runtime
+
+- **Instance:** `t4g.small` (2 vCPU, 2 GB RAM), Amazon Linux 2023 on ARM, in
+  the `lab` account and `172.16.0.0/16` VPC from the AWS design.
+- **Runtime:** Docker Compose manages two services:
+  - `keycloak` — official upstream Keycloak image, tagged to a specific
+    version pinned in `infra/`.
+  - `postgres` — official Postgres image, tagged to a specific version pinned
+    in `infra/`. Data volume on the instance's EBS root volume.
+- **Reverse proxy:** Caddy (or equivalent) runs alongside and terminates TLS,
+  proxying to Keycloak on loopback. Caddy performs ACME DNS-01 renewals using
+  the instance's IAM role. This is the recommended shape; the exact proxy is
+  an implementation detail that does not need to appear in this doc.
+- **No EBS snapshots or AMI backups.** State recovery is via the application
+  backup path below, not via block-level snapshots.
+
+### Identity Profile for the Host
+
+The EC2 instance's IAM role grants, at minimum:
+
+- Route 53 write access scoped to the `_acme-challenge.id.glab.lol` record
+  for DNS-01 validation.
+- S3 write access to the Keycloak backup bucket's prefix.
+- SSM Parameter Store read for any bootstrap-time secrets held there
+  (following the pattern established in the AWS design).
+
+The role carries no other permissions. All lab cluster-access,
+secret-decryption, and tailnet-identity paths that this instance depends on
+are established in the AWS Lab Account design and are not restated here.
+
+### Sizing and Tuning
+
+2 GB of RAM is tight but workable for a single-user lab because Keycloak 26.x
+(Quarkus-based) has a much smaller footprint than earlier WildFly-based
+versions. The required tuning is:
+
+- explicit Keycloak JVM max heap (e.g. ~768 MB)
+- conservative Postgres `shared_buffers` (~128 MB)
+- a swap file on the EBS volume as a safety margin
+
+CPU is burstable but effectively idle for single-user workloads; unlimited
+mode is enabled to tolerate rare login bursts at negligible cost.
+
+## Identity Federation
+
+### Realm Structure
+
+A single realm named `lab` holds all lab users and all OIDC/SAML clients. No
+separate realms for services vs. humans; a single-user lab does not benefit
+from the separation, and multi-realm setups make GitOps reconciliation more
+fragile.
+
+### Upstream IdP
+
+The realm has **exactly one identity provider configured: GitHub, via
+OIDC**. There is no local username-password fallback. A lab user's identity
+is their GitHub identity, federated through Keycloak, presented downstream to
+each OIDC client.
+
+The Keycloak admin bootstrap user exists only briefly during initial realm
+creation and is disabled once `keycloak-config-cli` has reconciled the
+realm from git.
+
+### Intended Early OIDC Clients
+
+Keycloak is expected to front these services first:
+
+- **kubectl → Talos clusters** via each cluster's Kubernetes API-server OIDC
+  configuration.
+- **Argo CD** (web UI and CLI).
+- **Grafana** (when deployed).
+
+Additional clients will be added over time. The authoritative list lives in
+the realm repository, not in this document.
+
+## TLS
+
+### Issuance
+
+The cert for `id.glab.lol` is issued via **ACME DNS-01 against Route 53**
+using the instance's IAM role. There is no internet-exposed HTTP-01 path,
+because the host sits in a private VPC with the subnet router as its only
+tailnet attachment.
+
+No wildcard cert is used. Each service in the lab gets its own host-scoped
+cert:
+
+- this host: `id.glab.lol`
+- future clusters: `nonprod.k8s.glab.lol`, `prod.k8s.glab.lol`
+- future services: their own host names
+
+Cluster-fronted services will issue their own certs via cert-manager and
+each cluster's OIDC trust relationship with AWS (a separate future design).
+
+### Renewal
+
+Renewal is automatic and handled by whichever TLS-terminating proxy is used
+on the host. No human intervention is expected between renewals.
+
+### Restore-time behavior
+
+For same-hostname restore to work without waiting on ACME, the current TLS
+cert bundle (cert + private key) is included in the nightly backup payload
+alongside the Postgres dump. A restored host can serve HTTPS immediately
+with the backed-up cert and let the proxy handle its own renewal on the
+normal schedule.
+
+## Configuration as Code
+
+### Source of Truth
+
+Keycloak's declarative surface is reconciled from a git repository. The
+**repository is the source of truth**; the running Keycloak's admin surface
+is a read-through cache of what's in git.
+
+In scope for git reconciliation:
+
+- realms
+- clients
+- client scopes
+- roles and role mappings
+- identity-provider configuration
+- authentication flows and required actions
+- realm-level settings
+
+Out of scope for git (intentionally runtime state):
+
+- user credentials (password hashes, WebAuthn registrations, TOTP secrets)
+- sessions and refresh tokens
+- audit and event logs
+- ephemeral tokens and one-time codes
+
+### Reconciliation Tool
+
+Reconciliation uses **`keycloak-config-cli`** (adorsys). The tool is mature,
+works against the admin API, handles partial updates, and does not require
+Kubernetes CRDs or a separate operator. It is the best available GitOps
+option as of this writing given the upstream Keycloak Operator still does
+not provide first-class CRDs for clients, users, roles, or identity
+providers.
+
+### Reconciliation Location
+
+`keycloak-config-cli` runs **on the Keycloak host itself** as a scheduled
+job. Pull cadence is a small number of minutes; the exact cadence is an
+implementation detail. The job authenticates to Keycloak using a
+reconciliation service account stored in SSM Parameter Store.
+
+This location is intentionally simple for now. Pushing reconciliation into
+GitHub Actions (so every git push triggers a reconcile) is named as future
+work — it would enforce "git push is the only way config changes" more
+strictly — but it requires a reachable admin endpoint and an appropriate
+trust path, which are better designed once the realm repo has concrete
+shape.
+
+### Schema Versioning
+
+Keycloak migrations run forward only. The realm repository pins the
+Keycloak version it expects. Upgrades are driven by bumping the pin in
+`infra/` and allowing the next reconcile cycle to re-apply cleanly against
+the upgraded Keycloak.
+
+## Backups
+
+### What is backed up
+
+- Postgres dump (the full database, including all runtime state).
+- Keycloak configuration files that live outside the database: `keycloak.conf`,
+  environment overrides, any custom themes or providers.
+- The current TLS cert bundle (cert + private key).
+
+Configuration in git is **not** part of the backup — git is already the
+durable store for it.
+
+### Where they go
+
+- **Primary destination:** an S3 bucket in the `lab` account. The bucket
+  uses server-side encryption with a KMS key; object-lock/versioning is on
+  so corruptions cannot silently overwrite known-good backups.
+- **Secondary destination:** the lab's Synology NAS, which pulls from the S3
+  bucket on its own schedule.
+
+The host writes backups to S3 using the instance's IAM role (no long-lived
+credentials). The Synology pulls from S3 using a scoped, read-only access
+mechanism chosen when Synology-side automation is implemented — out of scope
+for this document.
+
+### Retention
+
+Retention is a rolling window, implemented via S3 lifecycle policies. The
+contract:
+
+- **daily** backups retained for **30 days**
+- **weekly** backups retained for **12 weeks**
+- **monthly** backups retained for **12 months**
+
+"Last backup only" is explicitly rejected. A corruption-style incident
+(realm data mangled by a bad change, not a hardware failure) requires
+point-in-time restore from days ago.
+
+### Encryption
+
+Backups contain password hashes, signing keys, TOTP secrets, and session
+state. They are encrypted twice:
+
+- client-side: the Postgres dump and cert bundle are encrypted before upload
+  using a recipient key managed alongside the rest of the lab's bootstrap
+  secrets
+- server-side: the S3 bucket uses SSE-KMS with a customer-managed key
+
+This ensures that neither a leaked S3 object ACL nor a Synology
+compromise yields usable plaintext.
+
+## Disaster Recovery
+
+The lab uses a **rebuild-first** recovery model. Restore exists as a
+fallback, not as the primary path.
+
+### Rebuild path (primary)
+
+When Keycloak is unrecoverable or is being moved:
+
+1. Provision a fresh EC2 instance from the `infra/` OpenTofu modules.
+2. Run `docker compose up` to start Keycloak and a **fresh** Postgres.
+3. Run `keycloak-config-cli` against the new Keycloak, pointing at the realm
+   repository. All realms, clients, roles, and GitHub federation come back.
+4. Sign in via GitHub. A new user entry is created on first login per the
+   identity-provider mapper configuration.
+5. Re-enroll WebAuthn / TOTP (30 seconds).
+6. Keycloak is operational.
+
+**Target RTO: 15 minutes.** This path requires only the git repository and
+AWS access; it does not require any backup store.
+
+### Restore path (fallback)
+
+When a rebuild is unacceptable — for example, if you need to preserve the
+exact user-state including federated-identity linkages and audit history —
+the restore path is:
+
+1. Provision a fresh EC2 instance.
+2. Pull the most recent, or a chosen point-in-time, backup from S3 or
+   Synology.
+3. Restore the Postgres dump into a fresh Postgres.
+4. Place the TLS cert bundle and any config files.
+5. Run `docker compose up`. Keycloak boots against the restored database.
+6. `keycloak-config-cli` runs its normal reconcile cycle; any configuration
+   drift between the backup time and `git HEAD` is corrected forward.
+
+For the single-user lab, the restore path is rarely worth it. For the
+day-job mirror architecture, it is the default.
+
+### Hostname preservation
+
+Both paths require that the restored instance serve at **`id.glab.lol`**.
+The `issuer` claim on every JWT Keycloak has ever signed is tied to that
+URL. Changing the hostname invalidates every existing token and every
+client's cached OIDC discovery.
+
+This is normally handled by DNS: the new instance comes up in the same VPC
+with the same Route 53 A record pointing to it. During a full internet-loss
+DR scenario where Route 53 is unreachable, a local override path exists:
+the lab's CoreDNS zonefile can be manually edited to point `id.glab.lol` at
+the restored instance's Tailscale address.
+
+### What rebuild does not recover
+
+- Stored user credentials (password hashes, WebAuthn, TOTP). These must be
+  re-enrolled by the user on first login. For a single-user lab, 30 seconds.
+- Active sessions. All users are forced to re-authenticate.
+- Federated-identity linkages established at previous logins. GitHub OIDC
+  users are re-linked on their next successful sign-in.
+- Audit / event history.
+
+For the lab, none of these matter. For a production deployment of this
+design, they would matter, and the restore path becomes primary.
+
+## Break-Glass Matrix
+
+When Keycloak is down, the lab continues to operate via
+per-service break-glass paths. These are not emergency workarounds; they are
+durable, documented alternate authentication paths that are kept active for
+exactly this reason.
+
+| Service            | Break-glass path                                     | Notes |
+|--------------------|------------------------------------------------------|-------|
+| Talos API          | mTLS via `talosconfig` and machine secrets          | Talos's PKI is independent of Keycloak; `talosconfig` is the ultimate root anchor for the lab. |
+| `kubectl` to clusters | Talos-generated admin kubeconfig via `talosctl kubeconfig` | Produced on demand against the cluster's own signing CA; not federated. |
+| Argo CD            | Built-in `admin` account and initial admin secret in the cluster | Retained and rotated but never disabled. |
+| Vault (per cluster)| Unseal keys and root/recovery keys                  | Kept outside the lab per the Vault design (separate doc). |
+| AWS                | IAM Identity Center local user + WebAuthn hardware key | Does not federate to Keycloak by design — see the AWS doc. |
+| Grafana            | Local admin account                                  | Kept active alongside OIDC client configuration. |
+| GitHub (upstream IdP) | Personal GitHub account, hardware-key MFA        | Keycloak is downstream of GitHub; if GitHub is down, federated login fails, and the above service-local paths are how the lab keeps moving. |
+
+All of these anchors' credentials live outside any Keycloak-dependent store.
+`talosconfig`, Argo CD admin secrets, Vault recovery keys, and AWS root
+credentials are kept in offline storage (password manager + hardware backup)
+whose access does not depend on Keycloak, AWS, or internet connectivity.
+
+## Cost
+
+Monthly, all-upfront amortized over the 3-year EC2 Instance Savings Plan
+commitment established in the AWS design:
+
+| Item                           | Monthly |
+|--------------------------------|---------|
+| t4g.small compute (3-yr SP)    | ~$3.89  |
+| S3 backup storage + lifecycle  | ~$0.20  |
+| Route 53 record queries        | negligible |
+| Data transfer                  | negligible |
+| **Approximate total**          | **~$4.10** |
+
+This is additive on top of the baseline AWS footprint laid out in the
+AWS Lab Account design.
+
+## Future Work
+
+- **CI-driven reconciliation.** Move `keycloak-config-cli` from an
+  on-host cron into a GitHub Actions workflow so `git push` is the only
+  trigger that mutates Keycloak's declarative state.
+- **Keycloak → IAM Identity Center SAML federation** as a secondary,
+  convenience path for AWS console access. The local IAM Identity Center
+  user remains primary.
+- **Promotion to HA** if / when this design is reused at day-job scale. Two
+  Keycloak replicas behind a load balancer with clustered cache and a shared
+  database is the standard upgrade path; none of the decisions in this doc
+  block it.
+- **Automated DR drill.** A periodic exercise — quarterly is appropriate — in
+  which the rebuild path is executed against a scratch instance to prove
+  the RTO target and keep the muscle alive.
+- **Richer per-service break-glass.** Codify the break-glass secrets
+  themselves (their storage location, rotation cadence, recovery order) in
+  a separate operational runbook.
+
+## References
+
+- [AWS Lab Account](./aws-lab-account.md)
+- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md)
+- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md)
+- [keycloak-config-cli (adorsys)](https://github.com/adorsys/keycloak-config-cli)
+- [Keycloak Operator: status of first-class CRDs](https://www.keycloak.org/2022/09/operator-crs)