From fdcff2ceeec56801a25038c2d2aa83bb5f315ebc Mon Sep 17 00:00:00 2001 From: Joshua Gilman Date: Mon, 27 Apr 2026 15:32:22 -0700 Subject: [PATCH] docs(architecture): consolidate lab architecture docs --- ARCHITECTURE.md | 6 +- README.md | 15 +- docs/docs/architecture.md | 481 +++--------------- .../architecture/bootstrap-and-lifecycle.md | 196 +++++++ .../architecture/gitops-and-platform-apis.md | 457 +++++++++++++++++ docs/docs/architecture/hosts-and-substrate.md | 119 +++++ docs/docs/architecture/keycloak-runtime.md | 175 +++++++ .../architecture/networking-and-endpoints.md | 188 +++++++ .../docs/architecture/secrets-identity-pki.md | 358 +++++++++++++ docs/docs/architecture/state-and-recovery.md | 114 +++++ docs/docs/designs/app-rgd.md | 249 --------- docs/docs/designs/aws-lab-account.md | 459 ----------------- docs/docs/designs/bootstrap-core-delivery.md | 354 ------------- docs/docs/designs/gitops-multi-cluster.md | 470 ----------------- docs/docs/designs/index.md | 31 -- docs/docs/designs/keycloak.md | 441 ---------------- docs/docs/designs/kro-consumption-model.md | 365 ------------- docs/docs/designs/platform-rgd-delivery.md | 182 ------- docs/docs/designs/secrets-and-pki.md | 426 ---------------- ...ce-exposure-and-control-plane-endpoints.md | 164 ------ docs/docs/index.md | 16 +- docs/docs/routeros-acme.md | 10 + 22 files changed, 1700 insertions(+), 3576 deletions(-) create mode 100644 docs/docs/architecture/bootstrap-and-lifecycle.md create mode 100644 docs/docs/architecture/gitops-and-platform-apis.md create mode 100644 docs/docs/architecture/hosts-and-substrate.md create mode 100644 docs/docs/architecture/keycloak-runtime.md create mode 100644 docs/docs/architecture/networking-and-endpoints.md create mode 100644 docs/docs/architecture/secrets-identity-pki.md create mode 100644 docs/docs/architecture/state-and-recovery.md delete mode 100644 docs/docs/designs/app-rgd.md delete mode 100644 docs/docs/designs/aws-lab-account.md delete mode 100644 docs/docs/designs/bootstrap-core-delivery.md delete mode 100644 docs/docs/designs/gitops-multi-cluster.md delete mode 100644 docs/docs/designs/index.md delete mode 100644 docs/docs/designs/keycloak.md delete mode 100644 docs/docs/designs/kro-consumption-model.md delete mode 100644 docs/docs/designs/platform-rgd-delivery.md delete mode 100644 docs/docs/designs/secrets-and-pki.md delete mode 100644 docs/docs/designs/service-exposure-and-control-plane-endpoints.md diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 3e8f8c5..adbf1be 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -1,6 +1,8 @@ # Architecture Overview -The canonical version of this document now lives in +The canonical architecture entrypoint lives in [`docs/docs/architecture.md`](docs/docs/architecture.md). -Use the Docusaurus source under `docs/` for ongoing updates. The canonical page now includes the current DNS and naming model for `lab.gilman.io` alongside the broader platform architecture. +Start there for the current IncusOS, Incus, Talos, CAPI, GitOps, networking, +secrets, and recovery architecture. Focused architecture documents live under +[`docs/docs/architecture/`](docs/docs/architecture/). diff --git a/README.md b/README.md index 1ce5007..ffc4a53 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,10 @@ # GilmanLab Docs -This repository is the dedicated documentation home for the GilmanLab -homelab. +This repository is the dedicated documentation home for the GilmanLab homelab. -The root documents point into the Docusaurus site source under `docs/`, and -the site is the primary surface for architecture notes, hardware references, -runbooks, and future how-to material. +The Docusaurus site source lives under `docs/`. The architecture set is the +canonical starting point for the current lab design, with runbooks and +implementation details added separately as prototypes become real workflows. ## Quick Start @@ -30,11 +29,11 @@ moon run docs:start ## Current Content - [`docs/docs/index.md`](docs/docs/index.md): docs landing page -- [`docs/docs/architecture.md`](docs/docs/architecture.md): architecture overview -- [`docs/docs/designs/`](docs/docs/designs): proposed design documents that are - not yet part of the settled architecture baseline +- [`docs/docs/architecture.md`](docs/docs/architecture.md): architecture entrypoint and reading order +- [`docs/docs/architecture/`](docs/docs/architecture): focused architecture documents, including Keycloak runtime and recovery - [`docs/docs/hardware.md`](docs/docs/hardware.md): hardware inventory - [`docs/docs/network-device-backups.md`](docs/docs/network-device-backups.md): RouterOS backup design for the future platform cluster +- [`docs/docs/routeros-acme.md`](docs/docs/routeros-acme.md): RouterOS ACME certificate notes ## Support diff --git a/docs/docs/architecture.md b/docs/docs/architecture.md index fb7fc35..c7ff9b1 100644 --- a/docs/docs/architecture.md +++ b/docs/docs/architecture.md @@ -1,446 +1,83 @@ --- title: Architecture Overview -description: High-level architecture for the GilmanLab homelab rework. +description: Start here for the current GilmanLab architecture baseline. --- # Architecture Overview -This document captures the current high-level architecture for the lab rework. +This is the current architecture baseline for the GilmanLab homelab. -It is intentionally centered on control flow and system boundaries rather than detailed implementation. Low-level configuration, exact IP plans, and per-service manifests belong elsewhere. +The architecture centers on IncusOS hosts, an Incus cluster, Talos Linux +virtual machines, and a GitOps/CAPI management model. It is intentionally +practical rather than exhaustive: it records the direction and the important +boundaries, while leaving detailed manifests, exact addresses, and prototype +findings to implementation work. -For the physical inventory, see [the hardware reference](./hardware.md). +For the physical inventory, see [Hardware Reference](./hardware.md). -## Overview +## Reading Order -The architecture is built around a dedicated platform cluster running on the `UM760`, with the `MS-02 Ultra` systems acting as Proxmox compute hosts. The platform cluster is the control plane for provisioning, cluster creation, GitOps, and infrastructure automation. +Read these documents as one architecture set: -The main design goal is to keep the platform responsibilities concentrated in one isolated Talos-based Kubernetes cluster while treating the Proxmox layer as a unified VM substrate for downstream clusters. +1. [Hosts and Substrate](./architecture/hosts-and-substrate.md) +2. [Bootstrap and Cluster Lifecycle](./architecture/bootstrap-and-lifecycle.md) +3. [Networking and Endpoints](./architecture/networking-and-endpoints.md) +4. [GitOps and Platform APIs](./architecture/gitops-and-platform-apis.md) +5. [Secrets, Identity, DNS, and PKI](./architecture/secrets-identity-pki.md) +6. [Keycloak Runtime](./architecture/keycloak-runtime.md) +7. [State and Recovery](./architecture/state-and-recovery.md) -At a high level: +Runbook-style pages remain separate: -- The `UM760` runs the platform cluster on Talos Linux. -- The platform cluster runs Tinkerbell, CAPI, Argo CD, AWX, and optional TerraKube. -- The `MS-02 Ultra` nodes are provisioned over PXE into Proxmox and then clustered into a single Proxmox control plane. -- AWX and/or TerraKube handle node-level post-provisioning work. -- CAPI creates downstream Talos-based clusters as Proxmox VMs through the clustered Proxmox API surface. -- The `DS923+` provides shared storage to the Proxmox layer. -- The `VP6630` remains the lab router, DNS entrypoint, and network boundary to the home network. +- [Network Device Backups](./network-device-backups.md) +- [RouterOS ACME Certificates](./routeros-acme.md) -## Design Intent +## Architecture Baseline -The current design is driven by a few core ideas: +The lab shape is: -- Keep the platform control plane separate from the Proxmox compute layer. -- Use Tinkerbell for bare-metal provisioning of the Proxmox nodes. -- Form a Proxmox cluster early so CAPI has one control surface to target. -- Use CAPI as the main cluster lifecycle engine for downstream Kubernetes clusters. -- Prefer reusable Proxmox VM templates over highly customized VM-by-VM definitions. -- Avoid coupling initial Proxmox clustering to shared storage; storage is a later concern. +- The `UM760` and all three `MS-02 Ultra` systems run IncusOS directly on bare + metal. +- Those four hosts form one Incus cluster. +- Talos Linux runs as Incus VMs and provides the Kubernetes nodes. +- The first platform Talos VM starts on the `UM760`; the final platform cluster + expands to three dual-role Talos nodes once the `MS-02` hosts join. +- Cluster API Provider Incus, the Talos CAPI providers, and GitOps own normal + Kubernetes cluster lifecycle. +- VyOS remains the lab network boundary and provides DHCP, DNS, PXE support, + the real platform Kubernetes API TCP frontend, and BGP peering for Kubernetes + service VIPs. +- Cilium provides the Kubernetes datapath, LoadBalancer IP allocation, and BGP + advertisements for service VIPs. +- AWS anchors bootstrap identity, SOPS/KMS access, selected DNS material, and + GitHub token broker access to the private `secrets` repo. +- Recovery is rebuild-first. Talos VMs and Incus hosts should be recreated from + declarative inputs when possible; backups protect the state that cannot be + rebuilt from Git, CAPI, and GitOps. -## Core Components +## Current Status -### Platform Cluster (`UM760`) +This is the desired architecture, not a claim that every piece is already live. -The `UM760` hosts the platform cluster as an isolated Talos Linux Kubernetes cluster. +The following items are deliberately still prototype-validation work: -The current design treats this as a single-node cluster. +- IncusOS `Operation` image generation and seeding for a final disk image. +- Writing the chosen IncusOS image through Tinkerbell `image2disk` or + `oci2disk`. +- Running the temporary bootstrap `k0s` cluster as a host-networked VyOS + container. +- CAPN plus the Talos providers creating the desired Talos VM shape. +- Exact VLANs, static addresses, ASNs, DNS records, and service VIP pools. -This cluster is intended to own the following responsibilities: +Those details should be proven in small prototypes before they become runbooks +or exact implementation references. -- `Tinkerbell`: full provisioning stack for PXE-based installation of the Proxmox nodes -- `CAPI`: downstream cluster lifecycle management using Proxmox as infrastructure -- `Argo CD`: GitOps for the platform cluster itself, and potentially for downstream cluster registration and sync -- `AWX`: Ansible orchestration for infrastructure tasks that are still better handled through playbooks -- `TerraKube`: optional Terraform-based automation for future non-node-bootstrap workflows -- network-device backup services for RouterOS configuration history and encrypted recovery artifacts +## Explicit Non-Goals -This machine is not being used as a general-purpose compute node. Its purpose is to act as the lab control plane. +The v1 architecture does not include: -At present there is no separate physical capacity available to make the platform cluster highly available. A fallback idea would be to turn the `UM760` into a Proxmox node and run three platform-cluster VMs on it, but that does not materially improve failure tolerance because all three nodes would still share the same physical host, storage path, and power domain. The added virtualization and bootstrap complexity would outweigh the benefit. - -### Proxmox Nodes (`MS-02 Ultra`) - -Each `MS-02 Ultra` is intended to be a dedicated Proxmox node. - -The current design assumes: - -- Each node is provisioned through Tinkerbell over PXE using unattended installation -- Each node receives follow-up configuration through `AWX` -- The three nodes are clustered into a single Proxmox control plane early -- Shared storage is not required for initial Proxmox cluster formation -- Each node should become a "complete" unit that can host VM templates and receive CAPI-created VMs cleanly through the clustered Proxmox API - -This means the architecture uses clustered Proxmox management from the start, while deferring shared storage and any dedicated storage backplane work to a later phase. - -### Shared Storage (`DS923+`) - -The `DS923+` is the shared storage layer for the Proxmox environment. - -The current intended uses are: - -- `NFS` storage for templates, images, and backups -- Optional NAS-backed disks for workloads that need a more redundant storage path - -The default storage model is intentionally local-first: - -- active VM disks run from node-local NVMe on the `MS-02 Ultra` hosts -- the `DS923+` is used for templates, images, backups, and optional NAS-backed VM disks exposed over NFS -- `iSCSI` is not part of the default VM execution path - -This design leaves room for a mixed storage model, where some storage remains node-local while some is NAS-backed. Shared storage improves migration, HA behavior, and storage flexibility, but it is not a prerequisite for initial Proxmox cluster formation. - -The main tradeoff is deliberate: although the NAS has `10GbE` connectivity to the Proxmox hosts, its underlying media is HDD-based. The design therefore prioritizes local NVMe performance for running VMs and uses the NAS primarily for backup, image distribution, and selective redundancy rather than as the primary block storage for all guests. - -The architecture does not define a strict workload placement policy for NAS-backed disks. NFS-backed storage is available to VMs that need it, but whether any given node or workload uses that storage is left to the workload's own requirements rather than being fixed globally in advance. - -The `DS923+` is also the primary durable backup boundary in the system. Platform-cluster state and Proxmox host configuration are intended to be rebuilt from source-controlled configuration and automation, while NAS-backed artifacts and data are the main state that should be explicitly protected. - -### Network Boundary and Switching - -The physical network roles remain: - -- `CCR2004`: home router -- `VP6630`: lab router, DNS entrypoint, and DMZ boundary to the home network -- `CRS309-1G-8S+IN`: lab switch -- `TL-SG105`: dedicated Intel AMT switch for the `MS-02 Ultra` management links - -The current cabling intent is: - -- Each `MS-02 Ultra` uses one `25GbE` link to the `CRS309` -- A future second link per node may be dedicated to storage and/or Proxmox clustering traffic -- Intel AMT links terminate on the `TL-SG105` -- The `VP6630` remains the routing boundary and is the intended upstream BGP - peer for Cilium-advertised service VIPs on future multi-node clusters - -The baseline network model is intentionally smaller than the previous lab design, but it still preserves a dedicated Layer 2 provisioning domain for Tinkerbell: - -- platform / management network -- provisioning network for PXE and DHCP -- workload / VM network -- AMT / OOB network -- optional storage / replication network later - -The provisioning network exists because Tinkerbell's DHCP and PXE flow requires Layer 2 access or DHCP relay. The current design assumes a dedicated provisioning segment rather than folding PXE traffic into the general workload path. - -### DNS and Naming - -The lab uses `lab.gilman.io` as its internal naming root rather than a private-only top-level domain. - -This keeps internal naming under a real domain the lab controls while still allowing private DNS views, selective public exposure later, and a clean future path for public certificates on intentionally exposed endpoints. - -The current intended DNS design is: - -- `VyOS` remains the client-facing resolver for the lab networks. -- A `PowerDNS Authoritative` service runs as a container on the `VP6630`. -- `VyOS` forwards the lab's internal zones to that local authoritative service. -- Internal names are private by default; public DNS is reserved for explicitly exposed entrypoints. -- Internal certificates are expected to use a private CA by default; public CA issuance is reserved for endpoints that benefit from public trust. - -Running the authoritative DNS service on the router boundary instead of inside the platform cluster avoids a bootstrap dependency where the platform control plane would need to be healthy before the lab can resolve the names used to reach it. - -The namespace is intentionally split by ownership boundary instead of using one flat dynamic zone: - -| Zone | Writers | Purpose | -| --- | --- | --- | -| `lab.gilman.io` | manual or GitOps-managed only | Parent zone, delegations, and a small set of static anchor records | -| `mgmt.lab.gilman.io` | manual or GitOps-managed only | Stable management and platform service names | -| `dhcp.lab.gilman.io` | `VyOS` DHCP via `RFC2136` | Dynamic lease-driven hostnames | -| `.k8s.lab.gilman.io` | `ExternalDNS` via `RFC2136` | Cluster-scoped workload and ingress names | - -This design keeps the management namespace stable while still allowing dynamic DNS for both DHCP clients and Kubernetes workloads. It also keeps update rights narrow: `VyOS` DHCP cannot mutate management records, and each Kubernetes cluster can be constrained to only its own delegated subzone. - -### Internal PKI and Trust - -The lab's internal PKI is designed around the same bootstrap constraint as internal DNS: the trust anchor for internal services cannot depend on the platform cluster being healthy before it can issue or rotate certificates. - -The current intended PKI design is: - -- The internal root CA key lives in `AWS KMS`. -- The root CA is treated as operationally offline: no always-on lab service has standing permission to use it for routine issuance. -- A `Smallstep step-ca` service runs as a container on the `VP6630` as the online intermediate CA. -- Internal ACME is provided by that `step-ca` instance for automated certificate issuance and renewal. -- `Vault` remains the expected long-term home for most secret management inside the platform cluster, but it is not the bootstrap owner of the internal CA hierarchy. - -This keeps naming and trust in the same edge-adjacent failure domain without forcing the platform cluster to come up first. If the platform cluster is down, the lab can still resolve internal names and issue or renew the certificates needed to restore that control plane. - -The intended trust boundary is deliberately split: - -| Component | Role | Notes | -| --- | --- | --- | -| Root CA | trust anchor | Stored in `AWS KMS`; used only for intermediate issuance and rotation | -| `step-ca` on `VP6630` | online issuing intermediate | Handles day-to-day certificate issuance for internal services | -| ACME clients | automated consumers | Used by `cert-manager` and other internal services that can rotate through ACME | -| `Vault` | secret management consumer | May issue or store service-specific material later, but does not own bootstrap PKI | - -This design accepts that routing, internal DNS, and the online intermediate CA share the `VP6630` failure domain. That is an intentional trade for the homelab: a single edge host keeps the bootstrap path simple, while the root CA remains outside that host's routine operating privileges. - -### Network Device Backups - -Network-device backup collection belongs in the platform cluster once that -cluster is online. - -The first target devices are the MikroTik `CRS309` lab switch and `CCR2004` home -router. The durable flow should use `Oxidized` for RouterOS collection and a -small SOPS-aware writer that commits only encrypted backup artifacts into the -private `secrets` repo. - -This is intentionally not a `VP6630` container responsibility. RouterOS backups -are operational recovery support, not a bootstrap dependency like DNS or PKI. -Keeping the backup stack in the platform cluster keeps the router focused on -routing, internal DNS, and certificate issuance while the platform cluster owns -automation and Git-backed operational services. - -The design is documented in [Network device backups](./network-device-backups.md). - -## Control Flow - -### 1. Platform Bootstrap - -The first step is bringing up the Talos-based platform cluster on the `UM760`. - -The bootstrap path is intentionally independent of `Tinkerbell`, `CAPI`, `AWX`, and `TerraKube`. - -The current intended flow is: - -1. Generate the Talos machine configuration from source-controlled configuration and scripts. -2. Generate a reproducible Talos ISO for the `UM760`. -3. Write that ISO to a USB installer using standard host tooling. -4. Boot the `UM760` from USB using the normal Talos ISO install path. -5. Bring the node up as the initial platform cluster node. -6. Install `Argo CD`. -7. Let GitOps reconcile the rest of the platform stack. - -Talos supports this model directly: - -- Talos supports embedding machine configuration directly into the bootable image. -- Talos supports generating customized ISO boot assets through Image Factory or the `imager` tool. -- Talos documents ISO boot on bare metal as the standard installation path. - -The important boundary is that Talos provides the image-generation and boot-asset customization tooling, but the act of writing the ISO to a USB stick is still a normal host-side operation rather than a Talos-specific burn command. - -Once available, that cluster becomes the system of control for the rest of the lab: - -- Argo CD reconciles platform configuration -- Tinkerbell provisions bare metal -- AWX performs node-level infrastructure configuration -- TerraKube remains available for future infrastructure workflows where Terraform is a better fit -- CAPI creates downstream clusters on Proxmox - -### 2. Bare-Metal Proxmox Provisioning - -The `MS-02 Ultra` nodes are provisioned through Tinkerbell. - -The intended path is: - -1. Tinkerbell PXE boots a target node -2. The node installs Proxmox using unattended configuration -3. The node comes up with enough baseline config to be remotely managed - -This is the primary Tinkerbell use case in the architecture. VM network installs may exist later, but they are not the main design driver. - -### 3. Proxmox Cluster Formation - -After the nodes are installed, they are joined into a Proxmox cluster. - -This happens before any expectation that `CAPI` will treat the Proxmox layer as a shared VM substrate. - -The key point is that: - -- clustered Proxmox management is required early -- shared storage is not required for this first clustering step -- storage enhancements and dedicated storage traffic can be layered in later - -This gives the lab a single Proxmox control surface for scheduling and lifecycle operations without requiring the full storage design to be complete on day one. - -### 4. Post-Provision Node Configuration - -After a node is installed, additional configuration is applied. - -This post-provisioning stage includes items such as: - -- networking -- storage configuration -- Proxmox-specific options -- image and template publication -- delivery of custom images such as Packer-built golden images - -Current intent: - -- Use `AWX` as the primary post-bootstrap configuration plane for Proxmox nodes -- Keep `TerraKube` optional for future workflows that are not part of node bootstrap itself -- Likely TerraKube use cases are complementary infrastructure resources that are better suited to Terraform than Ansible -- Exact TerraKube-managed resource classes are intentionally undecided for now -- Use `AWX` to publish golden images and VM templates into the Proxmox cluster - -### 5. Golden Images and Templates - -Golden image ownership is split across four layers: - -- `Packer` owns image creation -- the `DS923+` is the durable repository for built images and template artifacts -- `AWX` owns publishing or registering those images as Proxmox templates -- `CAPI` consumes only templates that are already present in Proxmox - -This keeps responsibilities narrow: - -- image creation stays separate from cluster lifecycle -- template publication is an explicit infrastructure operation -- `CAPI` stays focused on machine orchestration rather than image distribution - -### 6. Downstream Cluster Creation - -Once the Proxmox cluster is ready, CAPI uses Proxmox support to create downstream Kubernetes clusters as VMs. - -The downstream cluster assumptions are: - -- downstream clusters are Talos-based -- VM creation should rely primarily on reusable templates -- the default storage path for downstream cluster VMs is node-local NVMe -- cluster scaling should be handled through CAPI rather than ad hoc Proxmox operations -- CAPI references templates that have already been published into Proxmox - -The desired outcome is that the clustered Proxmox layer exposes a stable substrate, while CAPI owns the actual Kubernetes cluster lifecycle. - -After creation, downstream clusters are treated as strongly isolated environments rather than extensions of the platform cluster. - -The intended multi-node cluster endpoint model is: - -- Cilium LB IPAM plus Cilium BGP peering with the `VP6630` for service and - ingress VIPs -- Talos VIP for the canonical Kubernetes API endpoint on shared Layer 2 -- direct control-plane endpoints for the Talos API by default - -This is the intended standard for downstream clusters and for any future -multi-node platform cluster, but it is not live on the current platform -cluster while that cluster remains single-node. - -## Role of GitOps - -`Argo CD` runs on the platform cluster and manages the desired state of the control plane. - -At minimum, that includes: - -- platform cluster applications -- platform cluster infrastructure controllers -- provisioning stack configuration -- platform-owned operational services such as network-device backups - -The current design keeps `Argo CD` scoped to the platform cluster itself. - -This means: - -- the platform cluster's `Argo CD` manages only platform services and platform-owned infrastructure components -- downstream clusters are not assumed to be centrally registered into platform `Argo CD` -- downstream clusters are expected to manage their own services independently -- whether a downstream cluster uses `Argo CD` or some other delivery model is left to that cluster's own design - -## Disaster Recovery - -The lab's recovery model is rebuild-first. Most components are reconstructed from source-controlled configuration, GitOps state, and automation rather than restored from a point-in-time backup. This section defines the backup and restore substrate for the state that cannot be reasonably rebuilt that way. - -### Scope - -The lab accumulates irreducible state in four tiers: - -- **Proxmox VMs** — downstream cluster nodes, utility VMs, and any workload VMs whose guest state cannot be cheaply reconstructed -- **Platform and downstream Kubernetes clusters** — cluster objects (manifests, CRDs), etcd, and persistent volumes holding workload data -- **Arbitrary Linux hosts and one-off container volumes** — the `VP6630` VyOS router's container volumes (`pdns-auth` LMDB, `step-ca` BadgerDB, Tailscale machine key), Talos etcd snapshots pushed out of cluster, and similar filesystem-shaped state that does not live inside a Kubernetes cluster -- **RouterOS configuration history** — covered separately in [Network Device Backups](./network-device-backups.md). Its requirement is reviewable plaintext change history rather than block or filesystem recovery, so it uses a different artifact model - -### Tooling - -The first three tiers are covered by two complementary tools: - -| Tool | Scope | -| --- | --- | -| `Proxmox Backup Server` (PBS) | Proxmox VMs (native block-level, dedup, dirty-bitmap incrementals); arbitrary Linux hosts and container volumes via the standalone `proxmox-backup-client` | -| `Velero` | Kubernetes-native backup for the platform cluster and all CAPI-provisioned downstream clusters: manifests, CRDs, etcd, and persistent volumes via CSI snapshots or file-system backup | - -Two tools rather than one is a deliberate choice. PBS is best-in-class for VM-level and filesystem backup but has no native understanding of Kubernetes objects. Velero understands Kubernetes semantics — namespace-scoped restore, CRDs, application-consistent hooks, CSI integration — but does not replace a VM backup system. Each tool covers its slice well, and they do not overlap. - -### PBS Placement - -PBS runs as a Proxmox VM on the `MS-02 Ultra` tier, not inside the platform cluster and not on the NAS itself. - -This placement is driven by several constraints: - -- **Upstream install path.** PBS is shipped as a Debian-based appliance. There is no official container image, and the project's design assumes systemd and a local filesystem for the datastore. Running PBS in Kubernetes, including via KubeVirt, forces an off-path install and a container story PBS was not designed for. -- **No circular dependency on the platform cluster.** PBS exists to recover from failures, including platform-cluster failures. Running it inside the cluster it is meant to help restore is a trap. -- **GitOps automation.** Synology VMM on the `DS923+` is functional but has effectively no automation ecosystem. No official Terraform provider, no first-class Ansible coverage, and a semi-documented DSM API. Managing the PBS VM declaratively on Synology would regress against the GitOps posture the rest of the lab holds. Proxmox, by contrast, has a mature Terraform provider, an actively maintained Ansible collection, Packer builders, and a stable API — the PBS VM fits the same declarative pipeline as other infrastructure VMs. -- **Datastore locality.** The PBS datastore lives on NAS-backed storage via NFS from the `DS923+`, consistent with the NAS being the primary durable-data boundary. PBS reads and writes its dedup chunks against that mount; the VM itself stays lightweight and stateless enough to rebuild. - -The main tradeoff is timing. PBS depends on the Proxmox layer being up. Until that tier exists, backup for pre-existing state — notably the `VP6630` container volumes — is handled as an interim stopgap rather than through PBS. - -### Velero Placement - -Velero runs per Kubernetes cluster. The platform cluster gets its own Velero install, and each CAPI-provisioned downstream cluster gets its own. - -Every Velero instance writes to a shared S3-compatible object store exposed on the NAS. This keeps backup artifacts consolidated on the same durable substrate as PBS and lets restore operations pull from a single location, while still honoring the per-cluster operational boundary that Velero itself requires. - -Velero's backup scope per cluster includes: - -- Kubernetes manifests and CRDs -- etcd snapshots (on clusters where Velero's etcd integration applies; Talos clusters additionally retain native `talosctl etcd snapshot` as a direct path, pushed into PBS) -- persistent volumes via CSI snapshots where the cluster's CSI driver supports them, or via Velero's file-system backup path otherwise - -Downstream clusters are treated as independent recovery domains. They are not centrally registered into the platform cluster's Argo CD, and their Velero backups are self-contained so that a lost cluster can be reconstituted without standing up the platform cluster first. - -### Restore Drills - -No backup mechanism is considered complete until its restore path has been exercised at least once. For each of PBS-backed VMs, Velero-backed cluster state, and PBS-backed host-level volumes, the first production use of that backup must be paired with a documented restore drill against a lab-safe target. - -Restore expectations differ by tier: - -- **VMs** — PBS restore reconstitutes the full VM disk image. Tested via restore to a scratch VM on the Proxmox cluster. -- **Kubernetes clusters** — Velero restore reconstitutes cluster objects and PVC data into an empty cluster. Tested via restore into a throwaway CAPI cluster. -- **Host volumes** — `proxmox-backup-client` restore reconstitutes file trees on a target host. Tested via restore into a scratch directory on a lab VM. - -This applies equally to downstream clusters. A downstream cluster whose Velero backup has never been successfully restored is not considered protected. - -### Off-Site - -A single PBS datastore on the NAS leaves the lab exposed to NAS-level failure. Off-site replication is a planned addition rather than a day-one requirement. PBS supports pull-mode sync between PBS instances; the likely path is a second PBS target either on a different physical location or backed by S3-compatible object storage. Velero's object store can be mirrored or cross-region-replicated through whatever backend is chosen. The exact off-site design is deferred until the primary PBS is in place and the NAS-level failure scenarios have been characterized. -The intended future multi-cluster delivery model is being tracked separately in -[Multi-Cluster GitOps Model](./designs/gitops-multi-cluster.md). Until that -design is implemented, this architecture overview remains intentionally -conservative about downstream-cluster GitOps behavior. - -## Why This Layout - -This layout separates concerns in a way that matches the intended operating model: - -- Talos on the `UM760` keeps the control plane narrow and appliance-like -- Proxmox on the `MS-02 Ultra` nodes provides a clustered VM substrate for downstream clusters -- Tinkerbell handles bare-metal installation -- Early Proxmox clustering gives CAPI a single control surface without waiting for the full shared-storage design -- AWX fills the gap between bare-metal install and a fully configured Proxmox node -- Packer, the NAS, and AWX together provide a clear image pipeline without pushing image management into CAPI -- TerraKube remains available as a future addition for complementary infrastructure automation when that need becomes concrete -- CAPI becomes the main abstraction for downstream cluster creation and scaling -- Argo CD keeps the platform cluster declarative -- multi-node clusters are intended to use Cilium+BGP for service VIPs and - Talos VIP for the Kubernetes API endpoint, while the current platform cluster - remains single-node - -The design also avoids forcing too much day-one complexity into the Proxmox layer. The nodes can start as individually useful machines before later being combined into a more integrated Proxmox topology. - -For the same reason, the platform cluster remains single-node on the `UM760`. This accepts that `UM760` failure is a platform outage, but avoids introducing a false form of HA where multiple control-plane VMs still depend on the same underlying machine. - -The same rebuild-first logic applies to recovery boundaries: - -- the platform cluster is primarily rebuilt from Talos configuration, GitOps state, and automation -- Proxmox hosts are primarily rebuilt through Tinkerbell and AWX -- downstream clusters are primarily recreated through CAPI -- NAS-backed artifacts, backups, and selected workload data form the primary durable state boundary - -## Likely Next Sections - -As the design firms up, the next useful additions to this document are likely: - -- bootstrap path for the `UM760` -- Proxmox node lifecycle in more detail -- storage model -- network model -- downstream cluster lifecycle -- restore drills and disaster recovery procedures +- Proxmox on the `MS-02` hosts. +- A Proxmox cluster or Proxmox CAPI provider. +- Proxmox Backup Server as the default VM backup system. +- Shared Incus VM storage, Ceph, LINSTOR, or Incus OVN in v1. +- A manual USB path as the preferred host bootstrap workflow. diff --git a/docs/docs/architecture/bootstrap-and-lifecycle.md b/docs/docs/architecture/bootstrap-and-lifecycle.md new file mode 100644 index 0000000..ff34177 --- /dev/null +++ b/docs/docs/architecture/bootstrap-and-lifecycle.md @@ -0,0 +1,196 @@ +--- +title: Bootstrap and Cluster Lifecycle +description: Bootstrap flow, CAPI pivot, and cluster lifecycle ownership. +--- + +# Bootstrap and Cluster Lifecycle + +Bootstrap uses the same core tools that should own long-term lifecycle: +Tinkerbell for bare-metal provisioning, CAPI for cluster lifecycle, CAPN for +Incus infrastructure, and the Talos providers for Talos Kubernetes nodes. + +The design deliberately avoids a separate hand-built host install path that +would be thrown away after day one. + +## Prototype First + +Before touching the real `UM760`, prove the risky assumptions locally: + +- generate or download a seeded IncusOS USB/IMG image +- write it to a VM's only disk +- boot that disk as the steady-state IncusOS host +- confirm Incus initialization, trusted client certificate access, network + reachability, and API access +- exercise CAPN plus the Talos providers against Incus + +This prototype should be disposable. Its purpose is to learn which parts of the +new path are real before producing exact runbooks. + +## Temporary Bootstrap Cluster + +The first real bootstrap cluster is a disposable single-node `k0s` cluster on +VyOS, likely as a host-networked container. + +It exists only to run: + +- Tinkerbell +- Cluster API +- Cluster API Provider Incus +- Talos bootstrap provider +- Talos control-plane provider + +VyOS container support must be validated with the required privileges, mounts, +cgroups, and stability before this becomes an operator recipe. + +## Host Bootstrap Flow + +The intended host bootstrap sequence is: + +1. Start the temporary VyOS-hosted `k0s` cluster. +2. Install Tinkerbell and the CAPI providers into that cluster. +3. Generate a seeded IncusOS USB/IMG image for the `UM760`. +4. Use Tinkerbell and HookOS to write that image directly to the internal + `UM760` disk through `image2disk` or `oci2disk`. +5. Boot the `UM760` into IncusOS as the steady-state host OS. +6. Initialize Incus on the `UM760` with the first-node defaults needed for the + final cluster. +7. Enable Incus clustering. +8. Import or publish the Talos nocloud image needed by CAPN. +9. Use CAPN and the Talos providers to create the first platform Talos VM on + the `UM760`. + +The same Tinkerbell path provisions the `MS-02` hosts later. Joining nodes must +use IncusOS seed settings appropriate for joining the existing Incus cluster, +not for creating independent local Incus defaults. + +## Platform Cluster Bring-Up + +The platform cluster starts as one Talos VM on the `UM760`. + +Day-0 Talos configuration installs only the substrate needed to make the +cluster reachable and let GitOps take over: + +- bootstrap-safe Cilium +- minimal Argo CD on the platform cluster +- an admin-owned root Application pointing at the platform cluster selection in + `gitops` + +After Argo CD is running, the GitOps bootstrap selection installs the full +cluster-core components: full Cilium, full/self-managed Argo CD, and `kro`. + +The platform repo owns canonical bootstrap/core artifacts and release history. +The gitops repo owns per-cluster version selection and cluster-local desired +state. Infra and CAPI templates own only immutable day-0 references needed for +fresh installs and reinstalls. + +## Bootstrap/Core Artifact Contract + +The `platform/bootstrap/` subtree carries both Talos/CAPI day-0 substrate +artifacts and reusable day-1 cluster-core components. The name does not imply +that every component there is consumed directly by Talos. + +```text +platform/ +└── bootstrap/ + ├── cilium/ + │ ├── Chart.yaml + │ ├── Chart.lock + │ ├── values.yaml + │ ├── bootstrap-values.yaml + │ ├── templates/ + │ └── render/ + │ ├── bootstrap.yaml + │ └── full.yaml + ├── argocd/ + │ ├── Chart.yaml + │ ├── Chart.lock + │ ├── values.yaml + │ ├── bootstrap-values.yaml + │ └── render/ + │ ├── bootstrap.yaml + │ └── full.yaml + └── kro/ + ├── Chart.yaml + ├── Chart.lock + ├── values.yaml + └── render/ + └── full.yaml +``` + +The contract for each component is: + +- `Chart.yaml`: wrapper chart metadata and pinned upstream chart dependency. +- `Chart.lock`: locked dependency resolution for local render parity and chart + publication. +- `values.yaml`: steady-state defaults for the GitOps-managed install. +- `bootstrap-values.yaml`: day-0 overrides for Talos/CAPI-safe bootstrap + rendering. +- `templates/`: platform-owned manifests layered on the upstream chart. +- `render/bootstrap.yaml`: immutable raw manifest consumed by Talos/CAPI day-0 + bootstrap. +- `render/full.yaml`: fully rendered steady-state manifest for review and + validation parity. + +`kro` has no Talos/CAPI bootstrap variant, so it does not need +`bootstrap-values.yaml` or `render/bootstrap.yaml`. + +The release and selection rules are: + +- Change canonical inputs in `platform`. +- Re-render `render/bootstrap.yaml` and `render/full.yaml`. +- Publish the wrapper chart as an OCI artifact under a component-scoped release + tag. +- Select versions per cluster through `gitops/clusters//bootstrap.yaml`. +- Reference raw Talos/CAPI artifacts by immutable commit SHA, not floating tags. +- Keep the SHA referenced by Talos/CAPI aligned with the released artifact + selected for that version. + +Bootstrap-safe Cilium must preserve the intended steady-state datapath behavior +while disabling secret-producing features that make public immutable raw +manifests unsafe. Argo CD has a minimal bootstrap render for the platform +cluster and a full self-managed render for GitOps. Day-2 changes happen by +updating `platform` inputs and `gitops` selections, not by editing bootstrap +URLs by hand. + +## Expanding And Pivoting + +After the `MS-02` hosts join the Incus cluster: + +1. Add two more Talos VMs on the `MS-02` tier. +2. Run the platform cluster as three dual-role control-plane/worker nodes. +3. Install the CAPI providers into the platform cluster. +4. Use `clusterctl move` to transfer ownership from the temporary bootstrap + cluster to the platform cluster. +5. Remove the temporary VyOS bootstrap cluster and any temporary PXE behavior + once the platform cluster and Incus cluster are healthy. + +`clusterctl move` is a bootstrap pivot mechanism. It is not a backup or +disaster recovery model. + +## Downstream Clusters + +The platform cluster is the management cluster for downstream Kubernetes +clusters. + +Downstream clusters are Talos-based and created by CAPI through CAPN. They do +not run their own Argo CD instance by default. The platform Argo CD instance +syncs cluster-core and platform API state to them after they exist. + +## Prototype Validation Needed + +The bootstrap path is not complete until these are proven: + +- IncusOS image generation and seeding for first-node and joining-node modes +- Tinkerbell image writing from HookOS to the selected disks +- VyOS-hosted `k0s` stability for the temporary bootstrap stack +- CAPN plus Talos providers creating Talos VMs with the desired boot mode, + network attachment, storage pool, and endpoint model +- `clusterctl move` from the temporary cluster to the platform cluster + +## References + +- [IncusOS installation seed](https://linuxcontainers.org/incus-os/docs/main/reference/seed/) +- [Tinkerbell image2disk](https://github.com/tinkerbell/actions/tree/main/image2disk) +- [Tinkerbell oci2disk](https://github.com/tinkerbell/actions/tree/main/oci2disk) +- [CAPN Talos template](https://capn.linuxcontainers.org/reference/templates/talos.html) +- [VyOS containers](https://docs.vyos.io/en/latest/configuration/container/index.html) diff --git a/docs/docs/architecture/gitops-and-platform-apis.md b/docs/docs/architecture/gitops-and-platform-apis.md new file mode 100644 index 0000000..fb74ee9 --- /dev/null +++ b/docs/docs/architecture/gitops-and-platform-apis.md @@ -0,0 +1,457 @@ +--- +title: GitOps and Platform APIs +description: Argo CD, CAPI, kro, Kargo, cluster roles, and repository ownership. +--- + +# GitOps and Platform APIs + +The platform cluster is the management plane for the lab. + +It runs the controllers that create clusters, reconcile platform state, publish +reusable platform APIs, and drive application promotion. The workload clusters +run applications and cluster-local platform services, but they do not become +independent management planes by default. + +## Cluster Roles + +### Platform + +The platform cluster owns: + +- Argo CD +- Cluster API and providers +- Kargo +- shared `kro` APIs +- platform-only controllers and operational services + +It should not host general application workloads by default. + +### Nonprod + +The `nonprod` cluster hosts: + +- `dev` environments +- `staging` environments +- ephemeral environments such as pull requests and load tests +- nonprod shared services that belong in the workload plane + +### Prod + +The `prod` cluster hosts: + +- production application environments +- production shared services +- production policy + +## Cluster Lifecycle + +CAPI owns workload-cluster lifecycle. CAPN is the infrastructure provider for +Incus, and the Talos providers own Talos bootstrap and control-plane behavior. + +The intended responsibilities are: + +- install and manage cluster API providers on the platform cluster +- define reusable cluster classes or templates once the prototype proves the + VM shape +- create and scale `nonprod` and `prod` +- keep cluster lifecycle separate from application promotion + +Cluster lifecycle belongs in CAPI. Desired-state reconciliation belongs in Argo +CD. Application promotion belongs in Kargo. + +## Argo CD + +One Argo CD instance runs on the platform cluster. + +It syncs: + +- platform control-plane state to the platform cluster +- per-cluster bootstrap/core selections +- cluster-local platform API bundles and `Platform` resources +- workload-cluster shared services and policy +- team/application environment resources + +The default shape is: + +- dedicated AppProjects +- ApplicationSets for generated fleets of applications +- one admin-owned bootstrap Application per cluster +- Application resources kept in the `argocd` namespace + +Avoid production parameter overrides and mutable promotion state inside Argo CD. +Promotion should be represented by Git changes. + +One implementation gap remains explicit: the exact flow from CAPI outputs into +Argo CD cluster destinations still needs prototype validation. The platform +cluster will create downstream Talos clusters with CAPI/CAPN, but the mechanism +that turns the resulting kubeconfig, destination identity, and AppProject scope +into Argo CD cluster registration is not yet settled. + +## GitOps Repository Layout + +The `gitops` repository shape is: + +```text +gitops/ +├── platform/ +│ ├── argocd/ +│ │ ├── bootstrap.yaml +│ │ ├── projects/ +│ │ │ ├── platform.yaml +│ │ │ ├── teama.yaml +│ │ │ └── teamb.yaml +│ │ └── applicationsets/ +│ │ ├── platform.yaml +│ │ ├── clusters-platform.yaml +│ │ ├── clusters-nonprod.yaml +│ │ ├── clusters-prod.yaml +│ │ ├── teams-nonprod.yaml +│ │ └── teams-prod.yaml +│ ├── capi/ +│ │ ├── providers/ +│ │ ├── clusterclasses/ +│ │ └── clusters/ +│ │ ├── nonprod/ +│ │ └── prod/ +│ ├── kargo/ +│ │ └── projects/ +│ │ ├── teama-appa1/ +│ │ ├── teama-appa2/ +│ │ ├── teama-appa3/ +│ │ ├── teamb-appb1/ +│ │ └── teamb-appb2/ +├── clusters/ +│ ├── platform/ +│ │ ├── bootstrap.yaml +│ │ ├── platform/ +│ │ │ ├── rgds-platform.yaml +│ │ │ ├── rgds-apps.yaml +│ │ │ └── platform.yaml +│ │ ├── policies/ +│ │ └── shared/ +│ ├── nonprod/ +│ │ ├── bootstrap.yaml +│ │ ├── platform/ +│ │ │ ├── rgds-platform.yaml +│ │ │ ├── rgds-apps.yaml +│ │ │ └── platform.yaml +│ │ ├── capsule/ +│ │ │ ├── teama.yaml +│ │ │ └── teamb.yaml +│ │ ├── policies/ +│ │ └── shared/ +│ └── prod/ +│ ├── bootstrap.yaml +│ ├── platform/ +│ │ ├── rgds-platform.yaml +│ │ ├── rgds-apps.yaml +│ │ └── platform.yaml +│ ├── capsule/ +│ │ ├── teama.yaml +│ │ └── teamb.yaml +│ ├── policies/ +│ └── shared/ +└── teams/ + ├── teama/ + │ ├── appa1/ + │ │ ├── envs/dev/app.yaml + │ │ ├── envs/staging/app.yaml + │ │ ├── envs/prod/app.yaml + │ │ └── ephemeral/pr-123/app.yaml + │ ├── appa2/ + │ └── appa3/ + └── teamb/ + ├── appb1/ + └── appb2/ +``` + +Ownership follows the path: + +- `platform/`: platform-cluster control-plane state. +- `clusters/*/bootstrap.yaml`: per-cluster version selection for released + bootstrap/core OCI Helm charts. +- `clusters/*/platform/`: released RGD bundle installation and cluster-local + `Platform` instances. +- `clusters/*/capsule`, `clusters/*/policies`, and `clusters/*/shared`: + workload-cluster shared state. +- `teams/`: team-owned application instances. + +Argo CD syncs `platform/argocd`, `platform/capi`, and `platform/kargo` to the +platform cluster; cluster bootstrap and platform folders to their destination +clusters; nonprod team environments and ephemeral environments to `nonprod`; +and prod team environments to `prod`. + +## Bootstrap And Core Components + +Reusable bootstrap/core components are owned in two layers: + +- `platform` owns canonical inputs, wrapper charts, rendered day-0 artifacts, + OCI chart publication, and release history. +- `gitops` owns per-cluster version selection and cluster-local desired state. + +Day-0 Talos/CAPI references are narrow and reinstall-focused. They exist to +make a fresh cluster boot. They do not become the long-term day-2 control plane +for Cilium, Argo CD, or `kro`. + +The reusable cluster-core layers are: + +1. bootstrap-safe Cilium on every cluster +2. minimal Argo CD only on the platform cluster +3. GitOps-managed full Cilium, platform Argo CD self-management, and `kro` +4. released RGD bundles and cluster-local `Platform` resources + +## kro APIs + +`kro` is the abstraction layer for reusable platform and application APIs. + +The split is: + +- shared RGD source and release lifecycle live in `platform` +- cluster-local RGD bundle installation lives under each cluster's platform + state in `gitops` +- application instances live under team-owned environment resources +- Argo CD syncs YAML +- `kro` expands the custom resources into owned Kubernetes objects + +Do not model Cilium, Argo CD, or `kro` itself as `kro` APIs. They are +cluster-core primitives, not consumer-facing platform APIs. + +## RGD Release And Authoring Contract + +The `platform` repo owns shared RGD source and release history. The `gitops` +repo owns which released bundles are installed in each cluster and the +cluster-local resources that consume them. + +Release trains: + +- `platform-rgds` and `apps-rgds` are independently versioned in `platform`. +- `release-please` manages release PRs, version bumps, tags, and changelog + updates. +- Publish workflows render final YAML artifacts and push them to OCI registries + with ORAS. +- Clusters adopt bundle versions by updating the corresponding Argo CD + Applications in `gitops`. + +Authoring model: + +- CUE is the build-time authoring and validation language. +- `platform-rgds` starts with one public `Platform` RGD. +- CUE subpackages may model internal capability blocks such as core defaults, + secrets, networking, and bare-metal integration. +- Those subpackages are authoring boundaries, not separate operator-facing + APIs. +- CI can import CRDs or equivalent schemas into CUE for structural validation; + cluster-side `kro` validation remains the final semantic check. + +Cluster-local consumption stays intentionally small: + +```text +clusters//platform/ +├── rgds-platform.yaml +├── rgds-apps.yaml +└── platform.yaml +``` + +- `rgds-platform.yaml`: install the selected released `platform-rgds` OCI + artifact. +- `rgds-apps.yaml`: install the selected released `apps-rgds` OCI artifact. +- `platform.yaml`: instantiate the cluster-local `Platform` custom resource. + +## Application And Team Model + +Workload namespaces use: + +```text +team-app-env +``` + +Examples: + +- `teama-appa1-dev` +- `teama-appa1-staging` +- `teama-appa1-prod` +- `teama-appa1-pr-123` + +Each workload cluster can use Capsule to enforce team boundaries. The intended +shape is one Capsule tenant per team per workload cluster, while keeping +applications isolated in separate namespaces. + +Application environments are concrete instances of shared APIs, not Helm values +or Kustomize overlays. Environment-specific resources should be small and +explicit. + +The first developer-facing application API is `App`. Its first pass covers +containers, secrets, configs, and volumes: + +- developers author an `App` instance alongside application source +- Kargo materializes the final environment-specific `App` instance into the + `gitops` repo on `main` +- Argo CD reconciles that materialized instance +- secret references align with External Secrets and environment-owned + SecretStores or ClusterSecretStores +- plaintext secret values are out of scope +- non-secret config is distinct from secret material +- volumes are declared at the point of use by default +- the API should stay application-centric rather than exposing raw Pod templates + +The first schema is still prototype-dependent. The intended shape is concrete +enough to guide implementation: + +```yaml +apiVersion: apps.platform.gilman.io/v1alpha1 +kind: App +metadata: + name: orders-api +spec: + team: teama + app: orders-api + + containers: + - name: api + image: + repository: ghcr.io/gilmanlab/teama/orders-api + digest: sha256:2222222222222222222222222222222222222222222222222222222222222222 + ports: + - name: http + port: 8080 + env: + - name: LOG_LEVEL + value: info + - name: DB_USERNAME + secret: + remoteKey: kv/orders-api + property: username + - name: DB_PASSWORD + secret: + remoteKey: kv/orders-api + property: password + mounts: + - path: /var/lib/orders + volume: + persistent: + size: 10Gi + - path: /etc/orders + volume: + config: + files: + application.yaml: | + http: + port: 8080 + logging: + format: json + - path: /var/run/secrets/orders + volume: + secret: + files: + db-username: + remoteKey: kv/orders-api + property: username + db-password: + remoteKey: kv/orders-api + property: password + + - name: worker + image: + repository: ghcr.io/gilmanlab/teama/orders-worker + digest: sha256:4444444444444444444444444444444444444444444444444444444444444444 + env: + - name: LOG_LEVEL + value: info + mounts: + - path: /var/lib/orders + volume: + persistent: + size: 10Gi +``` + +This example captures the contract, not the final CRD schema: + +- containers are the main unit of declaration +- config values are attached where they are consumed +- secret references are attached where they are consumed +- volume definitions are attached where they are mounted +- External Secrets-backed needs are described without hand-authored Kubernetes + `Secret` objects +- secret store selection defaults from the target environment and is not a + normal developer-facing input + +Generated workload resources stamp stable ownership labels for policy, +selection, and observability: + +```text +glab.gilman.io/team +glab.gilman.io/app +glab.gilman.io/env +``` + +Composition boundaries stay narrow: + +- embed related configuration when lifecycle and ownership naturally belong + with the application instance +- use explicit peer contracts for independently managed or shared capabilities, + such as a future `Database` resource +- keep environment substitution close to Kargo and the materialized `gitops` + output +- keep policy and tenancy guardrails in governance layers such as Kyverno and + Capsule rather than forcing them into `kro` + +## Promotion + +Kargo runs on the platform cluster. + +There should be one promotion project per application pipeline. Durable stages +are `dev`, `staging`, and `prod`; ephemeral environments are created and +destroyed by automation that adds or removes the matching YAML. + +The promotion policy is: + +- `dev`: automatic promotion is acceptable +- `staging`: automatic promotion is acceptable +- `prod`: promotion requires an explicit approval step + +Promotion means editing Git, usually a narrow field such as an image digest. + +The working application lifecycle is: + +1. A developer authors an `App` instance alongside application source. +2. CI produces an image and the corresponding Git commit. +3. Kargo bundles the image and Git commit into Freight. +4. Kargo promotes that Freight into an environment. +5. During promotion, Kargo combines the source `App` instance with + environment-specific inputs. +6. Kargo writes the final `App` instance into the destination environment path + in the `gitops` repo on `main`. +7. Argo CD reconciles that final `App` instance. + +For `AppA1`, the long-lived promotion targets are: + +- `teams/teama/appa1/envs/dev/app.yaml` +- `teams/teama/appa1/envs/staging/app.yaml` +- `teams/teama/appa1/envs/prod/app.yaml` + +Promotion-time composition is intentionally bounded. It may set or combine: + +- image repository and digest from Freight +- source Git commit from Freight +- environment-specific non-secret config +- routing, namespace, and cluster placement +- resource sizing or replica count when those values differ by environment +- target-environment secret store defaults + +It must not: + +- mutate clusters directly +- generate plaintext Kubernetes `Secret` manifests +- change platform-owned defaults that belong in the RGD implementation +- treat mutable Argo CD parameters as the promotion mechanism + +The exact composition mechanism needs prototype validation, but the output is +settled: concrete environment-specific YAML in `gitops` on `main`. + +## References + +- [Argo CD documentation](https://argo-cd.readthedocs.io/en/stable/) +- [Cluster API](https://cluster-api.sigs.k8s.io/) +- [Cluster API Provider Incus](https://capn.linuxcontainers.org/) +- [kro](https://kro.run/) +- [Kargo](https://docs.kargo.io/) diff --git a/docs/docs/architecture/hosts-and-substrate.md b/docs/docs/architecture/hosts-and-substrate.md new file mode 100644 index 0000000..ec2ed15 --- /dev/null +++ b/docs/docs/architecture/hosts-and-substrate.md @@ -0,0 +1,119 @@ +--- +title: Hosts and Substrate +description: Physical host roles, IncusOS, Incus clustering, and VM substrate boundaries. +--- + +# Hosts and Substrate + +The bare-metal substrate is IncusOS plus Incus. + +IncusOS is the host operating system for every compute node. Incus is the VM +control surface. Talos is the guest operating system for Kubernetes nodes. This +keeps both layers immutable and API-driven without carrying a general-purpose +Linux host management layer. + +## Host Roles + +### VyOS Router + +The `VP6630` runs VyOS and remains the lab network appliance. It owns routing, +DHCP, DNS entrypoints, PXE coordination, bootstrap support, the real platform +Kubernetes API TCP frontend, and BGP peering with Cilium. + +VyOS is intentionally part of the bootstrap path. The lab already depends on it +for network reachability, so using it for API fronting and temporary bootstrap +coordination keeps the early system small. + +### UM760 + +The `UM760` runs IncusOS as the first permanent host and the first Incus +cluster member. + +During bootstrap it hosts the first platform Talos VM. After the `MS-02` hosts +join, it remains useful as bootstrap, recovery, and light-duty capacity. + +### MS-02 Ultra Hosts + +Each `MS-02 Ultra` runs IncusOS directly on bare metal. + +The three `MS-02` systems form the main compute tier. After they join the Incus +cluster, the platform Talos cluster expands by placing additional Talos VMs on +this tier. + +## Incus Cluster + +The final Incus cluster spans: + +- `um760` +- `ms02-1` +- `ms02-2` +- `ms02-3` + +The cluster is intentionally heterogeneous. Incus cluster groups should be used +only as lightweight placement and CPU-boundary labels, for example `amd-um760` +and `intel-ms02`. Kubernetes remains the main workload scheduler. + +The `UM760` is not a disposable bootstrap host. It is the first durable Incus +cluster member. + +## VM Substrate + +Talos nodes run as Incus VMs. Those VMs are infrastructure cattle: + +- created by CAPI/CAPN where possible +- configured through Talos machine configuration +- reconciled by GitOps after bootstrap +- recreated rather than manually repaired when practical + +Non-Talos Incus VMs are allowed later, but they are not a v1 design driver. Any +non-Talos VM that holds unique state must bring its own backup story. + +## Storage + +Use local ZFS-backed Incus storage on each IncusOS host in v1. + +An Incus cluster is a management cluster, not automatically a replicated +storage system. For most Incus storage drivers, volumes remain on the member +where they are created. That is acceptable for Talos VM disks because the +primary recovery model is rebuild-first. + +Do not introduce shared VM storage in v1: + +- no Ceph +- no LINSTOR +- no Incus OVN/storage architecture just for VM mobility +- no NAS-backed default VM disks + +The NAS remains a durable backup and artifact boundary, not the default block +storage path for every VM. + +## Network Attachment + +Physical switch ports to IncusOS hosts can be trunks. + +By default, terminate VLAN handling at the switch/IncusOS/Incus layer and +present Talos VMs with untagged, access-style vNICs on the selected lab VLAN. +Talos guest VLAN tagging is reserved for concrete multi-VLAN guest +requirements. + +This keeps DHCP reservations, CAPI templates, and recovery simpler while still +allowing the underlay to remain visible to VyOS. + +## Prototype Validation Needed + +Before treating the host substrate as implementation reference material, prove: + +- the selected IncusOS image mode is a correct final-disk artifact for the + single-disk `UM760` +- first-node and joining-node IncusOS seeds apply the right default Incus + settings +- joining nodes do not create local networks or storage pools that block cluster + join +- CAPN can place Talos VMs against the intended Incus profiles and storage pools + +## References + +- [IncusOS image download](https://linuxcontainers.org/incus-os/docs/main/getting-started/download/) +- [IncusOS installation seed](https://linuxcontainers.org/incus-os/docs/main/reference/seed/) +- [Incus clustering](https://linuxcontainers.org/incus/docs/main/explanation/clustering/) +- [Incus cluster storage](https://linuxcontainers.org/incus/docs/main/howto/cluster_config_storage/) diff --git a/docs/docs/architecture/keycloak-runtime.md b/docs/docs/architecture/keycloak-runtime.md new file mode 100644 index 0000000..28d2fa1 --- /dev/null +++ b/docs/docs/architecture/keycloak-runtime.md @@ -0,0 +1,175 @@ +--- +title: Keycloak Runtime +description: Keycloak deployment, configuration, backups, recovery, and break-glass paths. +--- + +# Keycloak Runtime + +Keycloak is the central human-facing identity system for lab services. It is +outside the lab's physical failure domain, but it is not a bootstrap dependency +for raw recovery. + +## Deployment Shape + +Keycloak runs on one dedicated EC2 instance in the `lab` account. + +The runtime contract is: + +- instance: `t4g.small`, Amazon Linux 2023 on ARM, in the `172.16.0.0/16` VPC +- runtime: Docker Compose +- services: upstream Keycloak plus upstream Postgres, both pinned in `infra` +- database: colocated Postgres with data on the instance EBS root volume +- access name: `id.glab.lol` +- TLS: ACME DNS-01 through Route 53 using the instance IAM role +- reverse proxy: Caddy or equivalent on the host, terminating TLS and proxying + to Keycloak on loopback + +State recovery uses application backups. Do not depend on EBS snapshots or AMI +backups as the primary recovery path. + +The `t4g.small` shape is tight but workable for a single-user lab. The runtime +must be tuned for the 2 GB memory budget: + +- set an explicit Keycloak JVM max heap around 768 MB +- keep Postgres `shared_buffers` conservative, around 128 MB +- provision an EBS-backed swap file as a safety margin +- enable burstable CPU unlimited mode so occasional login bursts do not throttle + the instance + +## IAM Contract + +The instance IAM role grants only what the runtime needs: + +- Route 53 writes scoped to `_acme-challenge.id.glab.lol` +- S3 write access to the Keycloak backup bucket prefix +- SSM Parameter Store reads for bootstrap-time values such as reconciliation + credentials + +Cluster access, secret decryption, and tailnet access follow the AWS bootstrap +and secrets contracts. Keycloak does not receive broad AWS administrative +permissions. + +## Realm And Federation + +One realm named `lab` holds lab users and OIDC/SAML clients. + +GitHub is the only upstream identity provider and is federated through OIDC. +There is no standing local username-password fallback for normal users. The +bootstrap admin user exists only during initial realm creation and is disabled +after `keycloak-config-cli` reconciles the realm from git. + +The first expected clients are: + +- Kubernetes API OIDC for Talos clusters +- Argo CD web UI and CLI +- Grafana + +The authoritative client list lives in the realm repository. + +## Configuration As Code + +The realm repository is the source of truth for Keycloak's declarative surface: + +- realms +- clients +- client scopes +- roles and role mappings +- identity-provider configuration +- authentication flows and required actions +- realm-level settings + +Runtime state is intentionally out of git: + +- user credentials +- WebAuthn registrations +- TOTP secrets +- sessions and refresh tokens +- audit and event logs +- ephemeral tokens and one-time codes + +`keycloak-config-cli` reconciles from the realm repository on a short schedule +from the Keycloak host. It authenticates with a service account whose secret is +stored in SSM Parameter Store. Keycloak version upgrades are driven by bumping +the pinned runtime version in `infra` and reconciling forward. + +## Backups + +Backups contain: + +- a Postgres dump +- host-local Keycloak configuration files such as `keycloak.conf` +- environment overrides +- custom themes or providers +- the current TLS cert bundle and private key + +Git-tracked realm configuration is not part of the backup payload because git +is already the durable store for it. + +Backups are written nightly to an S3 bucket in the `lab` account. The bucket +uses SSE-KMS and object lock or versioning so corruptions cannot silently +overwrite known-good backups. The Synology NAS pulls a secondary copy on its +own schedule. + +Retention contract: + +- daily backups for 30 days +- weekly backups for 12 weeks +- monthly backups for 12 months + +Backup payloads are encrypted before upload with a recipient key managed with +the lab's bootstrap secrets, and the S3 bucket also uses SSE-KMS. + +## Recovery + +The primary path is rebuild-first: + +1. Provision a fresh EC2 instance from `infra`. +2. Start Keycloak and fresh Postgres through Docker Compose. +3. Run `keycloak-config-cli` against the new instance using the realm repo. +4. Sign in through GitHub. +5. Re-enroll WebAuthn or TOTP. + +Target RTO for the single-user lab is 15 minutes. This path requires AWS access +and git; it does not require a backup store. + +The restore fallback is: + +1. Provision a fresh EC2 instance. +2. Pull a selected point-in-time backup from S3 or the NAS. +3. Restore the Postgres dump. +4. Place the TLS cert bundle and config files. +5. Start Docker Compose. +6. Let `keycloak-config-cli` reconcile the restored runtime forward to git + `HEAD`. + +Restores are for cases where preserving exact runtime state matters, such as +federated identity linkages, event history, and active user state. + +## Hostname Constraint + +Recovered Keycloak instances must serve at `id.glab.lol`. + +The issuer claim in signed JWTs and client OIDC discovery is tied to that URL. +Changing it invalidates existing tokens and client configuration. + +Normally the Route 53 A record points `id.glab.lol` at the replacement +instance. During a full internet-loss recovery where Route 53 is unreachable, +the local CoreDNS zonefile can be edited to point `id.glab.lol` at the restored +instance's reachable address. + +## Break-Glass Matrix + +When Keycloak is down, the lab keeps operating through service-local +authentication paths. + +| Service | Break-glass path | Notes | +| --- | --- | --- | +| Talos API | mTLS via `talosconfig` and machine secrets | Talos PKI is independent of Keycloak. | +| Kubernetes | Talos-generated admin kubeconfig via `talosctl kubeconfig` | Produced on demand from each cluster's signing CA. | +| Argo CD | Built-in `admin` account and initial admin secret | Retained and rotated, not disabled. | +| Vault | Unseal keys and root/recovery keys | Stored outside any Keycloak-dependent path. | +| AWS | IAM Identity Center local user with hardware key | AWS does not federate to Keycloak. | +| Grafana | Local admin account | Kept active alongside OIDC. | +| GitHub | Personal GitHub account with hardware-key MFA | GitHub is upstream of Keycloak. | + +These anchors live outside Keycloak-dependent storage. diff --git a/docs/docs/architecture/networking-and-endpoints.md b/docs/docs/architecture/networking-and-endpoints.md new file mode 100644 index 0000000..b78cc34 --- /dev/null +++ b/docs/docs/architecture/networking-and-endpoints.md @@ -0,0 +1,188 @@ +--- +title: Networking and Endpoints +description: Lab network ownership, Kubernetes API endpoints, Cilium service VIPs, DNS, and VLAN boundaries. +--- + +# Networking and Endpoints + +The network design keeps the underlay visible to VyOS and avoids adding an +Incus SDN layer in v1. + +VyOS remains the routing and naming boundary. Kubernetes service exposure is +handled by Cilium. The platform Kubernetes API endpoint is handled by VyOS +HAProxy, not by the same Cilium service VIP path that depends on the cluster +already being healthy. + +## Ownership Split + +| Layer | Owner | Purpose | +| --- | --- | --- | +| Physical network | Switches and VyOS | VLANs, trunks, routing, DHCP/PXE reachability | +| Host substrate | IncusOS and Incus | VM attachment to lab VLANs | +| Kubernetes nodes | Talos | Kubernetes control plane and worker runtime | +| Service VIPs | Cilium plus VyOS | LoadBalancer IP allocation and BGP advertisement | +| Platform API frontend | VyOS HAProxy | TCP passthrough to Talos control-plane VMs | + +## Kubernetes API Endpoint + +The real platform cluster uses a VyOS-owned TCP frontend for the Kubernetes API: + +- configure CAPN with an external load balancer model +- create a stable DNS name such as `kube.platform.` +- point that name at a VyOS-owned listener +- run VyOS HAProxy in TCP mode on `:6443` +- configure backends as the reserved IPs of the platform Talos control-plane + VMs on `:6443` +- do not terminate Kubernetes API TLS on VyOS +- start with basic TCP health checks + +This makes the platform API reachable without depending on Cilium, Talos VIP +leadership, or a CAPN-managed development HAProxy container. + +CAPN's built-in HAProxy load balancer remains acceptable for local prototypes. +Its own Talos template documentation describes that container as a development +or evaluation path and warns that the template is not currently tested in CI. + +## Talos API Endpoint + +`talosctl` should target individual Talos control-plane node IPs on the Talos +API port, `50000`. + +Do not use the Kubernetes API DNS name, the VyOS HAProxy frontend, or a virtual +IP as the Talos API recovery endpoint. Talos API access must still work when +the Kubernetes API or etcd is unhealthy. + +## KubePrism + +Keep KubePrism enabled on Talos clusters. + +KubePrism is the internal highly available Kubernetes API endpoint for +host-network consumers such as control-plane components and CNI components. It +does not replace the external platform API frontend. + +## Talos VIP + +Talos VIP is not the default platform API endpoint. + +Talos VIP can simplify access for some clusters, but it depends on etcd +election state and comes alive only after bootstrap. That makes it a weak +recovery endpoint for the platform cluster. Future workload clusters may use it +only after the tradeoff is explicit for that cluster. + +## Kubernetes Service Exposure + +Use Cilium LB IPAM and Cilium BGP Control Plane for Kubernetes +`LoadBalancer` services. + +Expected behavior: + +- Cilium LB IPAM allocates service IPs from configured pools. +- Cilium BGP Control Plane advertises selected service VIPs to VyOS. +- Those advertisements are service routes, not PodCIDRs. With the current + `ipam.mode=kubernetes` assumption, PodCIDR advertisement is not part of the + design. +- VyOS learns exact routes to those VIPs and forwards traffic toward eligible + cluster nodes. +- `externalTrafficPolicy`, node selectors, and advertisement policy can be + refined after real workloads exist. + +Cilium is not responsible for the same cluster's initial Kubernetes API +bootstrap endpoint because it depends on the Kubernetes API to reconcile. + +## VLAN Boundary + +The default VM attachment model is: + +```text +IncusOS bridge/VLAN = VM attachment and node underlay +VyOS DHCP/DNS/HAProxy = addressing and platform API frontend +Talos/Kubernetes = compute and control plane +Cilium+BGP to VyOS = service LoadBalancer VIP advertisement +``` + +Physical switch ports can be trunks. Talos VMs should normally receive +untagged, access-style vNICs on their intended VLAN. Guest-side VLAN tagging is +reserved for cases where the VM genuinely needs multiple VLANs. + +Avoid in v1: + +- Incus NAT networks for real Talos nodes +- `macvlan` as the default VM attachment mode +- Incus OVN +- service VIP VLAN trunking into Talos guests + +## DNS + +The lab domain is `glab.lol`. + +The authoritative private zone lives in Route 53 in the `lab` AWS account. A +sync path renders that zone to a local zonefile served inside the lab, so query +serving does not depend on reaching AWS at request time. + +## AWS Link And DNS Mirror + +The AWS `lab` VPC uses `172.16.0.0/16`, intentionally separate from the lab +`10.10.0.0/16` space and Tailscale's `100.64.0.0/10` range. + +The AWS network shape is intentionally small: + +- one public subnet in a single AZ +- an attached internet gateway +- no NAT gateway +- a `t4g.nano` Amazon Linux 2023 subnet-router instance +- an Elastic IP attached to the subnet router ENI for outbound traffic +- a separate Keycloak host, not colocated with the subnet router + +The lab and AWS connect through Tailscale subnet routers on both sides: + +- the AWS-side subnet router advertises `172.16.0.0/16` and accepts + `10.10.0.0/16` +- VyOS advertises `10.10.0.0/16` and accepts `172.16.0.0/16` +- both sides preserve source IPs with `--snat-subnet-routes=false` +- the VPC route table sends `10.10.0.0/16` to the subnet router ENI +- source/destination check is disabled on that ENI +- security groups allow `10.10.0.0/16` as a source +- VyOS clamps MSS to avoid black-holed large packets through the tailnet path + +The AWS-side subnet router authenticates to Tailscale through workload identity +federation using its IAM role. The VyOS node uses a traditional Tailscale auth +key because workload identity federation is cloud-client only. + +Tailscale ACL tags for the AWS subnet router should derive from IAM claims such +as the role ARN and AWS account ID, so tailnet policy follows the AWS identity +rather than a hand-maintained device label. + +DNS serving uses a local zonefile for cold-start resilience: + +1. A job on the AWS-side subnet router reads the Route 53 private zone using + its IAM role and renders a standard zonefile at least once per minute. +2. The subnet router serves that rendered file over the tailnet. +3. An in-lab fetcher periodically writes the file to the path CoreDNS serves. + +CoreDNS does not query Route 53 at request time. The mirror exists because +CoreDNS cache and `serve_stale` behavior help during steady-state upstream +outages, but they do not solve a cold start where CoreDNS has no in-memory +zone. The zonefile on disk makes restart and bootstrap behavior deterministic. + +The intended split is: + +| Zone | Writers | Purpose | +| --- | --- | --- | +| `glab.lol` | Route 53 private zone automation | parent zone and static anchors | +| `mgmt.glab.lol` | manual or GitOps-managed | stable management and platform names | +| `dhcp.glab.lol` | DHCP/DNS automation | lease-driven hostnames | +| `.k8s.glab.lol` | ExternalDNS | cluster-scoped workload names | +| `acme.glab.lol` | Route 53 DNS-01 automation | public certificate validation targets | + +Exact records, forwarding rules, local zonefile mechanics, and address +assignments are implementation details and should be filled in after prototype +validation. + +## References + +- [VyOS HAProxy](https://docs.vyos.io/en/latest/configuration/loadbalancing/haproxy.html) +- [VyOS containers](https://docs.vyos.io/en/latest/configuration/container/index.html) +- [Cilium LB IPAM](https://docs.cilium.io/en/stable/network/lb-ipam/) +- [Cilium BGP Control Plane resources](https://docs.cilium.io/en/stable/network/bgp-control-plane/bgp-control-plane-configuration/) +- [Talos KubePrism](https://docs.siderolabs.com/kubernetes-guides/advanced-guides/kubeprism) +- [Talos virtual shared IP](https://docs.siderolabs.com/talos/v1.12/networking/advanced/vip) diff --git a/docs/docs/architecture/secrets-identity-pki.md b/docs/docs/architecture/secrets-identity-pki.md new file mode 100644 index 0000000..0282bcf --- /dev/null +++ b/docs/docs/architecture/secrets-identity-pki.md @@ -0,0 +1,358 @@ +--- +title: Secrets, Identity, DNS, and PKI +description: Bootstrap secrets, runtime secret ownership, AWS anchors, Keycloak, DNS, and internal PKI. +--- + +# Secrets, Identity, DNS, and PKI + +Bootstrap identity must work before any lab cluster is healthy. Runtime secrets +belong inside clusters once they exist. DNS and PKI must also avoid circular +dependencies on the platform cluster they help recover. + +## AWS Bootstrap Authority + +AWS is the external bootstrap authority. + +The AWS structure is a two-account Organization: + +- a management account for billing, IAM Identity Center, and organization-level + configuration +- a `lab` member account for workloads and bootstrap resources + +The `lab` account owns the VPC, Route 53 private zone, Tailscale subnet router, +Keycloak host, SOPS KMS key, SSM parameters, and GitHub token broker Lambda. + +AWS access uses IAM Identity Center with a local identity-store user and a +hardware security key. Do not make AWS access depend on Keycloak; Keycloak +depends on AWS. + +AWS resources are managed with OpenTofu from the `infra` repo under +`infra/aws/`. OpenTofu state lives in an S3 bucket in the `lab` account using +S3 native locking. + +The manual AWS bootstrap surface is intentionally small: + +1. Create the AWS Organization and the management/member accounts. +2. Enable IAM Identity Center and create the operator user. +3. Create the S3 state bucket. +4. Create the initial IAM role that OpenTofu assumes. + +Everything downstream of that, including the VPC, subnet router, private zone, +KMS keys, SSM parameters, instance profiles, and GitHub token broker, belongs in +OpenTofu. + +The SOPS KMS key is the customer-managed key in the current `lab` account: + +```text +alias/glab-sops +arn:aws:kms:us-west-2:186067932323:key/2aba1d94-6eaf-4d80-8d26-2077f32fd7c5 +``` + +## Secrets Bootstrap + +Bootstrap secrets are the minimum material needed to stand up or recover the +lab before Vault and cluster-local controllers are available. + +The bootstrap chain is intentionally rooted in an AWS IAM role: + +- encrypted files live in the private `secrets` repo +- those files are encrypted with a customer-managed AWS KMS key used as a SOPS + recipient +- the GitHub App private key is stored in SSM Parameter Store +- bootstrap callers invoke `github-token-broker` to receive a short-lived + installation token for `GilmanLab/secrets` +- the caller fetches encrypted files with that token and decrypts locally + through KMS + +GitHub access fetches encrypted files. AWS KMS access decrypts them. The GitHub +token broker is only a token broker; it is not a file broker and not a KMS +decryptor. + +The GitHub App token path is: + +- the GitHub App private signing key is stored in SSM Parameter Store as a + `SecureString` +- the `github-token-broker` Lambda execution role can read that SSM parameter +- bootstrap principals can invoke the broker, but do not read the App private key +- the broker returns a short-lived installation token for `GilmanLab/secrets` + with `contents:read` +- callers use that token with `git` or the GitHub Contents API, and do not store + it on disk + +SOPS files use AWS KMS encryption context so IAM can grant decrypt access by +path-oriented scope. SOPS is not a field-level authorization system. + +Example scope layout: + +```text +network/tailscale/* Scope=network-tailscale +network/vyos/* Scope=network-vyos +compute/talos/platform/* Scope=talos-platform +vault/platform/* Scope=vault-platform +vault/nonprod/* Scope=vault-nonprod +vault/prod/* Scope=vault-prod +``` + +Each encrypted file includes context similar to: + +```yaml +Repo: GilmanLab/secrets +Scope: network-tailscale +``` + +A caller scoped to `network/tailscale/*` receives short-lived AWS credentials +for a role that can call `kms:Decrypt` only when the request has both +`Repo=GilmanLab/secrets` and `Scope=network-tailscale`. + +PGP and age recipients are removed from current SOPS metadata after the KMS +cutover. That does not remove their ability to decrypt old git revisions, so +any secret that must become AWS-authoritative retroactively is rotated rather +than relying on history rewrite. + +Examples of bootstrap material include: + +- Talos and cluster bootstrap material +- DNS and PKI bootstrap material +- GitHub App bootstrap material +- initial Vault or Keycloak material where no runtime owner exists yet + +SOPS-encrypted bootstrap material lives in the private `secrets` repo. Public +repos may reference paths, identities, and workflows, but must not contain +plaintext secret payloads. + +## Runtime Secrets + +Each Kubernetes cluster runs its own HashiCorp Vault instance managed by +`bank-vaults`. Vault is the runtime source of truth for secrets in that +cluster, not a shared platform service. + +The intended cluster split is: + +| Cluster | Vault scope | +| --- | --- | +| `platform` | Platform-cluster services and platform control-plane needs | +| `nonprod` | Non-production workloads | +| `prod` | Production workloads | + +Vault instances do not read each other's storage, policies, tokens, or secret +paths. `prod` does not depend on `nonprod`; `nonprod` does not depend on +`prod`; and the platform cluster does not become a universal secret broker for +downstream clusters. + +Clusters that host multiple environments segregate secrets by path and policy. +For `nonprod`, the baseline shape is: + +```text +dev/* +staging/* +``` + +Path naming is not the security boundary by itself. Vault auth roles and +policies enforce which workloads, namespaces, and service accounts can read or +write each prefix. + +Runtime ownership starts after bootstrap: + +```text +SOPS bootstrap material -> initialize/configure Vault -> Vault owns runtime secret distribution +``` + +Do not create a long-term two-source-of-truth model where SOPS and Vault both +own mutable runtime secret state. If runtime material must be recovered from +bootstrap inputs, document it as a recovery flow. + +Vault unseal and recovery material is scoped per cluster even if the lab uses +one customer-managed AWS KMS key for cost control. The key may wrap distinct +per-cluster Vault material, but it is not a shared Vault unseal key. + +Isolation is enforced with: + +- per-cluster S3 prefixes or buckets for `bank-vaults` storage +- per-cluster IAM roles +- KMS encryption context such as `Purpose=vault-unseal` and `Cluster=nonprod` + +Example: + +```text +KMS key: alias/glab-vault-unseal + +S3: + s3://glab-vault-unseal/platform/* + s3://glab-vault-unseal/nonprod/* + s3://glab-vault-unseal/prod/* + +KMS context: + Purpose = vault-unseal + Cluster = platform | nonprod | prod +``` + +One KMS key per cluster would provide cleaner blast-radius isolation, but the +fixed monthly KMS cost is not worth it at this lab scale unless prototype work +shows that the shared-key policy model is too awkward. + +## DNS + +The lab domain is `glab.lol`. + +The authoritative private zone lives in Route 53 in the `lab` account. A sync +job on the AWS-side subnet router renders the zone to a local zonefile, and the +in-lab DNS path serves that local copy. The read path should not query Route 53 +at request time. + +Cluster-local DNS automation writes only the delegated cluster zones it owns. +This keeps name resolution available during internet or platform-cluster +outages and prevents workload controllers from mutating management records. + +## TLS And PKI + +Public HTTP TLS uses Let's Encrypt with Route 53 DNS-01 validation. The +dedicated ACME validation zone is under `acme.glab.lol`; cluster issuers should +write only their scoped validation records. + +Cluster responsibilities are split: + +- ExternalDNS manages service DNS records. +- cert-manager manages ACME orders, challenges, and certificate renewal. + +ExternalDNS does not manage `_acme-challenge` TXT records. Those belong to +cert-manager. + +`acme.glab.lol` is a public Route 53 validation zone delegated from Cloudflare. +Clusters use that zone so workload controllers do not need broad Cloudflare DNS +access. + +Challenge records use CNAME delegation: + +```text +_acme-challenge..glab.lol + CNAME _acme-challenge...acme.glab.lol +``` + +cert-manager follows the CNAME and writes TXT records into Route 53 using +short-lived AWS credentials. Each cluster role is scoped to its own challenge +names. No wildcard certificate is assumed; individual services receive +individual certificates unless a future workload justifies the broader blast +radius. + +Cluster workloads that need AWS credentials, including ExternalDNS and +cert-manager, use the cluster's Kubernetes OIDC issuer to assume IAM roles in an +IRSA-style flow. They do not use Tailscale workload identity federation; that +mechanism is reserved for AWS-hosted clients such as the subnet router. + +Internal runtime PKI uses per-cluster Vault intermediates signed by an +AWS-KMS-backed root CA. The hierarchy leaves room for future SPIRE: + +```text +AWS KMS root CA pathlen:2 + -> cluster Vault intermediate pathlen:1 + -> SPIRE intermediate pathlen:0 + -> workload SVID leaves +``` + +Root signing is operationally offline. No always-on lab workload has standing +permission to use the root key. Root signing is used only to mint or rotate +cluster subordinate CAs. + +For clusters where Vault directly issues mTLS leaves, the same hierarchy still +works without SPIRE: + +```text +AWS KMS root CA pathlen:2 + -> cluster Vault intermediate pathlen:1 + -> workload mTLS leaves +``` + +Each cluster gets its own subordinate CA generated and held by that cluster's +Vault instance. Vault generates the intermediate private key and CSR; the AWS +KMS root signs the CSR; and the signed intermediate is imported back into +Vault. + +The cluster subordinate CA identity includes the cluster name. Example common +names: + +```text +glab platform Vault CA +glab nonprod Vault CA +glab prod Vault CA +``` + +Vault PKI roles issue short-lived certificates for internal use cases such as +service-to-service mTLS, database client authentication, internal controllers, +and future SPIRE upstream authority material. + +The current implementation history matters for rebuilds: the existing +`infra/security/pki/root-ca` stack was applied against an earlier AWS account. +The lab root must be recreated in the current `lab` account with the +`pathlen:2` hierarchy, then the earlier account root key and state can be +cleaned up after consumers trust the new chain. + +Router-hosted `step-ca` is limited to existing consumers during migration. +Runtime issuance uses cert-manager plus Route 53 for public TLS and Vault for +internal PKI. + +The small implementation slices are: + +1. Rewrap existing SOPS files with `alias/glab-sops`, add encryption context, + and remove PGP/age recipients. +2. Rotate bootstrap secrets that previously depended on PGP/age-only history. +3. Create the GitHub App plus SSM bootstrap path for `secrets` repo access. +4. Recreate the internal root CA in the current `lab` account with the new path + length. +5. Add the shared Vault unseal KMS key and `bank-vaults` storage layout. +6. Stand up Vault in one cluster and prove the SOPS-to-Vault bootstrap handoff. +7. Add cert-manager DNS-01 with Route 53 ACME delegation for one cluster. +8. Migrate internal PKI consumers from `step-ca` to the per-cluster issuers. + +Open implementation threads: + +- exact KMS encryption-context key names and allowed scope values +- exact GitHub App name, installation ID storage, and SSM parameter paths +- whether Vault unseal material uses one shared S3 bucket with prefixes or + separate per-cluster buckets +- how trust bundles are distributed to workloads that need to trust internal + Vault or SPIRE issuers +- whether any future public TLS use case justifies wildcard certificates +- when remaining `step-ca` consumers should be allowed to expire naturally + versus being actively replaced + +## Identity + +Keycloak is the central human-facing identity system for lab services, but it +is not a bootstrap dependency for the raw recovery path. + +Keycloak runs on a dedicated EC2 instance in the `lab` account, colocated with +Postgres and managed with Docker Compose. It is reached at `id.glab.lol`. +GitHub is the upstream identity provider through OIDC. + +Keycloak configuration should be reconciled from git. Runtime state such as +sessions, user credentials, and TOTP enrollment is backed up separately. + +See [Keycloak Runtime](./keycloak-runtime.md) for the EC2 runtime shape, +backup contract, rebuild/restore paths, hostname constraint, and break-glass +matrix. + +Break-glass paths must still exist for: + +- AWS bootstrap access +- Talos API access +- Kubernetes admin kubeconfig generation +- Vault recovery +- router access + +The identity system itself should be rebuilt from declarative configuration and +backups rather than treated as an irreplaceable pet. + +## References + +- [AWS IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/what-is.html) +- [AWS KMS](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) +- [cert-manager Route 53 DNS-01](https://cert-manager.io/docs/configuration/acme/dns01/route53/) +- [cert-manager delegated DNS-01](https://cert-manager.io/docs/configuration/acme/dns01/#delegated-domains-for-dns01) +- [SOPS](https://github.com/getsops/sops) +- [Vault](https://developer.hashicorp.com/vault/docs) +- [Vault PKI secrets engine](https://developer.hashicorp.com/vault/docs/secrets/pki) +- [Vault PKI intermediate guidance](https://developer.hashicorp.com/vault/docs/secrets/pki/considerations) +- [Bank-Vaults unseal keys](https://bank-vaults.dev/docs/concepts/unseal-keys/) +- [SPIRE configuration](https://spiffe.io/docs/latest/deploying/configuring/) +- [Keycloak](https://www.keycloak.org/documentation) +- [Smallstep step-ca](https://smallstep.com/docs/step-ca/) diff --git a/docs/docs/architecture/state-and-recovery.md b/docs/docs/architecture/state-and-recovery.md new file mode 100644 index 0000000..415e44b --- /dev/null +++ b/docs/docs/architecture/state-and-recovery.md @@ -0,0 +1,114 @@ +--- +title: State and Recovery +description: Rebuild-first recovery, backup boundaries, and restore expectations. +--- + +# State and Recovery + +The lab recovery model is rebuild-first. + +Most infrastructure should be recreated from declarative inputs: IncusOS seeds, +Incus configuration, Talos machine configuration, CAPI objects, GitOps state, +and platform release artifacts. Backups protect the state that cannot be +recreated cleanly from those sources. + +## Durable State Boundaries + +The lab accumulates state in these tiers: + +- IncusOS host configuration, Incus state, and host storage encryption material. +- Talos and Kubernetes control-plane state. +- Kubernetes persistent volumes and application data. +- Runtime identity and secret manager data. +- Router boundary service data such as local DNS zonefile state and tailnet + identity. +- RouterOS configuration history, covered by + [Network Device Backups](../network-device-backups.md). + +The NAS is the main in-lab durable backup and artifact boundary for state that +must survive host rebuilds. + +## IncusOS Hosts + +Use IncusOS system backups to protect: + +- host configuration +- IncusOS state +- storage-pool encryption keys + +These backups are sensitive because they contain encryption key material. Store +them as secret recovery artifacts, not as ordinary logs. + +IncusOS system backups do not protect installed application data or Incus +instance data. Those must be covered separately if they matter. + +## Talos VMs And Kubernetes + +Talos VM disks are not the primary backup object. + +Protect cluster state through: + +- Talos etcd snapshots +- Kubernetes-native backups such as Velero +- CSI snapshots or file-system backup for persistent volumes where appropriate +- application-level backups for stateful apps + +CAPI and GitOps should recreate Talos VM infrastructure when possible. + +## Non-Talos Incus VMs + +Non-Talos Incus VMs are optional in v1. + +If a non-Talos VM holds unique state, use an explicit VM-level backup path such +as Incus snapshots, Incus exports, or copying to another backup Incus server. +Do not build a VM backup platform before a real non-Talos VM requirement exists. + +## Router Boundary Data + +VyOS-hosted services that are bootstrap dependencies must have direct backup +paths because the platform cluster may be unavailable during recovery. + +Examples include: + +- local DNS zonefile state +- Tailscale machine identity where applicable +- temporary bootstrap artifacts during an active bootstrap + +RouterOS device configuration history is operational evidence and reviewable +change history. It is intentionally handled by the network-device backup flow +rather than by block-level backup. + +## No Default VM Backup Platform + +There is no default VM backup appliance in v1. + +That is intentional. Talos VMs should be recreated through CAPI and Talos +configuration. Non-Talos VMs should justify their own backup requirements when +they appear. + +## Restore Drills + +A backup mechanism is not complete until its restore path has been exercised. + +The first production use of each backup class should include a restore drill +against a lab-safe target: + +- IncusOS system backup restored enough to prove host recovery assumptions. +- Talos etcd snapshot used to prove cluster recovery. +- Kubernetes backup restored into a throwaway cluster. +- Application backup restored into a disposable environment. +- Router boundary data restored into a scratch service or equivalent safe + target. + +Exact drill commands belong in justfiles and runbooks once the implementation +exists. + +Keycloak's runtime state, backup retention, and rebuild/restore paths are +defined separately in [Keycloak Runtime](./keycloak-runtime.md). + +## References + +- [IncusOS backup and restore](https://linuxcontainers.org/incus-os/docs/main/reference/system/backup/) +- [Incus instance backups](https://linuxcontainers.org/incus/docs/main/howto/instances_backup/) +- [Talos disaster recovery](https://docs.siderolabs.com/talos/v1.12/build-and-extend-talos/cluster-operations-and-maintenance/disaster-recovery) +- [Velero](https://velero.io/docs/) diff --git a/docs/docs/designs/app-rgd.md b/docs/docs/designs/app-rgd.md deleted file mode 100644 index e83056c..0000000 --- a/docs/docs/designs/app-rgd.md +++ /dev/null @@ -1,249 +0,0 @@ ---- -title: App RGD Design -description: Proposed first-pass design for the application-facing kro API. ---- - -# App RGD Design - -## Status - -Proposed. - -This document captures the intended shape of the first application-facing `kro` -API, referred to here as `App`. - -This is not the final schema. The goal of this pass is to make the API shape -clear enough that the next pass can define the actual `ResourceGraphDefinition` -without reopening the high-level model. - -## Summary - -`App` should be the primary developer-facing API for deployable application -workloads. - -The first pass should focus on four concerns: - -- containers -- secrets -- configs -- volumes - -The current design direction is: - -- developers author an `App` instance alongside application source code -- the public API stays small and application-centric -- secrets, configs, and volumes are defined at the point where they are used or - mounted -- secret store selection defaults from the target environment and is not a - normal developer-facing input -- Kargo materializes the final environment-specific `App` instance into the - `gitops` repo on `main` -- Argo CD reconciles that final `App` instance - -## Illustrative Example - -The example below is illustrative only. It exists to make the intended shape -concrete before the actual schema is written. - -```yaml -apiVersion: apps.platform.gilman.io/v1alpha1 -kind: App -metadata: - name: orders-api -spec: - team: teama - app: orders-api - - containers: - - name: api - image: - repository: ghcr.io/gilmanlab/teama/orders-api - digest: sha256:2222222222222222222222222222222222222222222222222222222222222222 - ports: - - name: http - port: 8080 - env: - - name: MAX_PRICE - value: "20" - - name: LOG_LEVEL - value: info - - name: DB_USERNAME - secret: - remoteKey: kv/orders-api - property: username - - name: DB_PASSWORD - secret: - remoteKey: kv/orders-api - property: password - - name: THIRD_PARTY_API_KEY - secret: - remoteKey: kv/shared/third-party - property: apiKey - mounts: - - path: /var/lib/orders - volume: - persistent: - size: 10Gi - - path: /etc/orders - volume: - config: - files: - application.yaml: | - http: - port: 8080 - logging: - format: json - features.yaml: | - enableDiscountsV2: true - - path: /var/run/secrets/orders - volume: - secret: - files: - db-username: - remoteKey: kv/orders-api - property: username - db-password: - remoteKey: kv/orders-api - property: password - third-party-api-key: - remoteKey: kv/shared/third-party - property: apiKey - - - name: worker - image: - repository: ghcr.io/gilmanlab/teama/orders-worker - digest: sha256:4444444444444444444444444444444444444444444444444444444444444444 - env: - - name: LOG_LEVEL - value: info - - name: DB_USERNAME - secret: - remoteKey: kv/orders-api - property: username - - name: DB_PASSWORD - secret: - remoteKey: kv/orders-api - property: password - mounts: - - path: /var/lib/orders - volume: - persistent: - size: 10Gi - - - name: metrics-proxy - image: - repository: ghcr.io/gilmanlab/platform/metrics-proxy - digest: sha256:3333333333333333333333333333333333333333333333333333333333333333 - ports: - - name: metrics - port: 9090 -``` - -This example shows the intended first-pass shape: - -- containers are the main unit of declaration -- config values are attached where they are consumed -- secret references are attached where they are consumed -- volume definitions are attached where they are mounted -- the API describes External Secrets-backed needs without exposing handwritten - Kubernetes `Secret` objects - -It also intentionally leaves some things open: - -- the exact inline shape for config files versus simple env vars -- the exact secret reference shape -- how much container surface is exposed in v1 -- how environment-specific values such as `THIRD_PARTY_URL` are merged during - promotion/materialization - -## Design Rules - -### Containers - -- `App` should support more than one container. -- The single-main-container case should still be the easiest path. -- The public API should model deployable application containers, not raw Pod - templates. -- Security and runtime defaults should be platform-owned wherever practical. - -### Secrets - -- Secrets must align with `ExternalSecret` plus `SecretStore` or - `ClusterSecretStore`. -- Plaintext secret values are out of scope. -- Hand-authored Kubernetes `Secret` manifests are not the primary path. -- Secret store selection should default from the target environment rather than - being a normal developer-facing field. - -### Configs - -- Non-secret config should be distinct from secrets. -- Release-coupled config lives with the developer-authored `App` instance. -- Environment-specific config is added during promotion/materialization. -- The public API should not force developers to author raw `ConfigMap` - manifests. - -### Volumes - -- The public API should use the term `volume`, not `PersistentVolumeClaim`. -- Volume definitions should default to point-of-use declaration at the mount - site. -- True shared-lifecycle named volumes may exist later, but they should not - shape the v1 API around a less common reuse case. - -## Why This Shape - -This design optimizes for local reasoning. - -The expected common case is: - -- one container needs one config value -- one container needs one secret value -- one mount path needs one volume definition - -That means the public API should default to point-of-use declarations instead of -top-level registries and references. - -For the same reason, the API should not expose more of Kubernetes than it needs -to. The goal is a small, opinionated application API, not a renamed PodSpec. - -## Relationship to GitOps and Promotion - -The current working lifecycle is: - -1. A developer authors an `App` instance alongside application source code. -2. CI produces an image and the corresponding Git commit. -3. Kargo bundles the image and commit into Freight. -4. Kargo promotes that Freight into an environment. -5. During promotion, Kargo combines the source `App` instance with any - environment-specific inputs or overrides. -6. Kargo writes the final `App` instance into the destination environment path - in the `gitops` repo on `main`. -7. Argo CD reconciles that final `App` instance. - -This means the `App` API is developer-facing, but the final reconciled instance -is still environment-specific GitOps state. - -## Out of Scope for This Pass - -This document does not define: - -- the final `App` schema -- the exact `ExternalSecret` generation model -- the exact promotion-time composition mechanism -- policy and governance rules that are better enforced by `Kyverno`, - `Capsule`, or similar systems - -## Open Questions - -- What is the smallest useful first schema for the common backend API case? -- How much container surface should v1 expose for commands, args, probes, - resources, and ports? -- What is the minimum honest secret-reference shape for the External Secrets - model? -- Should config files and simple env values use one unified shape or distinct - ones? - -## Next Step - -The next pass should define the actual first schema draft for `App`. diff --git a/docs/docs/designs/aws-lab-account.md b/docs/docs/designs/aws-lab-account.md deleted file mode 100644 index 6feb1e2..0000000 --- a/docs/docs/designs/aws-lab-account.md +++ /dev/null @@ -1,459 +0,0 @@ ---- -title: AWS Lab Account -description: Proposed design for a dedicated AWS Organization and member account that anchors lab DNS, bootstrap identity, and offsite compute for identity systems. ---- - -# AWS Lab Account - -## Status - -Proposed. - -This document defines how the lab uses a dedicated AWS Organization and member -account as its durable, out-of-lab trust anchor. It covers account and identity -structure, the VPC and its Tailscale site-to-site link to the lab, the Route 53 -private zone and the in-lab mirror that consumes it, and the secrets bootstrap -path that lets AWS be the single identity required to retrieve and decrypt all -other bootstrap material. - -Detailed Keycloak design is out of scope for this document and is covered -separately. - -## Purpose - -The primary purpose of this design is to keep lab identity, lab DNS, and -lab secrets from having a circular dependency on the lab itself. - -The intended split is: - -- AWS owns the durable trust anchor — the account, the identity system, the - KMS key, the private zone of record -- the lab consumes what AWS provides without requiring continuous connectivity - to operate steady-state -- a small number of AWS-resident components (a Tailscale subnet router, a - Keycloak host) provide the minimum bridge between the two sides - -This keeps the lab's hardest-to-bootstrap layers (identity, DNS, and secret -decryption keys) outside the lab, while keeping day-to-day serving and latency -local. - -## Goals - -- Provide a durable, off-lab trust anchor that survives total lab failure. -- Break the chicken-and-egg between lab DNS, lab identity, and lab secrets. -- Keep AWS itself identity-independent from anything the lab hosts, so AWS - remains reachable even when Keycloak, the platform cluster, or the lab - network is unavailable. -- Eliminate long-lived static credentials on AWS-hosted bootstrap nodes. -- Allow the lab to continue serving DNS and operating existing workloads - during a total internet outage. -- Mirror the durable-trust-anchor shape of a future day-job architecture so the - lab exercises the same patterns at small scale. - -## Non-Goals - -- This document does not define the Keycloak deployment, its database, or its - disaster-recovery plan. Those belong to a separate Keycloak design doc. -- This document does not define monitoring, logging, or alerting for the AWS - account. -- This document does not define exact IAM policy JSON, SCPs, Tailscale ACL - rules, CoreDNS configuration, or OpenTofu module layout. The doc names - contracts, not implementations. -- This document does not define OIDC trust between cluster workloads and AWS. - That is a separate future design connected to ExternalDNS and similar cluster - components. -- This document does not federate IAM Identity Center to Keycloak. Doing so - would reintroduce the circular dependency this design is built to avoid. - -## Design Summary - -The lab uses a dedicated AWS Organization with two accounts: - -- a **management account** that holds the Organization, billing, and IAM - Identity Center, and runs no workloads -- a **member account** that holds all lab-owned AWS resources: the VPC, the - Tailscale subnet router, the Keycloak host, the Route 53 private zone, the - KMS key used for SOPS, and the SSM parameters used for bootstrap - -Identity into both accounts comes from **IAM Identity Center with its built-in -identity store**. There is no external identity provider. A single human user -signs in with a hardware security key and assumes time-limited permission sets -into the member account. Root credentials on both accounts are break-glass only -and stored offline. - -The member account peers with the lab via a pair of **Tailscale subnet -routers**, one in AWS and one on VyOS. Devices on either side can reach devices -on the other side by their real IPs without being Tailscale nodes themselves. -The AWS-side subnet router authenticates to the tailnet via **Tailscale -workload identity federation**, using its attached IAM role — no pre-shared -auth key. - -The lab's authoritative DNS lives in a **Route 53 private zone** for `glab.lol` -bound to the lab's VPC. A sync job on the subnet router renders that zone to a -local zonefile, serves it over the tailnet, and an in-lab fetcher pulls the -file to disk. CoreDNS in the lab serves from the on-disk zonefile. The read -path never reaches AWS at query time, so DNS serving survives internet outages -and cold starts. - -Bootstrap secrets are gated end-to-end by the AWS-resident IAM role: - -- encrypted secrets live in the existing `secrets/` repo on GitHub, encrypted - with a **KMS customer-managed key** used as a SOPS recipient -- that repo is cloned via a **GitHub App** whose private key is stored in SSM - Parameter Store (SecureString) -- both KMS decrypt and SSM read are granted to the bootstrap instance via its - IAM role - -The result is that an AWS-resident bootstrap node holds zero persistent -secrets on disk: every identity it uses — Tailscale, GitHub, the SOPS -decryption key, AWS itself — traces back to its IAM role. - -## Account Structure - -The lab uses a two-account AWS Organization: - -| Account | Purpose | -|-------------|-----------------------------------------------------------------------------| -| `lab-mgmt` | Organization management account. Holds billing, IAM Identity Center, org-level config. No workloads. | -| `lab` | Lab workload member account. Holds VPC, EC2, Route 53, KMS, SSM, and all other lab resources. | - -The split follows AWS's own recommendation that the management account should -not run workloads. It also leaves room to add additional member accounts later -(for example, a prod-mirror account that stages day-job patterns) without -restructuring. - -Region: **`us-west-2`**. - -All resources in this design live in `us-west-2` in the `lab` account unless -explicitly stated otherwise. - -## Identity - -### Primary path - -IAM Identity Center is enabled in the `lab-mgmt` account and uses its -**built-in identity store**. There is no external IdP wired in. The identity -store holds one human user with **WebAuthn MFA enforced** via a hardware -security key. - -Access to the `lab` account is granted via permission sets assigned from -Identity Center. Daily operator access — console and CLI — is short-lived: - -- console access through the Identity Center access portal -- CLI access via `aws sso login`, which produces short-lived role credentials - -No long-lived IAM user access keys exist in either account for human use. - -### Break-glass - -Root user credentials exist on both accounts and are used for emergency -recovery only (loss of Identity Center access, billing-only actions not -permitted to Identity Center). Both root accounts: - -- use strong unique passwords -- have hardware-key MFA enabled -- are stored offline (outside any system whose recovery depends on AWS or - Keycloak being reachable) - -### Why the identity store is local - -Federating IAM Identity Center to Keycloak would make AWS access depend on -Keycloak. Keycloak depends on AWS for its compute, its DNS, and its secrets -bootstrap. Coupling the two defeats the entire reason for placing identity on -a durable off-lab trust anchor. - -A future addition of Keycloak SAML federation as a **secondary, convenience** -path for Identity Center is possible and explicitly deferred. The local -identity store always remains the primary admin path. - -## Network - -### VPC - -- **CIDR:** `172.16.0.0/16` -- **Subnets:** one public subnet, single AZ -- **Internet gateway:** attached; the subnet router carries outbound traffic - via an Elastic IP attached to its ENI -- **NAT gateway:** none. With a single public-subnet instance there is no - workload needing egress through a private subnet; skipping NAT removes the - largest ongoing fixed cost that would otherwise apply (~$32/mo) - -`172.16.0.0/16` is deliberately far from both the lab's `10.10.0.0/16` and -Tailscale's `100.64.0.0/10` CGNAT range, so no address-space collisions can -occur when routes are advertised across the tailnet. - -### Site-to-site with the lab - -The lab and the VPC connect via **Tailscale subnet routers on both sides**: - -- **AWS side:** the subnet router EC2 instance advertises `172.16.0.0/16` and - accepts `10.10.0.0/16`. -- **Lab side:** VyOS runs Tailscale and advertises `10.10.0.0/16` while - accepting `172.16.0.0/16`. - -Both sides run with `--snat-subnet-routes=false` so traffic preserves real -source IPs. The VPC route table directs `10.10.0.0/16` to the subnet router's -ENI, and the ENI has source/destination check disabled so it can forward. -Security groups allow `10.10.0.0/16` as a source on the ENI. - -From either side, a host can address the other side by its real IP without -being a Tailscale node itself. Lab DNS clients reach `172.16.0.0/16` -transparently; VPC workloads (Keycloak) can reach lab workloads when needed. - -MSS clamping is configured on VyOS to avoid black-holed large packets through -the WireGuard-based tunnel's smaller MTU. Tailscale ACLs permit traffic -between the two advertised CIDRs. - -### Tailscale node identity - -The AWS-side subnet router authenticates to the tailnet via **Tailscale -workload identity federation**, using its attached IAM role. No pre-shared -auth key is stored on the instance. Tailscale ACL tags are derived from IAM -claims (role ARN, account ID), so policy can be written against the -IAM identity rather than per-device labels. - -The VyOS node uses a traditional Tailscale auth key, because workload identity -federation only supports cloud-hosted clients. That key is managed out of band -and lives on a single on-prem device; it is not checked into any repo. - -## DNS - -### Authoritative zone - -The canonical lab domain is **`glab.lol`**. A Route 53 **private hosted zone** -for `glab.lol` lives in the `lab` account, bound to the VPC. All lab DNS -records are managed there. - -Private was chosen over public intentionally: the day-job architecture this -lab mirrors requires record names themselves to be non-public. A public zone -would be operationally simpler but would not exercise the same pattern. - -### Lab read path - -CoreDNS in the lab serves `glab.lol` from a **local zonefile on disk**. The -file is kept up to date by a sync pipeline that runs entirely outside the lab: - -1. A job on the AWS-side subnet router reads the Route 53 zone using its IAM - role and renders it to a standard zonefile. Refresh cadence is ≤1 minute. -2. The subnet router serves the rendered file over the tailnet. -3. An in-lab fetcher periodically pulls the file and writes it to the - filesystem CoreDNS reads from. - -CoreDNS never queries Route 53 at request time. The fetch path is -asynchronous and decoupled from serving. - -### Failure characteristics - -- **Steady-state AWS or internet outage:** fetches fail; CoreDNS continues to - serve from the last-fetched zonefile. The zone data becomes progressively - stale in proportion to the outage length, but queries continue to resolve. -- **Cold start during an outage:** CoreDNS loads the last zonefile from local - disk and resumes serving. The sync job is not on the critical path. -- **Full lab internet loss:** the tailnet path to the subnet router is itself - unreachable, which stops syncs but not serving. The zonefile on disk is the - resilience layer. -- **Stale-vs-unavailable tradeoff:** this design accepts staleness as the - price of availability during outages. Zone changes during an outage simply - do not propagate until connectivity returns. - -### Why the mirror exists - -Steady-state DNS resilience can be provided by CoreDNS itself — the `route53` -plugin reads zones into memory, and the `cache` plugin with `serve_stale` -enabled keeps answering through upstream outages. The mirror layer's specific -job is **cold-start and bootstrap resilience**: if CoreDNS restarts (node -reboot, container replaced) while Route 53 is unreachable, it has no in-memory -zone to fall back on. A zonefile on disk removes that failure mode. - -## Secrets Bootstrap - -### Contract - -An AWS-resident bootstrap instance must be able to, starting from only its -IAM role: - -1. Reach the tailnet. -2. Fetch encrypted files from the private `secrets/` repo on GitHub. -3. Decrypt those SOPS-encrypted files locally. - -At no point may the instance hold a durable, plaintext credential for any of -the three systems (Tailscale, GitHub, SOPS). All identity traces back to the -instance's attached IAM role. - -### KMS as a SOPS recipient - -The existing `secrets/` repo continues to hold SOPS-encrypted files. A single -**customer-managed KMS key** in the `lab` account is added as an additional -SOPS recipient. Any principal granted `kms:Decrypt` on that key — human -(via Identity Center permission set) or machine (via instance profile) — can -decrypt. - -This is non-breaking for existing workflows: SOPS supports multiple -recipients, so the KMS key can be added alongside the existing age key. -Retiring the age key is possible later but not required. - -The details of the SOPS-over-KMS workflow, key rotation, and human vs. -automation paths live in a separate secrets design doc. This document only -establishes that the KMS key lives in the `lab` account and is the anchor -for machine decryption. - -### GitHub App token broker - -Fetching private repo contents from an AWS-resident bootstrap instance uses a -**GitHub App** owned by the `GilmanLab` organization and installed on the -`secrets` repo. - -- The App's **private signing key** is stored in an SSM Parameter Store - `SecureString` in the `lab` account. -- The `github-token-broker` Lambda execution role grants `ssm:GetParameter` on - the GitHub App parameter path. -- Bootstrap principals get `lambda:InvokeFunction` on the broker, not direct - access to the App private key. -- On bootstrap, the caller invokes the broker, receives a short-lived - installation token for `GilmanLab/secrets` with `contents:read`, and then - fetches encrypted files with `git` or the GitHub Contents API. - -The only durable non-AWS secret anywhere in the chain is the App's private -signing key itself, and that key is at rest in AWS behind the broker execution -role. Installation tokens are never stored on disk. - -### The single-anchor property - -Taken together, the chain on a single EC2 bootstrap instance is: - -| Step | Identity used | -|------|----------------------------------------------------------------| -| Join tailnet | IAM role (via workload identity federation) | -| Invoke GitHub token broker | IAM role (via instance profile → Lambda) | -| Read GitHub App key | Lambda execution role (via SSM, optionally KMS) | -| Mint installation token | App private key (short-lived, in broker memory) | -| Fetch encrypted `secrets/` files | Installation token (short-lived, in caller memory) | -| Decrypt SOPS files | IAM role (via instance profile → KMS) | - -Every persistent identity is the IAM role. Lose AWS, lose bootstrap. Gain AWS, -everything else unlocks in order. This is the design outcome the account -structure is in service of. - -## Compute and Cost Model - -### Instances - -| Name | Type | Purpose | -|----------------|------------|-------------------------------------------------------------| -| subnet router | `t4g.nano` | Tailscale site-to-site, Route 53 zonefile rendering | -| Keycloak host | `t4g.small`| Keycloak + colocated Postgres. Detailed design in Keycloak doc. | - -Both run Amazon Linux 2023 on ARM (`t4g` / Graviton). Tailscale and Keycloak -both ship native ARM builds. - -The two are kept **as separate instances** rather than colocated. A colocated -box would save ~$1.75/mo but would collapse the subnet router and identity -failure domains into one. The premium for separation is cheap insurance, and -separation is also more faithful to the day-job architecture this design -mirrors. - -EC2 instances do **not** run EBS snapshot or AMI backup jobs. The lab's -philosophy is rebuild-over-restore: every instance's durable state is either -in an external store (Route 53, KMS, SSM, S3, GitHub) or is designed to be -reconstructed from those sources. - -### Savings Plan commitment - -Both instances are long-lived infrastructure and not expected to change -instance family over their lifetime. The commitment shape is: - -- **3-year EC2 Instance Savings Plans, all-upfront**, covering the `t4g` - family in `us-west-2` -- expected effective discount ~72% vs. on-demand -- one purchase sized to cover both instances; additional commitment can be - layered later - -### Cost envelope - -Approximate, all-upfront amortized: - -| Item | Monthly | 3-year | -|-------------------------------|---------|--------| -| Subnet router (t4g.nano) | ~$1.75 | ~$63 | -| Keycloak host (t4g.small) | ~$3.89 | ~$140 | -| KMS customer-managed key | ~$1.00 | ~$36 | -| SSM Parameter Store (standard)| ~$0 | ~$0 | -| Route 53 private zone + queries | ~$0.50 | ~$18 | -| Data transfer | negligible | negligible | -| **Total (approximate)** | **~$7.15** | **~$260** | - -EBS, Elastic IPs attached to running instances, and Route 53 API calls for the -1-minute zonefile sync all fall into noise-level cost at lab scale. - -## Infrastructure as Code - -- All AWS resources are managed with **OpenTofu** from the `infra/` repo - under `infra/aws/`. -- OpenTofu state is stored in an **S3 bucket in the `lab` account**, using - S3's native locking. -- The OpenTofu entrypoint assumes a permission set role via Identity Center - for human-operator runs. Future CI-triggered runs will use a separate - identity (out of scope for this document). - -### Manual bootstrap surface - -A small amount of setup exists outside of OpenTofu, because it must exist -before OpenTofu can run: - -1. Creation of the AWS Organization and the two accounts. -2. Enablement of IAM Identity Center and the single operator user. -3. Creation of the S3 state bucket and the minimum IAM role OpenTofu will - assume. - -Everything downstream of that — the VPC, the subnet router, the private zone, -the KMS key, the SSM parameters, the instance profiles — is declared in -OpenTofu. - -## Failure Domains - -What fails together and what does not: - -| Failure | Lab DNS | Lab serving | AWS console access | Bootstrap of new lab instances | -|--------------------------------|---------|-------------|--------------------|------------------------------| -| Lab internet outage | ✓ (cached zonefile) | ✓ | ✗ (can't reach AWS) | ✗ | -| Subnet router EC2 down | ✓ (cached zonefile) | ✓ | ✓ | ✗ (tailnet → AWS bridge down) | -| `lab` account compromise | ✓ (cached zonefile, short-term) | ✓ | partial | ✗ | -| `lab-mgmt` account lost | ✓ (cached zonefile, short-term) | ✓ | ✗ | ✗ | -| Keycloak host down | ✓ | ✓ (except OIDC-gated services) | ✓ | ✓ | -| AWS region outage | ✓ (cached zonefile) | ✓ | ✗ | ✗ | -| Full lab power/hardware loss | ✗ | ✗ | ✓ | depends on external rebuild | - -The dominant pattern: **lab-side serving is robust to any offsite failure** -thanks to the zonefile-on-disk DNS path and locally-resident workloads. -Offsite failure primarily costs the ability to make changes, not the ability -to keep running. - -## Future Work - -The following are known next steps that are intentionally out of scope here: - -- **Keycloak design doc.** Deployment shape, Postgres colocation, backup to - object storage with Synology sync, rebuild-over-restore DR procedure, and - GitOps via `keycloak-config-cli`. -- **Cluster workload OIDC to AWS.** ExternalDNS-style workloads on the Talos - cluster will need AWS credentials; the cluster's own OIDC issuer (IRSA-style - federation) is the expected mechanism, not Tailscale-based federation. -- **GitHub Actions OIDC.** Trusting GitHub Actions as an OIDC identity - provider in the `lab` account so CI can apply OpenTofu without long-lived - keys. -- **Keycloak SAML federation to IAM Identity Center** as a secondary, - convenience access path alongside the local identity store. -- **Additional member accounts** under the same Organization as the lab's - day-job mirroring grows (prod-mirror, dev, etc.). -- **Secrets design doc** covering the SOPS-over-KMS workflow, rotation, - Vault relationship, and promotion model across bootstrap vs. per-cluster - secrets. - -## References - -- [Keycloak](./keycloak.md) -- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) -- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) -- [Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md) -- [Tailscale Workload Identity Federation](https://tailscale.com/kb/1581/workload-identity-federation) -- [AWS IAM Identity Center external IdP options](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-identity-source-idp.html) diff --git a/docs/docs/designs/bootstrap-core-delivery.md b/docs/docs/designs/bootstrap-core-delivery.md deleted file mode 100644 index 733baa9..0000000 --- a/docs/docs/designs/bootstrap-core-delivery.md +++ /dev/null @@ -1,354 +0,0 @@ ---- -title: Bootstrap and Core Delivery Model -description: Proposed design for day-0 substrate and day-1 cluster-core delivery across the platform, nonprod, and prod clusters. ---- - -# Bootstrap and Core Delivery Model - -## Status - -Proposed. - -This document defines how the lab brings clusters to life before the reusable -`kro` API layer becomes active. It covers the narrow Talos/CAPI bootstrap path, -the reusable cluster-core components that GitOps manages afterward, and the -handoff from bootstrap artifacts to steady-state Argo CD ownership. - -## Purpose - -The primary purpose of this design is to keep bootstrap delivery, -cluster-core reuse, and platform API ownership clearly separated. - -The intended split is: - -- the `platform` repo owns canonical bootstrap/core component inputs and - rendered bootstrap artifacts -- the `gitops` repo owns per-cluster version selection and cluster-local - desired state after bootstrap -- the `infra` repo and CAPI templates own immutable Talos day-0 references for - fresh installs and reinstalls - -This lets the lab reuse Cilium, Argo CD, and `kro` consistently across the -platform, `nonprod`, and `prod` clusters without copying their canonical install -artifacts into `gitops`. - -## Goals - -- Keep canonical bootstrap/core component source out of the `gitops` repo. -- Make platform-cluster bootstrap and downstream CAPI cluster creation - reproducible from versioned artifacts. -- Keep day-0 substrate narrow and explicit. -- Make day-2 ownership belong to Argo CD rather than to Talos/CAPI bootstrap - references. -- Start `kro` only at the first real platform API boundary. - -## Non-Goals - -- This document does not define the exact Helm values for Cilium, Argo CD, or - `kro`. -- This document does not define the long-term service exposure or control-plane - endpoint strategy for clusters after bootstrap. -- This document does not define the CI workflow implementation for rendering, - validation, or publishing. -- This document does not define the exact `Platform` schema. -- This document does not change the current architecture assumption that only - the platform cluster runs Argo CD. - -## Design Summary - -The intended cluster bring-up model is: - -- every cluster boots with a day-0 substrate -- reusable day-1 cluster-core components are then installed by GitOps -- the reusable `kro` platform API begins only after those prerequisites exist - -The three layers are: - -1. **Day-0 substrate** - - components required before GitOps or higher-level APIs can act - - includes Cilium on every cluster - - includes minimal Argo CD and root-app seeding only on the platform - cluster -2. **Day-1 cluster-core** - - reusable cluster components managed by GitOps but not exposed as - consumer-facing platform APIs - - includes the full Cilium install and `kro` - - includes Argo CD self-management on the platform cluster -3. **Platform API** - - released RGD bundles and the cluster-local `Platform` custom resource - - starts only after the day-1 cluster-core layer is present - -The following are intentionally **not** modeled as `kro` APIs: - -- Cilium -- Argo CD -- `kro` itself - -They are reusable installable cluster primitives, not consumer-facing platform -APIs. - -The long-term service exposure and control-plane endpoint model for those -clusters is defined in -[Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md). - -## Cluster Flows - -### Platform Cluster - -The platform cluster bootstrap flow is: - -1. Talos installs bootstrap-safe Cilium from an immutable artifact reference. -2. Talos installs minimal Argo CD from an immutable artifact reference. -3. Talos seeds the admin-owned root `Application`. -4. The root app syncs `clusters/platform/bootstrap.yaml` from `gitops`. -5. That per-cluster bootstrap selection installs: - - full/self-managed Argo CD - - full Cilium - - `kro` -6. After `kro` is present, `clusters/platform/platform/` installs the selected - released RGD bundles and the cluster-local `Platform` custom resource. - -### Downstream Clusters - -The downstream `nonprod` and `prod` cluster flow is: - -1. CAPI/Talos installs bootstrap-safe Cilium from an immutable artifact - reference. -2. The platform-cluster Argo CD instance registers or reaches the new cluster. -3. Argo CD syncs `clusters//bootstrap.yaml` from `gitops`. -4. That per-cluster bootstrap selection installs: - - full Cilium - - `kro` -5. After `kro` is present, `clusters//platform/` installs the selected - released RGD bundles and the cluster-local `Platform` custom resource. - -Downstream clusters do **not** bootstrap their own Argo CD instances under the -current design. - -## Ownership and Repository Boundaries - -### Platform Repo - -The `platform` repo is the source of truth for reusable bootstrap/core -artifacts. - -It owns: - -- wrapper Helm charts for reusable cluster primitives -- bootstrap-safe rendered manifests for Talos/CAPI day-0 consumption -- released OCI chart artifacts for the GitOps-managed steady-state install -- version tags and release history for those artifacts - -It does not own per-cluster version selection or cluster-local desired state. - -### GitOps Repo - -The `gitops` repo is the source of truth for which version of each reusable -bootstrap/core component a cluster should run after GitOps takes over. - -It owns: - -- `clusters//bootstrap.yaml` for per-cluster bootstrap/core version - selection -- `clusters//platform/` for released RGD bundle installation and the - cluster-local `Platform` custom resource -- any other cluster-local desired state after bootstrap - -It does not own the canonical rendered source for reusable bootstrap/core -components. - -### Infra Repo and CAPI Templates - -The `infra` repo and CAPI templates own only the immutable day-0 references -needed to create or reinstall clusters. - -They own: - -- Talos machine-config references for platform-cluster day-0 artifacts -- CAPI/Talos template references for downstream-cluster day-0 artifacts - -They do not own day-2 change management for those components. - -## Canonical Artifact Layout - -The `platform/bootstrap/` subtree carries both Talos/CAPI day-0 substrate -artifacts and reusable day-1 cluster-core components. The name does not imply -that every component there is consumed directly by Talos. - -The intended `platform` repo layout is: - -```text -platform/ -└── bootstrap/ - ├── cilium/ - │ ├── Chart.yaml - │ ├── Chart.lock - │ ├── values.yaml - │ ├── bootstrap-values.yaml - │ ├── templates/ - │ ├── render/ - │ │ ├── bootstrap.yaml - │ │ └── full.yaml - ├── argocd/ - │ ├── Chart.yaml - │ ├── Chart.lock - │ ├── values.yaml - │ ├── bootstrap-values.yaml - │ ├── render/ - │ │ ├── bootstrap.yaml - │ │ └── full.yaml - └── kro/ - ├── Chart.yaml - ├── Chart.lock - ├── values.yaml - ├── render/ - │ └── full.yaml -``` - -The intended semantics are: - -- `Chart.yaml`: wrapper chart metadata and the pinned upstream chart dependency -- `Chart.lock`: the locked dependency resolution used for local render parity - and chart publication -- `values.yaml`: steady-state defaults for the GitOps-managed install -- `bootstrap-values.yaml`: bootstrap-only overrides needed for Talos/CAPI-safe - day-0 delivery when a bootstrap lane differs from the steady-state install -- `templates/`: platform-owned manifests layered on top of the upstream chart -- `render/bootstrap.yaml`: the immutable raw manifest Talos/CAPI consumes for - day-0 bring-up -- `render/full.yaml`: the fully rendered steady-state manifest for review and - validation parity with the Helm-driven install - -The per-cluster `clusters//bootstrap.yaml` resources in `gitops` -remain the only cluster-specific version-selection surface. They pin -destination and chart `targetRevision` while pointing directly at the released -OCI wrapper charts published from `platform`. - -`kro` has no Talos/CAPI bootstrap variant in the current design, so it does not -need `bootstrap-values.yaml` or `render/bootstrap.yaml`. - -The intended `gitops` repo surface is: - -```text -gitops/ -└── clusters/ - ├── platform/ - │ ├── bootstrap.yaml - │ └── platform/ - │ ├── rgds-platform.yaml - │ ├── rgds-apps.yaml - │ └── platform.yaml - ├── nonprod/ - │ ├── bootstrap.yaml - │ └── platform/ - │ ├── rgds-platform.yaml - │ ├── rgds-apps.yaml - │ └── platform.yaml - └── prod/ - ├── bootstrap.yaml - └── platform/ - ├── rgds-platform.yaml - ├── rgds-apps.yaml - └── platform.yaml -``` - -Each `bootstrap.yaml` is admin-owned and selects which released version of the -reusable bootstrap/core components a cluster should adopt. - -## Versioning and Promotion Rules - -The intended versioning model is: - -1. Change the canonical values or source inputs in `platform`. -2. Re-render `render/bootstrap.yaml` and `render/full.yaml` from those pinned - inputs. -3. Cut a component-scoped release tag and publish the wrapper chart as an OCI - artifact. -4. Bump each cluster's `clusters//bootstrap.yaml` in `gitops` to the - selected chart version. -5. If a day-0 artifact changed, also bump the immutable bootstrap artifact - references in: - - platform-cluster Talos config in `infra` - - downstream-cluster CAPI templates - -The versioning rules are: - -- cluster selections happen in `gitops` -- Talos/CAPI raw artifact URLs use immutable commit SHAs -- human-facing GitOps release selection happens by OCI chart version -- tags must be treated as immutable once published - -The SHA referenced by Talos/CAPI must correspond to the released artifact -selected for that version, even if GitOps later advances clusters at different -cadences. - -## Bootstrap-Safe Versus Full Installs - -### Cilium - -Cilium has two delivery shapes: - -- **bootstrap-safe** - - used by Talos/CAPI day-0 bootstrap - - must preserve the intended steady-state core datapath behavior - - must disable secret-producing features so the rendered manifest is safe to - host at a public immutable URL -- **full** - - used by the GitOps-managed day-1/day-2 install - - may enable observability and TLS features that create or depend on secret - material - -Bootstrap Cilium is intentionally not a separate product. It is the steady-state -core datapath intent plus a small, explicit set of bootstrap-only exceptions. - -### Argo CD - -Argo CD also has two delivery shapes on the platform cluster: - -- **bootstrap** - - minimal install sufficient to run the root app -- **full** - - self-managed steady-state Argo CD installed by GitOps - -Downstream clusters do not use an Argo CD bootstrap variant under the current -design. - -### kro - -`kro` has only a full GitOps-managed install in this design. It is not Talos -day-0 substrate. - -## Ownership Handoff - -Talos/CAPI bootstrap and Argo CD do not share day-2 ownership equally. - -The intended ownership handoff is: - -- Talos/CAPI bootstrap gets the cluster alive -- Argo CD becomes the steady-state owner of full Cilium, Argo CD, and `kro` -- day-2 changes are made by updating `platform` inputs and the per-cluster - selections in `gitops`, not by editing Talos/CAPI day-0 URLs - -The Talos/CAPI references remain narrow and reinstall-focused. They exist so a -fresh cluster can boot, not so Talos/CAPI become the long-term control plane -for those components or define the cluster's steady-state external service and -API endpoint model. - -## Relationship to Other Designs - -This design builds on: - -- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) for cluster topology, - Argo scope, and application flow -- [Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md) - for the steady-state external service and API endpoint model once bootstrap is - complete -- [Platform RGD Delivery Model](./platform-rgd-delivery.md) for released RGD - bundle delivery after `kro` is already present -- [kro Consumption Model](./kro-consumption-model.md) for the ownership split - between platform-owned APIs, developer release intent, and GitOps - materialization - -This document starts before those other designs. It ends at the point where a -cluster already has its reusable cluster-core components and is ready to consume -released RGD bundles and cluster-local platform APIs. diff --git a/docs/docs/designs/gitops-multi-cluster.md b/docs/docs/designs/gitops-multi-cluster.md deleted file mode 100644 index 78117d0..0000000 --- a/docs/docs/designs/gitops-multi-cluster.md +++ /dev/null @@ -1,470 +0,0 @@ ---- -title: Multi-Cluster GitOps Model -description: Proposed GitOps design for the platform, nonprod, and prod clusters using CAPI, Argo CD, Kargo, kro, and Capsule. ---- - -# Multi-Cluster GitOps Model - -## Status - -Proposed. - -This document describes the intended GitOps model for the lab once the platform -cluster, workload clusters, and application delivery flows are built out. It is -more specific than the architecture overview, but it is still a design rather -than a description of live state. - -Until this design is implemented, the architecture overview remains the source -of truth for the current baseline. - -## Scope - -This design covers four related concerns: - -- platform cluster responsibilities -- downstream cluster creation and management -- long-lived and ephemeral environments -- team and application isolation - -It uses the current planning baseline: - -- `1` platform cluster on the `UM760` -- `1` nonprod cluster -- `1` prod cluster -- long-lived environments: `dev`, `staging`, `prod` -- ephemeral environments for pull requests, load tests, and similar short-lived - work - -The example team layout used throughout this document is: - -- `TeamA` - - `AppA1` - - `AppA2` - - `AppA3` -- `TeamB` - - `AppB1` - - `AppB2` - -## Goals - -- Keep one management cluster responsible for cluster lifecycle and GitOps - control. -- Keep application delivery Git-native: promotion means changing Git, not - mutating clusters directly. -- Reuse `kro` APIs for application delivery instead of Helm or Kustomize - overlays. -- Keep downstream workload clusters strongly isolated from each other. -- Give each team a stable governance boundary without collapsing all of that - team's applications into a single namespace. - -## Non-Goals - -- This document does not define the exact `kro` `ResourceGraphDefinition` schema - for every platform API. -- This document does not define CI pipelines, image signing, or registry - hardening in detail. -- This document does not define every shared service that should run in - `nonprod` or `prod`. -- This document does not treat the current design as implemented reality. - -## Design Summary - -The intended control-plane model is: - -- the platform cluster runs `Argo CD`, `CAPI`, and `Kargo` -- `CAPI` creates the `nonprod` and `prod` workload clusters -- `Argo CD` runs only on the platform cluster and syncs plain YAML to all three - clusters -- `kro` provides the reusable application and platform APIs -- `Capsule` provides the team governance layer in each workload cluster -- `Kargo` promotes applications by editing environment-specific YAML in Git - -The high-level split is: - -- cluster boundary: `platform`, `nonprod`, `prod` -- team boundary: Capsule tenant per team per workload cluster -- workload boundary: namespace per `team-app-env` - -For future multi-node clusters, service and ingress VIPs are intended to use -Cilium LB IPAM plus Cilium BGP peering with the `VP6630`, while the canonical -Kubernetes API endpoint is intended to use Talos VIP. The control-plane -endpoint model is defined in -[Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md). - -## Cluster Roles - -### Platform Cluster - -The platform cluster is the only management cluster. - -It owns: - -- `Argo CD` -- `CAPI` -- `Kargo` -- shared `kro` APIs -- platform-only controllers and operational services - -It does not host general application workloads by default. - -### Nonprod Cluster - -The `nonprod` cluster hosts: - -- `dev` environments -- `staging` environments -- ephemeral environments such as `pr-123` or `loadtest-001` -- any nonprod shared services that belong in the workload plane instead of the - platform plane - -### Prod Cluster - -The `prod` cluster hosts: - -- `prod` application environments -- prod shared services and policy - -## Namespace and Team Model - -Namespaces use the template: - -```text -team-app-env -``` - -Examples: - -- `teama-appa1-dev` -- `teama-appa1-staging` -- `teama-appa1-prod` -- `teama-appa1-pr-123` - -This keeps each application instance isolated at the namespace boundary. - -Do not use one namespace per team for all of that team's applications. That -would couple unrelated apps at the secrets, RBAC, quota, and blast-radius -layers. - -### Capsule - -Each workload cluster runs Capsule. - -Capsule tenants are cluster-local, so the intended shape is: - -- `nonprod`: - - `Tenant/teama` - - `Tenant/teamb` -- `prod`: - - `Tenant/teama` - - `Tenant/teamb` - -This means there is one logical team boundary across the lab, implemented as -one tenant object per team per workload cluster. - -`dev` and `staging` do not currently need separate team governance layers. -They share the same team-level Capsule tenant in `nonprod`, while remaining -separate per-app namespaces. - -## Environment Model - -Environments are not modeled as Helm values files or Kustomize overlays. - -Instead, each environment is a concrete instance of the same `kro` API. -Reuse lives in shared `ResourceGraphDefinition`s, while environment-specific -differences live in environment-specific custom resources. - -For example, `AppA1` should have separate resources for: - -- `teams/teama/appa1/envs/dev/app.yaml` -- `teams/teama/appa1/envs/staging/app.yaml` -- `teams/teama/appa1/envs/prod/app.yaml` - -Each file is small because the heavy lifting lives in the shared `kro` API. - -Ephemeral environments follow the same pattern under `ephemeral/`, for example: - -- `teams/teama/appa1/ephemeral/pr-123/app.yaml` - -## GitOps Repository Layout - -The intended `gitops` repository shape is: - -```text -gitops/ -├── platform/ -│ ├── argocd/ -│ │ ├── bootstrap.yaml -│ │ ├── projects/ -│ │ │ ├── platform.yaml -│ │ │ ├── teama.yaml -│ │ │ └── teamb.yaml -│ │ └── applicationsets/ -│ │ ├── platform.yaml -│ │ ├── clusters-platform.yaml -│ │ ├── clusters-nonprod.yaml -│ │ ├── clusters-prod.yaml -│ │ ├── teams-nonprod.yaml -│ │ └── teams-prod.yaml -│ ├── capi/ -│ │ ├── providers/ -│ │ ├── clusterclasses/ -│ │ └── clusters/ -│ │ ├── nonprod/ -│ │ └── prod/ -│ ├── kargo/ -│ │ └── projects/ -│ │ ├── teama-appa1/ -│ │ ├── teama-appa2/ -│ │ ├── teama-appa3/ -│ │ ├── teamb-appb1/ -│ │ └── teamb-appb2/ -├── clusters/ -│ ├── platform/ -│ │ ├── bootstrap.yaml -│ │ ├── platform/ -│ │ │ ├── rgds-platform.yaml -│ │ │ ├── rgds-apps.yaml -│ │ │ └── platform.yaml -│ │ ├── policies/ -│ │ └── shared/ -│ ├── nonprod/ -│ │ ├── bootstrap.yaml -│ │ ├── platform/ -│ │ │ ├── rgds-platform.yaml -│ │ │ ├── rgds-apps.yaml -│ │ │ └── platform.yaml -│ │ ├── capsule/ -│ │ │ ├── teama.yaml -│ │ │ └── teamb.yaml -│ │ ├── policies/ -│ │ └── shared/ -│ └── prod/ -│ ├── bootstrap.yaml -│ ├── platform/ -│ │ ├── rgds-platform.yaml -│ │ ├── rgds-apps.yaml -│ │ └── platform.yaml -│ ├── capsule/ -│ │ ├── teama.yaml -│ │ └── teamb.yaml -│ ├── policies/ -│ └── shared/ -└── teams/ - ├── teama/ - │ ├── appa1/ - │ │ ├── envs/dev/app.yaml - │ │ ├── envs/staging/app.yaml - │ │ ├── envs/prod/app.yaml - │ │ └── ephemeral/pr-123/app.yaml - │ ├── appa2/ - │ └── appa3/ - └── teamb/ - ├── appb1/ - └── appb2/ -``` - -The ownership model is: - -- `platform/`: platform-cluster control-plane state -- `clusters/*/bootstrap.yaml`: per-cluster version selection for reusable - bootstrap/core OCI Helm charts released from the `platform` repo -- `clusters/*/platform/`: released RGD bundle installation and cluster-local - `Platform` instances after the bootstrap/core layer is present -- `clusters/*/capsule`, `clusters/*/policies`, and `clusters/*/shared`: - workload-cluster shared state -- `teams/`: team-owned application instances - -## Argo CD Model - -One `Argo CD` instance runs on the platform cluster. - -It syncs: - -- `platform/argocd`, `platform/capi`, and `platform/kargo` to the platform - cluster -- `clusters/platform/bootstrap.yaml` to the platform cluster -- `clusters/platform/platform` to the platform cluster -- `clusters/nonprod/bootstrap.yaml` to the `nonprod` cluster -- `clusters/nonprod/platform`, `clusters/nonprod/capsule`, - `clusters/nonprod/policies`, and `clusters/nonprod/shared` to the `nonprod` - cluster -- `clusters/prod/bootstrap.yaml` to the `prod` cluster -- `clusters/prod/platform`, `clusters/prod/capsule`, - `clusters/prod/policies`, and `clusters/prod/shared` to the `prod` cluster -- `teams/*/*/envs/dev`, `teams/*/*/envs/staging`, and - `teams/*/*/ephemeral/*` to the `nonprod` cluster -- `teams/*/*/envs/prod` to the `prod` cluster - -The intended Argo shape is: - -- one `AppProject` per team -- `ApplicationSet` for platform-owned fleet generation -- one admin-owned bootstrap `Application` per cluster for - `clusters//bootstrap.yaml` -- `Application` resources kept in the `argocd` namespace - -Each `clusters//bootstrap.yaml` selects the version of the reusable -bootstrap/core components for that cluster by pinning the released OCI chart -versions for the admin-owned Cilium, Argo CD, and `kro` Applications. The full -bootstrap/core delivery sequence is defined in -[Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md). - -Once the bootstrap/core layer is in place, `clusters//platform/` -holds the released RGD bundle installation and the cluster-local `Platform` -instance. - -Do not rely on application CRs scattered across arbitrary namespaces as the -default control model. The central `argocd` namespace is simpler unless a later -team self-service requirement makes that extra complexity worth it. - -## kro Model - -`kro` is the abstraction layer for reusable platform and application APIs. - -The intended pattern is: - -- shared RGD source and release lifecycle live in the `platform` repo -- cluster-local RGD bundle installation and cluster-local platform instances - live under `clusters//platform/` after the bootstrap/core layer has - already installed `kro` -- environment-specific application custom resources live under `teams/` -- Argo CD syncs the YAML -- versioned RGD bundles are installed from OCI artifacts -- `kro` expands the custom resources into the Kubernetes objects they own - -The platform-side release, CUE authoring, and OCI publication model is defined -in [Platform RGD Delivery Model](./platform-rgd-delivery.md). The preceding -bootstrap/core delivery layer is defined in -[Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md). - -An environment-specific application resource should be narrow and explicit. For -example: - -```yaml -apiVersion: apps.platform.gilman.io/v1alpha1 -kind: AppDeployment -metadata: - name: appa1 -spec: - team: teama - app: appa1 - env: dev - namespace: teama-appa1-dev - image: - repository: ghcr.io/gilmanlab/teama/appa1 - digest: sha256:... - routing: - host: appa1.dev.apps.lab.gilman.io -``` - -The shared `AppDeployment` API should stamp stable labels such as: - -- `glab.gilman.io/team` -- `glab.gilman.io/app` -- `glab.gilman.io/env` - -## CAPI Model - -`CAPI` owns workload-cluster lifecycle. - -The intended CAPI responsibilities are: - -- install and manage cluster API providers -- define reusable cluster classes -- create and scale `nonprod` and `prod` -- keep workload-cluster creation separate from application promotion - -This keeps: - -- cluster lifecycle in `CAPI` -- desired-state reconciliation in `Argo CD` -- application promotion in `Kargo` - -## Kargo Model - -`Kargo` runs on the platform cluster. - -There should be one Kargo project per application pipeline: - -- `teama-appa1` -- `teama-appa2` -- `teama-appa3` -- `teamb-appb1` -- `teamb-appb2` - -The intended durable stages are: - -- `dev` -- `staging` -- `prod` - -Ephemeral environments are intentionally outside the long-lived promotion graph. -They should be created and destroyed by automation that adds or removes the -corresponding YAML from `teams/.../ephemeral/...`. - -Promotion means editing the environment-specific resource in Git. A narrow field -such as `spec.image.digest` is the preferred promotion target. - -For `AppA1`, the promotion targets are: - -- `teams/teama/appa1/envs/dev/app.yaml` -- `teams/teama/appa1/envs/staging/app.yaml` -- `teams/teama/appa1/envs/prod/app.yaml` - -The intended policy is: - -- `dev`: automatic promotion is acceptable -- `staging`: automatic promotion is acceptable -- `prod`: promotion should require an explicit approval step - -## Worked Example: TeamA / AppA1 - -1. `CAPI` creates the `nonprod` and `prod` workload clusters. -2. `Argo CD` syncs shared control-plane state to the platform cluster. -3. `Argo CD` syncs `clusters/platform/platform/`, - `clusters/nonprod/platform/`, and `clusters/prod/platform/`, installing - `kro`, the selected released `platform-rgds` and `apps-rgds` bundles, and - the cluster-local `Platform` instances. -4. `Argo CD` syncs Capsule tenants `teama` and `teamb` to `nonprod` and `prod`. -5. `teams/teama/appa1/envs/dev/app.yaml` defines an `AppDeployment` with - namespace `teama-appa1-dev`. -6. CI publishes a new image for `AppA1`. -7. `Kargo` detects the new artifact and updates - `teams/teama/appa1/envs/dev/app.yaml`. -8. `Argo CD` syncs that file to `nonprod`. -9. `kro` expands the `AppDeployment` into the namespace and workload resources - for `teama-appa1-dev`. -10. After validation, `Kargo` updates - `teams/teama/appa1/envs/staging/app.yaml`. -11. `Argo CD` syncs the staging instance to `nonprod` namespace - `teama-appa1-staging`. -12. After approval, `Kargo` updates - `teams/teama/appa1/envs/prod/app.yaml`. -13. `Argo CD` syncs the prod instance to the `prod` cluster namespace - `teama-appa1-prod`. - -## Open Questions - -- Which `kro` APIs should exist first beyond `AppDeployment` and - team-namespace bootstrap? -- Should `Argo CD` generate one application per environment directory, one per - app, or one per team/app/cluster boundary? -- Which shared services belong under `clusters/nonprod/shared` and - `clusters/prod/shared` versus platform-wide control-plane management? -- What is the exact cluster-registration flow from `CAPI` outputs into Argo CD - destinations? - -## Migration into Architecture Docs - -This document should be folded into the architecture overview after the -following are true: - -- the `gitops` repo structure exists in a stable form -- the platform cluster is actually running `Argo CD`, `CAPI`, and `Kargo` -- the `nonprod` and `prod` clusters exist under `CAPI` -- at least one real application has exercised the `dev` -> `staging` -> `prod` - flow - -At that point, the architecture overview should be updated so it describes the -steady-state GitOps model directly, and this design document can either be -trimmed or kept as historical design context. diff --git a/docs/docs/designs/index.md b/docs/docs/designs/index.md deleted file mode 100644 index 9fa7248..0000000 --- a/docs/docs/designs/index.md +++ /dev/null @@ -1,31 +0,0 @@ ---- -title: Design Documents -description: Proposed designs that are not yet part of the settled GilmanLab architecture baseline. -slug: /designs/ ---- - -# Design Documents - -This section holds proposed designs that are specific enough to guide -implementation, but are not yet part of the settled architecture baseline. - -Use these documents when: - -- a design is clear enough to review formally -- the target implementation does not exist yet -- the architecture overview should stay conservative until the design is proven - -Current designs: - -- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) -- [Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md) -- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) -- [kro Consumption Model](./kro-consumption-model.md) -- [Platform RGD Delivery Model](./platform-rgd-delivery.md) -- [App RGD Design](./app-rgd.md) -- [AWS Lab Account](./aws-lab-account.md) -- [Keycloak](./keycloak.md) -- [Secrets and PKI](./secrets-and-pki.md) - -Once a design is implemented and considered durable, its steady-state shape -should be folded back into the architecture overview and any relevant runbooks. diff --git a/docs/docs/designs/keycloak.md b/docs/docs/designs/keycloak.md deleted file mode 100644 index 7ecf9fd..0000000 --- a/docs/docs/designs/keycloak.md +++ /dev/null @@ -1,441 +0,0 @@ ---- -title: Keycloak -description: Proposed design for the lab's central identity system — deployment shape, federation, configuration, backups, and the rebuild-over-restore disaster-recovery model. ---- - -# Keycloak - -## Status - -Proposed. - -This document defines how the lab runs Keycloak as its central identity -provider. It covers deployment shape, identity federation, declarative -configuration, TLS, backups, the rebuild-over-restore disaster-recovery -model, and the per-service break-glass matrix that lets the lab continue to -operate while Keycloak is down. - -This document assumes the [AWS Lab Account](./aws-lab-account.md) design. It -does not re-establish shared context about the AWS Organization, networking, -IAM, or secrets bootstrap. - -## Purpose - -The primary purpose of this design is to keep the lab's identity system on a -durable off-lab trust anchor, while keeping the lab itself operable when that -trust anchor is unreachable. - -The intended split is: - -- Keycloak holds the authoritative, human-facing identity of record -- cluster-level and service-level **break-glass paths** exist for every OIDC - consumer, so identity outages do not cascade into cluster-access outages -- configuration is declaratively sourced from git, so **rebuild is the default - recovery mode** and restore is a fallback used only for runtime state a - single-user lab can recreate in seconds - -This mirrors a day-job architecture at small scale without overpaying for -HA features a single-user lab cannot justify. - -## Goals - -- Provide one place to manage human identity across the lab. -- Keep Keycloak outside the lab's physical failure domain while keeping its - blast radius understood. -- Make Keycloak's configuration surface fully declarative via git, so rebuild - is a first-class recovery path. -- Ensure every Keycloak-dependent service has a documented break-glass path - that does not require Keycloak. -- Make the disaster-recovery procedure short enough to execute without a - runbook open on a phone. - -## Non-Goals - -- This document does not run Keycloak in a highly-available configuration. - Single-node is a deliberate choice for a single-user lab and is not a gap. -- This document does not define the per-service Keycloak client configuration - (redirect URIs, scopes, role mappings, token TTLs). Those live in the - realm repository. -- This document does not define monitoring, logging, or alerting. -- This document does not define the Keycloak → Identity Center SAML - federation path. That remains future work. -- This document does not define cluster-level OIDC federation to AWS (used - by ExternalDNS and similar controllers). That is a separate concern handled - by the cluster's own OIDC issuer, not by Keycloak. - -## Design Summary - -Keycloak runs on a single dedicated EC2 instance in the `lab` member account, -colocated with its Postgres database. Access is at **`id.glab.lol`** via a -Route 53 private-zone record and a TLS certificate issued automatically via -**ACME DNS-01**. The instance and database are deployed via **Docker -Compose**. - -The only upstream identity source is **GitHub, federated via OIDC**. A single -realm named `lab` holds all users and all OIDC/SAML clients. Sign-in to any -Keycloak-fronted service is: user → service → Keycloak → GitHub. - -Keycloak's declarative surface — realms, clients, roles, identity provider -settings, scopes, authentication flows — is reconciled from a git repository -by **`keycloak-config-cli`** running as a scheduled job on the Keycloak host. -Runtime state (user credentials, sessions, TOTP enrollment) is not in git and -is the only part of the system that needs backup-based recovery. - -Database dumps and the current TLS cert bundle are backed up nightly to an -S3 bucket in the `lab` account. The lab's Synology NAS pulls those backups -locally on a schedule, so a recent copy of Keycloak's runtime state exists on -a second continent and a second provider. - -**Disaster recovery is rebuild-first.** For a single-user lab, a rebuild from -the git-tracked realm + a fresh Postgres is faster than a restore, enforces -discipline that every config-surface change actually lives in git, and -produces a clean outcome. A restore path exists as fallback but is not the -primary recovery mode. - -When Keycloak is entirely unavailable, every Keycloak-fronted service has a -local break-glass path documented in this doc. The lab continues to operate; -only the "unified identity" experience degrades. - -## Deployment Shape - -### Host and Runtime - -- **Instance:** `t4g.small` (2 vCPU, 2 GB RAM), Amazon Linux 2023 on ARM, in - the `lab` account and `172.16.0.0/16` VPC from the AWS design. -- **Runtime:** Docker Compose manages two services: - - `keycloak` — official upstream Keycloak image, tagged to a specific - version pinned in `infra/`. - - `postgres` — official Postgres image, tagged to a specific version pinned - in `infra/`. Data volume on the instance's EBS root volume. -- **Reverse proxy:** Caddy (or equivalent) runs alongside and terminates TLS, - proxying to Keycloak on loopback. Caddy performs ACME DNS-01 renewals using - the instance's IAM role. This is the recommended shape; the exact proxy is - an implementation detail that does not need to appear in this doc. -- **No EBS snapshots or AMI backups.** State recovery is via the application - backup path below, not via block-level snapshots. - -### Identity Profile for the Host - -The EC2 instance's IAM role grants, at minimum: - -- Route 53 write access scoped to the `_acme-challenge.id.glab.lol` record - for DNS-01 validation. -- S3 write access to the Keycloak backup bucket's prefix. -- SSM Parameter Store read for any bootstrap-time secrets held there - (following the pattern established in the AWS design). - -The role carries no other permissions. All lab cluster-access, -secret-decryption, and tailnet-identity paths that this instance depends on -are established in the AWS Lab Account design and are not restated here. - -### Sizing and Tuning - -2 GB of RAM is tight but workable for a single-user lab because Keycloak 26.x -(Quarkus-based) has a much smaller footprint than earlier WildFly-based -versions. The required tuning is: - -- explicit Keycloak JVM max heap (e.g. ~768 MB) -- conservative Postgres `shared_buffers` (~128 MB) -- a swap file on the EBS volume as a safety margin - -CPU is burstable but effectively idle for single-user workloads; unlimited -mode is enabled to tolerate rare login bursts at negligible cost. - -## Identity Federation - -### Realm Structure - -A single realm named `lab` holds all lab users and all OIDC/SAML clients. No -separate realms for services vs. humans; a single-user lab does not benefit -from the separation, and multi-realm setups make GitOps reconciliation more -fragile. - -### Upstream IdP - -The realm has **exactly one identity provider configured: GitHub, via -OIDC**. There is no local username-password fallback. A lab user's identity -is their GitHub identity, federated through Keycloak, presented downstream to -each OIDC client. - -The Keycloak admin bootstrap user exists only briefly during initial realm -creation and is disabled once `keycloak-config-cli` has reconciled the -realm from git. - -### Intended Early OIDC Clients - -Keycloak is expected to front these services first: - -- **kubectl → Talos clusters** via each cluster's Kubernetes API-server OIDC - configuration. -- **Argo CD** (web UI and CLI). -- **Grafana** (when deployed). - -Additional clients will be added over time. The authoritative list lives in -the realm repository, not in this document. - -## TLS - -### Issuance - -The cert for `id.glab.lol` is issued via **ACME DNS-01 against Route 53** -using the instance's IAM role. There is no internet-exposed HTTP-01 path, -because the host sits in a private VPC with the subnet router as its only -tailnet attachment. - -No wildcard cert is used. Each service in the lab gets its own host-scoped -cert: - -- this host: `id.glab.lol` -- future clusters: `nonprod.k8s.glab.lol`, `prod.k8s.glab.lol` -- future services: their own host names - -Cluster-fronted services will issue their own certs via cert-manager and -each cluster's OIDC trust relationship with AWS (a separate future design). - -### Renewal - -Renewal is automatic and handled by whichever TLS-terminating proxy is used -on the host. No human intervention is expected between renewals. - -### Restore-time behavior - -For same-hostname restore to work without waiting on ACME, the current TLS -cert bundle (cert + private key) is included in the nightly backup payload -alongside the Postgres dump. A restored host can serve HTTPS immediately -with the backed-up cert and let the proxy handle its own renewal on the -normal schedule. - -## Configuration as Code - -### Source of Truth - -Keycloak's declarative surface is reconciled from a git repository. The -**repository is the source of truth**; the running Keycloak's admin surface -is a read-through cache of what's in git. - -In scope for git reconciliation: - -- realms -- clients -- client scopes -- roles and role mappings -- identity-provider configuration -- authentication flows and required actions -- realm-level settings - -Out of scope for git (intentionally runtime state): - -- user credentials (password hashes, WebAuthn registrations, TOTP secrets) -- sessions and refresh tokens -- audit and event logs -- ephemeral tokens and one-time codes - -### Reconciliation Tool - -Reconciliation uses **`keycloak-config-cli`** (adorsys). The tool is mature, -works against the admin API, handles partial updates, and does not require -Kubernetes CRDs or a separate operator. It is the best available GitOps -option as of this writing given the upstream Keycloak Operator still does -not provide first-class CRDs for clients, users, roles, or identity -providers. - -### Reconciliation Location - -`keycloak-config-cli` runs **on the Keycloak host itself** as a scheduled -job. Pull cadence is a small number of minutes; the exact cadence is an -implementation detail. The job authenticates to Keycloak using a -reconciliation service account stored in SSM Parameter Store. - -This location is intentionally simple for now. Pushing reconciliation into -GitHub Actions (so every git push triggers a reconcile) is named as future -work — it would enforce "git push is the only way config changes" more -strictly — but it requires a reachable admin endpoint and an appropriate -trust path, which are better designed once the realm repo has concrete -shape. - -### Schema Versioning - -Keycloak migrations run forward only. The realm repository pins the -Keycloak version it expects. Upgrades are driven by bumping the pin in -`infra/` and allowing the next reconcile cycle to re-apply cleanly against -the upgraded Keycloak. - -## Backups - -### What is backed up - -- Postgres dump (the full database, including all runtime state). -- Keycloak configuration files that live outside the database: `keycloak.conf`, - environment overrides, any custom themes or providers. -- The current TLS cert bundle (cert + private key). - -Configuration in git is **not** part of the backup — git is already the -durable store for it. - -### Where they go - -- **Primary destination:** an S3 bucket in the `lab` account. The bucket - uses server-side encryption with a KMS key; object-lock/versioning is on - so corruptions cannot silently overwrite known-good backups. -- **Secondary destination:** the lab's Synology NAS, which pulls from the S3 - bucket on its own schedule. - -The host writes backups to S3 using the instance's IAM role (no long-lived -credentials). The Synology pulls from S3 using a scoped, read-only access -mechanism chosen when Synology-side automation is implemented — out of scope -for this document. - -### Retention - -Retention is a rolling window, implemented via S3 lifecycle policies. The -contract: - -- **daily** backups retained for **30 days** -- **weekly** backups retained for **12 weeks** -- **monthly** backups retained for **12 months** - -"Last backup only" is explicitly rejected. A corruption-style incident -(realm data mangled by a bad change, not a hardware failure) requires -point-in-time restore from days ago. - -### Encryption - -Backups contain password hashes, signing keys, TOTP secrets, and session -state. They are encrypted twice: - -- client-side: the Postgres dump and cert bundle are encrypted before upload - using a recipient key managed alongside the rest of the lab's bootstrap - secrets -- server-side: the S3 bucket uses SSE-KMS with a customer-managed key - -This ensures that neither a leaked S3 object ACL nor a Synology -compromise yields usable plaintext. - -## Disaster Recovery - -The lab uses a **rebuild-first** recovery model. Restore exists as a -fallback, not as the primary path. - -### Rebuild path (primary) - -When Keycloak is unrecoverable or is being moved: - -1. Provision a fresh EC2 instance from the `infra/` OpenTofu modules. -2. Run `docker compose up` to start Keycloak and a **fresh** Postgres. -3. Run `keycloak-config-cli` against the new Keycloak, pointing at the realm - repository. All realms, clients, roles, and GitHub federation come back. -4. Sign in via GitHub. A new user entry is created on first login per the - identity-provider mapper configuration. -5. Re-enroll WebAuthn / TOTP (30 seconds). -6. Keycloak is operational. - -**Target RTO: 15 minutes.** This path requires only the git repository and -AWS access; it does not require any backup store. - -### Restore path (fallback) - -When a rebuild is unacceptable — for example, if you need to preserve the -exact user-state including federated-identity linkages and audit history — -the restore path is: - -1. Provision a fresh EC2 instance. -2. Pull the most recent, or a chosen point-in-time, backup from S3 or - Synology. -3. Restore the Postgres dump into a fresh Postgres. -4. Place the TLS cert bundle and any config files. -5. Run `docker compose up`. Keycloak boots against the restored database. -6. `keycloak-config-cli` runs its normal reconcile cycle; any configuration - drift between the backup time and `git HEAD` is corrected forward. - -For the single-user lab, the restore path is rarely worth it. For the -day-job mirror architecture, it is the default. - -### Hostname preservation - -Both paths require that the restored instance serve at **`id.glab.lol`**. -The `issuer` claim on every JWT Keycloak has ever signed is tied to that -URL. Changing the hostname invalidates every existing token and every -client's cached OIDC discovery. - -This is normally handled by DNS: the new instance comes up in the same VPC -with the same Route 53 A record pointing to it. During a full internet-loss -DR scenario where Route 53 is unreachable, a local override path exists: -the lab's CoreDNS zonefile can be manually edited to point `id.glab.lol` at -the restored instance's Tailscale address. - -### What rebuild does not recover - -- Stored user credentials (password hashes, WebAuthn, TOTP). These must be - re-enrolled by the user on first login. For a single-user lab, 30 seconds. -- Active sessions. All users are forced to re-authenticate. -- Federated-identity linkages established at previous logins. GitHub OIDC - users are re-linked on their next successful sign-in. -- Audit / event history. - -For the lab, none of these matter. For a production deployment of this -design, they would matter, and the restore path becomes primary. - -## Break-Glass Matrix - -When Keycloak is down, the lab continues to operate via -per-service break-glass paths. These are not emergency workarounds; they are -durable, documented alternate authentication paths that are kept active for -exactly this reason. - -| Service | Break-glass path | Notes | -|--------------------|------------------------------------------------------|-------| -| Talos API | mTLS via `talosconfig` and machine secrets | Talos's PKI is independent of Keycloak; `talosconfig` is the ultimate root anchor for the lab. | -| `kubectl` to clusters | Talos-generated admin kubeconfig via `talosctl kubeconfig` | Produced on demand against the cluster's own signing CA; not federated. | -| Argo CD | Built-in `admin` account and initial admin secret in the cluster | Retained and rotated but never disabled. | -| Vault (per cluster)| Unseal keys and root/recovery keys | Kept outside the lab per the Vault design (separate doc). | -| AWS | IAM Identity Center local user + WebAuthn hardware key | Does not federate to Keycloak by design — see the AWS doc. | -| Grafana | Local admin account | Kept active alongside OIDC client configuration. | -| GitHub (upstream IdP) | Personal GitHub account, hardware-key MFA | Keycloak is downstream of GitHub; if GitHub is down, federated login fails, and the above service-local paths are how the lab keeps moving. | - -All of these anchors' credentials live outside any Keycloak-dependent store. -`talosconfig`, Argo CD admin secrets, Vault recovery keys, and AWS root -credentials are kept in offline storage (password manager + hardware backup) -whose access does not depend on Keycloak, AWS, or internet connectivity. - -## Cost - -Monthly, all-upfront amortized over the 3-year EC2 Instance Savings Plan -commitment established in the AWS design: - -| Item | Monthly | -|--------------------------------|---------| -| t4g.small compute (3-yr SP) | ~$3.89 | -| S3 backup storage + lifecycle | ~$0.20 | -| Route 53 record queries | negligible | -| Data transfer | negligible | -| **Approximate total** | **~$4.10** | - -This is additive on top of the baseline AWS footprint laid out in the -AWS Lab Account design. - -## Future Work - -- **CI-driven reconciliation.** Move `keycloak-config-cli` from an - on-host cron into a GitHub Actions workflow so `git push` is the only - trigger that mutates Keycloak's declarative state. -- **Keycloak → IAM Identity Center SAML federation** as a secondary, - convenience path for AWS console access. The local IAM Identity Center - user remains primary. -- **Promotion to HA** if / when this design is reused at day-job scale. Two - Keycloak replicas behind a load balancer with clustered cache and a shared - database is the standard upgrade path; none of the decisions in this doc - block it. -- **Automated DR drill.** A periodic exercise — quarterly is appropriate — in - which the rebuild path is executed against a scratch instance to prove - the RTO target and keep the muscle alive. -- **Richer per-service break-glass.** Codify the break-glass secrets - themselves (their storage location, rotation cadence, recovery order) in - a separate operational runbook. - -## References - -- [AWS Lab Account](./aws-lab-account.md) -- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) -- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) -- [keycloak-config-cli (adorsys)](https://github.com/adorsys/keycloak-config-cli) -- [Keycloak Operator: status of first-class CRDs](https://www.keycloak.org/2022/09/operator-crs) diff --git a/docs/docs/designs/kro-consumption-model.md b/docs/docs/designs/kro-consumption-model.md deleted file mode 100644 index 1c6a04e..0000000 --- a/docs/docs/designs/kro-consumption-model.md +++ /dev/null @@ -1,365 +0,0 @@ ---- -title: kro Consumption Model -description: Proposed design for how GilmanLab publishes, consumes, and promotes kro-based application and platform APIs. ---- - -# kro Consumption Model - -## Status - -Proposed. - -This document captures the intended role of `kro` in the lab and the current -working model for how shared platform-owned APIs and developer-owned release -intent flow into environment-specific desired state in the `gitops` repo. - -This is an initial design draft. It intentionally does not define concrete -`ResourceGraphDefinition` schemas yet. The next design pass should define the -first public APIs in detail. - -## Purpose - -The primary purpose of `kro` in this lab is two-fold: - -- give software engineers a common abstraction for deploying applications -- give platform engineers a common abstraction for maintaining platform - capabilities - -In practice, these are closely related. Platform engineers and software -engineers are both describing and operating software systems through the same -API layer. The difference is usually ownership and blast radius, not the -fundamental shape of the task. - -`kro` is therefore the public API layer for both application delivery and -platform delivery. - -## Design Principles - -The `kro` API layer should be standardized and opinionated. - -The goal is not to recreate the full input surface of the underlying Kubernetes -objects. The goal is to publish a smaller, clearer API that is safer and easier -to reason about. - -The design principles are: - -- prefer a minimal public interface over a mechanically complete one -- hardcode platform-controlled values when that improves security, - consistency, or operational correctness -- prefer defaults that "do the right thing" and reduce cognitive load -- keep required inputs minimal -- constrain mutable inputs wherever reasonable -- expose escape hatches only when there is a real need, not as a default design - posture - -The expected result is that a developer should be able to use a platform API -without needing to understand the full Kubernetes object model underneath it. - -## Terminology - -Public `kro` APIs should prefer common software engineering terms over -specialized Kubernetes terms where doing so improves clarity. - -Examples: - -- prefer `volume` over `PersistentVolumeClaim` in a public API -- prefer application-level language like `service`, `route`, `database`, or - `worker` when those terms are understandable without Kubernetes-specific - context - -This is not a strict ban on Kubernetes terminology. Some Kubernetes concepts, -such as `Deployment`, are already self-explanatory to many users. The important -point is that public APIs should be shaped for the intended consumers, not as -thin wrappers around upstream object names. - -## API Ownership Model - -There are three important ownership layers: - -- platform-owned API definitions -- developer-owned application release intent -- environment-owned deployment materialization - -### Platform-owned API definitions - -The platform layer owns the `ResourceGraphDefinition`s and any shared conventions -that define what public APIs exist and how they behave. - -Examples include future APIs such as: - -- application delivery APIs -- team namespace bootstrap APIs -- platform capability APIs - -The source of truth for those shared APIs lives in the `platform` repo, not the -`gitops` repo. The platform team authors and releases those APIs there, then -publishes versioned bundle artifacts that `gitops` can install per cluster. - -The concrete platform-side release and delivery model is described in -[Platform RGD Delivery Model](./platform-rgd-delivery.md). - -### Developer-owned application release intent - -Developers should define instances of platform APIs alongside the application -source code that they belong to. - -For example: - -```text -project/ -├── src/ -├── Dockerfile -└── deployment.yaml -``` - -In this model, `deployment.yaml` is not the `RGD`. It is an instance of a -platform-owned API. - -This file should be treated as part of the application's release intent and -versioned alongside the application source and image build inputs. - -Important: `App` is only one example of a future public API. This model does -not imply that developers will only ever instantiate a single kind of `kro` -resource. - -### Environment-owned deployment materialization - -Argo CD still needs a Git source of truth to reconcile from. - -In this design, that source of truth remains the `gitops` repo on its mainline -branch. Argo CD does not read directly from developer application repositories. - -Therefore, the final environment-specific `kro` resource instances must be -materialized into the `gitops` repo under environment-specific paths on -`main`/`master`. - -This means the environment-specific folders in the `gitops` repo are still the -final desired state that Argo CD reconciles. - -For platform-side APIs, the `gitops` repo also holds cluster-local installation -of released RGD bundles and cluster-local instances such as `Platform`. - -## Relationship to the GitOps Model - -This design builds on the multi-cluster GitOps model described in -[Multi-Cluster GitOps Model](./gitops-multi-cluster.md). - -The important clarification is: - -- `kro` API definitions are platform-owned and released from the `platform` - repo -- developer-authored API instances are release inputs -- `gitops` selects released API bundle versions per cluster and carries - cluster-local platform instances -- Kargo materializes environment-specific outputs into the `gitops` repo -- Argo CD reconciles those outputs from `main` - -Environment-specific paths remain the expected layout, for example: - -```text -gitops/ -└── envs/ - ├── dev/ - │ └── orders-api/ - ├── staging/ - │ └── orders-api/ - └── prod/ - └── orders-api/ -``` - -This document does not settle the exact `gitops` folder layout beyond that -principle. - -The intended platform-side cluster bootstrap shape in `gitops` is: - -```text -gitops/ -└── clusters/ - └── / - └── platform/ - ├── kro.yaml - ├── rgds-platform.yaml - ├── rgds-apps.yaml - └── platform.yaml -``` - -This cluster-local layout installs `kro`, installs selected released RGD -bundles, and declares the cluster-local `Platform` instance. The detailed -release, build, and OCI publication model is described in -[Platform RGD Delivery Model](./platform-rgd-delivery.md). - -## Promotion Model - -`Kargo` should treat a release as more than an image update. - -The intended model is: - -- CI produces a new image -- CI also produces a Git commit containing the developer-owned application - release intent -- Kargo bundles the image revision and Git commit into one piece of Freight -- Kargo promotes that Freight through environments -- during promotion, Kargo materializes the final environment-specific desired - state into the `gitops` repo -- Argo CD reconciles that materialized desired state from `main` - -This is different from treating promotion as nothing more than a digest bump. -Behavior-defining configuration that belongs to the application release should -travel with the release artifact. - -## Environment-specific Inputs - -Environment-specific concerns still exist and must influence the final deployed -resource instances. - -A realistic example is an application input such as `THIRD_PARTY_URL`: - -- in `dev`, the application may need to call - `sandbox.api.thirdparty.com` -- in `prod`, the application may need to call `api.thirdparty.com` - -This kind of input is not necessarily part of the application release itself. -It may instead be a property of the target environment. - -The current design direction is: - -- developer-owned release inputs live with the application source -- environment-specific inputs live in the `gitops` repo -- Kargo combines the two during promotion and writes the resulting `kro` - resource instance into the destination environment folder on `main` - -This lets release-coupled configuration move through the promotion pipeline -while still letting environment-specific concerns influence the final deployed -resource. - -## Promotion-time Composition - -The current working assumption is that any composition of: - -- developer-authored release input -- environment-specific overrides or bindings -- final Argo-reconciled output - -happens during promotion, inside Kargo. - -In other words: - -- developers do not manually write environment-specific final manifests into the - `gitops` repo -- Argo CD should reconcile the final materialized YAML -- Kargo is the place where environment-specific shaping occurs before that YAML - lands in Git - -One pragmatic option is to use Kustomize at promotion time only. - -If used, the important boundary is: - -- Kustomize is a promotion-time composition tool -- Kustomize is not the public API -- Kustomize is not the developer-facing abstraction -- Argo CD should still reconcile the final materialized YAML written by Kargo - -This keeps `kro` as the abstraction layer while allowing a familiar merge and -patch mechanism to help materialize final environment-specific manifests. - -The exact composition mechanism remains an open question. Kustomize is one -candidate, not a final commitment in this document. - -## Composition Patterns - -Cross-resource relationships need explicit design. - -The preferred default is to embed related configuration where lifecycle and -ownership naturally belong together. - -For example, if an application needs a small amount of secret material or a -simple runtime capability that is specific to that application instance, the -public API should prefer embedding that relationship rather than forcing the -developer to construct multiple loosely related peer resources. - -However, embedding will not work in every case. - -Some capabilities have independent lifecycle or sharing boundaries. A future -`Database` capability is an example: - -- if an application declares that it needs a database, how does the application - receive connection details? -- if the database is managed independently, what is the contract between the - application-facing resource and the database-facing resource? -- if multiple consumers share a capability, where should the shared contract - live? - -`kro` does not remove the need to design those contracts carefully. This is an -area that still needs explicit design guidance for the lab. - -## Guardrails and Policy - -Environment substitution and policy guardrails are related but distinct -problems. - -### Environment substitution - -This is about supplying environment-sensitive values during promotion or -materialization. - -Examples: - -- environment-specific endpoints -- environment-specific hostnames -- environment-local service references - -This concern belongs close to the promotion/materialization flow and therefore -belongs close to Kargo and the `gitops` repo. - -### Policy guardrails - -This is about constraining what a final deployed resource is allowed to do. - -Examples: - -- forbidding privileged or root workloads in prod -- enforcing tenancy boundaries -- constraining namespaces, quotas, or network access - -This concern is not necessarily a `kro` concern. It may be better handled by -policy and governance layers such as `Kyverno` or `Capsule`. - -This design intentionally does not force those responsibilities into `kro`. - -## Current Design Direction - -The current working model is: - -1. The platform team publishes shared `kro` APIs. -2. Developers instantiate those APIs alongside the application source code. -3. CI produces an image and a corresponding Git commit. -4. Kargo bundles those artifacts into Freight. -5. Kargo promotes Freight between environments. -6. The `gitops` repo installs released shared API bundles per cluster and - carries cluster-local platform instances such as `Platform`. -7. During promotion, Kargo combines release input with environment-specific - inputs and writes the final resource instances into the environment-specific - area of the `gitops` repo on `main`. -8. Argo CD reconciles those final resource instances from the `gitops` repo. - -This is the design baseline for the next pass. - -## Open Questions - -- What are the first public `kro` APIs the platform should publish? -- Which relationships should be embedded by default versus modeled as explicit - peer resources? -- What is the standard contract for peer-style relationships such as an - application consuming a separately managed database? -- Should Kustomize be the default promotion-time composition mechanism, or only - one optional implementation technique? -- What exact files live in the environment-specific `gitops` folders during - promotion and after promotion? -- How should environment-specific inputs be authored, owned, and reviewed in the - `gitops` repo? - -## Next Step - -The next design pass should define one or more concrete `ResourceGraphDefinition` -schemas that embody these principles, starting with a minimal application-facing -API and its expected lifecycle. diff --git a/docs/docs/designs/platform-rgd-delivery.md b/docs/docs/designs/platform-rgd-delivery.md deleted file mode 100644 index e348d7b..0000000 --- a/docs/docs/designs/platform-rgd-delivery.md +++ /dev/null @@ -1,182 +0,0 @@ ---- -title: Platform RGD Delivery Model -description: Proposed design for authoring, releasing, publishing, and consuming platform-side kro APIs. ---- - -# Platform RGD Delivery Model - -## Status - -Proposed. - -This document defines the platform-side delivery model for shared `kro` -`ResourceGraphDefinition`s. It complements the broader -[kro Consumption Model](./kro-consumption-model.md) by making the platform-owned -API lifecycle concrete without turning the `gitops` repo into the source of -truth for reusable API definitions. - -## Purpose - -The primary purpose of this design is to keep platform API ownership, release -lifecycle, and cluster consumption clearly separated. - -The intended split is: - -- the `platform` repo owns authoring, validation, release notes, and - publication for shared platform APIs -- the `gitops` repo owns cluster-local installation of released API bundles and - cluster-local instances of those APIs - -This lets the lab version and promote platform APIs deliberately, while keeping -cluster bootstrap in Git simple enough to reason about at a glance. - -## Goals - -- Keep reusable platform API source out of the `gitops` repo. -- Give `platform-rgds` and `apps-rgds` independent release trains. -- Publish released RGD bundles as OCI artifacts that Argo CD can install - declaratively. -- Keep the cluster-local bootstrap surface small and explicit. -- Use CUE for build-time authoring and validation without exposing CUE as an - operator-facing runtime interface. - -## Non-Goals - -- This document does not define the exact `Platform` schema. -- This document does not define bootstrap/core component delivery for Cilium, - Argo CD, or `kro`. -- This document does not define the full CI workflow YAML for release or - publication. -- This document does not define every future platform capability block. - -## Design Summary - -The intended model is: - -- `platform` owns the source for shared `kro` APIs -- `release-please` orchestrates release PRs, tags, and changelog updates -- `platform-rgds` and `apps-rgds` are released independently -- publish workflows render final YAML artifacts from CUE and push them to OCI - registries with ORAS -- `gitops` installs those released OCI artifacts through Argo CD -- `gitops` also holds the cluster-local `Platform` custom resource that carries - cluster-specific inputs - -## Ownership and Repository Boundaries - -### Platform Repo - -The `platform` repo is the source of truth for shared RGD definitions. - -It owns: - -- CUE authoring input for `platform-rgds` and `apps-rgds` -- validation of rendered RGD artifacts before publication -- release configuration and changelog management -- OCI publication of rendered artifacts - -It does not own cluster-local desired state. - -### GitOps Repo - -The `gitops` repo is the source of truth for which released API bundles a -cluster installs and which cluster-local custom resources should exist there. - -It owns: - -- Argo CD `Application` resources that install `kro` and released RGD bundles -- the cluster-local `Platform` custom resource -- the ordering and composition of those objects during cluster bootstrap - -It does not own raw shared RGD source. - -## Release Model - -The `platform` repo should manage `platform-rgds` and `apps-rgds` as separate -release trains in the same repository. - -The intended flow is: - -1. `release-please` manages release PRs, version bumps, tags, and changelog - updates. -2. `platform-rgds` and `apps-rgds` each advance only when their own changes - require a release. -3. A publish workflow runs after a release is created, renders the final YAML - artifacts, and pushes them as OCI artifacts via ORAS. -4. Cluster operators choose which released version to install by updating the - corresponding Argo CD `Application` in `gitops`. - -This keeps API release history explicit and lets lower environments validate new -bundle versions before higher environments adopt them. - -## Authoring and Build Model - -The intended authoring model for `platform-rgds` is: - -- one public `Platform` RGD -- CUE as the build-time authoring language -- a root package that defines the final public shape -- CUE subpackages for logically ordered platform capability blocks - -The subpackages are intended to correspond to stable blocks of cluster -configuration, such as: - -- core platform defaults -- secrets integration -- networking integration -- bare-metal integration such as `tinkerbell` - -These subpackages are an authoring and validation boundary, not a separate -operator-facing API surface. The published product remains the rendered RGD -YAML artifact. - -CI may import CRDs or equivalent schemas into CUE so the rendered artifact can -be validated structurally before publication. Cluster-side `kro` validation is -still responsible for the final semantic checks when the RGD is created. - -## Cluster-local Platform API Consumption Model - -This document starts after the bootstrap/core layer has already installed -`kro`. The preceding day-0/day-1 delivery sequence is defined in -[Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md). - -The intended cluster-local platform API surface in `gitops` is: - -```text -clusters//platform/ -├── rgds-platform.yaml -├── rgds-apps.yaml -└── platform.yaml -``` - -Each file has one job: - -- `rgds-platform.yaml`: install the selected released `platform-rgds` OCI - artifact -- `rgds-apps.yaml`: install the selected released `apps-rgds` OCI artifact -- `platform.yaml`: instantiate the cluster-local `Platform` custom resource - -This keeps the cluster-local platform API surface intentionally small and makes the -chosen bundle versions obvious in Git. - -## Relationship to Other Designs - -This design builds on: - -- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) for day-0 - and day-1 component delivery before `kro` is present -- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) for cluster topology, - tenancy, and application flow -- [kro Consumption Model](./kro-consumption-model.md) for the ownership split - between platform-owned APIs, developer release intent, and GitOps - materialization - -It does not change the application-side model where developers author `App` -instances alongside application source code and Kargo materializes final -environment-specific resources into `gitops`. - -## Next Step - -The next implementation-oriented design pass should define the initial -`Platform` schema and the first concrete capability block, starting with the -platform-side `tinkerbell` inputs. diff --git a/docs/docs/designs/secrets-and-pki.md b/docs/docs/designs/secrets-and-pki.md deleted file mode 100644 index cf34ee3..0000000 --- a/docs/docs/designs/secrets-and-pki.md +++ /dev/null @@ -1,426 +0,0 @@ ---- -title: Secrets and PKI -description: Proposed design for bootstrap secrets, per-cluster Vault, and the lab PKI split. ---- - -# Secrets and PKI - -## Status - -Proposed. - -This document defines the target model for secrets and PKI across the lab. It -connects the AWS bootstrap foundation from [AWS Lab Account](./aws-lab-account.md) -to the future per-cluster Vault model and replaces the older router-hosted -`step-ca` internal PKI design. - -## Purpose - -The lab needs two different secret systems: - -- a **bootstrap** system that works before any cluster or Vault instance exists -- a **runtime** system that belongs to each cluster after that cluster exists - -The bootstrap system must be small, externally anchored, and able to recover -the lab from a cold start. The runtime system must be local to the cluster it -serves, because `platform`, `nonprod`, and `prod` are independent recovery and -security domains. - -PKI follows the same split. Public HTTP TLS uses public trust through Let's -Encrypt. Internal workload identity and mTLS use lab-managed authorities rooted -in an AWS KMS-held root CA and delegated to cluster-local Vault or future SPIRE -issuers. - -## Goals - -- Make AWS the authoritative control surface for bootstrap secret access. -- Remove standing PGP and age decryptors from SOPS-encrypted bootstrap - material. -- Allow short-lived, scope-limited access to selected bootstrap secrets. -- Keep runtime secrets in each cluster's own Vault instance. -- Prevent cross-cluster secret reads between `platform`, `nonprod`, and `prod`. -- Use Let's Encrypt for browser-facing HTTP TLS. -- Use cluster-local subordinate CAs for internal mTLS and future SPIFFE/SPIRE. -- Retire the router-hosted `step-ca` internal issuer from the target design. - -## Non-Goals - -- This document does not define exact IAM policy JSON, Vault policy HCL, or - Kubernetes manifests. -- This document does not define a general secret synchronization mechanism - between SOPS and Vault. -- This document does not make the platform cluster a central secret broker for - downstream clusters. -- This document does not federate SPIRE trust domains between clusters. -- This document does not cover Keycloak runtime backups or realm - configuration; those remain in [Keycloak](./keycloak.md). - -## Design Summary - -Secrets are split into two layers: - -| Layer | Authority | Scope | Purpose | -| --- | --- | --- | --- | -| Bootstrap | AWS + GitHub App + SOPS | Pre-cluster and recovery | Retrieve enough material to create or recover clusters, Vault, and base platform services. | -| Runtime | Per-cluster Vault | One cluster | Store and issue secrets for workloads in that cluster. | - -PKI is split into two layers: - -| Use case | Authority | Issuer | -| --- | --- | --- | -| HTTP TLS | Public WebPKI | Let's Encrypt through cert-manager DNS-01 and Route 53 | -| Internal mTLS / workload identity | Lab internal PKI | Per-cluster Vault subordinate CAs, with future SPIRE authorities | - -The root invariant is: - -```text -AWS grants the workload identity. -GitHub App grants temporary repository read access. -AWS KMS grants temporary scoped SOPS decrypt access. -Cluster-local Vault grants runtime secret and certificate access. -``` - -## Bootstrap Secrets - -### Source of truth - -Bootstrap secret payloads live in the private `GilmanLab/secrets` repository. -Public repositories keep only templates, variable contracts, and lookup logic. - -The `secrets` repository holds only material that is needed before a cluster's -Vault is ready, or material needed to recover that Vault. Examples include: - -- Talos and cluster bootstrap material -- initial Vault unseal/recovery storage configuration -- GitHub App bootstrap material -- credentials needed to create the first runtime secret sources -- emergency recovery material that cannot live inside the thing it recovers - -Runtime application secrets do not belong in SOPS once Vault exists. - -### SOPS over AWS KMS - -All existing SOPS files are rewrapped with the customer-managed KMS key in the -current `lab` AWS account: - -```text -alias/glab-sops -arn:aws:kms:us-west-2:186067932323:key/2aba1d94-6eaf-4d80-8d26-2077f32fd7c5 -``` - -The existing PGP and age recipients are removed from SOPS metadata. After the -cutover, AWS is the only routine decrypt control surface. - -### Scoped decryption - -SOPS files are encrypted with AWS KMS encryption context so IAM can grant -decrypt access by scope. Scopes are file/path oriented; SOPS is not a -field-level authorization system. - -Example scope layout: - -```text -network/tailscale/* Scope=network-tailscale -network/vyos/* Scope=network-vyos -compute/talos/platform/* Scope=talos-platform -vault/platform/* Scope=vault-platform -vault/nonprod/* Scope=vault-nonprod -vault/prod/* Scope=vault-prod -``` - -Each encrypted file includes KMS context similar to: - -```yaml -Repo: GilmanLab/secrets -Scope: network-tailscale -``` - -A workload that needs only `network/tailscale/*` receives short-lived AWS -credentials for a role that can call `kms:Decrypt` only when the request's -encryption context has `Repo=GilmanLab/secrets` and -`Scope=network-tailscale`. - -Because KMS encryption context is authenticated data bound to the encrypted -data key, changing the SOPS file metadata to a different scope does not allow -the ciphertext to decrypt. - -### Repository access - -Private repository access uses a GitHub App owned by `GilmanLab` and installed -on `GilmanLab/secrets`. - -The App private signing key is stored in SSM Parameter Store as a SecureString -in the `lab` account. Bootstrap workloads do not read that key directly. They -invoke the `github-token-broker` Lambda with their AWS identity. The broker is -the only normal principal that can read the App private key, generate a GitHub -App JWT, and exchange it for a short-lived installation token. - -Installation tokens are requested with the narrowest useful shape: - -```text -repositories = ["secrets"] -permissions = {"contents": "read"} -ttl = 1 hour -``` - -The GitHub token grants access to encrypted files. AWS KMS grants access to -plaintext. A workload may be able to clone the whole private repository while -still being unable to decrypt files outside its KMS context scope. - -Callers may use either `git` or the GitHub Contents API with `curl` to fetch -encrypted files. The no-`git` path still uses the GitHub token only for -encrypted repository access; the Lambda does not return SOPS files and does -not decrypt them. - -### Historical exposure - -Removing PGP and age recipients from current SOPS files does not remove their -ability to decrypt old git revisions. Any bootstrap secret that must become -AWS-authoritative retroactively is rotated after the KMS cutover. - -History rewrite is possible but is not the default. For this lab, rotating -affected secrets is the simpler and more auditable path. - -## Runtime Secrets - -### Per-cluster Vault - -Each cluster runs its own HashiCorp Vault instance managed by `bank-vaults`. -Vault is the runtime source of truth for secrets in that cluster. - -The intended cluster split is: - -| Cluster | Vault scope | -| --- | --- | -| `platform` | Platform-cluster services and platform control-plane needs | -| `nonprod` | Non-production workloads | -| `prod` | Production workloads | - -Vault instances do not read each other's storage, policies, tokens, or secret -paths. `prod` does not depend on `nonprod`; `nonprod` does not depend on -`prod`; `platform` does not become a universal secret broker. - -### Environment separation - -Clusters that host multiple environments segregate secrets by path and policy. -For `nonprod`, the baseline shape is: - -```text -dev/* -staging/* -``` - -Path naming is not the security boundary by itself. Vault auth roles and -policies enforce which workloads, namespaces, and service accounts can read or -write each prefix. - -### Bootstrap-to-runtime handoff - -SOPS may seed initial Vault configuration and initial secret material during -cluster bootstrap. Once the cluster is operating, runtime mutation belongs in -Vault. - -There is no bidirectional SOPS-to-Vault synchronization loop. That would create -two sources of truth. The direction is: - -```text -SOPS bootstrap material -> initialize/configure Vault -> Vault owns runtime -``` - -If a runtime secret must be recovered from bootstrap material, the recovery -process is explicit and documented for that secret class. - -### Vault unseal material - -For cost management, Vault unseal and root/recovery material may share one -customer-managed AWS KMS key across clusters. The KMS key wraps distinct -per-cluster Vault material; it is not a shared Vault unseal key. - -Isolation is enforced with: - -- per-cluster S3 prefixes or buckets for bank-vaults storage -- per-cluster IAM roles -- KMS encryption context such as `Purpose=vault-unseal` and - `Cluster=nonprod` - -Example: - -```text -KMS key: alias/glab-vault-unseal - -S3: - s3://glab-vault-unseal/platform/* - s3://glab-vault-unseal/nonprod/* - s3://glab-vault-unseal/prod/* - -KMS context: - Purpose = vault-unseal - Cluster = platform | nonprod | prod -``` - -One KMS key per cluster would provide cleaner blast-radius isolation, but the -fixed monthly KMS cost is not worth it at this lab scale. - -## Public HTTP TLS - -HTTP TLS certificates are always issued by Let's Encrypt through ACME DNS-01 -against Route 53. - -Cluster responsibilities: - -- ExternalDNS manages service DNS records. -- cert-manager manages ACME orders, challenges, and certificate renewal. - -ExternalDNS does not manage `_acme-challenge` TXT records. Those belong to -cert-manager. - -The AWS account already contains a public Route 53 ACME validation zone, -`acme.glab.lol`, delegated from Cloudflare. The cluster TLS design uses that -zone rather than granting cluster workloads broad Cloudflare DNS access. - -The challenge delegation convention is: - -```text -_acme-challenge..glab.lol - CNAME _acme-challenge...acme.glab.lol -``` - -cert-manager is configured to follow CNAMEs and write TXT records into the -Route 53 ACME zone using short-lived AWS credentials. Each cluster's AWS role -is scoped to its own challenge names. - -No wildcard certificate is assumed. Individual services receive individual -certificates unless a future workload proves a wildcard is worth the broader -blast radius. - -## Internal PKI - -### Root CA - -The internal PKI root is an AWS KMS asymmetric signing key. The private key -never leaves KMS. - -The existing `infra/security/pki/root-ca` stack was applied against an old AWS -account and is not the target root. The target implementation recreates the -root CA in the current `lab` account, then cleans up the old-account root key -and state. - -During recreation, the root certificate's path length is increased from the -current `pathlen:1` model. The recommended target is `pathlen:2`: - -```text -Root CA pathlen:2 - -> cluster Vault intermediate pathlen:1 - -> SPIRE intermediate pathlen:0 - -> workload SVID leaves -``` - -This keeps the future SPIRE path open without requiring another root rotation. -For clusters where Vault directly issues mTLS leaves, the same hierarchy still -works: - -```text -Root CA pathlen:2 - -> cluster Vault intermediate pathlen:1 - -> workload mTLS leaves -``` - -Root signing is operationally offline. No always-on lab workload has standing -permission to use the root key. Root signing is used only to mint or rotate -cluster subordinate CAs. - -### Cluster subordinate CAs - -Each cluster gets its own subordinate CA, generated and held by that cluster's -Vault instance. Vault generates the intermediate private key and CSR; the AWS -KMS root signs the CSR; the signed intermediate is imported back into Vault. - -The cluster subordinate CA identity includes the cluster name. Example common -names: - -```text -glab platform Vault CA -glab nonprod Vault CA -glab prod Vault CA -``` - -Vault PKI roles issue short-lived certificates for internal use cases such as: - -- service-to-service mTLS -- database client authentication -- internal controllers that need X.509 credentials -- future SPIRE upstream authority material - -### SPIFFE and SPIRE - -SPIRE is a future addition, not a baseline dependency. - -The first SPIRE deployment in each cluster uses an independent trust domain and -does not federate with other clusters. That keeps `platform`, `nonprod`, and -`prod` aligned with the Vault isolation model. - -When SPIRE is introduced, the preferred shape is: - -```text -cluster Vault CA -> SPIRE intermediate -> workload SVIDs -``` - -The SPIRE intermediate is local to the cluster. Federation is deferred until a -real cross-cluster workload requires it. - -## Retiring step-ca - -The older architecture placed `Smallstep step-ca` on `VP6630` as the online -internal intermediate CA. That was the right bootstrap-oriented first shape, -but it is not the target model for this design. - -In this design: - -- public HTTP TLS moves to Let's Encrypt and Route 53 DNS-01 -- internal runtime issuance moves to per-cluster Vault -- future workload identity moves to per-cluster SPIRE -- `step-ca` is removed once its remaining consumers have migrated - -The root CA migration is the natural time to make this break. Old chains can -expire or be replaced as consumers move to the new issuers. - -## Implementation Slices - -This design should be implemented in small slices: - -1. Rewrap existing SOPS files with `alias/glab-sops`, add encryption context, - and remove PGP/age recipients. -2. Rotate bootstrap secrets that previously depended on PGP/age-only history. -3. Create the GitHub App + SSM bootstrap path for `secrets` repo access. -4. Recreate the internal root CA in the current `lab` AWS account with the new - path length. -5. Clean up the old-account PKI root after the new root is usable. -6. Add the shared Vault unseal KMS key and bank-vaults storage layout. -7. Stand up Vault in one cluster and prove SOPS -> Vault bootstrap handoff. -8. Add cert-manager DNS-01 with Route 53 ACME delegation for one cluster. -9. Migrate internal PKI consumers away from `step-ca`. - -## Open Threads - -- Exact KMS encryption context keys and scope names. -- Exact GitHub App name, installation ID storage, and SSM parameter path. -- Whether Vault unseal material uses one shared S3 bucket with prefixes or - separate per-cluster buckets. -- Whether the new root CA should use `pathlen:2` exactly or a larger value. -- How trust bundles are distributed to workloads that need to trust internal - Vault or SPIRE issuers. -- Whether public TLS certificates should ever use wildcards. -- When old `step-ca` certificates are allowed to expire versus actively - replaced. - -## References - -- [AWS Lab Account](./aws-lab-account.md) -- [Keycloak](./keycloak.md) -- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) -- [cert-manager Route 53 DNS-01](https://cert-manager.io/docs/configuration/acme/dns01/route53/) -- [cert-manager delegated DNS-01](https://cert-manager.io/docs/configuration/acme/dns01/#delegated-domains-for-dns01) -- [Vault PKI secrets engine](https://developer.hashicorp.com/vault/docs/secrets/pki) -- [Vault PKI intermediate guidance](https://developer.hashicorp.com/vault/docs/secrets/pki/considerations) -- [Bank-Vaults unseal keys](https://bank-vaults.dev/docs/concepts/unseal-keys/) -- [SPIRE configuration](https://spiffe.io/docs/latest/deploying/configuring/) diff --git a/docs/docs/designs/service-exposure-and-control-plane-endpoints.md b/docs/docs/designs/service-exposure-and-control-plane-endpoints.md deleted file mode 100644 index 85d56c2..0000000 --- a/docs/docs/designs/service-exposure-and-control-plane-endpoints.md +++ /dev/null @@ -1,164 +0,0 @@ ---- -title: Service Exposure and Control Plane Endpoints -description: Proposed design for service VIP exposure and API endpoint HA on future multi-node Talos clusters in the lab. ---- - -# Service Exposure and Control Plane Endpoints - -## Status - -Proposed. - -This document defines how future multi-node clusters in the lab should expose -application traffic and how their external API endpoints should be formed. It -separates service and ingress VIPs from the Kubernetes API endpoint and from -the Talos API endpoint so those concerns do not blur together in later design -or implementation work. - -## Purpose - -The primary purpose of this design is to make the cluster entrypoint model -explicit and consistent across the lab. - -The intended split is: - -- Cilium provides service and ingress VIPs -- the `VP6630` acts as the upstream BGP peer for those VIPs -- Talos provides the external Kubernetes API endpoint via a shared VIP for - multi-node clusters on shared Layer 2 -- Talos API access continues to use direct control-plane endpoints by default - -This keeps service exposure, Kubernetes API HA, and Talos API access on -separate mechanisms that match what each layer is actually good at. - -## Goals - -- Standardize service exposure for future multi-node clusters. -- Standardize the external Kubernetes API endpoint shape for future multi-node - Talos clusters. -- Keep KubePrism enabled as the internal HA endpoint for host-network - components. -- Avoid mixing application VIPs with control-plane API endpoint design. - -## Non-Goals - -- This document does not define concrete Helm values for Cilium. -- This document does not define exact VyOS CLI or HAProxy configuration. -- This document does not define concrete Talos or CAPI manifests. -- This document does not change the current single-node platform cluster into - an HA cluster today. - -## Problem Split - -There are three separate networking problems here: - -1. **Service and ingress exposure** - - how traffic from outside a cluster reaches workloads inside it -2. **Kubernetes API endpoint** - - the canonical `https://...:6443` endpoint for a cluster -3. **Talos API endpoint** - - how operators reach the Talos API on port `50000` - -The lab should not try to solve all three with one mechanism. - -KubePrism is related, but it is a fourth, internal concern: - -- **Internal API consumers** - - host-network components such as Cilium or control-plane processes needing a - resilient in-cluster API path - -## Chosen Model - -### Service and Ingress VIPs - -Future multi-node clusters use: - -- Cilium LB IPAM for allocating service VIPs -- Cilium BGP Control Plane for advertising those VIPs -- the `VP6630` as the upstream BGP peer - -This applies to stable external entrypoints such as: - -- `LoadBalancer` Services -- ingress controller VIPs -- Gateway API data-plane entrypoints - -These VIPs are advertised as **service routes**, not PodCIDRs. With the -current Cilium `ipam.mode=kubernetes` assumption, PodCIDR advertisement is not -part of the design. - -### Internal API Consumers - -Talos clusters keep KubePrism enabled. - -KubePrism is the internal HA endpoint for host-network consumers of the -Kubernetes API, including Cilium. It is not the external cluster endpoint used -by operators or external clients. - -### Kubernetes API Endpoint - -Future multi-node Talos clusters use: - -- Talos VIP for the canonical Kubernetes API endpoint - -This means each multi-node cluster gets one canonical external endpoint of the -form: - -- `https://:6443` - -The endpoint is backed by a Talos virtual IP shared by the control-plane nodes. -This is the default cluster API HA model as long as the control-plane nodes -share a Layer 2 domain. - -### Talos API Endpoint - -The default Talos API model remains: - -- direct control-plane node endpoints - -An optional future enhancement is: - -- VyOS TCP load balancing for the Talos API - -That is intentionally not part of the baseline design for now. - -## Supporting Assumptions - -The chosen model assumes: - -- multi-node Talos control planes share a Layer 2 domain when Talos VIP is - used -- Cilium peers with the `VP6630` over BGP -- service and ingress VIPs are external traffic entrypoints, not the mechanism - for Kubernetes API HA -- Talos VIP is never used as the Talos API endpoint -- KubePrism stays enabled and Cilium is configured to use it for internal API - access - -If a future multi-node cluster does not have shared Layer 2 for its control -planes, the Kubernetes API endpoint strategy must be revisited explicitly. That -case is outside this baseline decision. - -## Cluster Scope - -This decision applies to: - -- future multi-node downstream clusters such as `nonprod` and `prod` -- any future multi-node platform cluster, if the platform cluster ever stops - being single-node - -This decision does **not** describe the current live platform cluster, which -remains single-node on the `UM760` and therefore does not yet exercise the HA -endpoint pattern. - -## Relationship to Other Designs - -This design builds on: - -- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) for day-0 - and day-1 cluster bring-up -- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) for cluster roles and - GitOps scope - -This document defines the intended endpoint model once a cluster is beyond -bootstrap and has become a real multi-node control plane. diff --git a/docs/docs/index.md b/docs/docs/index.md index b77794f..7bbb1e2 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -8,12 +8,22 @@ description: Architecture, hardware, and operating notes for the GilmanLab homel This site is the primary documentation surface for the GilmanLab homelab. -Start with: +Start with the architecture set: - [Architecture overview](./architecture.md) -- [Design documents](./designs/) +- [Hosts and substrate](./architecture/hosts-and-substrate.md) +- [Bootstrap and cluster lifecycle](./architecture/bootstrap-and-lifecycle.md) +- [Networking and endpoints](./architecture/networking-and-endpoints.md) +- [GitOps and platform APIs](./architecture/gitops-and-platform-apis.md) +- [Secrets, identity, DNS, and PKI](./architecture/secrets-identity-pki.md) +- [Keycloak runtime](./architecture/keycloak-runtime.md) +- [State and recovery](./architecture/state-and-recovery.md) + +Supporting references: + - [Hardware reference](./hardware.md) - [Network device backups](./network-device-backups.md) - [RouterOS ACME certificates](./routeros-acme.md) -More runbooks, decisions, and operating guides will live here as the lab grows. +Runbooks, exact commands, and implementation details will be added as the lab +is prototyped and built. diff --git a/docs/docs/routeros-acme.md b/docs/docs/routeros-acme.md index 94a1e63..dfb7a8e 100644 --- a/docs/docs/routeros-acme.md +++ b/docs/docs/routeros-acme.md @@ -5,6 +5,16 @@ description: How the MikroTik router and switch get WebFig HTTPS certificates fr # RouterOS ACME Certificates +:::note + +This page documents RouterOS certificates that still use `step-ca`. The +architecture docs define the steady-state PKI path: public HTTP TLS uses Let's +Encrypt with Route 53 DNS-01, and internal runtime PKI uses Vault +intermediates. Keep this page as an operational reference until these RouterOS +consumers are migrated. + +::: + The `CCR2004` home router and `CRS309` lab switch use the lab `step-ca` intermediate for WebFig HTTPS certificates.