diff --git a/docs/docs/architecture.md b/docs/docs/architecture.md index 4e10a8a..fb7fc35 100644 --- a/docs/docs/architecture.md +++ b/docs/docs/architecture.md @@ -110,7 +110,8 @@ The current cabling intent is: - Each `MS-02 Ultra` uses one `25GbE` link to the `CRS309` - A future second link per node may be dedicated to storage and/or Proxmox clustering traffic - Intel AMT links terminate on the `TL-SG105` -- The `VP6630` remains the routing boundary and provides optional BGP-based floating IP advertisement for downstream clusters +- The `VP6630` remains the routing boundary and is the intended upstream BGP + peer for Cilium-advertised service VIPs on future multi-node clusters The baseline network model is intentionally smaller than the previous lab design, but it still preserves a dedicated Layer 2 provisioning domain for Tinkerbell: @@ -303,7 +304,16 @@ The desired outcome is that the clustered Proxmox layer exposes a stable substra After creation, downstream clusters are treated as strongly isolated environments rather than extensions of the platform cluster. -Downstream clusters may make use of BGP-advertised floating IPs through the `VP6630` when they need stable network entrypoints, but this is an available capability rather than a default requirement. +The intended multi-node cluster endpoint model is: + +- Cilium LB IPAM plus Cilium BGP peering with the `VP6630` for service and + ingress VIPs +- Talos VIP for the canonical Kubernetes API endpoint on shared Layer 2 +- direct control-plane endpoints for the Talos API by default + +This is the intended standard for downstream clusters and for any future +multi-node platform cluster, but it is not live on the current platform +cluster while that cluster remains single-node. ## Role of GitOps @@ -409,7 +419,9 @@ This layout separates concerns in a way that matches the intended operating mode - TerraKube remains available as a future addition for complementary infrastructure automation when that need becomes concrete - CAPI becomes the main abstraction for downstream cluster creation and scaling - Argo CD keeps the platform cluster declarative -- BGP floating IPs remain a downstream-cluster capability rather than part of the platform cluster's default exposure model +- multi-node clusters are intended to use Cilium+BGP for service VIPs and + Talos VIP for the Kubernetes API endpoint, while the current platform cluster + remains single-node The design also avoids forcing too much day-one complexity into the Proxmox layer. The nodes can start as individually useful machines before later being combined into a more integrated Proxmox topology. diff --git a/docs/docs/designs/bootstrap-core-delivery.md b/docs/docs/designs/bootstrap-core-delivery.md new file mode 100644 index 0000000..2d6e7ba --- /dev/null +++ b/docs/docs/designs/bootstrap-core-delivery.md @@ -0,0 +1,356 @@ +--- +title: Bootstrap and Core Delivery Model +description: Proposed design for day-0 substrate and day-1 cluster-core delivery across the platform, nonprod, and prod clusters. +--- + +# Bootstrap and Core Delivery Model + +## Status + +Proposed. + +This document defines how the lab brings clusters to life before the reusable +`kro` API layer becomes active. It covers the narrow Talos/CAPI bootstrap path, +the reusable cluster-core components that GitOps manages afterward, and the +handoff from bootstrap artifacts to steady-state Argo CD ownership. + +## Purpose + +The primary purpose of this design is to keep bootstrap delivery, +cluster-core reuse, and platform API ownership clearly separated. + +The intended split is: + +- the `platform` repo owns canonical bootstrap/core component inputs and + rendered bootstrap artifacts +- the `gitops` repo owns per-cluster version selection and cluster-local + desired state after bootstrap +- the `infra` repo and CAPI templates own immutable Talos day-0 references for + fresh installs and reinstalls + +This lets the lab reuse Cilium, Argo CD, and `kro` consistently across the +platform, `nonprod`, and `prod` clusters without copying their canonical install +artifacts into `gitops`. + +## Goals + +- Keep canonical bootstrap/core component source out of the `gitops` repo. +- Make platform-cluster bootstrap and downstream CAPI cluster creation + reproducible from versioned artifacts. +- Keep day-0 substrate narrow and explicit. +- Make day-2 ownership belong to Argo CD rather than to Talos/CAPI bootstrap + references. +- Start `kro` only at the first real platform API boundary. + +## Non-Goals + +- This document does not define the exact Helm values for Cilium, Argo CD, or + `kro`. +- This document does not define the long-term service exposure or control-plane + endpoint strategy for clusters after bootstrap. +- This document does not define the CI workflow implementation for rendering, + validation, or publishing. +- This document does not define the exact `Platform` schema. +- This document does not change the current architecture assumption that only + the platform cluster runs Argo CD. + +## Design Summary + +The intended cluster bring-up model is: + +- every cluster boots with a day-0 substrate +- reusable day-1 cluster-core components are then installed by GitOps +- the reusable `kro` platform API begins only after those prerequisites exist + +The three layers are: + +1. **Day-0 substrate** + - components required before GitOps or higher-level APIs can act + - includes Cilium on every cluster + - includes minimal Argo CD and root-app seeding only on the platform + cluster +2. **Day-1 cluster-core** + - reusable cluster components managed by GitOps but not exposed as + consumer-facing platform APIs + - includes the full Cilium install and `kro` + - includes Argo CD self-management on the platform cluster +3. **Platform API** + - released RGD bundles and the cluster-local `Platform` custom resource + - starts only after the day-1 cluster-core layer is present + +The following are intentionally **not** modeled as `kro` APIs: + +- Cilium +- Argo CD +- `kro` itself + +They are reusable installable cluster primitives, not consumer-facing platform +APIs. + +The long-term service exposure and control-plane endpoint model for those +clusters is defined in +[Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md). + +## Cluster Flows + +### Platform Cluster + +The platform cluster bootstrap flow is: + +1. Talos installs bootstrap-safe Cilium from an immutable artifact reference. +2. Talos installs minimal Argo CD from an immutable artifact reference. +3. Talos seeds the admin-owned root `Application`. +4. The root app syncs `clusters/platform/bootstrap.yaml` from `gitops`. +5. That per-cluster bootstrap selection installs: + - full/self-managed Argo CD + - full Cilium + - `kro` +6. After `kro` is present, `clusters/platform/platform/` installs the selected + released RGD bundles and the cluster-local `Platform` custom resource. + +### Downstream Clusters + +The downstream `nonprod` and `prod` cluster flow is: + +1. CAPI/Talos installs bootstrap-safe Cilium from an immutable artifact + reference. +2. The platform-cluster Argo CD instance registers or reaches the new cluster. +3. Argo CD syncs `clusters//bootstrap.yaml` from `gitops`. +4. That per-cluster bootstrap selection installs: + - full Cilium + - `kro` +5. After `kro` is present, `clusters//platform/` installs the selected + released RGD bundles and the cluster-local `Platform` custom resource. + +Downstream clusters do **not** bootstrap their own Argo CD instances under the +current design. + +## Ownership and Repository Boundaries + +### Platform Repo + +The `platform` repo is the source of truth for reusable bootstrap/core +artifacts. + +It owns: + +- canonical Helm values for reusable cluster primitives +- bootstrap-safe rendered manifests for Talos/CAPI day-0 consumption +- component-scoped Argo application definitions that use the same canonical + source inputs +- version tags and release history for those artifacts + +It does not own per-cluster version selection or cluster-local desired state. + +### GitOps Repo + +The `gitops` repo is the source of truth for which version of each reusable +bootstrap/core component a cluster should run after GitOps takes over. + +It owns: + +- `clusters//bootstrap.yaml` for per-cluster bootstrap/core version + selection +- `clusters//platform/` for released RGD bundle installation and the + cluster-local `Platform` custom resource +- any other cluster-local desired state after bootstrap + +It does not own the canonical rendered source for reusable bootstrap/core +components. + +### Infra Repo and CAPI Templates + +The `infra` repo and CAPI templates own only the immutable day-0 references +needed to create or reinstall clusters. + +They own: + +- Talos machine-config references for platform-cluster day-0 artifacts +- CAPI/Talos template references for downstream-cluster day-0 artifacts + +They do not own day-2 change management for those components. + +## Canonical Artifact Layout + +The `platform/bootstrap/` subtree carries both Talos/CAPI day-0 substrate +artifacts and reusable day-1 cluster-core components. The name does not imply +that every component there is consumed directly by Talos. + +The intended `platform` repo layout is: + +```text +platform/ +└── bootstrap/ + ├── cilium/ + │ ├── values/ + │ │ ├── base.yaml + │ │ ├── bootstrap-overrides.yaml + │ │ └── full-overrides.yaml + │ ├── render/ + │ │ ├── bootstrap.yaml + │ │ └── full.yaml + │ └── app.yaml + ├── argocd/ + │ ├── values/ + │ │ ├── base.yaml + │ │ ├── bootstrap-overrides.yaml + │ │ └── full-overrides.yaml + │ ├── render/ + │ │ ├── bootstrap.yaml + │ │ └── full.yaml + │ └── app.yaml + └── kro/ + ├── values/ + │ ├── base.yaml + │ └── full-overrides.yaml + ├── render/ + │ └── full.yaml + └── app.yaml +``` + +The intended semantics are: + +- `values/base.yaml`: shared component baseline +- `values/bootstrap-overrides.yaml`: bootstrap-only overrides needed for + Talos/CAPI-safe day-0 delivery +- `values/full-overrides.yaml`: steady-state overrides for the GitOps-managed + install +- `render/bootstrap.yaml`: the immutable raw manifest Talos/CAPI consumes for + day-0 bring-up +- `render/full.yaml`: the fully rendered steady-state manifest for review and + validation parity with the Helm-driven install +- `app.yaml`: the component-scoped Argo CD application definition using the + canonical chart, version, and full values inputs + +The per-cluster `clusters//bootstrap.yaml` resources in `gitops` +remain the only cluster-specific version-selection surface. They pin +destination and `targetRevision` while reusing the canonical component source +shape defined in the matching `platform/bootstrap//app.yaml`. + +`kro` has no Talos/CAPI bootstrap variant in the current design, so it does not +need `bootstrap-overrides.yaml` or `render/bootstrap.yaml`. + +The intended `gitops` repo surface is: + +```text +gitops/ +└── clusters/ + ├── platform/ + │ ├── bootstrap.yaml + │ └── platform/ + │ ├── rgds-platform.yaml + │ ├── rgds-apps.yaml + │ └── platform.yaml + ├── nonprod/ + │ ├── bootstrap.yaml + │ └── platform/ + │ ├── rgds-platform.yaml + │ ├── rgds-apps.yaml + │ └── platform.yaml + └── prod/ + ├── bootstrap.yaml + └── platform/ + ├── rgds-platform.yaml + ├── rgds-apps.yaml + └── platform.yaml +``` + +Each `bootstrap.yaml` is admin-owned and selects which released version of the +reusable bootstrap/core components a cluster should adopt. + +## Versioning and Promotion Rules + +The intended versioning model is: + +1. Change the canonical values or source inputs in `platform`. +2. Re-render `render/bootstrap.yaml` and `render/full.yaml` from those pinned + inputs. +3. Cut a versioned `platform` release tag. +4. Bump each cluster's `clusters//bootstrap.yaml` in `gitops` to the + selected tag. +5. If a day-0 artifact changed, also bump the immutable bootstrap artifact + references in: + - platform-cluster Talos config in `infra` + - downstream-cluster CAPI templates + +The versioning rules are: + +- cluster selections happen in `gitops` +- Talos/CAPI raw artifact URLs use immutable commit SHAs +- human-facing release selection happens by tag +- tags must be treated as immutable once published + +The SHA referenced by Talos/CAPI must correspond to the released artifact +selected for that version, even if GitOps later advances clusters at different +cadences. + +## Bootstrap-Safe Versus Full Installs + +### Cilium + +Cilium has two delivery shapes: + +- **bootstrap-safe** + - used by Talos/CAPI day-0 bootstrap + - must preserve the intended steady-state core datapath behavior + - must disable secret-producing features so the rendered manifest is safe to + host at a public immutable URL +- **full** + - used by the GitOps-managed day-1/day-2 install + - may enable observability and TLS features that create or depend on secret + material + +Bootstrap Cilium is intentionally not a separate product. It is the steady-state +core datapath intent plus a small, explicit set of bootstrap-only exceptions. + +### Argo CD + +Argo CD also has two delivery shapes on the platform cluster: + +- **bootstrap** + - minimal install sufficient to run the root app +- **full** + - self-managed steady-state Argo CD installed by GitOps + +Downstream clusters do not use an Argo CD bootstrap variant under the current +design. + +### kro + +`kro` has only a full GitOps-managed install in this design. It is not Talos +day-0 substrate. + +## Ownership Handoff + +Talos/CAPI bootstrap and Argo CD do not share day-2 ownership equally. + +The intended ownership handoff is: + +- Talos/CAPI bootstrap gets the cluster alive +- Argo CD becomes the steady-state owner of full Cilium, Argo CD, and `kro` +- day-2 changes are made by updating `platform` inputs and the per-cluster + selections in `gitops`, not by editing Talos/CAPI day-0 URLs + +The Talos/CAPI references remain narrow and reinstall-focused. They exist so a +fresh cluster can boot, not so Talos/CAPI become the long-term control plane +for those components or define the cluster's steady-state external service and +API endpoint model. + +## Relationship to Other Designs + +This design builds on: + +- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) for cluster topology, + Argo scope, and application flow +- [Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md) + for the steady-state external service and API endpoint model once bootstrap is + complete +- [Platform RGD Delivery Model](./platform-rgd-delivery.md) for released RGD + bundle delivery after `kro` is already present +- [kro Consumption Model](./kro-consumption-model.md) for the ownership split + between platform-owned APIs, developer release intent, and GitOps + materialization + +This document starts before those other designs. It ends at the point where a +cluster already has its reusable cluster-core components and is ready to consume +released RGD bundles and cluster-local platform APIs. diff --git a/docs/docs/designs/gitops-multi-cluster.md b/docs/docs/designs/gitops-multi-cluster.md index 949fc83..36cda04 100644 --- a/docs/docs/designs/gitops-multi-cluster.md +++ b/docs/docs/designs/gitops-multi-cluster.md @@ -85,6 +85,12 @@ The high-level split is: - team boundary: Capsule tenant per team per workload cluster - workload boundary: namespace per `team-app-env` +For future multi-node clusters, service and ingress VIPs are intended to use +Cilium LB IPAM plus Cilium BGP peering with the `VP6630`, while the canonical +Kubernetes API endpoint is intended to use Talos VIP. The control-plane +endpoint model is defined in +[Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md). + ## Cluster Roles ### Platform Cluster @@ -214,16 +220,16 @@ gitops/ │ │ └── teamb-appb2/ ├── clusters/ │ ├── platform/ +│ │ ├── bootstrap.yaml │ │ ├── platform/ -│ │ │ ├── kro.yaml │ │ │ ├── rgds-platform.yaml │ │ │ ├── rgds-apps.yaml │ │ │ └── platform.yaml │ │ ├── policies/ │ │ └── shared/ │ ├── nonprod/ +│ │ ├── bootstrap.yaml │ │ ├── platform/ -│ │ │ ├── kro.yaml │ │ │ ├── rgds-platform.yaml │ │ │ ├── rgds-apps.yaml │ │ │ └── platform.yaml @@ -233,8 +239,8 @@ gitops/ │ │ ├── policies/ │ │ └── shared/ │ └── prod/ +│ ├── bootstrap.yaml │ ├── platform/ -│ │ ├── kro.yaml │ │ ├── rgds-platform.yaml │ │ ├── rgds-apps.yaml │ │ └── platform.yaml @@ -260,8 +266,10 @@ gitops/ The ownership model is: - `platform/`: platform-cluster control-plane state -- `clusters/*/platform/`: cluster-local `kro` bootstrap, released RGD bundle - installation, and cluster-local `Platform` instances +- `clusters/*/bootstrap.yaml`: per-cluster version selection for reusable + bootstrap/core components delivered from the `platform` repo +- `clusters/*/platform/`: released RGD bundle installation and cluster-local + `Platform` instances after the bootstrap/core layer is present - `clusters/*/capsule`, `clusters/*/policies`, and `clusters/*/shared`: workload-cluster shared state - `teams/`: team-owned application instances @@ -274,10 +282,13 @@ It syncs: - `platform/argocd`, `platform/capi`, and `platform/kargo` to the platform cluster +- `clusters/platform/bootstrap.yaml` to the platform cluster - `clusters/platform/platform` to the platform cluster +- `clusters/nonprod/bootstrap.yaml` to the `nonprod` cluster - `clusters/nonprod/platform`, `clusters/nonprod/capsule`, `clusters/nonprod/policies`, and `clusters/nonprod/shared` to the `nonprod` cluster +- `clusters/prod/bootstrap.yaml` to the `prod` cluster - `clusters/prod/platform`, `clusters/prod/capsule`, `clusters/prod/policies`, and `clusters/prod/shared` to the `prod` cluster - `teams/*/*/envs/dev`, `teams/*/*/envs/staging`, and @@ -289,12 +300,17 @@ The intended Argo shape is: - one `AppProject` per team - `ApplicationSet` for platform-owned fleet generation - one admin-owned bootstrap `Application` per cluster for - `clusters//platform/` + `clusters//bootstrap.yaml` - `Application` resources kept in the `argocd` namespace -Within each `clusters//platform/` directory, sync waves should order -objects so `kro` installs first, the released RGD bundles install second, and -the cluster-local `Platform` instance is created last. +Each `clusters//bootstrap.yaml` selects the version of the reusable +bootstrap/core components for that cluster. The full bootstrap/core delivery +sequence is defined in +[Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md). + +Once the bootstrap/core layer is in place, `clusters//platform/` +holds the released RGD bundle installation and the cluster-local `Platform` +instance. Do not rely on application CRs scattered across arbitrary namespaces as the default control model. The central `argocd` namespace is simpler unless a later @@ -308,14 +324,17 @@ The intended pattern is: - shared RGD source and release lifecycle live in the `platform` repo - cluster-local RGD bundle installation and cluster-local platform instances - live under `clusters//platform/` + live under `clusters//platform/` after the bootstrap/core layer has + already installed `kro` - environment-specific application custom resources live under `teams/` - Argo CD syncs the YAML - versioned RGD bundles are installed from OCI artifacts - `kro` expands the custom resources into the Kubernetes objects they own The platform-side release, CUE authoring, and OCI publication model is defined -in [Platform RGD Delivery Model](./platform-rgd-delivery.md). +in [Platform RGD Delivery Model](./platform-rgd-delivery.md). The preceding +bootstrap/core delivery layer is defined in +[Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md). An environment-specific application resource should be narrow and explicit. For example: diff --git a/docs/docs/designs/index.md b/docs/docs/designs/index.md index 277f872..e22fc70 100644 --- a/docs/docs/designs/index.md +++ b/docs/docs/designs/index.md @@ -17,6 +17,8 @@ Use these documents when: Current designs: +- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) +- [Service Exposure and Control Plane Endpoints](./service-exposure-and-control-plane-endpoints.md) - [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) - [kro Consumption Model](./kro-consumption-model.md) - [Platform RGD Delivery Model](./platform-rgd-delivery.md) diff --git a/docs/docs/designs/platform-rgd-delivery.md b/docs/docs/designs/platform-rgd-delivery.md index 1ac9cd2..e348d7b 100644 --- a/docs/docs/designs/platform-rgd-delivery.md +++ b/docs/docs/designs/platform-rgd-delivery.md @@ -43,6 +43,8 @@ cluster bootstrap in Git simple enough to reason about at a glance. ## Non-Goals - This document does not define the exact `Platform` schema. +- This document does not define bootstrap/core component delivery for Cilium, + Argo CD, or `kro`. - This document does not define the full CI workflow YAML for release or publication. - This document does not define every future platform capability block. @@ -132,13 +134,16 @@ CI may import CRDs or equivalent schemas into CUE so the rendered artifact can be validated structurally before publication. Cluster-side `kro` validation is still responsible for the final semantic checks when the RGD is created. -## Cluster Consumption Model +## Cluster-local Platform API Consumption Model -The intended cluster-local bootstrap surface in `gitops` is: +This document starts after the bootstrap/core layer has already installed +`kro`. The preceding day-0/day-1 delivery sequence is defined in +[Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md). + +The intended cluster-local platform API surface in `gitops` is: ```text clusters//platform/ -├── kro.yaml ├── rgds-platform.yaml ├── rgds-apps.yaml └── platform.yaml @@ -146,26 +151,20 @@ clusters//platform/ Each file has one job: -- `kro.yaml`: install `kro` itself - `rgds-platform.yaml`: install the selected released `platform-rgds` OCI artifact - `rgds-apps.yaml`: install the selected released `apps-rgds` OCI artifact - `platform.yaml`: instantiate the cluster-local `Platform` custom resource -An admin-owned Argo CD bootstrap `Application` should point at -`clusters//platform/` and use sync waves so the order is explicit: - -1. install `kro` -2. install the released RGD bundles -3. create the cluster-local `Platform` instance - -This keeps the cluster bootstrap surface intentionally small and makes the +This keeps the cluster-local platform API surface intentionally small and makes the chosen bundle versions obvious in Git. ## Relationship to Other Designs This design builds on: +- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) for day-0 + and day-1 component delivery before `kro` is present - [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) for cluster topology, tenancy, and application flow - [kro Consumption Model](./kro-consumption-model.md) for the ownership split diff --git a/docs/docs/designs/service-exposure-and-control-plane-endpoints.md b/docs/docs/designs/service-exposure-and-control-plane-endpoints.md new file mode 100644 index 0000000..85d56c2 --- /dev/null +++ b/docs/docs/designs/service-exposure-and-control-plane-endpoints.md @@ -0,0 +1,164 @@ +--- +title: Service Exposure and Control Plane Endpoints +description: Proposed design for service VIP exposure and API endpoint HA on future multi-node Talos clusters in the lab. +--- + +# Service Exposure and Control Plane Endpoints + +## Status + +Proposed. + +This document defines how future multi-node clusters in the lab should expose +application traffic and how their external API endpoints should be formed. It +separates service and ingress VIPs from the Kubernetes API endpoint and from +the Talos API endpoint so those concerns do not blur together in later design +or implementation work. + +## Purpose + +The primary purpose of this design is to make the cluster entrypoint model +explicit and consistent across the lab. + +The intended split is: + +- Cilium provides service and ingress VIPs +- the `VP6630` acts as the upstream BGP peer for those VIPs +- Talos provides the external Kubernetes API endpoint via a shared VIP for + multi-node clusters on shared Layer 2 +- Talos API access continues to use direct control-plane endpoints by default + +This keeps service exposure, Kubernetes API HA, and Talos API access on +separate mechanisms that match what each layer is actually good at. + +## Goals + +- Standardize service exposure for future multi-node clusters. +- Standardize the external Kubernetes API endpoint shape for future multi-node + Talos clusters. +- Keep KubePrism enabled as the internal HA endpoint for host-network + components. +- Avoid mixing application VIPs with control-plane API endpoint design. + +## Non-Goals + +- This document does not define concrete Helm values for Cilium. +- This document does not define exact VyOS CLI or HAProxy configuration. +- This document does not define concrete Talos or CAPI manifests. +- This document does not change the current single-node platform cluster into + an HA cluster today. + +## Problem Split + +There are three separate networking problems here: + +1. **Service and ingress exposure** + - how traffic from outside a cluster reaches workloads inside it +2. **Kubernetes API endpoint** + - the canonical `https://...:6443` endpoint for a cluster +3. **Talos API endpoint** + - how operators reach the Talos API on port `50000` + +The lab should not try to solve all three with one mechanism. + +KubePrism is related, but it is a fourth, internal concern: + +- **Internal API consumers** + - host-network components such as Cilium or control-plane processes needing a + resilient in-cluster API path + +## Chosen Model + +### Service and Ingress VIPs + +Future multi-node clusters use: + +- Cilium LB IPAM for allocating service VIPs +- Cilium BGP Control Plane for advertising those VIPs +- the `VP6630` as the upstream BGP peer + +This applies to stable external entrypoints such as: + +- `LoadBalancer` Services +- ingress controller VIPs +- Gateway API data-plane entrypoints + +These VIPs are advertised as **service routes**, not PodCIDRs. With the +current Cilium `ipam.mode=kubernetes` assumption, PodCIDR advertisement is not +part of the design. + +### Internal API Consumers + +Talos clusters keep KubePrism enabled. + +KubePrism is the internal HA endpoint for host-network consumers of the +Kubernetes API, including Cilium. It is not the external cluster endpoint used +by operators or external clients. + +### Kubernetes API Endpoint + +Future multi-node Talos clusters use: + +- Talos VIP for the canonical Kubernetes API endpoint + +This means each multi-node cluster gets one canonical external endpoint of the +form: + +- `https://:6443` + +The endpoint is backed by a Talos virtual IP shared by the control-plane nodes. +This is the default cluster API HA model as long as the control-plane nodes +share a Layer 2 domain. + +### Talos API Endpoint + +The default Talos API model remains: + +- direct control-plane node endpoints + +An optional future enhancement is: + +- VyOS TCP load balancing for the Talos API + +That is intentionally not part of the baseline design for now. + +## Supporting Assumptions + +The chosen model assumes: + +- multi-node Talos control planes share a Layer 2 domain when Talos VIP is + used +- Cilium peers with the `VP6630` over BGP +- service and ingress VIPs are external traffic entrypoints, not the mechanism + for Kubernetes API HA +- Talos VIP is never used as the Talos API endpoint +- KubePrism stays enabled and Cilium is configured to use it for internal API + access + +If a future multi-node cluster does not have shared Layer 2 for its control +planes, the Kubernetes API endpoint strategy must be revisited explicitly. That +case is outside this baseline decision. + +## Cluster Scope + +This decision applies to: + +- future multi-node downstream clusters such as `nonprod` and `prod` +- any future multi-node platform cluster, if the platform cluster ever stops + being single-node + +This decision does **not** describe the current live platform cluster, which +remains single-node on the `UM760` and therefore does not yet exercise the HA +endpoint pattern. + +## Relationship to Other Designs + +This design builds on: + +- [Bootstrap and Core Delivery Model](./bootstrap-core-delivery.md) for day-0 + and day-1 cluster bring-up +- [Multi-Cluster GitOps Model](./gitops-multi-cluster.md) for cluster roles and + GitOps scope + +This document defines the intended endpoint model once a cluster is beyond +bootstrap and has become a real multi-node control plane.