From 7d98c7e1939e0165c20d2dcb8572bac3070b50b4 Mon Sep 17 00:00:00 2001 From: Joshua Gilman Date: Wed, 15 Apr 2026 19:46:52 -0700 Subject: [PATCH 1/2] docs(hardware): reflect four-subnet VyOS layout Update the documented gateway list and UM760 addressing to match the reduced VyOS subnet plan. --- docs/docs/hardware.md | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/docs/docs/hardware.md b/docs/docs/hardware.md index 01e7dc7..a1c1076 100644 --- a/docs/docs/hardware.md +++ b/docs/docs/hardware.md @@ -30,12 +30,9 @@ It is intentionally descriptive rather than prescriptive: - Transit address: `10.0.0.2/30` - Hostname: `gateway` - VLAN gateways: - - `10.10.10.1` (`LAB_MGMT`) + - `10.10.10.1` (`LAB_MGMT`, management + platform) - `10.10.20.1` (`LAB_PROV`) - - `10.10.30.1` (`LAB_PLATFORM`) - - `10.10.40.1` (`LAB_CLUSTER`) - - `10.10.50.1` (`LAB_SERVICE`) - - `10.10.60.1` (`LAB_STORAGE`) + - `10.10.40.1` (`LAB_WORKLOAD`) - `10.10.70.1` (`LAB_OOB`) ### Minisforum UM760 @@ -49,9 +46,9 @@ It is intentionally descriptive rather than prescriptive: - NVMe install target noted as `>= 256GB` - Observed NIC MAC in Talos config: `38:05:25:34:25:d0` - Network details found in repo: - - Node address: `10.10.30.10/24` + - Node address: `10.10.10.10/24` - API endpoint references: - - `https://10.10.30.10:6443` + - `https://10.10.10.10:6443` - `https://platform.lab.local:6443` ### Minisforum MS-02 Ultra From 452e2c5e21bf5ad5ceed85401cf05c88eb2a951e Mon Sep 17 00:00:00 2001 From: Joshua Gilman Date: Sat, 18 Apr 2026 15:00:28 -0700 Subject: [PATCH 2/2] docs: adds disaster recovery section (#7) --- ARCHITECTURE.md | 68 +--------- README.md | 1 + docs/docs/architecture.md | 144 +++++++++++++++++++- docs/docs/index.md | 2 + docs/docs/network-device-backups.md | 201 ++++++++++++++++++++++++++++ docs/docs/routeros-acme.md | 123 +++++++++++++++++ 6 files changed, 469 insertions(+), 70 deletions(-) create mode 100644 docs/docs/network-device-backups.md create mode 100644 docs/docs/routeros-acme.md diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 352973c..3e8f8c5 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -3,70 +3,4 @@ The canonical version of this document now lives in [`docs/docs/architecture.md`](docs/docs/architecture.md). -Use the Docusaurus site source under `docs/` for ongoing updates. -- VM creation should rely primarily on reusable templates -- the default storage path for downstream cluster VMs is node-local NVMe -- cluster scaling should be handled through CAPI rather than ad hoc Proxmox operations -- CAPI references templates that have already been published into Proxmox - -The desired outcome is that the clustered Proxmox layer exposes a stable substrate, while CAPI owns the actual Kubernetes cluster lifecycle. - -After creation, downstream clusters are treated as strongly isolated environments rather than extensions of the platform cluster. - -Downstream clusters may make use of BGP-advertised floating IPs through the `VP6630` when they need stable network entrypoints, but this is an available capability rather than a default requirement. - -## Role of GitOps - -`Argo CD` runs on the platform cluster and manages the desired state of the control plane. - -At minimum, that includes: - -- platform cluster applications -- platform cluster infrastructure controllers -- provisioning stack configuration - -The current design keeps `Argo CD` scoped to the platform cluster itself. - -This means: - -- the platform cluster's `Argo CD` manages only platform services and platform-owned infrastructure components -- downstream clusters are not assumed to be centrally registered into platform `Argo CD` -- downstream clusters are expected to manage their own services independently -- whether a downstream cluster uses `Argo CD` or some other delivery model is left to that cluster's own design - -## Why This Layout - -This layout separates concerns in a way that matches the intended operating model: - -- Talos on the `UM760` keeps the control plane narrow and appliance-like -- Proxmox on the `MS-02 Ultra` nodes provides a clustered VM substrate for downstream clusters -- Tinkerbell handles bare-metal installation -- Early Proxmox clustering gives CAPI a single control surface without waiting for the full shared-storage design -- AWX fills the gap between bare-metal install and a fully configured Proxmox node -- Packer, the NAS, and AWX together provide a clear image pipeline without pushing image management into CAPI -- TerraKube remains available as a future addition for complementary infrastructure automation when that need becomes concrete -- CAPI becomes the main abstraction for downstream cluster creation and scaling -- Argo CD keeps the platform cluster declarative -- BGP floating IPs remain a downstream-cluster capability rather than part of the platform cluster's default exposure model - -The design also avoids forcing too much day-one complexity into the Proxmox layer. The nodes can start as individually useful machines before later being combined into a more integrated Proxmox topology. - -For the same reason, the platform cluster remains single-node on the `UM760`. This accepts that `UM760` failure is a platform outage, but avoids introducing a false form of HA where multiple control-plane VMs still depend on the same underlying machine. - -The same rebuild-first logic applies to recovery boundaries: - -- the platform cluster is primarily rebuilt from Talos configuration, GitOps state, and automation -- Proxmox hosts are primarily rebuilt through Tinkerbell and AWX -- downstream clusters are primarily recreated through CAPI -- NAS-backed artifacts, backups, and selected workload data form the primary durable state boundary - -## Likely Next Sections - -As the design firms up, the next useful additions to this document are likely: - -- bootstrap path for the `UM760` -- Proxmox node lifecycle in more detail -- storage model -- network model -- downstream cluster lifecycle -- backup and disaster recovery boundaries +Use the Docusaurus source under `docs/` for ongoing updates. The canonical page now includes the current DNS and naming model for `lab.gilman.io` alongside the broader platform architecture. diff --git a/README.md b/README.md index 0fe04fa..957352e 100644 --- a/README.md +++ b/README.md @@ -32,6 +32,7 @@ moon run docs:start - [`docs/docs/index.md`](docs/docs/index.md): docs landing page - [`docs/docs/architecture.md`](docs/docs/architecture.md): architecture overview - [`docs/docs/hardware.md`](docs/docs/hardware.md): hardware inventory +- [`docs/docs/network-device-backups.md`](docs/docs/network-device-backups.md): RouterOS backup design for the future platform cluster ## Support diff --git a/docs/docs/architecture.md b/docs/docs/architecture.md index f9dd499..0aa75ff 100644 --- a/docs/docs/architecture.md +++ b/docs/docs/architecture.md @@ -25,7 +25,7 @@ At a high level: - AWX and/or TerraKube handle node-level post-provisioning work. - CAPI creates downstream Talos-based clusters as Proxmox VMs through the clustered Proxmox API surface. - The `DS923+` provides shared storage to the Proxmox layer. -- The `VP6630` remains the lab router and network boundary to the home network. +- The `VP6630` remains the lab router, DNS entrypoint, and network boundary to the home network. ## Design Intent @@ -53,6 +53,7 @@ This cluster is intended to own the following responsibilities: - `Argo CD`: GitOps for the platform cluster itself, and potentially for downstream cluster registration and sync - `AWX`: Ansible orchestration for infrastructure tasks that are still better handled through playbooks - `TerraKube`: optional Terraform-based automation for future non-node-bootstrap workflows +- network-device backup services for RouterOS configuration history and encrypted recovery artifacts This machine is not being used as a general-purpose compute node. Its purpose is to act as the lab control plane. @@ -100,7 +101,7 @@ The `DS923+` is also the primary durable backup boundary in the system. Platform The physical network roles remain: - `CCR2004`: home router -- `VP6630`: lab router and DMZ boundary to the home network +- `VP6630`: lab router, DNS entrypoint, and DMZ boundary to the home network - `CRS309-1G-8S+IN`: lab switch - `TL-SG105`: dedicated Intel AMT switch for the `MS-02 Ultra` management links @@ -121,6 +122,76 @@ The baseline network model is intentionally smaller than the previous lab design The provisioning network exists because Tinkerbell's DHCP and PXE flow requires Layer 2 access or DHCP relay. The current design assumes a dedicated provisioning segment rather than folding PXE traffic into the general workload path. +### DNS and Naming + +The lab uses `lab.gilman.io` as its internal naming root rather than a private-only top-level domain. + +This keeps internal naming under a real domain the lab controls while still allowing private DNS views, selective public exposure later, and a clean future path for public certificates on intentionally exposed endpoints. + +The current intended DNS design is: + +- `VyOS` remains the client-facing resolver for the lab networks. +- A `PowerDNS Authoritative` service runs as a container on the `VP6630`. +- `VyOS` forwards the lab's internal zones to that local authoritative service. +- Internal names are private by default; public DNS is reserved for explicitly exposed entrypoints. +- Internal certificates are expected to use a private CA by default; public CA issuance is reserved for endpoints that benefit from public trust. + +Running the authoritative DNS service on the router boundary instead of inside the platform cluster avoids a bootstrap dependency where the platform control plane would need to be healthy before the lab can resolve the names used to reach it. + +The namespace is intentionally split by ownership boundary instead of using one flat dynamic zone: + +| Zone | Writers | Purpose | +| --- | --- | --- | +| `lab.gilman.io` | manual or GitOps-managed only | Parent zone, delegations, and a small set of static anchor records | +| `mgmt.lab.gilman.io` | manual or GitOps-managed only | Stable management and platform service names | +| `dhcp.lab.gilman.io` | `VyOS` DHCP via `RFC2136` | Dynamic lease-driven hostnames | +| `.k8s.lab.gilman.io` | `ExternalDNS` via `RFC2136` | Cluster-scoped workload and ingress names | + +This design keeps the management namespace stable while still allowing dynamic DNS for both DHCP clients and Kubernetes workloads. It also keeps update rights narrow: `VyOS` DHCP cannot mutate management records, and each Kubernetes cluster can be constrained to only its own delegated subzone. + +### Internal PKI and Trust + +The lab's internal PKI is designed around the same bootstrap constraint as internal DNS: the trust anchor for internal services cannot depend on the platform cluster being healthy before it can issue or rotate certificates. + +The current intended PKI design is: + +- The internal root CA key lives in `AWS KMS`. +- The root CA is treated as operationally offline: no always-on lab service has standing permission to use it for routine issuance. +- A `Smallstep step-ca` service runs as a container on the `VP6630` as the online intermediate CA. +- Internal ACME is provided by that `step-ca` instance for automated certificate issuance and renewal. +- `Vault` remains the expected long-term home for most secret management inside the platform cluster, but it is not the bootstrap owner of the internal CA hierarchy. + +This keeps naming and trust in the same edge-adjacent failure domain without forcing the platform cluster to come up first. If the platform cluster is down, the lab can still resolve internal names and issue or renew the certificates needed to restore that control plane. + +The intended trust boundary is deliberately split: + +| Component | Role | Notes | +| --- | --- | --- | +| Root CA | trust anchor | Stored in `AWS KMS`; used only for intermediate issuance and rotation | +| `step-ca` on `VP6630` | online issuing intermediate | Handles day-to-day certificate issuance for internal services | +| ACME clients | automated consumers | Used by `cert-manager` and other internal services that can rotate through ACME | +| `Vault` | secret management consumer | May issue or store service-specific material later, but does not own bootstrap PKI | + +This design accepts that routing, internal DNS, and the online intermediate CA share the `VP6630` failure domain. That is an intentional trade for the homelab: a single edge host keeps the bootstrap path simple, while the root CA remains outside that host's routine operating privileges. + +### Network Device Backups + +Network-device backup collection belongs in the platform cluster once that +cluster is online. + +The first target devices are the MikroTik `CRS309` lab switch and `CCR2004` home +router. The durable flow should use `Oxidized` for RouterOS collection and a +small SOPS-aware writer that commits only encrypted backup artifacts into the +private `secrets` repo. + +This is intentionally not a `VP6630` container responsibility. RouterOS backups +are operational recovery support, not a bootstrap dependency like DNS or PKI. +Keeping the backup stack in the platform cluster keeps the router focused on +routing, internal DNS, and certificate issuance while the platform cluster owns +automation and Git-backed operational services. + +The design is documented in [Network device backups](./network-device-backups.md). + ## Control Flow ### 1. Platform Bootstrap @@ -243,6 +314,7 @@ At minimum, that includes: - platform cluster applications - platform cluster infrastructure controllers - provisioning stack configuration +- platform-owned operational services such as network-device backups The current design keeps `Argo CD` scoped to the platform cluster itself. @@ -253,6 +325,73 @@ This means: - downstream clusters are expected to manage their own services independently - whether a downstream cluster uses `Argo CD` or some other delivery model is left to that cluster's own design +## Disaster Recovery + +The lab's recovery model is rebuild-first. Most components are reconstructed from source-controlled configuration, GitOps state, and automation rather than restored from a point-in-time backup. This section defines the backup and restore substrate for the state that cannot be reasonably rebuilt that way. + +### Scope + +The lab accumulates irreducible state in four tiers: + +- **Proxmox VMs** — downstream cluster nodes, utility VMs, and any workload VMs whose guest state cannot be cheaply reconstructed +- **Platform and downstream Kubernetes clusters** — cluster objects (manifests, CRDs), etcd, and persistent volumes holding workload data +- **Arbitrary Linux hosts and one-off container volumes** — the `VP6630` VyOS router's container volumes (`pdns-auth` LMDB, `step-ca` BadgerDB, Tailscale machine key), Talos etcd snapshots pushed out of cluster, and similar filesystem-shaped state that does not live inside a Kubernetes cluster +- **RouterOS configuration history** — covered separately in [Network Device Backups](./network-device-backups.md). Its requirement is reviewable plaintext change history rather than block or filesystem recovery, so it uses a different artifact model + +### Tooling + +The first three tiers are covered by two complementary tools: + +| Tool | Scope | +| --- | --- | +| `Proxmox Backup Server` (PBS) | Proxmox VMs (native block-level, dedup, dirty-bitmap incrementals); arbitrary Linux hosts and container volumes via the standalone `proxmox-backup-client` | +| `Velero` | Kubernetes-native backup for the platform cluster and all CAPI-provisioned downstream clusters: manifests, CRDs, etcd, and persistent volumes via CSI snapshots or file-system backup | + +Two tools rather than one is a deliberate choice. PBS is best-in-class for VM-level and filesystem backup but has no native understanding of Kubernetes objects. Velero understands Kubernetes semantics — namespace-scoped restore, CRDs, application-consistent hooks, CSI integration — but does not replace a VM backup system. Each tool covers its slice well, and they do not overlap. + +### PBS Placement + +PBS runs as a Proxmox VM on the `MS-02 Ultra` tier, not inside the platform cluster and not on the NAS itself. + +This placement is driven by several constraints: + +- **Upstream install path.** PBS is shipped as a Debian-based appliance. There is no official container image, and the project's design assumes systemd and a local filesystem for the datastore. Running PBS in Kubernetes, including via KubeVirt, forces an off-path install and a container story PBS was not designed for. +- **No circular dependency on the platform cluster.** PBS exists to recover from failures, including platform-cluster failures. Running it inside the cluster it is meant to help restore is a trap. +- **GitOps automation.** Synology VMM on the `DS923+` is functional but has effectively no automation ecosystem. No official Terraform provider, no first-class Ansible coverage, and a semi-documented DSM API. Managing the PBS VM declaratively on Synology would regress against the GitOps posture the rest of the lab holds. Proxmox, by contrast, has a mature Terraform provider, an actively maintained Ansible collection, Packer builders, and a stable API — the PBS VM fits the same declarative pipeline as other infrastructure VMs. +- **Datastore locality.** The PBS datastore lives on NAS-backed storage via NFS from the `DS923+`, consistent with the NAS being the primary durable-data boundary. PBS reads and writes its dedup chunks against that mount; the VM itself stays lightweight and stateless enough to rebuild. + +The main tradeoff is timing. PBS depends on the Proxmox layer being up. Until that tier exists, backup for pre-existing state — notably the `VP6630` container volumes — is handled as an interim stopgap rather than through PBS. + +### Velero Placement + +Velero runs per Kubernetes cluster. The platform cluster gets its own Velero install, and each CAPI-provisioned downstream cluster gets its own. + +Every Velero instance writes to a shared S3-compatible object store exposed on the NAS. This keeps backup artifacts consolidated on the same durable substrate as PBS and lets restore operations pull from a single location, while still honoring the per-cluster operational boundary that Velero itself requires. + +Velero's backup scope per cluster includes: + +- Kubernetes manifests and CRDs +- etcd snapshots (on clusters where Velero's etcd integration applies; Talos clusters additionally retain native `talosctl etcd snapshot` as a direct path, pushed into PBS) +- persistent volumes via CSI snapshots where the cluster's CSI driver supports them, or via Velero's file-system backup path otherwise + +Downstream clusters are treated as independent recovery domains. They are not centrally registered into the platform cluster's Argo CD, and their Velero backups are self-contained so that a lost cluster can be reconstituted without standing up the platform cluster first. + +### Restore Drills + +No backup mechanism is considered complete until its restore path has been exercised at least once. For each of PBS-backed VMs, Velero-backed cluster state, and PBS-backed host-level volumes, the first production use of that backup must be paired with a documented restore drill against a lab-safe target. + +Restore expectations differ by tier: + +- **VMs** — PBS restore reconstitutes the full VM disk image. Tested via restore to a scratch VM on the Proxmox cluster. +- **Kubernetes clusters** — Velero restore reconstitutes cluster objects and PVC data into an empty cluster. Tested via restore into a throwaway CAPI cluster. +- **Host volumes** — `proxmox-backup-client` restore reconstitutes file trees on a target host. Tested via restore into a scratch directory on a lab VM. + +This applies equally to downstream clusters. A downstream cluster whose Velero backup has never been successfully restored is not considered protected. + +### Off-Site + +A single PBS datastore on the NAS leaves the lab exposed to NAS-level failure. Off-site replication is a planned addition rather than a day-one requirement. PBS supports pull-mode sync between PBS instances; the likely path is a second PBS target either on a different physical location or backed by S3-compatible object storage. Velero's object store can be mirrored or cross-region-replicated through whatever backend is chosen. The exact off-site design is deferred until the primary PBS is in place and the NAS-level failure scenarios have been characterized. + ## Why This Layout This layout separates concerns in a way that matches the intended operating model: @@ -288,4 +427,3 @@ As the design firms up, the next useful additions to this document are likely: - storage model - network model - downstream cluster lifecycle -- backup and disaster recovery boundaries diff --git a/docs/docs/index.md b/docs/docs/index.md index 9f59ce1..2921df0 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -12,5 +12,7 @@ Start with: - [Architecture overview](./architecture.md) - [Hardware reference](./hardware.md) +- [Network device backups](./network-device-backups.md) +- [RouterOS ACME certificates](./routeros-acme.md) More runbooks, decisions, and operating guides will live here as the lab grows. diff --git a/docs/docs/network-device-backups.md b/docs/docs/network-device-backups.md new file mode 100644 index 0000000..f2fc526 --- /dev/null +++ b/docs/docs/network-device-backups.md @@ -0,0 +1,201 @@ +--- +title: Network Device Backups +description: Future design for RouterOS configuration backup and encrypted Git storage. +--- + +# Network Device Backups + +This document defines the intended backup process for managed network devices. + +The first scope is the MikroTik RouterOS devices: + +- `CRS309-1G-8S+IN`: lab switch +- `CCR2004`: home router + +The `VP6630` runs VyOS and is managed separately through the `infra` repo's +VyOS configuration and Ansible flow. It may be added to the same visibility +surface later, but the first backup process should stay focused on RouterOS. + +## Placement + +The durable implementation belongs in the platform cluster, not on the `VP6630`. + +The backup service is operational plumbing rather than a bootstrap dependency. +Running it in the platform cluster keeps the router focused on routing, DNS, and +PKI duties, while the platform cluster owns automation, GitOps-managed services, +and recovery helpers. + +Until the platform cluster exists, ad hoc manual exports are acceptable. Do not +add a long-lived RouterOS backup container to the VyOS router as the default +design. + +## Desired Flow + +The target flow is: + +1. `Oxidized` polls each RouterOS device on a schedule. +2. Oxidized writes the latest fetched config to a private staging volume. +3. A small `backup-sync` job or sidecar reads the staged export. +4. `backup-sync` compares the plaintext export with the decrypted current SOPS + backup. +5. If the plaintext changed, `backup-sync` writes a structured SOPS-encrypted + backup into the private `secrets` repo. +6. `backup-sync` commits and pushes only encrypted files. +7. Health checks report the last successful backup time per device. + +Oxidized should use file output for the handoff to `backup-sync`, not its native +Git output. Oxidized's Git backend commits plaintext configs, and its encrypted +Git option uses `git-crypt`, not SOPS. + +The encrypted Git writer should be deliberately small. It only needs to compare, +encrypt, commit, and push. Oxidized should remain responsible for device polling, +connection handling, and RouterOS model support. + +## Secret Boundary + +Encrypted backup payloads belong in the private `secrets` repo. + +Public repos may contain: + +- the platform workload manifests +- the list of expected devices +- config templates +- references to secret names and paths +- the `backup-sync` source code, if a custom tool is needed + +Public repos must not contain: + +- RouterOS credentials +- Git deploy keys +- SOPS age identities +- encrypted backup payloads +- raw RouterOS exports + +The expected private repo layout is: + +```text +network/ + mikrotik/ + backup-credentials.sops.yaml + backup-writer.sops.yaml + backups/ + ccr2004.sops.yaml + crs309.sops.yaml +``` + +`backup-credentials.sops.yaml` should hold the RouterOS backup user material. +`backup-writer.sops.yaml` should hold the Git and SOPS identity material needed +by the platform-cluster writer. + +## Backup Format + +The reviewable backup artifact should be a SOPS-encrypted YAML envelope, not a +bare `.rsc` file. + +The plaintext form before encryption should look like: + +```yaml +device: crs309 +kind: routeros-export +captured_at: "2026-04-16T00:00:00Z" +routeros_version: "7.16.2" +source: oxidized +export: | + /interface ethernet + set [ find default-name=sfp-sfpplus1 ] name=to-vyos +``` + +This keeps metadata available to tooling while still letting SOPS encrypt the +actual configuration content. The exact schema can grow, but it should remain +simple enough to inspect with `sops -d`. + +Do not blindly re-encrypt and commit every poll. SOPS encryption output can +change even when the plaintext has not changed, so `backup-sync` must compare +plaintext before writing a new encrypted file. + +## Export Policy + +Start with plain text RouterOS exports. + +The initial command should prefer a terse export rather than verbose output. +Verbose exports include more default and built-in state, which makes them harder +to review and more fragile as restore input. + +Treat text exports as the primary review and change-history artifact. They are +not automatically a complete bare-metal recovery guarantee. + +RouterOS text exports do not include every sensitive or device-local artifact, +including system user passwords, installed certificates, SSH keys, Dude data, or +User Manager databases. Future implementation should explicitly decide whether +to add same-device binary backups for disaster recovery. If binary backups are +added, they must also be SOPS-encrypted before commit and their restore path must +be tested. + +## Access Model + +Create a dedicated RouterOS backup identity per device or per backup domain. + +The backup user should have only the policies needed for the selected export +method. Do not reuse the day-to-day administrator identity. If the export process +eventually needs sensitive values, grant that deliberately and document why. + +The platform cluster needs network reachability from the backup namespace to: + +- `crs309.mgmt.lab.gilman.io` or `10.10.10.2` +- the home `CCR2004` management address + +The implementation session must add the minimum firewall and service-access +rules needed for those connections. + +## GitOps Shape + +The backup stack should be deployed by Argo CD as a platform-owned application. + +The public desired state should define: + +- namespace +- Oxidized deployment +- `backup-sync` job or sidecar +- persistent or ephemeral staging volume +- network policy, if the cluster network plugin supports it +- health checks and alerting hooks +- secret references, not secret payloads + +The private `secrets` repo supplies credentials and stores the encrypted backup +artifacts. The backup writer therefore needs both read access to its deployment +secrets and write access to the backup destination path. + +## Restore Expectations + +Every backup mechanism must be paired with a restore drill. + +The first implementation is not complete until it proves: + +- a current export can be decrypted from the `secrets` repo +- the export can be inspected by an operator +- a non-destructive import dry run or lab-device restore test has been performed +- the known gaps in text exports are documented + +Do not assume a RouterOS `.rsc` export can be applied blindly to a wiped or +replacement device. RouterOS imports are sensitive to version, hardware, default +objects, interface naming, certificates, keys, and users. + +## Future Implementation Checklist + +- Choose whether to deploy stock Oxidized plus a custom `backup-sync` container + or a single custom collector for the first two devices. +- Add RouterOS backup users for the `CRS309` and `CCR2004`. +- Add encrypted RouterOS credentials under `secrets/network/mikrotik/`. +- Add the platform-cluster Git writer credentials under + `secrets/network/mikrotik/`. +- Create the Argo CD application and manifests for the backup stack. +- Confirm platform-cluster network reachability to both devices. +- Run the first backup and verify only SOPS-encrypted files are committed. +- Test restore behavior against a lab-safe target before relying on the backups + for disaster recovery. + +## References + +- [Oxidized](https://github.com/ytti/oxidized) +- [RouterOS Configuration Management](https://help.mikrotik.com/docs/spaces/ROS/pages/328155/Configuration%2BManagement) +- [SOPS](https://github.com/getsops/sops) diff --git a/docs/docs/routeros-acme.md b/docs/docs/routeros-acme.md new file mode 100644 index 0000000..94a1e63 --- /dev/null +++ b/docs/docs/routeros-acme.md @@ -0,0 +1,123 @@ +--- +title: RouterOS ACME Certificates +description: How the MikroTik router and switch get WebFig HTTPS certificates from the lab CA. +--- + +# RouterOS ACME Certificates + +The `CCR2004` home router and `CRS309` lab switch use the lab `step-ca` +intermediate for WebFig HTTPS certificates. + +The live names are: + +| Device | RouterOS identity | HTTPS name | Address | +| --- | --- | --- | --- | +| `CCR2004-16G-2S+` | `Core Router` | `ccr2004.mgmt.lab.gilman.io` | `192.168.1.1` | +| `CRS309-1G-8S+` | `lab-10g-switch` | `crs309.mgmt.lab.gilman.io` | `10.10.10.2` | + +Both names are served from the `mgmt.lab.gilman.io` PowerDNS zone. The +`CRS309` address also has a PTR in `10.10.10.in-addr.arpa`. + +## CA and DNS Path + +`step-ca` runs on the `VP6630` as the online intermediate CA at: + +```text +https://ca.mgmt.lab.gilman.io:9000/acme/acme/directory +``` + +The `stepca` container is configured with `name-server 10.10.10.1` so ACME +HTTP-01 validation resolves internal names through the VyOS recursor. + +The MikroTik devices import and trust the lab root CA before using the private +ACME directory: + +```routeros +/certificate/import file-name=glab-root-ca.crt name=glab-root-ca trusted=yes +``` + +The `CRS309` also needs the intermediate imported explicitly so WebFig serves a +complete chain: + +```routeros +/certificate/import file-name=glab-intermediate-ca.crt name=glab-intermediate-ca trusted=yes +``` + +On the `CCR2004`, RouterOS imported the intermediate during ACME issuance, but +it still had to be marked trusted so WebFig would serve the full chain: + +```routeros +/certificate/set [find name="glab Intermediate CA"] trusted=yes +``` + +## Issuance + +RouterOS `7.16` and `7.18` use the older ACME command: + +```routeros +/certificate/enable-ssl-certificate \ + directory-url=https://ca.mgmt.lab.gilman.io:9000/acme/acme/directory \ + dns-name=ccr2004.mgmt.lab.gilman.io \ + reset-private-key=yes +``` + +For the switch, replace the DNS name with `crs309.mgmt.lab.gilman.io`. + +Current RouterOS documentation describes the newer `/certificate/add-acme` +command. Use that when these devices are upgraded to a release that exposes it. + +## WebFig Services + +The ACME command assigns the issued certificate to `www-ssl`, but on these +devices it did not enable the service. HTTPS is enabled explicitly: + +```routeros +/ip/service/set [find name=www-ssl] \ + disabled=no \ + address=192.168.1.0/24,10.10.0.0/16 \ + certificate=ccr2004.mgmt.lab.gilman.io +``` + +Use `certificate=crs309.mgmt.lab.gilman.io` on the switch. + +Plain HTTP remains enabled only for ACME HTTP-01 validation. It is restricted +to the source address that `step-ca` uses to reach each device: + +| Device | `www` allowed source | Reason | +| --- | --- | --- | +| `CCR2004` | `10.0.0.2/32` | VyOS source address toward `192.168.1.1` | +| `CRS309` | `10.10.10.1/32` | VyOS source address toward `10.10.10.2` | + +```routeros +/ip/service/set [find name=www] address=10.0.0.2/32 +/ip/service/set [find name=www] address=10.10.10.1/32 +``` + +Run the first command on the `CCR2004` and the second on the `CRS309`. + +## Verification + +From a client that trusts `glab Root CA` and uses the lab resolver or Tailscale +split DNS, these should return HTTP `200` with TLS verification result `0`: + +```bash +curl --cacert infra/security/pki/root-ca/root_ca.crt \ + https://ccr2004.mgmt.lab.gilman.io/ + +curl --cacert infra/security/pki/root-ca/root_ca.crt \ + https://crs309.mgmt.lab.gilman.io/ +``` + +For a client that does not resolve lab DNS locally, use `--resolve` while +testing. Treat this as a fallback; normal access should resolve through the +home router, VyOS, or Tailscale split DNS. + +```bash +curl --cacert infra/security/pki/root-ca/root_ca.crt \ + --resolve ccr2004.mgmt.lab.gilman.io:443:192.168.1.1 \ + https://ccr2004.mgmt.lab.gilman.io/ + +curl --cacert infra/security/pki/root-ca/root_ca.crt \ + --resolve crs309.mgmt.lab.gilman.io:443:10.10.10.2 \ + https://crs309.mgmt.lab.gilman.io/ +```