docs(specs): deployment infrastructure design (Hetzner + NixOS)#461
Open
intendednull wants to merge 5 commits intomainfrom
Open
docs(specs): deployment infrastructure design (Hetzner + NixOS)#461intendednull wants to merge 5 commits intomainfrom
intendednull wants to merge 5 commits intomainfrom
Conversation
Draft spec for migrating production from a hand-configured Linode VM to a reproducible Hetzner Cloud + NixOS stack. Each layer (host, IaC, config push, secrets, TLS, CI, cache, backups, firewalls) is documented with its role, rationale, and rejected alternatives so the tradeoffs are auditable. Observability is intentionally deferred to issue #460 so this spec can land independently. https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Incorporates decisions from review: - Migrate relay Ed25519 key for peer-ID continuity; workers get fresh identities and a one-event SyncProvider re-grant. - Cloudflare manages DNS for whichever domain is in use; registrar stays put. - Identity keys are NOT committed to agenix; restic snapshots are the recovery path. - CI deploy SSH key uses forced-command restriction; interactive SSH is per-developer break-glass only. https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Synthesises findings from three parallel review agents (technical, factual research, security audit). Major changes: Factual corrections: - NixOS 25.11 (not 25.05); pin nixpkgs by hash, not channel name. - Hetzner pricing updated for 2026-04-01 hike: CAX21 ARM €7.99/mo preferred over CPX21 AMD €11.99/mo (better $/perf, project already targets aarch64). Volume €0.57, Storage Box €3.20, Primary IP €0.50 (not Floating IP, which is €3.00 and not in Phase-1 budget). - Cost total €12.26 vs €20 cap. - Cloudflare WSS-proxy claim corrected (proxy works fine for WSS; real reasons to keep relay direct are 100s idle timeout + binary TCP/9090). Security additions: - Phase 0 reframed as compromise response: leaked password is permanently public; auth-log audit gates whether the relay key is migrated at all. - CI = root on prod stated honestly; forced-command is no defence against malicious closures. Required mitigations: manual approval gate, closure signing pinned in trusted-public-keys, action SHA pinning, scoped permissions. - agenix host-key threat model documented; rekey procedure; bootstrap chicken-and-egg sequence; sops-nix as upgrade path for high-value secrets needing online revocation. - LUKS encryption on Hetzner Volume. - restic in append-only mode; offline DR secrets (restic password, Storage Box key, LUKS recovery key) listed; backup-failure email path. - New "Security baseline" section: hardware-token 2FA on Hetzner + Cloudflare + GitHub; CAA records; OpenTofu state encryption; systemd hardening defaults (`systemd-analyze security` ≤ 3.0); break-glass key separate from CI key. Operational additions: - New "Host configuration baseline" section: time sync, swap, Nix GC, journald limits, sshd hardening, mount ordering, NixOS version-bump procedure. - Caddy footguns explicit: WASM MIME directive, Brotli requires xcaddy build, admin API disabled. Phase reordering: - Phase 1 now includes backups + a verified restore drill before cutover (closes regression vs goal #6). - Phase 2 cutover gates on the Phase-0 audit conclusion + atomicity invariant (one host holds the live key at any moment). New runbooks/appendices: - Appendix C: relay key migration runbook (stream through age, never write to laptop disk, integrity-check via SHA-256, shred source). Tooling research updates: - Cachix risk note: attic upstream stalled; celler is the actively-maintained fork. - nh and clan-core added to rejected alternatives with current rationale. - §10 commits to Caddy-on-VM as long-term answer (CDN via Cloudflare proxy toggle), drops the speculative "Phase 2 move to Pages" stepping-stone shortcut. Firewall fix: - Port 9091 dropped from edge firewall (loopback never traverses edge); rate limit added on TCP/9090; IPv6 rules stated. https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Synthesises findings from second-round technical + security re-review agents. Top issues addressed: Blockers / cross-reference drift: - §13 no longer claims the LUKS volume key lives in /etc/willow (which is itself on the LUKS volume — circular). Volume key is in agenix only; backup set excludes it. - Stale Phase 3/4 references migrated to Phase 1 after the iter-1 reorder. - Cost table no longer lists "Pages" after §10's Pages rejection. - Stack-at-a-glance row removes the (Phase 1) qualifier on Caddy static hosting and corrects the "replaces" cell to match §"Context". - Repo layout disko file no longer named after the rejected CPX21. CI threat-model honesty (closure signing custody): - §11 names the closure-signing key location (GitHub Secrets, in a separate signing job) and bounds the threat-model claim to what signing actually defends: Cachix supply-chain compromise. A full GHA runner compromise of the signing job still owns prod; closure signing is blast-radius reduction, not a defence against runner compromise itself. DR completeness: - §13 DR triplet → quadruplet: adds an offline operator age private key (e.g. age-plugin-yubikey on hardware token). Without it, full-loss recovery cannot decrypt the OpenTofu API tokens to drive `tofu apply` on the new VM. - §13 explicit on identity-key behaviour: relay + worker keys are restored byte-identical from restic; SyncProvider re-grants only happen at the one-shot Phase-2 cutover, not on every DR exercise. Phase 2 cutover honesty: - "Atomicity invariant" renamed to "single-key invariant". Acknowledged sub-2-minute reachability gap during DNS TTL expiry is now documented as accepted; client retry covers it. Appendix C runbook hardening: - Operator preamble: ssh ForwardAgent disabled, set -euo pipefail, HISTFILE=/dev/null, recipient-file existence check, portability notes for zsh/fish. - Integrity verification now keys on **peer-ID continuity** (security-relevant) plus SHA-256 (transport correctness), instead of SHA-only (which couldn't catch source-side substitution). - `shred` removed: unreliable on virtualised network volumes; honest framing that VM destruction is the only real disposal guarantee. - Every remote command sets HISTFILE=/dev/null on the remote side. Firewall column clarity: - §14 table now has separate "Hetzner edge" and "in-host nftables" columns so rate-limiting is unambiguously in nftables, not at the edge (which Hetzner Cloud Firewall doesn't support). Security baseline operationalised: - Ownership + cadence + evidence table for every "required" control (2FA, token rotation, journald review, audit logs, restore drill, systemd hardening verification). - Multi-approver future-work item explicitly named. - Action supply-chain SHA-bump checklist (provenance attestation cross-reference, diff review, prefer first-party publishers). - ACME / CAA accounturi ordering: bootstrap with permissive CAA, tighten after first issuance. - Compromise response order specified (10 numbered steps, designed so each step doesn't break the operator's ability to do later steps). - Hetzner abuse-desk runbook (response SLA, escalation window). Phase 0 reframing: - "(Optional) git filter-repo" demoted to step 6 hygiene-only; step 5 is now MANDATORY security advisory under SECURITY.md / docs/reports/. Removal does not equal remediation. Secrets layout completeness: - infra/secrets/ enumeration includes smtp-credential.age, willow-volume-luks.age, tfstate-encryption.age, nix-signing-pubkey.txt. - secrets.nix recipient matrix sketched with host/operator/CI roles and per-secret access — the authority document for prod access. Future-work additions: - DDoS scrubbing item now has trigger threshold + fallback runbook. - Off-Hetzner restic mirror to B2 to remove single-point-of-recovery on Hetzner cooperation. Misc: - §6 drops the "WSS over proxy is fine" aside (justified a path spec doesn't take). - Cost arithmetic for Phase-2 multi-region corrected to ~€9/mo. https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Iter-3 surfaced one MAJOR finding: the spec described a 2-port relay topology (TCP/9090 binary iroh + TCP/9091 WebSocket via Caddy), but willow-relay's actual binary multiplexes both onto a single --relay-port (default 3340) HTTP listener. The 9090/9091 flags referenced in docker-compose.yml and the legacy deploy skill no longer exist on the binary. Phase-1 deploy would have mis-configured both Caddy (proxying a port nothing listens on) and the firewall (opening a port externally that no service binds). Spec now matches the binary: - Topology diagram: single :443 → 127.0.0.1:3340 ingress. - §9 Caddy: proxies all relay traffic to one loopback HTTP port. - §14 firewall: edge exposes 22/80/443 only; relay's listen port is not exposed externally because the binary binds 127.0.0.1 via the NixOS module. - Historical note added documenting the dev-stack/CLAUDE.md staleness for separate cleanup. Iter-3 minor cleanups: - "DR triplet" wording corrected to "DR offline-secrets list" (now four items after iter-2's operator-age-key addition). - §6 names Hetzner DNS as the explicit runner-up + reasons rejected (CLAUDE.md tradeoff-disclosure requirement). - §10 adds a concrete trigger for flipping Cloudflare proxy on the web UI (95p egress ≥ 5 TB/mo, US TTFB ≥ 1s, or active DDoS). - secrets matrix adds nix-signing-privkey.age (CI + operator decryptable, never host) and clarifies nix-signing-pubkey.txt is intentionally plaintext. https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/specs/2026-04-28-deployment-infrastructure-design.md: a Hetzner Cloud + NixOS production stack to replace today's hand-configured Linode VM + scp-based deploys.Decisions resolved this round
agenix;resticsnapshots are the recovery path.forced-commandrestricted; interactive SSH is per-developer break-glass.Out of scope
infra/layout is documented but not yet created.Test plan
Spec only; no code. Once accepted, follow-up work creates
infra/, runsnixos-anywhereagainst a staging hostname, and verifies before cutover (per Phase 1 / Phase 3).https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Generated by Claude Code