Skip to content

docs(specs): deployment infrastructure design (Hetzner + NixOS)#461

Open
intendednull wants to merge 5 commits intomainfrom
claude/research-deployment-infrastructure-V4ilv
Open

docs(specs): deployment infrastructure design (Hetzner + NixOS)#461
intendednull wants to merge 5 commits intomainfrom
claude/research-deployment-infrastructure-V4ilv

Conversation

@intendednull
Copy link
Copy Markdown
Owner

Summary

  • Drafts docs/specs/2026-04-28-deployment-infrastructure-design.md: a Hetzner Cloud + NixOS production stack to replace today's hand-configured Linode VM + scp-based deploys.
  • 14 layered decisions (host, IaC, OS install, config push, secrets, TLS, CI, cache, backups, firewalls, etc.), each with role, rationale, and rejected alternatives.
  • 5-phase migration plan with explicit rollback points, ~€11.44/mo steady-state cost, and a documented restore drill in Phase 3.

Decisions resolved this round

  • Relay Ed25519 key migrates to Hetzner (peer-ID continuity); worker keys are fresh.
  • Cloudflare manages DNS only — registrar stays put.
  • Identity keys are not committed to agenix; restic snapshots are the recovery path.
  • CI deploy SSH key is forced-command restricted; interactive SSH is per-developer break-glass.

Out of scope

Test plan

Spec only; no code. Once accepted, follow-up work creates infra/, runs nixos-anywhere against a staging hostname, and verifies before cutover (per Phase 1 / Phase 3).

https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM


Generated by Claude Code

claude added 5 commits April 28, 2026 06:36
Draft spec for migrating production from a hand-configured Linode VM to
a reproducible Hetzner Cloud + NixOS stack. Each layer (host, IaC,
config push, secrets, TLS, CI, cache, backups, firewalls) is documented
with its role, rationale, and rejected alternatives so the tradeoffs
are auditable.

Observability is intentionally deferred to issue #460 so this spec can
land independently.

https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Incorporates decisions from review:
- Migrate relay Ed25519 key for peer-ID continuity; workers get fresh
  identities and a one-event SyncProvider re-grant.
- Cloudflare manages DNS for whichever domain is in use; registrar
  stays put.
- Identity keys are NOT committed to agenix; restic snapshots are the
  recovery path.
- CI deploy SSH key uses forced-command restriction; interactive SSH
  is per-developer break-glass only.

https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Synthesises findings from three parallel review agents (technical,
factual research, security audit). Major changes:

Factual corrections:
- NixOS 25.11 (not 25.05); pin nixpkgs by hash, not channel name.
- Hetzner pricing updated for 2026-04-01 hike: CAX21 ARM €7.99/mo
  preferred over CPX21 AMD €11.99/mo (better $/perf, project already
  targets aarch64). Volume €0.57, Storage Box €3.20, Primary IP €0.50
  (not Floating IP, which is €3.00 and not in Phase-1 budget).
- Cost total €12.26 vs €20 cap.
- Cloudflare WSS-proxy claim corrected (proxy works fine for WSS;
  real reasons to keep relay direct are 100s idle timeout + binary
  TCP/9090).

Security additions:
- Phase 0 reframed as compromise response: leaked password is
  permanently public; auth-log audit gates whether the relay key is
  migrated at all.
- CI = root on prod stated honestly; forced-command is no defence
  against malicious closures. Required mitigations: manual approval
  gate, closure signing pinned in trusted-public-keys, action SHA
  pinning, scoped permissions.
- agenix host-key threat model documented; rekey procedure;
  bootstrap chicken-and-egg sequence; sops-nix as upgrade path for
  high-value secrets needing online revocation.
- LUKS encryption on Hetzner Volume.
- restic in append-only mode; offline DR secrets (restic password,
  Storage Box key, LUKS recovery key) listed; backup-failure
  email path.
- New "Security baseline" section: hardware-token 2FA on Hetzner +
  Cloudflare + GitHub; CAA records; OpenTofu state encryption;
  systemd hardening defaults (`systemd-analyze security` ≤ 3.0);
  break-glass key separate from CI key.

Operational additions:
- New "Host configuration baseline" section: time sync, swap, Nix
  GC, journald limits, sshd hardening, mount ordering, NixOS
  version-bump procedure.
- Caddy footguns explicit: WASM MIME directive, Brotli requires
  xcaddy build, admin API disabled.

Phase reordering:
- Phase 1 now includes backups + a verified restore drill before
  cutover (closes regression vs goal #6).
- Phase 2 cutover gates on the Phase-0 audit conclusion + atomicity
  invariant (one host holds the live key at any moment).

New runbooks/appendices:
- Appendix C: relay key migration runbook (stream through age,
  never write to laptop disk, integrity-check via SHA-256, shred
  source).

Tooling research updates:
- Cachix risk note: attic upstream stalled; celler is the
  actively-maintained fork.
- nh and clan-core added to rejected alternatives with current
  rationale.
- §10 commits to Caddy-on-VM as long-term answer (CDN via
  Cloudflare proxy toggle), drops the speculative "Phase 2 move
  to Pages" stepping-stone shortcut.

Firewall fix:
- Port 9091 dropped from edge firewall (loopback never traverses
  edge); rate limit added on TCP/9090; IPv6 rules stated.

https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Synthesises findings from second-round technical + security re-review
agents. Top issues addressed:

Blockers / cross-reference drift:
- §13 no longer claims the LUKS volume key lives in /etc/willow
  (which is itself on the LUKS volume — circular). Volume key is in
  agenix only; backup set excludes it.
- Stale Phase 3/4 references migrated to Phase 1 after the iter-1
  reorder.
- Cost table no longer lists "Pages" after §10's Pages rejection.
- Stack-at-a-glance row removes the (Phase 1) qualifier on Caddy
  static hosting and corrects the "replaces" cell to match §"Context".
- Repo layout disko file no longer named after the rejected CPX21.

CI threat-model honesty (closure signing custody):
- §11 names the closure-signing key location (GitHub Secrets, in a
  separate signing job) and bounds the threat-model claim to what
  signing actually defends: Cachix supply-chain compromise. A full
  GHA runner compromise of the signing job still owns prod;
  closure signing is blast-radius reduction, not a defence against
  runner compromise itself.

DR completeness:
- §13 DR triplet → quadruplet: adds an offline operator age private
  key (e.g. age-plugin-yubikey on hardware token). Without it,
  full-loss recovery cannot decrypt the OpenTofu API tokens to
  drive `tofu apply` on the new VM.
- §13 explicit on identity-key behaviour: relay + worker keys are
  restored byte-identical from restic; SyncProvider re-grants only
  happen at the one-shot Phase-2 cutover, not on every DR exercise.

Phase 2 cutover honesty:
- "Atomicity invariant" renamed to "single-key invariant".
  Acknowledged sub-2-minute reachability gap during DNS TTL expiry
  is now documented as accepted; client retry covers it.

Appendix C runbook hardening:
- Operator preamble: ssh ForwardAgent disabled, set -euo pipefail,
  HISTFILE=/dev/null, recipient-file existence check, portability
  notes for zsh/fish.
- Integrity verification now keys on **peer-ID continuity**
  (security-relevant) plus SHA-256 (transport correctness), instead
  of SHA-only (which couldn't catch source-side substitution).
- `shred` removed: unreliable on virtualised network volumes; honest
  framing that VM destruction is the only real disposal guarantee.
- Every remote command sets HISTFILE=/dev/null on the remote side.

Firewall column clarity:
- §14 table now has separate "Hetzner edge" and "in-host nftables"
  columns so rate-limiting is unambiguously in nftables, not at the
  edge (which Hetzner Cloud Firewall doesn't support).

Security baseline operationalised:
- Ownership + cadence + evidence table for every "required" control
  (2FA, token rotation, journald review, audit logs, restore drill,
  systemd hardening verification).
- Multi-approver future-work item explicitly named.
- Action supply-chain SHA-bump checklist (provenance attestation
  cross-reference, diff review, prefer first-party publishers).
- ACME / CAA accounturi ordering: bootstrap with permissive CAA,
  tighten after first issuance.
- Compromise response order specified (10 numbered steps, designed
  so each step doesn't break the operator's ability to do later
  steps).
- Hetzner abuse-desk runbook (response SLA, escalation window).

Phase 0 reframing:
- "(Optional) git filter-repo" demoted to step 6 hygiene-only;
  step 5 is now MANDATORY security advisory under
  SECURITY.md / docs/reports/. Removal does not equal remediation.

Secrets layout completeness:
- infra/secrets/ enumeration includes smtp-credential.age,
  willow-volume-luks.age, tfstate-encryption.age,
  nix-signing-pubkey.txt.
- secrets.nix recipient matrix sketched with host/operator/CI roles
  and per-secret access — the authority document for prod access.

Future-work additions:
- DDoS scrubbing item now has trigger threshold + fallback runbook.
- Off-Hetzner restic mirror to B2 to remove single-point-of-recovery
  on Hetzner cooperation.

Misc:
- §6 drops the "WSS over proxy is fine" aside (justified a path
  spec doesn't take).
- Cost arithmetic for Phase-2 multi-region corrected to ~€9/mo.

https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Iter-3 surfaced one MAJOR finding: the spec described a 2-port relay
topology (TCP/9090 binary iroh + TCP/9091 WebSocket via Caddy), but
willow-relay's actual binary multiplexes both onto a single
--relay-port (default 3340) HTTP listener. The 9090/9091 flags
referenced in docker-compose.yml and the legacy deploy skill no
longer exist on the binary. Phase-1 deploy would have mis-configured
both Caddy (proxying a port nothing listens on) and the firewall
(opening a port externally that no service binds).

Spec now matches the binary:
- Topology diagram: single :443 → 127.0.0.1:3340 ingress.
- §9 Caddy: proxies all relay traffic to one loopback HTTP port.
- §14 firewall: edge exposes 22/80/443 only; relay's listen port is
  not exposed externally because the binary binds 127.0.0.1 via the
  NixOS module.
- Historical note added documenting the dev-stack/CLAUDE.md staleness
  for separate cleanup.

Iter-3 minor cleanups:
- "DR triplet" wording corrected to "DR offline-secrets list" (now
  four items after iter-2's operator-age-key addition).
- §6 names Hetzner DNS as the explicit runner-up + reasons rejected
  (CLAUDE.md tradeoff-disclosure requirement).
- §10 adds a concrete trigger for flipping Cloudflare proxy on the
  web UI (95p egress ≥ 5 TB/mo, US TTFB ≥ 1s, or active DDoS).
- secrets matrix adds nix-signing-privkey.age (CI + operator
  decryptable, never host) and clarifies nix-signing-pubkey.txt is
  intentionally plaintext.

https://claude.ai/code/session_01PPGTBSaKd4c6iPCDmFNtoM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants