Skip to content

ci(health): tolerate deploy-window restarts (no false alarms)#327

Merged
mrviduus merged 1 commit into
mainfrom
fix-healthcheck-deploy-window
Jun 15, 2026
Merged

ci(health): tolerate deploy-window restarts (no false alarms)#327
mrviduus merged 1 commit into
mainfrom
fix-healthcheck-deploy-window

Conversation

@mrviduus

Copy link
Copy Markdown
Owner

Symptom

The Health Check scheduled run at 00:10 UTC failed on Smoke — book listing (curl -sf exit 22 = HTTP ≥400 on /api/books). The runs before (23:42) and after were green.

Root cause — not a prod outage

The failure landed inside the deploy window of #326 (23:52 → ~00:18): a deploy restarts the API container and rebuilds SSG, so the API is briefly connection-refused / 5xx. The /health checks already retried for ~30s (--retry 3 --retry-delay 10), but the smoke checks used only --retry 2 with no delay (~3s of retry) and couldn't ride out the restart. Prod was verified healthy at fix time (/api/books 200 with data, /health 200, explain 200).

Fix

One shared, tolerant retry policy for every check via a CURL_RETRY env: --retry 5 --retry-delay 15 --retry-all-errors --retry-connrefused (~75s window). A deploy-window restart is now ridden out instead of paging; a genuine multi-minute outage still fails (the next scheduled run, or sustained downtime beyond the window, trips it).

No application change.

A scheduled health check landing inside the ~25-min deploy window saw the API
container restarting (connection-refused / brief 5xx) and false-failed: the
smoke checks used only '--retry 2' with no delay (~3s of retry). Unified all
checks onto one retry policy (--retry 5 --retry-delay 15 --retry-all-errors
--retry-connrefused, ~75s) so a deploy restart is ridden out instead of paging.
Prod was verified healthy at fix time; the 00:10 failure was mid-deploy of #326.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mrviduus mrviduus merged commit 75e2c8d into main Jun 15, 2026
5 checks passed
@mrviduus mrviduus deleted the fix-healthcheck-deploy-window branch June 15, 2026 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant