ci(health): tolerate deploy-window restarts (no false alarms)#327
Merged
Conversation
A scheduled health check landing inside the ~25-min deploy window saw the API container restarting (connection-refused / brief 5xx) and false-failed: the smoke checks used only '--retry 2' with no delay (~3s of retry). Unified all checks onto one retry policy (--retry 5 --retry-delay 15 --retry-all-errors --retry-connrefused, ~75s) so a deploy restart is ridden out instead of paging. Prod was verified healthy at fix time; the 00:10 failure was mid-deploy of #326. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
The Health Check scheduled run at 00:10 UTC failed on
Smoke — book listing(curl -sfexit 22 = HTTP ≥400 on/api/books). The runs before (23:42) and after were green.Root cause — not a prod outage
The failure landed inside the deploy window of #326 (23:52 → ~00:18): a deploy restarts the API container and rebuilds SSG, so the API is briefly connection-refused / 5xx. The
/healthchecks already retried for ~30s (--retry 3 --retry-delay 10), but the smoke checks used only--retry 2with no delay (~3s of retry) and couldn't ride out the restart. Prod was verified healthy at fix time (/api/books200 with data,/health200, explain 200).Fix
One shared, tolerant retry policy for every check via a
CURL_RETRYenv:--retry 5 --retry-delay 15 --retry-all-errors --retry-connrefused(~75s window). A deploy-window restart is now ridden out instead of paging; a genuine multi-minute outage still fails (the next scheduled run, or sustained downtime beyond the window, trips it).No application change.