fix(weekday-sf): raise MorningEnrich timeout + poll-cap (L4552b) and force lib-pin re-resolve (L4591)#426
Merged
Conversation
…force lib-pin re-resolve (L4591) Two weekday-pipeline reliability defects (scoping cipher813/alpha-engine-config#970, #1059). FIX 1 — L4552b MorningEnrich SSM timeout (#970, #1059): The 6/11 weekday RunMorningEnrich timed out against the EXACT executionTimeout=1800 ceiling (SF state TimeoutSeconds=1860) — daily_closes+intraday collection is legitimately >30 min on the t3.small, not hung (the slow ArcticDB append was already split to MorningArcticAppend, #983). Raise the SSM executionTimeout 1800->3000s, the SSM-param TimeoutSeconds 1800->3000, and the SF state TimeoutSeconds 1860->3060 (60s margin). Add a bounded poll-iteration cap on the CheckMorningEnrichStatus loop (InitMorningEnrichPoll + IncrementMorningEnrichPoll + a >=210-attempt branch, ~57 min, just past the 3000s ceiling) that fails fast into a new MorningEnrichPollTimeout state with a CLEAR cause (routed through HandleFailure so the SNS alert carries it) instead of spinning to the state TimeoutSeconds on a stuck-InProgress SSM agent. Mirrors the existing InitSSMPollCounter/SSMReadyChoice/IncrementSSMPoll cap pattern. FIX 2 — L4591 weekday lib-pin auto-heal re-resolve gap (#1059 blocker 3): ensure_lib_pin.sh is already wired into the weekday MorningEnrich entrypoint (landed in data#386, the same 2026-06-10 fix), so the SF wiring is present and structurally guarded by test_sf_lib_pin_self_heal_wiring.py. The remaining gap is the heal mechanism itself: a plain `pip install` of a git-URL pin treats the version as already-satisfied and SKIPS the reinstall, so a new symbol (guard_entrypoint) or newly declared extras never re-resolve onto the long-lived box. Harden the heal-path install to `pip install --force-reinstall --no-cache-dir` so the pinned ref's actual contents (symbols + extras) always land. Validation: step_function_daily.json is valid JSON, no dangling/orphan states; ensure_lib_pin.sh passes bash -n. Full tests/ suite green: 1981 passed, 2 skipped (pre-existing FutureWarnings only). SF wiring tests (weekday skipgate, lib-pin self-heal, morning-enrich split, poll resultselector) all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two weekday-pipeline reliability defects in
infrastructure/step_function_daily.json+scripts/ensure_lib_pin.sh. Scopes cipher813/alpha-engine-config#970 (L4552b) and cipher813/alpha-engine-config#1059 (blockers 2 + 3).NOTE: merging this PR auto-deploys the weekday Step Function via
deploy-infrastructure.yml(push-to-main) — the PR is the safe artifact; do not self-merge.FIX 1 — L4552b MorningEnrich SSM timeout (#970, #1059 blocker 2)
The 6/11 weekday
RunMorningEnrichtimed out against the exactexecutionTimeout=1800ceiling (SF stateTimeoutSeconds=1860). Per the #970/#1059 diagnosis,daily_closes+intraday collection is legitimately >30 min on the t3.small, not hung — the slow ArcticDB append was already split toMorningArcticAppend(#983), so this is the remaining fetch tail.executionTimeout1800→3000s, SSM-paramTimeoutSeconds1800→3000, SF stateTimeoutSeconds1860→3060 (60s margin).CheckMorningEnrichStatusloop (InitMorningEnrichPoll+IncrementMorningEnrichPoll+ a≥210-attempt branch, ~57 min — just past the 3000s ceiling) that fails fast into a newMorningEnrichPollTimeoutstate with a clear cause (routed throughHandleFailureso the SNS failure alert carries it) instead of spinning to the stateTimeoutSecondson a stuck-InProgressSSM agent (the #970 symptom). Mirrors the existingInitSSMPollCounter/SSMReadyChoice/IncrementSSMPollcap pattern already in the SF.FIX 2 — L4591 weekday lib-pin auto-heal re-resolve gap (#1059 blocker 3)
The 6/10 weekday run crashed
ImportError: cannot import name 'guard_entrypoint'(box ran new code against a stale installed lib).Finding (reality vs the #1059 hypothesis):
ensure_lib_pin.shis already wired into the weekdayMorningEnrichentrypoint between git-pull and the collector invocation — it landed indata#386(the same 2026-06-10 fix) and is structurally guarded bytest_sf_lib_pin_self_heal_wiring.py. So the wiring is present, not missing. The genuine remaining gap is the heal mechanism itself: a plainpip installof a git-URL pin treats the version as already-satisfied and skips the reinstall, so a new symbol (guard_entrypoint) or newly declared extras never re-resolve onto the long-lived box. Hardened the heal-path install topip install --force-reinstall --no-cache-dirso the pinned ref's actual contents (symbols + extras) always land. (The Saturday data path is unaffected: it runs on ephemeral spot boxes that pip-install fresh each launch, which is why drift only bites the long-lived weekday EC2.)Validation
step_function_daily.jsonis valid JSON; no dangling/orphan states (graph checked).ensure_lib_pin.shpassesbash -n.tests/suite green: 1981 passed, 2 skipped (pre-existing FutureWarnings only). SF wiring tests (weekday skipgate, lib-pin self-heal, morning-enrich split, poll resultselector) all pass.References: L4552b, L4591, #970, #1059.
🤖 Generated with Claude Code