Skip to content

fix(weekday-sf): raise MorningEnrich timeout + poll-cap (L4552b) and force lib-pin re-resolve (L4591)#426

Merged
cipher813 merged 1 commit into
mainfrom
fix/weekday-morningenrich-timeout-libpin
Jun 14, 2026
Merged

fix(weekday-sf): raise MorningEnrich timeout + poll-cap (L4552b) and force lib-pin re-resolve (L4591)#426
cipher813 merged 1 commit into
mainfrom
fix/weekday-morningenrich-timeout-libpin

Conversation

@cipher813

Copy link
Copy Markdown
Owner

Summary

Fixes two weekday-pipeline reliability defects in infrastructure/step_function_daily.json + scripts/ensure_lib_pin.sh. Scopes cipher813/alpha-engine-config#970 (L4552b) and cipher813/alpha-engine-config#1059 (blockers 2 + 3).

NOTE: merging this PR auto-deploys the weekday Step Function via deploy-infrastructure.yml (push-to-main) — the PR is the safe artifact; do not self-merge.

FIX 1 — L4552b MorningEnrich SSM timeout (#970, #1059 blocker 2)

The 6/11 weekday RunMorningEnrich timed out against the exact executionTimeout=1800 ceiling (SF state TimeoutSeconds=1860). Per the #970/#1059 diagnosis, daily_closes+intraday collection is legitimately >30 min on the t3.small, not hung — the slow ArcticDB append was already split to MorningArcticAppend (#983), so this is the remaining fetch tail.

  • Raised SSM executionTimeout 1800→3000s, SSM-param TimeoutSeconds 1800→3000, SF state TimeoutSeconds 1860→3060 (60s margin).
  • Added a bounded poll-iteration cap on the CheckMorningEnrichStatus loop (InitMorningEnrichPoll + IncrementMorningEnrichPoll + a ≥210-attempt branch, ~57 min — just past the 3000s ceiling) that fails fast into a new MorningEnrichPollTimeout state with a clear cause (routed through HandleFailure so the SNS failure alert carries it) instead of spinning to the state TimeoutSeconds on a stuck-InProgress SSM agent (the #970 symptom). Mirrors the existing InitSSMPollCounter/SSMReadyChoice/IncrementSSMPoll cap pattern already in the SF.

FIX 2 — L4591 weekday lib-pin auto-heal re-resolve gap (#1059 blocker 3)

The 6/10 weekday run crashed ImportError: cannot import name 'guard_entrypoint' (box ran new code against a stale installed lib).

Finding (reality vs the #1059 hypothesis): ensure_lib_pin.sh is already wired into the weekday MorningEnrich entrypoint between git-pull and the collector invocation — it landed in data#386 (the same 2026-06-10 fix) and is structurally guarded by test_sf_lib_pin_self_heal_wiring.py. So the wiring is present, not missing. The genuine remaining gap is the heal mechanism itself: a plain pip install of a git-URL pin treats the version as already-satisfied and skips the reinstall, so a new symbol (guard_entrypoint) or newly declared extras never re-resolve onto the long-lived box. Hardened the heal-path install to pip install --force-reinstall --no-cache-dir so the pinned ref's actual contents (symbols + extras) always land. (The Saturday data path is unaffected: it runs on ephemeral spot boxes that pip-install fresh each launch, which is why drift only bites the long-lived weekday EC2.)

Validation

  • step_function_daily.json is valid JSON; no dangling/orphan states (graph checked).
  • ensure_lib_pin.sh passes bash -n.
  • Full tests/ suite green: 1981 passed, 2 skipped (pre-existing FutureWarnings only). SF wiring tests (weekday skipgate, lib-pin self-heal, morning-enrich split, poll resultselector) all pass.

References: L4552b, L4591, #970, #1059.

🤖 Generated with Claude Code

…force lib-pin re-resolve (L4591)

Two weekday-pipeline reliability defects (scoping cipher813/alpha-engine-config#970, #1059).

FIX 1 — L4552b MorningEnrich SSM timeout (#970, #1059):
The 6/11 weekday RunMorningEnrich timed out against the EXACT executionTimeout=1800
ceiling (SF state TimeoutSeconds=1860) — daily_closes+intraday collection is
legitimately >30 min on the t3.small, not hung (the slow ArcticDB append was already
split to MorningArcticAppend, #983). Raise the SSM executionTimeout 1800->3000s, the
SSM-param TimeoutSeconds 1800->3000, and the SF state TimeoutSeconds 1860->3060 (60s
margin). Add a bounded poll-iteration cap on the CheckMorningEnrichStatus loop
(InitMorningEnrichPoll + IncrementMorningEnrichPoll + a >=210-attempt branch, ~57 min,
just past the 3000s ceiling) that fails fast into a new MorningEnrichPollTimeout state
with a CLEAR cause (routed through HandleFailure so the SNS alert carries it) instead of
spinning to the state TimeoutSeconds on a stuck-InProgress SSM agent. Mirrors the
existing InitSSMPollCounter/SSMReadyChoice/IncrementSSMPoll cap pattern.

FIX 2 — L4591 weekday lib-pin auto-heal re-resolve gap (#1059 blocker 3):
ensure_lib_pin.sh is already wired into the weekday MorningEnrich entrypoint
(landed in data#386, the same 2026-06-10 fix), so the SF wiring is present and
structurally guarded by test_sf_lib_pin_self_heal_wiring.py. The remaining gap is the
heal mechanism itself: a plain `pip install` of a git-URL pin treats the version as
already-satisfied and SKIPS the reinstall, so a new symbol (guard_entrypoint) or newly
declared extras never re-resolve onto the long-lived box. Harden the heal-path install
to `pip install --force-reinstall --no-cache-dir` so the pinned ref's actual contents
(symbols + extras) always land.

Validation: step_function_daily.json is valid JSON, no dangling/orphan states;
ensure_lib_pin.sh passes bash -n. Full tests/ suite green: 1981 passed, 2 skipped
(pre-existing FutureWarnings only). SF wiring tests (weekday skipgate, lib-pin
self-heal, morning-enrich split, poll resultselector) all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 8fad54a into main Jun 14, 2026
1 check passed
@cipher813 cipher813 deleted the fix/weekday-morningenrich-timeout-libpin branch June 14, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant