Restore timeseries benchmark (corrected, consolidated) — point duel + interval study (#57)#58
Merged
Merged
Conversation
…#57) Replaces the withdrawn gists/timeseries/conformal benchmark, which (per #57) leaked each evaluation chunk into NNS.ARMA.optim and compared methods across mismatched forecasting protocols. gists/timeseries/conformal/ — corrected interval study. The NNS block point forecast is held FIXED (model selected on a strictly historical validation tail; the scored block is never shown to the optimizer) and only the interval construction varies: NNS native PI vs. split-conformal (flat) vs. split-conformal (per-lead-time) vs. Gaussian, all on identical residuals. Framed explicitly as an adaptation to discern coverage guarantees on a heteroskedastic process. Finding: native PI is the efficiency winner and is essentially a flat split-conformal band; every flat band under-covers the volatile regime (exchangeability failure); only the horizon-adaptive per-lead wrapper recovers near-nominal coverage, at a width cost. gists/timeseries/point_duel/ — fair point-forecast comparison under one block protocol (no online updating): NNS block vs recursive-block ridge vs persistence. NNS wins decisively (MAE 1.51 vs 2.66 vs 2.49); recursive ridge degrades below persistence once the h=1 true-lag crutch is removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF
Merge the point-model duel and the interval study into a single walk-forward in
gists/timeseries/conformal/run_conformal.py. One leak-free NNS block forecast per
origin now feeds both analyses (previously two scripts each recomputed the
expensive NNS forecast), and the conformal calibration is unified so the per-lead
split-CP shares one well-seeded fallback across point models -- removing the
cold-start that depressed worst-window coverage in the standalone duel.
Outputs results/{point,interval}{,_all}.csv and one README covering both tables.
Removes the separate gists/timeseries/point_duel/ directory.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF
Correct `variable` specification at the call site (variable=train_slice, never y[:end_i]) prevents the evaluation-chunk leak, so there is no separate upstream guard to flag as outstanding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restores the
gists/timeseries/conformalbenchmark withdrawn under #57, as a single consolidated routine (run_conformal.py). The original (a) leaked each evaluation chunk intoNNS.ARMA.optim's validation tail and (b) compared methods across mismatched protocols (online h=1 baselines vs. an NNS multi-step block). Both are fixed: one leak-free block walk-forward emits two coherent analyses from the same NNS forecast. The leak is prevented by correctvariablespecification at the call site (variable=train_slice, nevery[:end_i]).One protocol
NNS's model is selected on a strictly historical validation tail; the scored block is never shown to the optimizer. DGP: heteroskedastic AR(1) with trend, two seasonals, piecewise σ-regimes. 10 seeds, T=3500.
Point duel — does NNS forecast better?
NNS wins decisively (~43% lower MAE than recursive ridge, all 10 seeds). Recursive ridge falls below persistence once the h=1 true-lag crutch is removed — its error compounds over the block.
Interval study — native PI vs. conformalizing the same residuals
Point forecast held fixed; only the band varies. Framed explicitly as an adaptation to discern coverage guarantees on a heteroskedastic process.
Layout
Note: the interval CRPS/log-score columns were dropped from the consolidated table; lean on interval score + the coverage columns.
🤖 Generated with Claude Code
https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF