Skip to content

Restore timeseries benchmark (corrected, consolidated) — point duel + interval study (#57)#58

Merged
OVVO-Financial merged 3 commits into
mainfrom
claude/restore-conformal-benchmark
Jun 23, 2026
Merged

Restore timeseries benchmark (corrected, consolidated) — point duel + interval study (#57)#58
OVVO-Financial merged 3 commits into
mainfrom
claude/restore-conformal-benchmark

Conversation

@OVVO-Financial

@OVVO-Financial OVVO-Financial commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Summary

Restores the gists/timeseries/conformal benchmark withdrawn under #57, as a single consolidated routine (run_conformal.py). The original (a) leaked each evaluation chunk into NNS.ARMA.optim's validation tail and (b) compared methods across mismatched protocols (online h=1 baselines vs. an NNS multi-step block). Both are fixed: one leak-free block walk-forward emits two coherent analyses from the same NNS forecast. The leak is prevented by correct variable specification at the call site (variable=train_slice, never y[:end_i]).

One protocol

At each origin t, given only data through t, every method forecasts an h-step block (h = implied_h = t·(1−0.9)/0.9) with no online updating — no peeking at a realized value to predict the next one.

NNS's model is selected on a strictly historical validation tail; the scored block is never shown to the optimizer. DGP: heteroskedastic AR(1) with trend, two seasonals, piecewise σ-regimes. 10 seeds, T=3500.

Point duel — does NNS forecast better?

           method   MAE  RMSE  median_AE
        NNS block 1.514 2.056      1.122
Ridge (recursive) 2.664 3.230      2.403
      Persistence 2.493 3.241      1.991

NNS wins decisively (~43% lower MAE than recursive ridge, all 10 seeds). Recursive ridge falls below persistence once the h=1 true-lag crutch is removed — its error compounds over the block.

Interval study — native PI vs. conformalizing the same residuals

Point forecast held fixed; only the band varies. Framed explicitly as an adaptation to discern coverage guarantees on a heteroskedastic process.

                                 method  marg_cov  worst_win_cov  cov_lowvol  cov_hivol  cond_cov_gap  width  interval_score
                          NNS native PI     0.845          0.561       0.927      0.853         0.149  5.658           8.674
                  NNS + split-CP (flat)     0.824          0.515       0.899      0.850         0.169  5.398           8.735
                  NNS + Gaussian (flat)     0.821          0.516       0.894      0.842         0.170  5.332           8.773
              NNS + split-CP (per-lead)     0.918          0.649       0.996      0.858         0.097  8.595          10.505
Ridge (recursive) + split-CP (per-lead)     0.865          0.553       0.955      0.826         0.149 10.169          13.853
      Persistence + split-CP (per-lead)     0.887          0.333       0.981      0.807         0.162 12.107          16.440
  • NNS native PI is the efficiency winner and ≈ a flat split-conformal band on the same residuals.
  • Every flat band under-covers the volatile regime — exchangeability failure under heteroskedasticity.
  • Only the horizon-adaptive per-lead wrapper recovers near-nominal coverage (0.918), at ~52% more width.
  • Interval quality follows point quality: same wrapper, NNS 10.5 ≪ ridge 13.9 ≪ persistence 16.4.

Layout

gists/timeseries/conformal/
  run_conformal.py            # one routine -> both tables
  README.md                   # framing + both tables + findings
  results/{point,interval}{,_all}.csv

Note: the interval CRPS/log-score columns were dropped from the consolidated table; lean on interval score + the coverage columns.

🤖 Generated with Claude Code

https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF

claude added 2 commits June 23, 2026 18:38
…#57)

Replaces the withdrawn gists/timeseries/conformal benchmark, which (per #57)
leaked each evaluation chunk into NNS.ARMA.optim and compared methods across
mismatched forecasting protocols.

gists/timeseries/conformal/ — corrected interval study. The NNS block point
forecast is held FIXED (model selected on a strictly historical validation tail;
the scored block is never shown to the optimizer) and only the interval
construction varies: NNS native PI vs. split-conformal (flat) vs. split-conformal
(per-lead-time) vs. Gaussian, all on identical residuals. Framed explicitly as an
adaptation to discern coverage guarantees on a heteroskedastic process. Finding:
native PI is the efficiency winner and is essentially a flat split-conformal band;
every flat band under-covers the volatile regime (exchangeability failure); only
the horizon-adaptive per-lead wrapper recovers near-nominal coverage, at a width
cost.

gists/timeseries/point_duel/ — fair point-forecast comparison under one block
protocol (no online updating): NNS block vs recursive-block ridge vs persistence.
NNS wins decisively (MAE 1.51 vs 2.66 vs 2.49); recursive ridge degrades below
persistence once the h=1 true-lag crutch is removed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF
Merge the point-model duel and the interval study into a single walk-forward in
gists/timeseries/conformal/run_conformal.py. One leak-free NNS block forecast per
origin now feeds both analyses (previously two scripts each recomputed the
expensive NNS forecast), and the conformal calibration is unified so the per-lead
split-CP shares one well-seeded fallback across point models -- removing the
cold-start that depressed worst-window coverage in the standalone duel.

Outputs results/{point,interval}{,_all}.csv and one README covering both tables.
Removes the separate gists/timeseries/point_duel/ directory.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF
@OVVO-Financial OVVO-Financial changed the title Restore timeseries conformal benchmark (corrected) + point-model duel (#57) Restore timeseries benchmark (corrected, consolidated) — point duel + interval study (#57) Jun 23, 2026
Correct `variable` specification at the call site (variable=train_slice, never
y[:end_i]) prevents the evaluation-chunk leak, so there is no separate upstream
guard to flag as outstanding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF
@OVVO-Financial OVVO-Financial merged commit 4135c18 into main Jun 23, 2026
8 checks passed
@OVVO-Financial OVVO-Financial deleted the claude/restore-conformal-benchmark branch June 23, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants