Restore timeseries benchmark (corrected, consolidated) — point duel + interval study (#57) by OVVO-Financial · Pull Request #58 · OVVO-Financial/NNS-python

OVVO-Financial · 2026-06-23T18:39:27Z

Summary

Restores the gists/timeseries/conformal benchmark withdrawn under #57, as a single consolidated routine (run_conformal.py). The original (a) leaked each evaluation chunk into NNS.ARMA.optim's validation tail and (b) compared methods across mismatched protocols (online h=1 baselines vs. an NNS multi-step block). Both are fixed: one leak-free block walk-forward emits two coherent analyses from the same NNS forecast. The leak is prevented by correct variable specification at the call site (variable=train_slice, never y[:end_i]).

One protocol

At each origin t, given only data through t, every method forecasts an h-step block (h = implied_h = t·(1−0.9)/0.9) with no online updating — no peeking at a realized value to predict the next one.

NNS's model is selected on a strictly historical validation tail; the scored block is never shown to the optimizer. DGP: heteroskedastic AR(1) with trend, two seasonals, piecewise σ-regimes. 10 seeds, T=3500.

Point duel — does NNS forecast better?

           method   MAE  RMSE  median_AE
        NNS block 1.514 2.056      1.122
Ridge (recursive) 2.664 3.230      2.403
      Persistence 2.493 3.241      1.991

NNS wins decisively (~43% lower MAE than recursive ridge, all 10 seeds). Recursive ridge falls below persistence once the h=1 true-lag crutch is removed — its error compounds over the block.

Interval study — native PI vs. conformalizing the same residuals

Point forecast held fixed; only the band varies. Framed explicitly as an adaptation to discern coverage guarantees on a heteroskedastic process.

                                 method  marg_cov  worst_win_cov  cov_lowvol  cov_hivol  cond_cov_gap  width  interval_score
                          NNS native PI     0.845          0.561       0.927      0.853         0.149  5.658           8.674
                  NNS + split-CP (flat)     0.824          0.515       0.899      0.850         0.169  5.398           8.735
                  NNS + Gaussian (flat)     0.821          0.516       0.894      0.842         0.170  5.332           8.773
              NNS + split-CP (per-lead)     0.918          0.649       0.996      0.858         0.097  8.595          10.505
Ridge (recursive) + split-CP (per-lead)     0.865          0.553       0.955      0.826         0.149 10.169          13.853
      Persistence + split-CP (per-lead)     0.887          0.333       0.981      0.807         0.162 12.107          16.440

NNS native PI is the efficiency winner and ≈ a flat split-conformal band on the same residuals.
Every flat band under-covers the volatile regime — exchangeability failure under heteroskedasticity.
Only the horizon-adaptive per-lead wrapper recovers near-nominal coverage (0.918), at ~52% more width.
Interval quality follows point quality: same wrapper, NNS 10.5 ≪ ridge 13.9 ≪ persistence 16.4.

Layout

gists/timeseries/conformal/
  run_conformal.py            # one routine -> both tables
  README.md                   # framing + both tables + findings
  results/{point,interval}{,_all}.csv

Note: the interval CRPS/log-score columns were dropped from the consolidated table; lean on interval score + the coverage columns.

🤖 Generated with Claude Code

https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF

…#57) Replaces the withdrawn gists/timeseries/conformal benchmark, which (per #57) leaked each evaluation chunk into NNS.ARMA.optim and compared methods across mismatched forecasting protocols. gists/timeseries/conformal/ — corrected interval study. The NNS block point forecast is held FIXED (model selected on a strictly historical validation tail; the scored block is never shown to the optimizer) and only the interval construction varies: NNS native PI vs. split-conformal (flat) vs. split-conformal (per-lead-time) vs. Gaussian, all on identical residuals. Framed explicitly as an adaptation to discern coverage guarantees on a heteroskedastic process. Finding: native PI is the efficiency winner and is essentially a flat split-conformal band; every flat band under-covers the volatile regime (exchangeability failure); only the horizon-adaptive per-lead wrapper recovers near-nominal coverage, at a width cost. gists/timeseries/point_duel/ — fair point-forecast comparison under one block protocol (no online updating): NNS block vs recursive-block ridge vs persistence. NNS wins decisively (MAE 1.51 vs 2.66 vs 2.49); recursive ridge degrades below persistence once the h=1 true-lag crutch is removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF

Merge the point-model duel and the interval study into a single walk-forward in gists/timeseries/conformal/run_conformal.py. One leak-free NNS block forecast per origin now feeds both analyses (previously two scripts each recomputed the expensive NNS forecast), and the conformal calibration is unified so the per-lead split-CP shares one well-seeded fallback across point models -- removing the cold-start that depressed worst-window coverage in the standalone duel. Outputs results/{point,interval}{,_all}.csv and one README covering both tables. Removes the separate gists/timeseries/point_duel/ directory. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF

Correct `variable` specification at the call site (variable=train_slice, never y[:end_i]) prevents the evaluation-chunk leak, so there is no separate upstream guard to flag as outstanding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF

claude added 2 commits June 23, 2026 18:38

OVVO-Financial changed the title ~~Restore timeseries conformal benchmark (corrected) + point-model duel (#57)~~ Restore timeseries benchmark (corrected, consolidated) — point duel + interval study (#57) Jun 23, 2026

OVVO-Financial merged commit 4135c18 into main Jun 23, 2026
8 checks passed

OVVO-Financial deleted the claude/restore-conformal-benchmark branch June 23, 2026 19:30

OVVO-Financial mentioned this pull request Jun 23, 2026

Conformal benchmark leaks evaluation chunks into NNS optimizer #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore timeseries benchmark (corrected, consolidated) — point duel + interval study (#57)#58

Restore timeseries benchmark (corrected, consolidated) — point duel + interval study (#57)#58
OVVO-Financial merged 3 commits into
mainfrom
claude/restore-conformal-benchmark

OVVO-Financial commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OVVO-Financial commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

One protocol

Point duel — does NNS forecast better?

Interval study — native PI vs. conformalizing the same residuals

Layout

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OVVO-Financial commented Jun 23, 2026 •

edited

Loading