From ed6af39035dc9e60825c14307582df5be31681d4 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 23 Jun 2026 18:38:55 +0000 Subject: [PATCH 1/3] Restore timeseries conformal benchmark (corrected) + point-model duel (#57) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaces the withdrawn gists/timeseries/conformal benchmark, which (per #57) leaked each evaluation chunk into NNS.ARMA.optim and compared methods across mismatched forecasting protocols. gists/timeseries/conformal/ — corrected interval study. The NNS block point forecast is held FIXED (model selected on a strictly historical validation tail; the scored block is never shown to the optimizer) and only the interval construction varies: NNS native PI vs. split-conformal (flat) vs. split-conformal (per-lead-time) vs. Gaussian, all on identical residuals. Framed explicitly as an adaptation to discern coverage guarantees on a heteroskedastic process. Finding: native PI is the efficiency winner and is essentially a flat split-conformal band; every flat band under-covers the volatile regime (exchangeability failure); only the horizon-adaptive per-lead wrapper recovers near-nominal coverage, at a width cost. gists/timeseries/point_duel/ — fair point-forecast comparison under one block protocol (no online updating): NNS block vs recursive-block ridge vs persistence. NNS wins decisively (MAE 1.51 vs 2.66 vs 2.49); recursive ridge degrades below persistence once the h=1 true-lag crutch is removed. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF --- gists/timeseries/conformal/README.md | 127 +++++++ .../conformal/results_block/scores.csv | 5 + .../conformal/results_block/scores_all.csv | 41 +++ .../conformal/run_block_nns_intervals.py | 304 +++++++++++++++++ gists/timeseries/point_duel/README.md | 93 ++++++ .../point_duel/results_duel/interval.csv | 5 + .../point_duel/results_duel/interval_all.csv | 41 +++ .../point_duel/results_duel/point.csv | 4 + .../point_duel/results_duel/point_all.csv | 31 ++ gists/timeseries/point_duel/run_point_duel.py | 316 ++++++++++++++++++ 10 files changed, 967 insertions(+) create mode 100644 gists/timeseries/conformal/README.md create mode 100644 gists/timeseries/conformal/results_block/scores.csv create mode 100644 gists/timeseries/conformal/results_block/scores_all.csv create mode 100644 gists/timeseries/conformal/run_block_nns_intervals.py create mode 100644 gists/timeseries/point_duel/README.md create mode 100644 gists/timeseries/point_duel/results_duel/interval.csv create mode 100644 gists/timeseries/point_duel/results_duel/interval_all.csv create mode 100644 gists/timeseries/point_duel/results_duel/point.csv create mode 100644 gists/timeseries/point_duel/results_duel/point_all.csv create mode 100644 gists/timeseries/point_duel/run_point_duel.py diff --git a/gists/timeseries/conformal/README.md b/gists/timeseries/conformal/README.md new file mode 100644 index 0000000..79ec107 --- /dev/null +++ b/gists/timeseries/conformal/README.md @@ -0,0 +1,127 @@ +# NNS prediction intervals vs. a conformal wrapper — apples to apples + +A controlled test of **prediction-interval construction** on a single, +heteroskedastic data-generating process. The **point forecast is held fixed**: +every interval in the table below wraps the *same* `NNS.ARMA.optim` block +forecast and the *same* NNS residuals. The only thing that varies is how the +band is drawn. This isolates interval construction with zero point-model and +zero protocol confounding. + +## Why this framing + +This is deliberately **not** the usual online conformal benchmark, and we want +to be explicit that it is an **adaptation**. Standard adaptive conformal methods +(ACI, PID, NexCP, …) are *online, h=1* procedures: they re-anchor on the realized +observation every step, and their coverage is a control-theoretic property of +that feedback loop rather than a statement about any forecast trajectory. An +online method that re-anchors each step will hit its nominal coverage rate on +almost any sequence — it is dragged across each one-step gap rather than +forecasting it. Comparing a structural multi-step forecaster against that on a +1-step task flatters the reactive method and tests the wrong thing. + +So here we **adapt conformal into a forecast wrapper** and ask a sharper +question instead: + +> Given a genuine multi-step forecast made at time *t* with **no online +> updating**, how well does each interval method discern **coverage guarantees +> on a heteroskedastic process** — including *conditional* coverage across the +> volatility regimes, not just the marginal rate? + +## Protocol + +- **DGP:** non-linear, heteroskedastic AR(1) with slow trend, two seasonal + components (periods 50, 200), and piecewise volatility regimes + (σ jumps to 2.5, drops to 0.55, settles at 1.8). 10 seeds, T = 3500. +- **Forecast (fixed for all methods):** at each origin *t*, `NNS.ARMA.optim` + emits a single `h`-step block forecast, with `h = implied_h = + t·(1−0.9)/0.9`. The model (periods/method/bias) is selected on a strictly + **historical** validation tail — the scored block is never shown to the + optimizer (the leak from issue #57 is fixed). No observation inside the block + is fed back: this is forecasting by design, exposed to its own compounding + error. +- **Intervals compared (all on the identical NNS forecast & residuals):** + - **NNS native PI** — `results ± pi_width`, NNS's own nonparametric rule + (a single flat half-width from the residual distribution + bias). + - **NNS + split-CP (flat)** — empirical (1−α) conformal quantile of the same + NNS validation residuals. Same residuals, textbook conformal quantile. + - **NNS + split-CP (per-lead-time)** — a conformal quantile *per lead-time k*, + pooled across past blocks (with a local-window thickening and a flat + fallback while pools are thin). The only method whose band **widens with + horizon**. + - **NNS + Gaussian (flat)** — `z·std(residuals)`, a parametric reference. +- **Scoring:** interval (Winkler) score, marginal coverage, **worst rolling-window + coverage** and **low-/high-vol stratum coverage** (the heteroskedasticity + discriminators), mean width, and Gaussian-proxy CRPS / log-score. + +## Results + +``` +=== BLOCK NNS INTERVAL COMPARISON (point = NNS fixed; mean over 10 seeds, alpha=0.1, target cov=0.9) === + + method marg_cov worst_win_cov cov_lowvol cov_hivol cond_cov_gap width interval_score CRPS logscore + NNS native PI 0.845 0.561 0.927 0.853 0.149 5.658 8.674 1.090 2.076 + NNS + split-CP (flat) 0.824 0.515 0.899 0.850 0.169 5.398 8.735 1.091 2.102 + NNS + Gaussian (flat) 0.821 0.516 0.894 0.842 0.170 5.332 8.773 1.092 2.105 +NNS + split-CP (per-lead) 0.890 0.582 0.983 0.844 0.099 7.995 10.521 1.169 2.268 +``` + +Sorted by interval (Winkler) score, lower is better. Point-forecast MAE was +identical across rows (~1.51, the fixed NNS block forecast); only the bands differ. + +## Findings + +- **NNS's native PI is the efficiency winner and ≈ a flat split-conformal band.** + It posts the best Winkler score (8.674) with the tightest width (5.66), and it + actually covers slightly *better* than the textbook split-conformal quantile on + the *same* residuals (marg 0.845 vs 0.824). NNS's `upm_var`+bias rule is a hair + more conservative than the empirical conformal quantile, but the two track + closely — confirming the native interval is, in effect, a flat split-conformal + interval. +- **Every flat band under-covers this heteroskedastic process.** All three + flat methods land at ~0.82–0.85 marginal vs. the 0.90 target, collapse to + ~0.51–0.56 in the *worst* 100-step window, and carry a ~0.15–0.17 calm-vs-volatile + conditional gap (≈0.90–0.93 in the calm regime, ≈0.84–0.85 in the volatile one). + A width calibrated on historical validation residuals does not transport to a + multi-step block that compounds error and crosses volatility regimes. +- **Only the horizon-adaptive conformal wrapper recovers the guarantee — and pays + for it.** `split-CP (per-lead)` is the lone method to reach near-nominal marginal + coverage (0.890) and the smallest conditional gap (0.099), because it widens with + lead-time and calibrates on realized block residuals. The cost is ~45% wider + intervals (7.995) and the worst Winkler score (10.521). + +## Takeaway + +On a heteroskedastic process, under honest multi-step forecasting with **no online +updating**, there is no free lunch. NNS's native interval is the *efficiency* +winner but does **not** deliver the nominal 90% coverage — least of all in the +volatile regime and the worst windows, exactly where guarantees matter most. +Recovering the guarantee requires a horizon-adaptive conformal wrapper and paying +for it in width. This coverage-vs-efficiency tension — not a single "winner" — is +what the benchmark is built to expose, and it is why we frame this as an +*adaptation to discern coverage guarantees on a heteroskedastic process* rather +than a claim that any one band dominates. + +## How to read it + +- **`NNS native PI` vs `NNS + split-CP (flat)`** is the headline head-to-head: + the *same* residuals, NNS's `upm_var`+bias rule vs. the empirical conformal + quantile. If they track closely, NNS's native band already *is*, in effect, a + flat split-conformal interval. +- **`per-lead-time`** is the only band that adapts to horizon. Watch + **worst-window** and **high-vol stratum** coverage: that is where letting the + width grow with lead-time should pay off on a heteroskedastic process, at some + cost in mean width. +- **Marginal coverage near 0.90 is necessary but cheap.** The honest test on a + heteroskedastic DGP is the **conditional** coverage gap between the calm and + volatile regimes — a flat band that hits 0.90 on average by over-covering the + calm regime and under-covering the volatile one has not actually solved the + problem. + +## Run it + +```bash +pip install ovvo-nns numpy pandas scipy +python run_block_nns_intervals.py +``` + +Writes `results_block/scores.csv` (aggregated) and `scores_all.csv` (per seed). diff --git a/gists/timeseries/conformal/results_block/scores.csv b/gists/timeseries/conformal/results_block/scores.csv new file mode 100644 index 0000000..8252db9 --- /dev/null +++ b/gists/timeseries/conformal/results_block/scores.csv @@ -0,0 +1,5 @@ +method,marg_cov,worst_win_cov,cov_lowvol,cov_hivol,cond_cov_gap,width,interval_score,CRPS,logscore +NNS native PI,0.845217041800643,0.561,0.9271704180064309,0.852572347266881,0.14855305466237945,5.658247904936664,8.673563693193246,1.0902136839659236,2.076059884663743 +NNS + split-CP (flat),0.8243569131832797,0.515,0.8993569131832798,0.84983922829582,0.1686495176848875,5.397754711147717,8.734633715776543,1.090996705962063,2.101657125801346 +NNS + Gaussian (flat),0.8206993569131832,0.516,0.8940514469453376,0.8421221864951768,0.17025723472668813,5.332100605902861,8.772635078222827,1.091613092749769,2.104833304629708 +NNS + split-CP (per-lead),0.890112540192926,0.5820000000000001,0.9829581993569132,0.8437299035369776,0.09909967845659165,7.995348220035973,10.520733085082481,1.1692625763837228,2.267904409014462 diff --git a/gists/timeseries/conformal/results_block/scores_all.csv b/gists/timeseries/conformal/results_block/scores_all.csv new file mode 100644 index 0000000..0e62ba6 --- /dev/null +++ b/gists/timeseries/conformal/results_block/scores_all.csv @@ -0,0 +1,41 @@ +method,marg_cov,worst_win_cov,cov_lowvol,cov_hivol,cond_cov_gap,width,interval_score,CRPS,logscore +NNS native PI,0.8476688102893891,0.67,0.9131832797427653,0.8070739549839229,0.09453376205787789,5.3336289551117595,8.347311881079282,1.0521806181725764,1.9988765275223412 +NNS + split-CP (flat),0.8271704180064309,0.58,0.8954983922829582,0.8311897106109325,0.14919614147909965,5.227161385540743,8.534113534345261,1.0564514057462153,2.047717813637678 +NNS + split-CP (per-lead),0.8886655948553055,0.65,0.9839228295819936,0.8086816720257235,0.09131832797427653,7.846909211270708,10.496079109394278,1.140506122590597,2.250303555178376 +NNS + Gaussian (flat),0.8295819935691319,0.58,0.8858520900321544,0.8279742765273312,0.14276527331189715,5.220493265073913,8.535512774975702,1.0556550754409413,2.0458238791253365 +NNS native PI,0.8287781350482315,0.52,0.9163987138263665,0.8617363344051447,0.2022508038585209,5.538396478733216,9.039315486588494,1.1017051722008695,2.123941052068062 +NNS + split-CP (flat),0.815112540192926,0.51,0.9067524115755627,0.8440514469453376,0.23118971061093252,5.423595785308306,9.16356799285496,1.1045221353056585,2.145312293713436 +NNS + split-CP (per-lead),0.8830385852090032,0.6,0.9887459807073955,0.8295819935691319,0.12025723472668814,7.990812843044477,10.886441408223009,1.185924631575878,2.292922429135039 +NNS + Gaussian (flat),0.8163183279742765,0.52,0.9163987138263665,0.8408360128617364,0.23118971061093252,5.371605324250339,9.132929020698203,1.1034425720507015,2.1344452670414986 +NNS native PI,0.8480707395498392,0.54,0.9453376205787781,0.8713826366559485,0.14115755627009652,5.498313881263709,8.568987367833104,1.0769091090418628,2.088025725407269 +NNS + split-CP (flat),0.8348070739549839,0.47,0.9115755627009646,0.8858520900321544,0.16366559485530552,5.475440096762711,8.646211618108355,1.0779606714369134,2.1374529629188985 +NNS + split-CP (per-lead),0.889871382636656,0.49,0.9758842443729904,0.8697749196141479,0.11061093247588427,7.666059526775997,10.240968683617762,1.1422851084449035,2.299705192621768 +NNS + Gaussian (flat),0.8340032154340836,0.47,0.905144694533762,0.8681672025723473,0.17009646302250803,5.3825939727945,8.623802541408669,1.0773609694299997,2.127147377666515 +NNS native PI,0.8404340836012861,0.55,0.9469453376205788,0.8762057877813505,0.17009646302250803,5.417943415993259,8.509212166486481,1.0513799372860273,2.0645442083490733 +NNS + split-CP (flat),0.8183279742765274,0.51,0.8858520900321544,0.8569131832797428,0.15241157556270102,5.073486836478882,8.35866609307891,1.0468951694182635,2.0402247395377504 +NNS + split-CP (per-lead),0.8971061093247589,0.59,0.9790996784565916,0.8553054662379421,0.08006430868167203,7.868177918914598,10.290785210091181,1.1317381317261397,2.2247260687632786 +NNS + Gaussian (flat),0.8207395498392283,0.54,0.8697749196141479,0.8569131832797428,0.13151125401929264,5.187680907668864,8.312439020985863,1.0459231999241414,2.026847378341962 +NNS native PI,0.8472668810289389,0.51,0.9405144694533762,0.7733118971061094,0.12668810289389065,5.912994900178406,8.952518643444352,1.1094460776961108,2.105079957413961 +NNS + split-CP (flat),0.8243569131832797,0.58,0.9067524115755627,0.7765273311897106,0.15401929260450165,5.307560736619756,8.610536381423858,1.099087789113522,2.0849945815307396 +NNS + split-CP (per-lead),0.8866559485530546,0.6,0.9823151125401929,0.7845659163987139,0.11543408360128615,7.9404115839516605,10.336416034686959,1.1735005767309783,2.2413111688768863 +NNS + Gaussian (flat),0.8171221864951769,0.57,0.887459807073955,0.7733118971061094,0.1508038585209004,5.2117633612580585,8.673499632507543,1.0995952439049514,2.1011013227051474 +NNS native PI,0.860128617363344,0.58,0.9501607717041801,0.8906752411575563,0.15562700964630227,5.662363541166043,8.391587116904702,1.0718817368677682,2.0521000120264206 +NNS + split-CP (flat),0.8243569131832797,0.37,0.887459807073955,0.8745980707395499,0.14919614147909965,5.3533136605690315,8.60722523849902,1.0765820133621553,2.1197718297356456 +NNS + split-CP (per-lead),0.8922829581993569,0.6,0.9871382636655949,0.8778135048231511,0.09453376205787789,7.897905901117482,10.200940248313179,1.1494769211085452,2.242593936246776 +NNS + Gaussian (flat),0.8247588424437299,0.37,0.9019292604501608,0.8729903536977492,0.15241157556270102,5.31108027242601,8.649115111814341,1.0773573984920315,2.1247815699699717 +NNS native PI,0.8432475884244373,0.62,0.9019292604501608,0.8408360128617364,0.10900321543408364,5.958905251551655,8.884737288789054,1.1533883414727397,2.112495798167121 +NNS + split-CP (flat),0.802652733118971,0.55,0.9019292604501608,0.8344051446945338,0.18778135048231515,5.514395594522053,9.282531631715036,1.1632614958510543,2.169479165301043 +NNS + split-CP (per-lead),0.8882636655948553,0.62,0.9823151125401929,0.842443729903537,0.08231511254019286,8.567512061462239,11.159806161705042,1.2453577968361804,2.3305953801588606 +NNS + Gaussian (flat),0.7930064308681672,0.54,0.887459807073955,0.8215434083601286,0.1958199356913184,5.443032862596285,9.404878484224922,1.165957599219953,2.1892321453449055 +NNS native PI,0.8404340836012861,0.42,0.9501607717041801,0.8665594855305466,0.19742765273311902,5.533785215198311,8.749410650539787,1.0714713083435363,2.076368446829223 +NNS + split-CP (flat),0.8336012861736335,0.47,0.9212218649517685,0.8488745980707395,0.16045016077170415,5.394905745099468,8.708916464271185,1.069461483415126,2.071863509791649 +NNS + split-CP (per-lead),0.8858520900321544,0.49,0.9887459807073955,0.837620578778135,0.11382636655948553,7.695935282614414,10.322007696267946,1.145102489810678,2.2482944871036135 +NNS + Gaussian (flat),0.8211414790996785,0.46,0.9115755627009646,0.8408360128617364,0.18135048231511253,5.203825372900099,8.80921517840864,1.071527536672299,2.088411117524039 +NNS native PI,0.8484726688102894,0.58,0.905144694533762,0.8778135048231511,0.13151125401929264,6.028058412975543,8.799144860895987,1.1309121052212678,2.0927473976307103 +NNS + split-CP (flat),0.8311897106109325,0.52,0.887459807073955,0.8697749196141479,0.1620578778135049,5.725218656291451,8.758728314500287,1.130064761156996,2.113226263951555 +NNS + split-CP (per-lead),0.8914790996784566,0.53,0.977491961414791,0.8569131832797428,0.09292604501607715,8.23717370280603,10.704707152434425,1.212676341149698,2.3029740791509625 +NNS + Gaussian (flat),0.8255627009646302,0.52,0.8858520900321544,0.8569131832797428,0.1620578778135049,5.649387814291777,8.818937346230603,1.1313117903011027,2.117039408279818 +NNS native PI,0.8476688102893891,0.62,0.9019292604501608,0.860128617363344,0.1572347266881029,5.69808899719474,8.493411469371221,1.0828624333564756,2.046419721223251 +NNS + split-CP (flat),0.8319935691318328,0.59,0.8890675241157556,0.8762057877813505,0.17652733118971065,5.482468614284766,8.675839888968568,1.0856801348147251,2.0865280978950644 +NNS + split-CP (per-lead),0.8979099678456591,0.65,0.9839228295819936,0.8745980707395499,0.0897106109324759,8.242584168402129,10.569179146091031,1.166057643863629,2.24561779290906 +NNS + Gaussian (flat),0.8247588424437299,0.59,0.8890675241157556,0.8617363344051447,0.1845659163987139,5.339542905768758,8.766021670973783,1.0879995420615687,2.0935035802978836 diff --git a/gists/timeseries/conformal/run_block_nns_intervals.py b/gists/timeseries/conformal/run_block_nns_intervals.py new file mode 100644 index 0000000..382bbf3 --- /dev/null +++ b/gists/timeseries/conformal/run_block_nns_intervals.py @@ -0,0 +1,304 @@ +""" +run_block_nns_intervals.py +================================================================================ +Apples-to-apples prediction-interval comparison, NNS point forecast held FIXED. + +Question answered +----------------- +At each origin t, given only data through t, NNS.ARMA.optim emits a single +h-step block forecast (h = implied_h = current_train*(1-0.9)/0.9, NO online +updating inside the block). We then build several prediction intervals on the +SAME NNS point forecast and the SAME NNS residuals, and ask: + + Does NNS's native prediction interval beat conformalizing NNS's own + residuals — and does letting the band widen with lead-time help? + +Because the point model is identical across every interval method, this isolates +interval construction with zero protocol mismatch. (Contrast issue #57, where the +NNS optimizer also saw the scored block — fixed here: the model is selected on a +strictly historical validation tail, and the scored block is never shown to it.) + +Interval methods compared (all on the identical NNS block point forecast): + * NNS native PI — results ± pi_width (flat width, NNS's own rule) + * NNS + split-CP (flat) — empirical 0.9 quantile of NNS validation residuals + * NNS + split-CP (per-lead)— quantile per lead-time k, pooled across past blocks + (widens with horizon; data-starved -> fallback early) + * NNS + Gaussian (flat) — z * std(NNS validation residuals) + +Run: + pip install ovvo-nns numpy pandas scipy matplotlib + python run_block_nns_intervals.py +""" +from __future__ import annotations + +import os +import math +import time +import warnings +from collections import defaultdict + +import numpy as np +import pandas as pd +from scipy import stats + +warnings.filterwarnings("ignore") + +try: + import nns as NNS +except ImportError: + raise SystemExit("ovvo-nns is required. Run: pip install ovvo-nns") + +# ── Constants ──────────────────────────────────────────────────────────────── +ALPHA = 0.10 +TARGET_COV = 1.0 - ALPHA +N_LAGS = 12 +CAL_END = 1000 # first NNS origin = N_LAGS + CAL_END +WINDOW = 100 +N_SEEDS = 10 +TRAINING_FRAC = 0.90 # implied_h = current_train*(1-frac)/frac +VAL_FRAC = 0.10 # historical validation-tail fraction for model selection +CP_MIN_POOL = 8 # min per-lead samples before trusting per-lead CP +LOCAL_WIN = 2 # +/- lead-time window to thicken per-lead pools + +os.makedirs("results_block", exist_ok=True) + +_MSE = lambda predicted, actual: np.mean((predicted - actual) ** 2) + +# ── DGP ────────────────────────────────────────────────────────────────────── +def make_timeseries(T: int = 3500, seed: int = 0, heavy_tail: bool = False) -> dict: + rng = np.random.default_rng(seed + 1) + tt = np.arange(1, T + 1, dtype=float) + level = 0.002 * tt + 1.50 * np.sin(2*np.pi*tt/50) + 0.75 * np.sin(2*np.pi*tt/200) + sigma = np.ones(T) + sigma[(tt > 900) & (tt <= 1400)] = 2.5 + sigma[(tt > 1900) & (tt <= 2450)] = 0.55 + sigma[(tt > 2800)] = 1.8 + eps = rng.standard_t(5, size=T) / math.sqrt(5/3) if heavy_tail else rng.standard_normal(T) + y = np.empty(T) + y[0] = level[0] + sigma[0] * eps[0] + for i in range(1, T): + y[i] = level[i] + 0.55 * (y[i-1] - level[i-1]) + sigma[i] * eps[i] + return {"y": y, "level": level, "sigma": sigma} + +# ── Scoring helpers ────────────────────────────────────────────────────────── +def z_alpha(alpha: float = ALPHA) -> float: + return float(stats.norm.ppf(1 - alpha / 2)) + +def coverage(lo, hi, y) -> float: + lo, hi, y = map(np.asarray, (lo, hi, y)) + return float(np.mean((y >= lo) & (y <= hi))) + +def mean_width(lo, hi) -> float: + return float(np.mean(np.asarray(hi) - np.asarray(lo))) + +def rolling_coverage(lo, hi, y, window: int = WINDOW) -> np.ndarray: + lo, hi, y = map(np.asarray, (lo, hi, y)) + n = len(y) + if n < window: + return np.array([]) + return np.array([coverage(lo[i:i+window], hi[i:i+window], y[i:i+window]) + for i in range(n - window + 1)]) + +def worst_window_coverage(lo, hi, y, window: int = WINDOW) -> float: + rc = rolling_coverage(lo, hi, y, window) + return float(np.min(rc)) if len(rc) > 0 else float("nan") + +def interval_score(lo, hi, y, alpha: float = ALPHA) -> float: + lo, hi, y = map(np.asarray, (lo, hi, y)) + return float(np.mean((hi - lo) + + (2/alpha) * np.maximum(lo - y, 0) + + (2/alpha) * np.maximum(y - hi, 0))) + +def coverage_by_stratum(lo, hi, y, sigma, k: int = 4) -> list[float]: + lo, hi, y, sigma = map(np.asarray, (lo, hi, y, sigma)) + ranks = stats.rankdata(sigma, method="ordinal") + grp = np.ceil(ranks / len(ranks) * k).astype(int).clip(1, k) + return [coverage(lo[grp == j], hi[grp == j], y[grp == j]) for j in range(1, k+1)] + +def crps_gaussian(mu, sigma, y) -> float: + sigma = np.maximum(np.asarray(sigma, float), 1e-8) + z = (np.asarray(y) - np.asarray(mu)) / sigma + return float(np.mean(sigma * (z * (2*stats.norm.cdf(z) - 1) + + 2*stats.norm.pdf(z) - 1/math.sqrt(math.pi)))) + +def log_score_gaussian(mu, sigma, y) -> float: + sigma = np.maximum(np.asarray(sigma, float), 1e-8) + return float(np.mean(-stats.norm.logpdf(np.asarray(y), loc=np.asarray(mu), scale=sigma))) + +def score_method(name, lo, hi, y, sigma, mu_=None, s_=None) -> dict: + lo_v = np.minimum(np.asarray(lo, float), np.asarray(hi, float)) + hi_v = np.maximum(np.asarray(lo, float), np.asarray(hi, float)) + cbs = coverage_by_stratum(lo_v, hi_v, y, sigma) + row = { + "method": name, + "marg_cov": coverage(lo_v, hi_v, y), + "worst_win_cov": worst_window_coverage(lo_v, hi_v, y), + "cov_lowvol": cbs[0], + "cov_hivol": cbs[-1], + "cond_cov_gap": max(abs(c - TARGET_COV) for c in cbs if not math.isnan(c)), + "width": mean_width(lo_v, hi_v), + "interval_score": interval_score(lo_v, hi_v, y), + "CRPS": float("nan"), + "logscore": float("nan"), + } + if mu_ is not None and s_ is not None: + row["CRPS"] = crps_gaussian(mu_, s_, y) + row["logscore"] = log_score_gaussian(mu_, s_, y) + return row + +# ── NNS seasonal period discovery (full-window) ────────────────────────────── +def get_nns_seas_periods(series: np.ndarray, train_n: int) -> list[int]: + max_period = int(train_n / 3) - 1 + try: + raw = NNS.nns_seas(series, plot=False).get("periods", np.array([])) + seen, valid = set(), [] + for p in raw: + pi = int(p) + if 1 < pi < max_period and pi not in seen: + seen.add(pi); valid.append(pi) + return valid if valid else [2] + except Exception: + return [2] + +def _emp_q(scores, level=TARGET_COV): + """Finite-sample conformal quantile of |scores| at coverage `level`.""" + s = np.sort(np.abs(np.asarray(scores, float))) + n = len(s) + if n == 0: + return None + k = int(np.ceil((n + 1) * level)) - 1 + return float("inf") if k >= n else float(s[k]) + +# ── Block walk-forward: NNS point fixed, intervals swapped ──────────────────── +def run_once(seed: int = 0, heavy_tail: bool = False, verbose: bool = True) -> dict: + d = make_timeseries(T=3500, seed=seed, heavy_tail=heavy_tail) + y, sig = d["y"], d["sigma"] + T_raw = len(y) + z = z_alpha(ALPHA) + + # accumulators: per interval-method lo/hi, shared point/y/sigma per scored step + acc_y, acc_sig, acc_pt = [], [], [] + lo_acc = defaultdict(list); hi_acc = defaultdict(list); sig_acc = defaultdict(list) + per_lead_pool: dict[int, list] = defaultdict(list) # realized |resid| by lead-time + + current_train = N_LAGS + CAL_END + t0 = time.time(); n_chunks = 0 + + while current_train < T_raw: + remaining = T_raw - current_train + implied_h = max(1, int(current_train * (1 - TRAINING_FRAC) / TRAINING_FRAC)) + h = min(remaining, implied_h) + train_slice = y[:current_train] + val_h = max(1, int(round(VAL_FRAC * current_train))) + periods = get_nns_seas_periods(train_slice, current_train) + + try: + fit = NNS.nns_arma_optim( + variable=train_slice, h=h, training_set=current_train - val_h, + seasonal_factor=periods, negative_values=True, + obj_fn=_MSE, objective="min", linear_approximation=True, + pred_int=TARGET_COV, print_trace=False, plot=False, + ) + except Exception as exc: + if verbose: + print(f" [seed {seed} chunk {n_chunks+1} failed: {exc}] – skipping") + current_train += h + continue + + point = np.asarray(fit["results"], float) # bias-corrected block point + nat_lo = np.asarray(fit["lower.pred.int"], float) + nat_hi = np.asarray(fit["upper.pred.int"], float) + val_err = np.asarray(fit["errors"], float) # validation-tail residuals + bias_shift = float(fit["bias.shift"]) + # residuals of the *reported* (bias-corrected) forecast on the validation tail + cal_resid = val_err + bias_shift + + yb = y[current_train:current_train + h] + sgb = sig[current_train:current_train + h] + n_chunks += 1 + + # ── interval method 1: NNS native PI ────────────────────────────── + lo_acc["NNS native PI"].extend(nat_lo) + hi_acc["NNS native PI"].extend(nat_hi) + nat_hw = np.maximum((nat_hi - nat_lo) / 2.0, 1e-8) + sig_acc["NNS native PI"].extend(nat_hw / z) + + # ── interval method 2: split-CP (flat) on NNS validation residuals ─ + q_flat = _emp_q(cal_resid, TARGET_COV) + q_flat = nat_hw.mean() if (q_flat is None or not np.isfinite(q_flat)) else q_flat + lo_acc["NNS + split-CP (flat)"].extend(point - q_flat) + hi_acc["NNS + split-CP (flat)"].extend(point + q_flat) + sig_acc["NNS + split-CP (flat)"].extend(np.full(h, max(q_flat / z, 1e-8))) + + # ── interval method 3: split-CP (per-lead-time) ─────────────────── + pl_lo = np.empty(h); pl_hi = np.empty(h); pl_s = np.empty(h) + for k in range(h): + pool = [] + for kk in range(max(0, k - LOCAL_WIN), k + LOCAL_WIN + 1): + pool.extend(per_lead_pool.get(kk, [])) + if len(pool) >= CP_MIN_POOL: + qk = _emp_q(pool, TARGET_COV) + if qk is None or not np.isfinite(qk): + qk = q_flat + else: + qk = q_flat # fallback while pool is thin + pl_lo[k] = point[k] - qk; pl_hi[k] = point[k] + qk; pl_s[k] = max(qk / z, 1e-8) + lo_acc["NNS + split-CP (per-lead)"].extend(pl_lo) + hi_acc["NNS + split-CP (per-lead)"].extend(pl_hi) + sig_acc["NNS + split-CP (per-lead)"].extend(pl_s) + + # ── interval method 4: Gaussian (flat) ──────────────────────────── + s_flat = max(float(np.std(cal_resid)), 1e-8) + lo_acc["NNS + Gaussian (flat)"].extend(point - z * s_flat) + hi_acc["NNS + Gaussian (flat)"].extend(point + z * s_flat) + sig_acc["NNS + Gaussian (flat)"].extend(np.full(h, s_flat)) + + # shared truth / point + acc_y.extend(yb); acc_sig.extend(sgb); acc_pt.extend(point) + + # update per-lead pools with realized residuals AFTER scoring (now data <= next t) + realized = np.abs(yb - point) + for k in range(h): + per_lead_pool[k].append(realized[k]) + + current_train += h + + y_arr = np.asarray(acc_y); sig_arr = np.asarray(acc_sig); pt_arr = np.asarray(acc_pt) + rows = [] + for name in lo_acc: + lo = np.asarray(lo_acc[name]); hi = np.asarray(hi_acc[name]) + s_ = np.asarray(sig_acc[name]) + rows.append(score_method(name, lo, hi, y_arr, sig_arr, mu_=pt_arr, s_=s_)) + + if verbose: + print(f" [seed {seed}] {n_chunks} blocks, " + f"point MAE={np.mean(np.abs(pt_arr - y_arr)):.3f}, {time.time()-t0:.0f}s") + + return {"scores": pd.DataFrame(rows), "lo": lo_acc, "hi": hi_acc, + "y": y_arr, "sigma": sig_arr, "point": pt_arr} + +# ── Aggregate over seeds ────────────────────────────────────────────────────── +def run_all() -> dict: + all_scores = [] + for s in range(N_SEEDS): + print(f"\n=== seed {s} ===") + all_scores.append(run_once(seed=s, verbose=True)["scores"]) + + scores_dt = pd.concat(all_scores, ignore_index=True) + metric_cols = ["marg_cov", "worst_win_cov", "cov_lowvol", "cov_hivol", + "cond_cov_gap", "width", "interval_score", "CRPS", "logscore"] + agg = (scores_dt.groupby("method", sort=False)[metric_cols] + .mean().reset_index().sort_values("interval_score")) + + scores_dt.to_csv("results_block/scores_all.csv", index=False) + agg.to_csv("results_block/scores.csv", index=False) + + print(f"\n=== BLOCK NNS INTERVAL COMPARISON " + f"(point=NNS fixed; mean over {N_SEEDS} seeds, alpha={ALPHA}, target={TARGET_COV}) ===\n") + print(agg.round(3).to_string(index=False)) + print("\nWrote results_block/scores.csv and scores_all.csv") + return {"scores": scores_dt, "summary": agg} + + +if __name__ == "__main__": + run_all() diff --git a/gists/timeseries/point_duel/README.md b/gists/timeseries/point_duel/README.md new file mode 100644 index 0000000..854f635 --- /dev/null +++ b/gists/timeseries/point_duel/README.md @@ -0,0 +1,93 @@ +# Point-model duel — NNS vs. recursive ridge, one honest protocol + +A fair head-to-head on **point forecasting**, with prediction intervals as a +secondary check. Every method makes the same commitment NNS makes: + +> At each origin *t*, given only data through *t*, forecast an `h`-step block +> (`h = implied_h = t·(1−0.9)/0.9`) with **no online updating**. No method may +> peek at a realized value to predict the next one. + +This is the comparison the earlier conformal benchmark lacked: there the +baselines forecast one step ahead with the *true* lag and refreshed every step, +while NNS forecast a multi-step block blind. Here everyone forecasts the block +blind, so the contest measures forecasting-by-design, not access to the truth. + +## Contenders + +- **NNS block** — `NNS.ARMA.optim`, model (periods/method/bias) selected on a + strictly historical validation tail (leak-free), native block forecast. +- **Ridge (recursive)** — ridge on `N_LAGS` lags, fit on all data ≤ *t*, then + projected `h` steps **recursively** (its own predictions become the lags; no + true intermediate values). This is the baseline stripped of the h=1 crutch. +- **Persistence** — last value carried forward `h` steps (floor). + +Intervals: each point method is wrapped with the **same** per-lead-time +split-conformal band (calibrated on that method's own realized block residuals, +pooled across past origins). NNS also reports its native PI. + +## Results + +Mean over 10 seeds, T = 3500, ~12 blocks/seed. Lower is better throughout. + +**Point forecast (the headline):** + +``` + method MAE RMSE median_AE + NNS block 1.514 2.056 1.122 +Ridge (recursive) 2.664 3.230 2.403 + Persistence 2.493 3.241 1.991 +``` + +**Intervals (same protocol; per-method split-CP + NNS native PI):** + +``` + method marg_cov worst_win_cov width interval_score + NNS native PI 0.845 0.561 5.658 8.674 + NNS block + CP 0.894 0.288 8.306 11.339 +Ridge (recursive) + CP 0.835 0.124 9.819 15.189 + Persistence + CP 0.860 0.000 11.701 18.674 +``` + +## Interpretation + +- **NNS wins the point duel decisively.** MAE 1.51 vs. recursive ridge 2.66 + (~43% lower) and persistence 2.49; same ordering on RMSE and median absolute + error. The result is consistent across all 10 seeds, not seed-driven. +- **Recursive ridge is *worse than persistence* on MAE.** Once you remove the + h=1 crutch (true lag every step) and force a genuine multi-step block, the + linear AR model's error compounds over the horizon until naive carry-forward + beats it. This is the concrete version of the forecaster-by-design vs. + forecaster-by-compulsion distinction: a structural seasonal model projects a + real trajectory; a reactive linear model that only ever learned the one-step + map degrades to noise without the truth fed back. +- **Intervals follow point quality.** NNS's native PI is far the best Winkler + score (8.67) at the tightest width — because better, smaller residuals make a + tighter, better-centred band. The conformal wrappers on the weaker point + models inherit their larger errors and score worse. + +### Caveat on the interval table + +The shared per-lead-time split-conformal wrapper here has a cold-start fallback +(it has no realized block residuals for the first origins), which depresses +**worst-window** coverage for the `+ CP` rows (e.g. NNS block + CP at 0.288). +Read the **point** table (MAE / RMSE / median AE) as the clean, primary result +and the interval table as directional. The carefully-calibrated interval study +lives separately; this branch's job is the point-model duel. + +## Run it + +```bash +pip install ovvo-nns numpy pandas scipy scikit-learn +python run_point_duel.py +``` + +Writes `results_duel/point.csv`, `interval.csv` (aggregated) and the per-seed +`*_all.csv`. + +## Note + +Marginal coverage of every flat band falls short of the 0.90 target on this +heteroskedastic DGP — that is the exchangeability-violation penalty, common to +all methods, not specific to any one. The point tables (MAE / RMSE) are the +clean comparison; the interval tables should be read as efficiency-at-similar- +(under)coverage, not as a coverage guarantee. diff --git a/gists/timeseries/point_duel/results_duel/interval.csv b/gists/timeseries/point_duel/results_duel/interval.csv new file mode 100644 index 0000000..54abe05 --- /dev/null +++ b/gists/timeseries/point_duel/results_duel/interval.csv @@ -0,0 +1,5 @@ +method,marg_cov,worst_win_cov,width,interval_score +NNS native PI,0.845217041800643,0.561,5.658247904936664,8.673563693193246 +NNS block + CP,0.8942926045016077,0.288,8.30644088604347,11.339456828288366 +Ridge (recursive) + CP,0.834766881028939,0.124,9.819236038266109,15.188952711334176 +Persistence + CP,0.8596061093247588,0.0,11.700845119008003,18.674312892485847 diff --git a/gists/timeseries/point_duel/results_duel/interval_all.csv b/gists/timeseries/point_duel/results_duel/interval_all.csv new file mode 100644 index 0000000..1f87543 --- /dev/null +++ b/gists/timeseries/point_duel/results_duel/interval_all.csv @@ -0,0 +1,41 @@ +method,marg_cov,worst_win_cov,width,interval_score +NNS native PI,0.8476688102893891,0.67,5.3336289551117595,8.347311881079282 +NNS block + CP,0.8906752411575563,0.22,8.177099486196127,11.529523035347692 +Ridge (recursive) + CP,0.8372186495176849,0.1,10.278876556971408,16.046670825709686 +Persistence + CP,0.8436495176848875,0.0,10.213249743026658,16.33320386075765 +NNS native PI,0.8287781350482315,0.52,5.538396478733216,9.039315486588494 +NNS block + CP,0.8902733118971061,0.37,8.390728807696368,11.499697971060435 +Ridge (recursive) + CP,0.7990353697749196,0.08,9.39395853105691,15.636645535901216 +Persistence + CP,0.8581189710610932,0.0,11.002413778926346,16.322217439495994 +NNS native PI,0.8480707395498392,0.54,5.498313881263709,8.568987367833104 +NNS block + CP,0.8870578778135049,0.32,7.850041608639307,10.881610133044617 +Ridge (recursive) + CP,0.8331993569131833,0.03,9.461311846092478,15.290577454252144 +Persistence + CP,0.8388263665594855,0.0,13.454901313710085,20.728177627283294 +NNS native PI,0.8404340836012861,0.55,5.417943415993259,8.509212166486481 +NNS block + CP,0.8971061093247589,0.18,8.178784348833949,11.442682746189892 +Ridge (recursive) + CP,0.842443729903537,0.21,9.937090055812732,15.019945691471802 +Persistence + CP,0.9091639871382636,0.0,14.085426811335523,22.670411720818084 +NNS native PI,0.8472668810289389,0.51,5.912994900178406,8.952518643444352 +NNS block + CP,0.8979099678456591,0.27,8.346058510609698,11.244196033911283 +Ridge (recursive) + CP,0.8336012861736335,0.13,9.626844810715227,14.755633310870019 +Persistence + CP,0.9011254019292605,0.0,11.723029193525093,15.965745582710372 +NNS native PI,0.860128617363344,0.58,5.662363541166043,8.391587116904702 +NNS block + CP,0.8922829581993569,0.25,8.141163390003173,11.244453556266834 +Ridge (recursive) + CP,0.8420418006430869,0.25,9.708049657419254,14.244410697181051 +Persistence + CP,0.8942926045016077,0.0,15.174974529034394,25.340887411818077 +NNS native PI,0.8432475884244373,0.62,5.958905251551655,8.884737288789054 +NNS block + CP,0.9115755627009646,0.41,9.046462585915824,11.49001377083638 +Ridge (recursive) + CP,0.8512861736334405,0.13,10.192315755071641,15.29971782370236 +Persistence + CP,0.8110932475884244,0.0,11.285499121470306,19.764628557431646 +NNS native PI,0.8404340836012861,0.42,5.533785215198311,8.749410650539787 +NNS block + CP,0.8854501607717041,0.21,7.999981741192585,11.219855763022137 +Ridge (recursive) + CP,0.8283762057877814,0.1,9.51494956048423,14.79895664009931 +Persistence + CP,0.8802250803858521,0.0,9.804629243119559,14.085680509142973 +NNS native PI,0.8484726688102894,0.58,6.028058412975543,8.799144860895987 +NNS block + CP,0.8959003215434084,0.46,8.40742772515753,11.10618124347616 +Ridge (recursive) + CP,0.8259646302250804,0.06,9.845777723130777,15.51537851501687 +Persistence + CP,0.8046623794212219,0.0,9.304325751647738,16.85406180705558 +NNS native PI,0.8476688102893891,0.62,5.69808899719474,8.493411469371221 +NNS block + CP,0.8946945337620579,0.19,8.52666065619013,11.736354029728231 +Ridge (recursive) + CP,0.8545016077170418,0.15,10.233185885906426,15.281590619137305 +Persistence + CP,0.854903536977492,0.0,10.960001704284322,18.678114408344804 diff --git a/gists/timeseries/point_duel/results_duel/point.csv b/gists/timeseries/point_duel/results_duel/point.csv new file mode 100644 index 0000000..3cfd679 --- /dev/null +++ b/gists/timeseries/point_duel/results_duel/point.csv @@ -0,0 +1,4 @@ +method,MAE,RMSE,median_AE +NNS block,1.5136011272156582,2.0556925586936883,1.12208606811395 +Ridge (recursive),2.6644732816036893,3.2304601906237487,2.4028239883431057 +Persistence,2.493290248083914,3.24141533222641,1.9910934283341892 diff --git a/gists/timeseries/point_duel/results_duel/point_all.csv b/gists/timeseries/point_duel/results_duel/point_all.csv new file mode 100644 index 0000000..7900c80 --- /dev/null +++ b/gists/timeseries/point_duel/results_duel/point_all.csv @@ -0,0 +1,31 @@ +method,MAE,RMSE,median_AE +NNS block,1.4624613757188,2.0095523514648352,1.0714033811834391 +Ridge (recursive),2.7093079489173206,3.2849148002794246,2.4276541206659417 +Persistence,2.5207425403649664,3.1838137342389228,2.0779938586077042 +NNS block,1.5179195200934121,2.0875213420368293,1.105426032360385 +Ridge (recursive),2.8282510401794556,3.3952268823991485,2.5875729210928973 +Persistence,2.3736672434422657,3.040742657230395,1.8980380282638447 +NNS block,1.4996940127463412,2.00757652089875,1.115767753513877 +Ridge (recursive),2.5709301773652182,3.1339434936155377,2.335881443966338 +Persistence,2.9117732238980736,3.794140173915152,2.234972066371844 +NNS block,1.4593534764791098,2.001246583495766,1.0864650913903005 +Ridge (recursive),2.6298954565866794,3.190132803482008,2.3755657648263804 +Persistence,2.4740376467312735,3.304377116943708,1.9953112501835901 +NNS block,1.5271989128921286,2.049606520740077,1.1602056950481388 +Ridge (recursive),2.596901417871215,3.1481544415297216,2.3307565425306667 +Persistence,2.3415372011880113,2.992295950923396,1.9604867694701609 +NNS block,1.4920674612633236,2.008429290064173,1.1172318673345731 +Ridge (recursive),2.614711558422066,3.1746112295444617,2.308934798983637 +Persistence,2.6319063703191286,3.543647928791187,2.0564798773344357 +NNS block,1.6060225425685897,2.1818865891046224,1.1965950466152333 +Ridge (recursive),2.717725103031881,3.295389524149246,2.4449283484676214 +Persistence,2.6267172703935375,3.4629309634408636,1.9490510884493548 +NNS block,1.4824820449499643,2.0065783979058245,1.1035396755182818 +Ridge (recursive),2.6981442061600074,3.249449182036675,2.5191619066463744 +Persistence,2.16538422214265,2.7236495214185332,1.867662248183055 +NNS block,1.5806024483770957,2.126087726777259,1.1775229045548874 +Ridge (recursive),2.6103148900670363,3.1982342648198507,2.298518235730201 +Persistence,2.4716622184905397,3.181652504844425,1.9899147900891014 +NNS block,1.5082094770678165,2.0784402644487474,1.086703233620384 +Ridge (recursive),2.6685510174360134,3.2345452843814164,2.3992658005209986 +Persistence,2.4154745438686924,3.1869027705175195,1.8810243063888008 diff --git a/gists/timeseries/point_duel/run_point_duel.py b/gists/timeseries/point_duel/run_point_duel.py new file mode 100644 index 0000000..4623b56 --- /dev/null +++ b/gists/timeseries/point_duel/run_point_duel.py @@ -0,0 +1,316 @@ +""" +run_point_duel.py +================================================================================ +Point-model duel under one honest protocol. + +Every method makes the SAME commitment NNS makes: at each origin t, given only +data through t, forecast an h-step block (h = implied_h = t*(1-0.9)/0.9) with NO +online updating inside the block. No method gets to peek at a realized value to +predict the next one. This is the apples-to-apples that the original benchmark +lacked (where the baselines forecast h=1 with true lags and refreshed every step). + +Point forecasters +----------------- + * NNS block — NNS.ARMA.optim, model selected on a strictly historical + validation tail (leak-free), native block forecast. + * Ridge (recursive)— ridge on N_LAGS lags, fit on all data <= t, then projected + h steps RECURSIVELY (its own predictions become the lags; + no true intermediate values). This removes the h=1 crutch. + * Persistence — last observed value carried forward h steps (floor). + +Intervals (secondary, same protocol) +------------------------------------ +Each point method is wrapped with the SAME per-lead-time split-conformal band, +calibrated on that method's own realized block residuals pooled across past +origins (flat fallback while thin). NNS additionally reports its native PI. + +Outputs a POINT table (MAE / RMSE, the headline) and an INTERVAL table. + +Run: + pip install ovvo-nns numpy pandas scipy scikit-learn + python run_point_duel.py +""" +from __future__ import annotations + +import os +import math +import time +import warnings +from collections import defaultdict + +import numpy as np +import pandas as pd +from scipy import stats + +warnings.filterwarnings("ignore") + +try: + import nns as NNS +except ImportError: + raise SystemExit("ovvo-nns is required. Run: pip install ovvo-nns") + +try: + from sklearn.linear_model import Ridge + from sklearn.pipeline import make_pipeline + from sklearn.preprocessing import StandardScaler + HAS_SKLEARN = True +except ImportError: + HAS_SKLEARN = False + +# ── Constants ──────────────────────────────────────────────────────────────── +ALPHA = 0.10 +TARGET_COV = 1.0 - ALPHA +N_LAGS = 12 +CAL_END = 1000 +WINDOW = 100 +N_SEEDS = 10 +TRAINING_FRAC = 0.90 +VAL_FRAC = 0.10 +CP_MIN_POOL = 8 +LOCAL_WIN = 2 + +os.makedirs("results_duel", exist_ok=True) +_MSE = lambda predicted, actual: np.mean((predicted - actual) ** 2) + +# ── DGP ────────────────────────────────────────────────────────────────────── +def make_timeseries(T: int = 3500, seed: int = 0, heavy_tail: bool = False) -> dict: + rng = np.random.default_rng(seed + 1) + tt = np.arange(1, T + 1, dtype=float) + level = 0.002 * tt + 1.50 * np.sin(2*np.pi*tt/50) + 0.75 * np.sin(2*np.pi*tt/200) + sigma = np.ones(T) + sigma[(tt > 900) & (tt <= 1400)] = 2.5 + sigma[(tt > 1900) & (tt <= 2450)] = 0.55 + sigma[(tt > 2800)] = 1.8 + eps = rng.standard_t(5, size=T) / math.sqrt(5/3) if heavy_tail else rng.standard_normal(T) + y = np.empty(T) + y[0] = level[0] + sigma[0] * eps[0] + for i in range(1, T): + y[i] = level[i] + 0.55 * (y[i-1] - level[i-1]) + sigma[i] * eps[i] + return {"y": y, "level": level, "sigma": sigma} + +# ── Lag features ───────────────────────────────────────────────────────────── +def lag_features(y: np.ndarray, n_lags: int = N_LAGS): + n = len(y) + yy = y[n_lags:] + X = np.column_stack([y[n_lags - k: n - k] for k in range(1, n_lags + 1)]) + return X, yy + +# ── Point forecasters (all produce an h-step block from the end of y[:t]) ──── +def forecast_ridge_recursive(y_hist: np.ndarray, h: int, n_lags: int = N_LAGS) -> np.ndarray: + """Fit ridge on all data <= t, then project h steps recursively.""" + X, yy = lag_features(y_hist, n_lags) + if HAS_SKLEARN: + model = make_pipeline(StandardScaler(), Ridge(alpha=1.0)) + model.fit(X, yy) + predict = lambda lags: float(model.predict(lags.reshape(1, -1))[0]) + else: + Xtr = np.column_stack([X, np.ones(len(X))]) + coef, *_ = np.linalg.lstsq(Xtr, yy, rcond=None) + predict = lambda lags: float(np.append(lags, 1.0) @ coef) + hist = list(y_hist[-n_lags:]) + out = [] + for _ in range(h): + lags = np.array(hist[-1::-1][:n_lags]) # [y_{t}, y_{t-1}, ..., y_{t-n_lags+1}] + yhat = predict(lags) + out.append(yhat) + hist.append(yhat) # feed prediction back (no true value) + return np.asarray(out) + +def forecast_persistence(y_hist: np.ndarray, h: int) -> np.ndarray: + return np.full(h, float(y_hist[-1])) + +def get_nns_seas_periods(series: np.ndarray, train_n: int) -> list[int]: + max_period = int(train_n / 3) - 1 + try: + raw = NNS.nns_seas(series, plot=False).get("periods", np.array([])) + seen, valid = set(), [] + for p in raw: + pi = int(p) + if 1 < pi < max_period and pi not in seen: + seen.add(pi); valid.append(pi) + return valid if valid else [2] + except Exception: + return [2] + +def forecast_nns_block(y_hist: np.ndarray, h: int): + """Returns (point, native_lo, native_hi, cal_resid) — leak-free block forecast.""" + n = len(y_hist) + val_h = max(1, int(round(VAL_FRAC * n))) + periods = get_nns_seas_periods(y_hist, n) + fit = NNS.nns_arma_optim( + variable=y_hist, h=h, training_set=n - val_h, seasonal_factor=periods, + negative_values=True, obj_fn=_MSE, objective="min", + linear_approximation=True, pred_int=TARGET_COV, print_trace=False, plot=False, + ) + point = np.asarray(fit["results"], float) + nat_lo = np.asarray(fit["lower.pred.int"], float) + nat_hi = np.asarray(fit["upper.pred.int"], float) + cal_resid = np.asarray(fit["errors"], float) + float(fit["bias.shift"]) + return point, nat_lo, nat_hi, cal_resid + +# ── Scoring helpers ────────────────────────────────────────────────────────── +def z_alpha(alpha: float = ALPHA) -> float: + return float(stats.norm.ppf(1 - alpha / 2)) + +def coverage(lo, hi, y) -> float: + lo, hi, y = map(np.asarray, (lo, hi, y)) + return float(np.mean((y >= lo) & (y <= hi))) + +def mean_width(lo, hi) -> float: + return float(np.mean(np.asarray(hi) - np.asarray(lo))) + +def rolling_coverage(lo, hi, y, window: int = WINDOW) -> np.ndarray: + lo, hi, y = map(np.asarray, (lo, hi, y)); n = len(y) + if n < window: + return np.array([]) + return np.array([coverage(lo[i:i+window], hi[i:i+window], y[i:i+window]) + for i in range(n - window + 1)]) + +def worst_window_coverage(lo, hi, y, window: int = WINDOW) -> float: + rc = rolling_coverage(lo, hi, y, window) + return float(np.min(rc)) if len(rc) > 0 else float("nan") + +def interval_score(lo, hi, y, alpha: float = ALPHA) -> float: + lo, hi, y = map(np.asarray, (lo, hi, y)) + return float(np.mean((hi - lo) + (2/alpha)*np.maximum(lo - y, 0) + + (2/alpha)*np.maximum(y - hi, 0))) + +def _emp_q(scores, level=TARGET_COV): + s = np.sort(np.abs(np.asarray(scores, float))); n = len(s) + if n == 0: + return None + k = int(np.ceil((n + 1) * level)) - 1 + return float("inf") if k >= n else float(s[k]) + +# ── One seed ────────────────────────────────────────────────────────────────── +POINT_METHODS = ["NNS block", "Ridge (recursive)", "Persistence"] + +def run_once(seed: int = 0, heavy_tail: bool = False, verbose: bool = True) -> dict: + d = make_timeseries(T=3500, seed=seed, heavy_tail=heavy_tail) + y, sig = d["y"], d["sigma"] + T_raw = len(y); z = z_alpha(ALPHA) + + pt_pred = defaultdict(list) # point forecasts by method + acc_y, acc_sig = [], [] + lo_acc = defaultdict(list); hi_acc = defaultdict(list) + pools = {m: defaultdict(list) for m in POINT_METHODS} # per-method per-lead resid pool + nat_lo_all, nat_hi_all = [], [] # NNS native PI + + current_train = N_LAGS + CAL_END + t0 = time.time(); n_chunks = 0 + + while current_train < T_raw: + remaining = T_raw - current_train + implied_h = max(1, int(current_train * (1 - TRAINING_FRAC) / TRAINING_FRAC)) + h = min(remaining, implied_h) + y_hist = y[:current_train] + yb = y[current_train:current_train + h] + sgb = sig[current_train:current_train + h] + + try: + nns_pt, nat_lo, nat_hi, _ = forecast_nns_block(y_hist, h) + except Exception as exc: + if verbose: + print(f" [seed {seed} chunk {n_chunks+1} NNS failed: {exc}] skip") + current_train += h + continue + + preds = { + "NNS block": nns_pt, + "Ridge (recursive)": forecast_ridge_recursive(y_hist, h), + "Persistence": forecast_persistence(y_hist, h), + } + n_chunks += 1 + + # per-lead split-CP on each method's own past realized residuals + for m in POINT_METHODS: + p = preds[m] + lo = np.empty(h); hi = np.empty(h) + for k in range(h): + pool = [] + for kk in range(max(0, k - LOCAL_WIN), k + LOCAL_WIN + 1): + pool.extend(pools[m].get(kk, [])) + qk = _emp_q(pool, TARGET_COV) if len(pool) >= CP_MIN_POOL else None + if qk is None or not np.isfinite(qk): + # fallback: flat quantile of whatever realized resid we have, else inf->wide + flat_pool = [v for vs in pools[m].values() for v in vs] + qf = _emp_q(flat_pool, TARGET_COV) if flat_pool else None + qk = qf if (qf is not None and np.isfinite(qf)) else float(np.std(p) + 1e-6) + lo[k] = p[k] - qk; hi[k] = p[k] + qk + lo_acc[f"{m} + CP"].extend(lo); hi_acc[f"{m} + CP"].extend(hi) + + nat_lo_all.extend(nat_lo); nat_hi_all.extend(nat_hi) + for m in POINT_METHODS: + pt_pred[m].extend(preds[m]) + acc_y.extend(yb); acc_sig.extend(sgb) + + # update pools AFTER scoring + for m in POINT_METHODS: + r = np.abs(yb - preds[m]) + for k in range(h): + pools[m][k].append(r[k]) + + current_train += h + + y_arr = np.asarray(acc_y); sig_arr = np.asarray(acc_sig) + + # POINT metrics + point_rows = [] + for m in POINT_METHODS: + p = np.asarray(pt_pred[m]) + err = p - y_arr + point_rows.append({ + "method": m, + "MAE": float(np.mean(np.abs(err))), + "RMSE": float(np.sqrt(np.mean(err**2))), + "median_AE": float(np.median(np.abs(err))), + }) + + # INTERVAL metrics + int_rows = [] + def _introw(name, lo, hi): + lo = np.asarray(lo); hi = np.asarray(hi) + return {"method": name, "marg_cov": coverage(lo, hi, y_arr), + "worst_win_cov": worst_window_coverage(lo, hi, y_arr), + "width": mean_width(lo, hi), "interval_score": interval_score(lo, hi, y_arr)} + int_rows.append(_introw("NNS native PI", nat_lo_all, nat_hi_all)) + for name in lo_acc: + int_rows.append(_introw(name, lo_acc[name], hi_acc[name])) + + if verbose: + mae = {m: np.mean(np.abs(np.asarray(pt_pred[m]) - y_arr)) for m in POINT_METHODS} + print(f" [seed {seed}] {n_chunks} blocks MAE: " + + " ".join(f"{m}={mae[m]:.3f}" for m in POINT_METHODS) + + f" ({time.time()-t0:.0f}s)") + + return {"point": pd.DataFrame(point_rows), "interval": pd.DataFrame(int_rows)} + +# ── Aggregate ──────────────────────────────────────────────────────────────── +def run_all() -> dict: + pts, ints = [], [] + for s in range(N_SEEDS): + print(f"\n=== seed {s} ===") + r = run_once(seed=s, verbose=True) + pts.append(r["point"]); ints.append(r["interval"]) + + point_dt = pd.concat(pts, ignore_index=True) + int_dt = pd.concat(ints, ignore_index=True) + point_agg = point_dt.groupby("method", sort=False).mean().reset_index().sort_values("RMSE") + int_agg = int_dt.groupby("method", sort=False).mean().reset_index().sort_values("interval_score") + + point_dt.to_csv("results_duel/point_all.csv", index=False) + int_dt.to_csv("results_duel/interval_all.csv", index=False) + point_agg.to_csv("results_duel/point.csv", index=False) + int_agg.to_csv("results_duel/interval.csv", index=False) + + print(f"\n=== POINT-MODEL DUEL (block h-step, no online updating; " + f"mean over {N_SEEDS} seeds) ===\n") + print(point_agg.round(3).to_string(index=False)) + print(f"\n=== INTERVALS (same protocol, per-method CP + NNS native) ===\n") + print(int_agg.round(3).to_string(index=False)) + print("\nWrote results_duel/{point,interval}{,_all}.csv") + return {"point": point_agg, "interval": int_agg} + + +if __name__ == "__main__": + run_all() From db400139a150e0cc082d4d54b508ae0fe22f5199 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 23 Jun 2026 19:19:55 +0000 Subject: [PATCH 2/3] Consolidate timeseries benchmark into one routine (#57) Merge the point-model duel and the interval study into a single walk-forward in gists/timeseries/conformal/run_conformal.py. One leak-free NNS block forecast per origin now feeds both analyses (previously two scripts each recomputed the expensive NNS forecast), and the conformal calibration is unified so the per-lead split-CP shares one well-seeded fallback across point models -- removing the cold-start that depressed worst-window coverage in the standalone duel. Outputs results/{point,interval}{,_all}.csv and one README covering both tables. Removes the separate gists/timeseries/point_duel/ directory. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF --- gists/timeseries/conformal/README.md | 218 ++++++------- .../timeseries/conformal/results/interval.csv | 7 + .../conformal/results/interval_all.csv | 61 ++++ .../results}/point.csv | 0 .../results}/point_all.csv | 0 .../conformal/results_block/scores.csv | 5 - .../conformal/results_block/scores_all.csv | 41 --- .../conformal/run_block_nns_intervals.py | 304 ------------------ .../run_conformal.py} | 232 ++++++------- gists/timeseries/point_duel/README.md | 93 ------ .../point_duel/results_duel/interval.csv | 5 - .../point_duel/results_duel/interval_all.csv | 41 --- 12 files changed, 297 insertions(+), 710 deletions(-) create mode 100644 gists/timeseries/conformal/results/interval.csv create mode 100644 gists/timeseries/conformal/results/interval_all.csv rename gists/timeseries/{point_duel/results_duel => conformal/results}/point.csv (100%) rename gists/timeseries/{point_duel/results_duel => conformal/results}/point_all.csv (100%) delete mode 100644 gists/timeseries/conformal/results_block/scores.csv delete mode 100644 gists/timeseries/conformal/results_block/scores_all.csv delete mode 100644 gists/timeseries/conformal/run_block_nns_intervals.py rename gists/timeseries/{point_duel/run_point_duel.py => conformal/run_conformal.py} (54%) delete mode 100644 gists/timeseries/point_duel/README.md delete mode 100644 gists/timeseries/point_duel/results_duel/interval.csv delete mode 100644 gists/timeseries/point_duel/results_duel/interval_all.csv diff --git a/gists/timeseries/conformal/README.md b/gists/timeseries/conformal/README.md index 79ec107..2aabb0d 100644 --- a/gists/timeseries/conformal/README.md +++ b/gists/timeseries/conformal/README.md @@ -1,127 +1,127 @@ -# NNS prediction intervals vs. a conformal wrapper — apples to apples - -A controlled test of **prediction-interval construction** on a single, -heteroskedastic data-generating process. The **point forecast is held fixed**: -every interval in the table below wraps the *same* `NNS.ARMA.optim` block -forecast and the *same* NNS residuals. The only thing that varies is how the -band is drawn. This isolates interval construction with zero point-model and -zero protocol confounding. - -## Why this framing - -This is deliberately **not** the usual online conformal benchmark, and we want -to be explicit that it is an **adaptation**. Standard adaptive conformal methods -(ACI, PID, NexCP, …) are *online, h=1* procedures: they re-anchor on the realized -observation every step, and their coverage is a control-theoretic property of -that feedback loop rather than a statement about any forecast trajectory. An -online method that re-anchors each step will hit its nominal coverage rate on -almost any sequence — it is dragged across each one-step gap rather than -forecasting it. Comparing a structural multi-step forecaster against that on a -1-step task flatters the reactive method and tests the wrong thing. - -So here we **adapt conformal into a forecast wrapper** and ask a sharper -question instead: - -> Given a genuine multi-step forecast made at time *t* with **no online -> updating**, how well does each interval method discern **coverage guarantees -> on a heteroskedastic process** — including *conditional* coverage across the -> volatility regimes, not just the marginal rate? - -## Protocol - -- **DGP:** non-linear, heteroskedastic AR(1) with slow trend, two seasonal - components (periods 50, 200), and piecewise volatility regimes - (σ jumps to 2.5, drops to 0.55, settles at 1.8). 10 seeds, T = 3500. -- **Forecast (fixed for all methods):** at each origin *t*, `NNS.ARMA.optim` - emits a single `h`-step block forecast, with `h = implied_h = - t·(1−0.9)/0.9`. The model (periods/method/bias) is selected on a strictly - **historical** validation tail — the scored block is never shown to the - optimizer (the leak from issue #57 is fixed). No observation inside the block - is fed back: this is forecasting by design, exposed to its own compounding - error. -- **Intervals compared (all on the identical NNS forecast & residuals):** - - **NNS native PI** — `results ± pi_width`, NNS's own nonparametric rule - (a single flat half-width from the residual distribution + bias). - - **NNS + split-CP (flat)** — empirical (1−α) conformal quantile of the same - NNS validation residuals. Same residuals, textbook conformal quantile. - - **NNS + split-CP (per-lead-time)** — a conformal quantile *per lead-time k*, - pooled across past blocks (with a local-window thickening and a flat - fallback while pools are thin). The only method whose band **widens with - horizon**. - - **NNS + Gaussian (flat)** — `z·std(residuals)`, a parametric reference. -- **Scoring:** interval (Winkler) score, marginal coverage, **worst rolling-window - coverage** and **low-/high-vol stratum coverage** (the heteroskedasticity - discriminators), mean width, and Gaussian-proxy CRPS / log-score. - -## Results +# NNS under one honest forecasting protocol +A single block walk-forward over one heteroskedastic data-generating process, +emitting two coherent analyses from the *same* leak-free NNS forecast: + +1. **Point duel** — does NNS actually forecast better than a fair baseline? +2. **Interval study** — given NNS's forecast, is its native prediction interval + better or worse than conformalizing the same residuals? + +This replaces the benchmark withdrawn under issue #57, which (a) leaked each +evaluation chunk into `NNS.ARMA.optim`'s validation tail and (b) compared methods +across mismatched protocols (online h=1 baselines vs. an NNS multi-step block). + +## The protocol (one rule for everyone) + +> At each origin *t*, given only data through *t*, every method forecasts an +> `h`-step block (`h = implied_h = t·(1−0.9)/0.9`) with **no online updating** — +> no method may peek at a realized value to predict the next one. + +This is *forecasting by design*, exposed to its own compounding error — the +opposite of an online method that re-anchors on the truth every step. NNS's model +(periods/method/bias) is selected on a strictly **historical** validation tail, so +the scored block is never shown to the optimizer (the #57 leak is fixed). + +- **DGP:** non-linear, heteroskedastic AR(1), slow trend, two seasonal components + (periods 50, 200), piecewise volatility regimes (σ → 2.5, 0.55, 1.8). 10 seeds, + T = 3500, ~12 blocks/seed. + +## Point duel + +- **NNS block** — `NNS.ARMA.optim` native block forecast (leak-free selection). +- **Ridge (recursive)** — ridge on `N_LAGS` lags, fit on all data ≤ *t*, projected + `h` steps recursively (its own predictions become the lags; no true intermediate + values — this removes the h=1 true-lag crutch the original baselines enjoyed). +- **Persistence** — last value carried forward (floor). + +``` + method MAE RMSE median_AE + NNS block 1.514 2.056 1.122 +Ridge (recursive) 2.664 3.230 2.403 + Persistence 2.493 3.241 1.991 ``` -=== BLOCK NNS INTERVAL COMPARISON (point = NNS fixed; mean over 10 seeds, alpha=0.1, target cov=0.9) === - method marg_cov worst_win_cov cov_lowvol cov_hivol cond_cov_gap width interval_score CRPS logscore - NNS native PI 0.845 0.561 0.927 0.853 0.149 5.658 8.674 1.090 2.076 - NNS + split-CP (flat) 0.824 0.515 0.899 0.850 0.169 5.398 8.735 1.091 2.102 - NNS + Gaussian (flat) 0.821 0.516 0.894 0.842 0.170 5.332 8.773 1.092 2.105 -NNS + split-CP (per-lead) 0.890 0.582 0.983 0.844 0.099 7.995 10.521 1.169 2.268 +## Interval study + +We deliberately **adapt** conformal into a forecast wrapper here, and say so. Online +adaptive conformal (ACI/PID/NexCP) gets its coverage from a per-step feedback loop — +a control-theoretic property, not forecast skill — so pitting it against a multi-step +block forecaster on a 1-step task tests the wrong thing. Instead we hold the NNS point +forecast **fixed** and vary only the band, to discern **coverage guarantees on a +heteroskedastic process** — conditional (per-regime / worst-window) coverage, not just +the marginal rate: + +- **NNS native PI** — `results ± pi_width`, NNS's own flat nonparametric rule. +- **NNS + split-CP (flat)** — empirical (1−α) conformal quantile of NNS residuals. +- **NNS + split-CP (per-lead)** — a quantile *per lead-time k* (the only band that + widens with horizon). +- **NNS + Gaussian (flat)** — `z · std(residuals)`. +- **Ridge / Persistence + split-CP (per-lead)** — same wrapper on the weaker point + models, to show interval quality follows point quality. + +``` + method marg_cov worst_win_cov cov_lowvol cov_hivol cond_cov_gap width interval_score + NNS native PI 0.845 0.561 0.927 0.853 0.149 5.658 8.674 + NNS + split-CP (flat) 0.824 0.515 0.899 0.850 0.169 5.398 8.735 + NNS + Gaussian (flat) 0.821 0.516 0.894 0.842 0.170 5.332 8.773 + NNS + split-CP (per-lead) 0.918 0.649 0.996 0.858 0.097 8.595 10.505 +Ridge (recursive) + split-CP (per-lead) 0.865 0.553 0.955 0.826 0.149 10.169 13.853 + Persistence + split-CP (per-lead) 0.887 0.333 0.981 0.807 0.162 12.107 16.440 ``` -Sorted by interval (Winkler) score, lower is better. Point-forecast MAE was -identical across rows (~1.51, the fixed NNS block forecast); only the bands differ. +Point table sorted by RMSE, interval table by interval (Winkler) score; lower is +better throughout. ## Findings +- **NNS wins the point duel decisively** — MAE 1.51 vs. recursive ridge 2.66 + (~43% lower) and persistence 2.49, same ordering on RMSE/median, consistent + across all 10 seeds. Recursive ridge falls *below* persistence on MAE: strip the + h=1 true-lag crutch and a linear AR model's error compounds over the block until + naive carry-forward beats it. Structural forecasting by design vs. reactive by + compulsion, made concrete. - **NNS's native PI is the efficiency winner and ≈ a flat split-conformal band.** - It posts the best Winkler score (8.674) with the tightest width (5.66), and it - actually covers slightly *better* than the textbook split-conformal quantile on - the *same* residuals (marg 0.845 vs 0.824). NNS's `upm_var`+bias rule is a hair - more conservative than the empirical conformal quantile, but the two track - closely — confirming the native interval is, in effect, a flat split-conformal - interval. -- **Every flat band under-covers this heteroskedastic process.** All three - flat methods land at ~0.82–0.85 marginal vs. the 0.90 target, collapse to - ~0.51–0.56 in the *worst* 100-step window, and carry a ~0.15–0.17 calm-vs-volatile - conditional gap (≈0.90–0.93 in the calm regime, ≈0.84–0.85 in the volatile one). - A width calibrated on historical validation residuals does not transport to a - multi-step block that compounds error and crosses volatility regimes. -- **Only the horizon-adaptive conformal wrapper recovers the guarantee — and pays - for it.** `split-CP (per-lead)` is the lone method to reach near-nominal marginal - coverage (0.890) and the smallest conditional gap (0.099), because it widens with - lead-time and calibrates on realized block residuals. The cost is ~45% wider - intervals (7.995) and the worst Winkler score (10.521). - -## Takeaway - -On a heteroskedastic process, under honest multi-step forecasting with **no online -updating**, there is no free lunch. NNS's native interval is the *efficiency* -winner but does **not** deliver the nominal 90% coverage — least of all in the -volatile regime and the worst windows, exactly where guarantees matter most. -Recovering the guarantee requires a horizon-adaptive conformal wrapper and paying -for it in width. This coverage-vs-efficiency tension — not a single "winner" — is -what the benchmark is built to expose, and it is why we frame this as an -*adaptation to discern coverage guarantees on a heteroskedastic process* rather -than a claim that any one band dominates. + Best interval score (8.674) at the tightest competitive width, and it tracks the + empirical split-conformal quantile on the *same* residuals (8.735) — confirming + the native band is, in effect, flat split conformal, slightly more conservative. +- **Every flat band under-covers the volatile regime** (marg 0.82–0.85, worst-window + ~0.51–0.56, `cov_hivol` ~0.84–0.85). Calibrating a single width on historical + residuals cannot transport to a multi-step block that compounds error and crosses + σ-regimes — a textbook exchangeability failure under heteroskedasticity. +- **Only the horizon-adaptive per-lead wrapper recovers the guarantee** — near-nominal + 0.918 marginal, the smallest conditional gap (0.097) and best worst-window (0.649) — + at ~52% wider intervals and the worst Winkler score among the NNS bands. No free + lunch: coverage on a heteroskedastic process costs width. +- **Interval quality follows point quality.** The *same* per-lead wrapper on the + weaker point models scores far worse (NNS 10.5 ≪ ridge 13.9 ≪ persistence 16.4) + with much wider bands — a better forecast makes a tighter, better-centred interval. ## How to read it -- **`NNS native PI` vs `NNS + split-CP (flat)`** is the headline head-to-head: - the *same* residuals, NNS's `upm_var`+bias rule vs. the empirical conformal - quantile. If they track closely, NNS's native band already *is*, in effect, a - flat split-conformal interval. -- **`per-lead-time`** is the only band that adapts to horizon. Watch - **worst-window** and **high-vol stratum** coverage: that is where letting the - width grow with lead-time should pay off on a heteroskedastic process, at some - cost in mean width. -- **Marginal coverage near 0.90 is necessary but cheap.** The honest test on a - heteroskedastic DGP is the **conditional** coverage gap between the calm and - volatile regimes — a flat band that hits 0.90 on average by over-covering the - calm regime and under-covering the volatile one has not actually solved the - problem. +- **Marginal coverage near 0.90 is necessary but cheap.** On a heteroskedastic DGP + the honest test is the **conditional** gap between calm (`cov_lowvol`) and volatile + (`cov_hivol`) regimes and the **worst rolling window** — a flat band that hits 0.90 + on average by over-covering calm and under-covering volatile has not solved the + problem. Every flat band here under-covers the volatile regime: that is the + exchangeability-violation penalty, common to all of them, not specific to any one. +- **Width is only a fair tie-breaker at matched coverage.** Among methods with + different coverage, a narrower band may just be under-covering more. The interval + (Winkler) score combines the two (`width + (2/α)·exceedance`), which is why it is + the sort key. ## Run it ```bash -pip install ovvo-nns numpy pandas scipy -python run_block_nns_intervals.py +pip install ovvo-nns numpy pandas scipy scikit-learn +python run_conformal.py ``` -Writes `results_block/scores.csv` (aggregated) and `scores_all.csv` (per seed). +Writes `results/point.csv`, `results/interval.csv` (aggregated) and the per-seed +`*_all.csv`. + +## Not yet addressed + +`NNS.ARMA.optim` still treats the tail after `training_set` as its validation target +with no guard or warning when out-of-range data is passed — the upstream gap behind +#57. Adding that guard is left to a follow-up; until then, treat this as the +corrected-content restoration, not a closed case. diff --git a/gists/timeseries/conformal/results/interval.csv b/gists/timeseries/conformal/results/interval.csv new file mode 100644 index 0000000..206495f --- /dev/null +++ b/gists/timeseries/conformal/results/interval.csv @@ -0,0 +1,7 @@ +method,marg_cov,worst_win_cov,cov_lowvol,cov_hivol,cond_cov_gap,width,interval_score +NNS native PI,0.845217041800643,0.561,0.9271704180064309,0.852572347266881,0.14855305466237945,5.658247904936664,8.673563693193246 +NNS + split-CP (flat),0.8243569131832797,0.515,0.8993569131832798,0.84983922829582,0.1686495176848875,5.397754711147717,8.734633715776543 +NNS + Gaussian (flat),0.8206993569131832,0.516,0.8940514469453376,0.8421221864951768,0.17025723472668813,5.332100605902861,8.772635078222827 +NNS + split-CP (per-lead),0.917604501607717,0.649,0.9959807073954984,0.8577170418006432,0.09655948553054661,8.59459297384414,10.505483113927781 +Ridge (recursive) + split-CP (per-lead),0.865032154340836,0.553,0.9546623794212218,0.8263665594855306,0.14855305466237945,10.16897349633998,13.853289701795891 +Persistence + split-CP (per-lead),0.88725884244373,0.333,0.9810289389067524,0.807395498392283,0.1617684887459807,12.106951264773958,16.439867386859973 diff --git a/gists/timeseries/conformal/results/interval_all.csv b/gists/timeseries/conformal/results/interval_all.csv new file mode 100644 index 0000000..cb5ba21 --- /dev/null +++ b/gists/timeseries/conformal/results/interval_all.csv @@ -0,0 +1,61 @@ +method,marg_cov,worst_win_cov,cov_lowvol,cov_hivol,cond_cov_gap,width,interval_score +NNS native PI,0.8476688102893891,0.67,0.9131832797427653,0.8070739549839229,0.09453376205787789,5.3336289551117595,8.347311881079282 +NNS + split-CP (flat),0.8271704180064309,0.58,0.8954983922829582,0.8311897106109325,0.14919614147909965,5.227161385540743,8.534113534345261 +NNS + Gaussian (flat),0.8295819935691319,0.58,0.8858520900321544,0.8279742765273312,0.14276527331189715,5.220493265073913,8.535512774975702 +NNS + split-CP (per-lead),0.9147909967845659,0.65,0.9967845659163987,0.8183279742765274,0.09678456591639872,8.425865285679825,10.568242001830663 +Ridge (recursive) + split-CP (per-lead),0.8661575562700965,0.64,0.9598070739549839,0.7909967845659164,0.10900321543408364,10.581823596303991,14.592993775300465 +Persistence + split-CP (per-lead),0.8721864951768489,0.44,0.9710610932475884,0.707395498392283,0.19260450160771703,10.549102615950632,14.52448029198288 +NNS native PI,0.8287781350482315,0.52,0.9163987138263665,0.8617363344051447,0.2022508038585209,5.538396478733216,9.039315486588494 +NNS + split-CP (flat),0.815112540192926,0.51,0.9067524115755627,0.8440514469453376,0.23118971061093252,5.423595785308306,9.16356799285496 +NNS + Gaussian (flat),0.8163183279742765,0.52,0.9163987138263665,0.8408360128617364,0.23118971061093252,5.371605324250339,9.132929020698203 +NNS + split-CP (per-lead),0.9111736334405145,0.6,0.9967845659163987,0.8456591639871383,0.09678456591639872,8.679187273640938,10.817084198706139 +Ridge (recursive) + split-CP (per-lead),0.8356109324758842,0.55,0.932475884244373,0.7909967845659164,0.1909967845659164,9.768945508312589,14.299961658445255 +Persistence + split-CP (per-lead),0.8934887459807074,0.52,0.9967845659163987,0.7508038585209004,0.14919614147909965,11.420957332270556,14.557742149799997 +NNS native PI,0.8480707395498392,0.54,0.9453376205787781,0.8713826366559485,0.14115755627009652,5.498313881263709,8.568987367833104 +NNS + split-CP (flat),0.8348070739549839,0.47,0.9115755627009646,0.8858520900321544,0.16366559485530552,5.475440096762711,8.646211618108355 +NNS + Gaussian (flat),0.8340032154340836,0.47,0.905144694533762,0.8681672025723473,0.17009646302250803,5.3825939727945,8.623802541408669 +NNS + split-CP (per-lead),0.9127813504823151,0.6,0.9871382636655949,0.8810289389067524,0.09292604501607715,8.192814278346807,10.105807673835704 +Ridge (recursive) + split-CP (per-lead),0.8701768488745981,0.6,0.9517684887459807,0.8520900321543409,0.16045016077170415,9.911929457282502,13.517878973369866 +Persistence + split-CP (per-lead),0.8782154340836013,0.24,0.9919614147909968,0.7893890675241158,0.15562700964630227,13.918516762035877,18.90446027710822 +NNS native PI,0.8404340836012861,0.55,0.9469453376205788,0.8762057877813505,0.17009646302250803,5.417943415993259,8.509212166486481 +NNS + split-CP (flat),0.8183279742765274,0.51,0.8858520900321544,0.8569131832797428,0.15241157556270102,5.073486836478882,8.35866609307891 +NNS + Gaussian (flat),0.8207395498392283,0.54,0.8697749196141479,0.8569131832797428,0.13151125401929264,5.187680907668864,8.312439020985863 +NNS + split-CP (per-lead),0.9232315112540193,0.65,0.9935691318327974,0.8665594855305466,0.09356913183279736,8.47780821511154,10.420173113316652 +Ridge (recursive) + split-CP (per-lead),0.8629421221864951,0.56,0.954983922829582,0.8054662379421221,0.14115755627009652,10.224738158689307,13.735472897847638 +Persistence + split-CP (per-lead),0.9143890675241158,0.11,1.0,0.8118971061093248,0.09999999999999998,14.479797580928631,19.260873784808723 +NNS native PI,0.8472668810289389,0.51,0.9405144694533762,0.7733118971061094,0.12668810289389065,5.912994900178406,8.952518643444352 +NNS + split-CP (flat),0.8243569131832797,0.58,0.9067524115755627,0.7765273311897106,0.15401929260450165,5.307560736619756,8.610536381423858 +NNS + Gaussian (flat),0.8171221864951769,0.57,0.887459807073955,0.7733118971061094,0.1508038585209004,5.2117633612580585,8.673499632507543 +NNS + split-CP (per-lead),0.9212218649517685,0.67,0.9903536977491961,0.8327974276527331,0.0903536977491961,8.612047167435271,10.189042171041397 +Ridge (recursive) + split-CP (per-lead),0.8621382636655949,0.48,0.9421221864951769,0.8183279742765274,0.15401929260450165,9.940825035097003,13.418993354272594 +Persistence + split-CP (per-lead),0.9308681672025724,0.6,0.9453376205787781,0.8569131832797428,0.07588424437299035,12.099187370057201,13.957050019151211 +NNS native PI,0.860128617363344,0.58,0.9501607717041801,0.8906752411575563,0.15562700964630227,5.662363541166043,8.391587116904702 +NNS + split-CP (flat),0.8243569131832797,0.37,0.887459807073955,0.8745980707395499,0.14919614147909965,5.3533136605690315,8.60722523849902 +NNS + Gaussian (flat),0.8247588424437299,0.37,0.9019292604501608,0.8729903536977492,0.15241157556270102,5.31108027242601,8.649115111814341 +NNS + split-CP (per-lead),0.917604501607717,0.62,0.9967845659163987,0.8971061093247589,0.09678456591639872,8.460963622012667,10.275595768590357 +Ridge (recursive) + split-CP (per-lead),0.8677652733118971,0.5,0.9308681672025724,0.8504823151125402,0.14437299035369777,10.008320672419277,13.318984886956887 +Persistence + split-CP (per-lead),0.8995176848874598,0.09,0.9967845659163987,0.7877813504823151,0.1122186495176849,15.609723603863243,21.539983068871113 +NNS native PI,0.8432475884244373,0.62,0.9019292604501608,0.8408360128617364,0.10900321543408364,5.958905251551655,8.884737288789054 +NNS + split-CP (flat),0.802652733118971,0.55,0.9019292604501608,0.8344051446945338,0.18778135048231515,5.514395594522053,9.282531631715036 +NNS + Gaussian (flat),0.7930064308681672,0.54,0.887459807073955,0.8215434083601286,0.1958199356913184,5.443032862596285,9.404878484224922 +NNS + split-CP (per-lead),0.9228295819935691,0.68,1.0,0.8440514469453376,0.09999999999999998,9.20007850046642,10.997501962538344 +Ridge (recursive) + split-CP (per-lead),0.8782154340836013,0.59,0.9839228295819936,0.8456591639871383,0.14758842443729903,10.477958983260796,14.128418910843173 +Persistence + split-CP (per-lead),0.8380225080385852,0.24,0.9823151125401929,0.7379421221864951,0.24565916398713827,11.622172326163268,17.7186519202099 +NNS native PI,0.8404340836012861,0.42,0.9501607717041801,0.8665594855305466,0.19742765273311902,5.533785215198311,8.749410650539787 +NNS + split-CP (flat),0.8336012861736335,0.47,0.9212218649517685,0.8488745980707395,0.16045016077170415,5.394905745099468,8.708916464271185 +NNS + Gaussian (flat),0.8211414790996785,0.46,0.9115755627009646,0.8408360128617364,0.18135048231511253,5.203825372900099,8.80921517840864 +NNS + split-CP (per-lead),0.9143890675241158,0.65,1.0,0.864951768488746,0.09999999999999998,8.344194523078558,10.216779329031802 +Ridge (recursive) + split-CP (per-lead),0.8633440514469454,0.5,0.9630225080385852,0.837620578778135,0.17813504823151127,9.899806431265855,13.626610825171564 +Persistence + split-CP (per-lead),0.9184083601286174,0.72,0.9758842443729904,0.8585209003215434,0.07588424437299035,10.229330031021911,12.202300632363936 +NNS native PI,0.8484726688102894,0.58,0.905144694533762,0.8778135048231511,0.13151125401929264,6.028058412975543,8.799144860895987 +NNS + split-CP (flat),0.8311897106109325,0.52,0.887459807073955,0.8697749196141479,0.1620578778135049,5.725218656291451,8.758728314500287 +NNS + Gaussian (flat),0.8255627009646302,0.52,0.8858520900321544,0.8569131832797428,0.1620578778135049,5.649387814291777,8.818937346230603 +NNS + split-CP (per-lead),0.9172025723472669,0.66,0.9983922829581994,0.8520900321543409,0.09839228295819935,8.733533232757578,10.717532145849189 +Ridge (recursive) + split-CP (per-lead),0.8633440514469454,0.51,0.9565916398713826,0.8279742765273312,0.14276527331189715,10.30781624713549,14.048835006470576 +Persistence + split-CP (per-lead),0.8464630225080386,0.2,0.9565916398713826,0.8842443729903537,0.28906752411575565,9.805366045553008,15.29539421577076 +NNS native PI,0.8476688102893891,0.62,0.9019292604501608,0.860128617363344,0.1572347266881029,5.69808899719474,8.493411469371221 +NNS + split-CP (flat),0.8319935691318328,0.59,0.8890675241157556,0.8762057877813505,0.17652733118971065,5.482468614284766,8.675839888968568 +NNS + Gaussian (flat),0.8247588424437299,0.59,0.8890675241157556,0.8617363344051447,0.1845659163987139,5.339542905768758,8.766021670973783 +NNS + split-CP (per-lead),0.9208199356913184,0.71,1.0,0.8745980707395499,0.09999999999999998,8.819437639911786,10.747072774537559 +Ridge (recursive) + split-CP (per-lead),0.8806270096463023,0.6,0.9710610932475884,0.8440514469453376,0.11704180064308689,10.567570873632986,13.844746729280885 +Persistence + split-CP (per-lead),0.8810289389067524,0.17,0.9935691318327974,0.8890675241157556,0.22154340836012865,11.335358979895252,16.437737508532983 diff --git a/gists/timeseries/point_duel/results_duel/point.csv b/gists/timeseries/conformal/results/point.csv similarity index 100% rename from gists/timeseries/point_duel/results_duel/point.csv rename to gists/timeseries/conformal/results/point.csv diff --git a/gists/timeseries/point_duel/results_duel/point_all.csv b/gists/timeseries/conformal/results/point_all.csv similarity index 100% rename from gists/timeseries/point_duel/results_duel/point_all.csv rename to gists/timeseries/conformal/results/point_all.csv diff --git a/gists/timeseries/conformal/results_block/scores.csv b/gists/timeseries/conformal/results_block/scores.csv deleted file mode 100644 index 8252db9..0000000 --- a/gists/timeseries/conformal/results_block/scores.csv +++ /dev/null @@ -1,5 +0,0 @@ -method,marg_cov,worst_win_cov,cov_lowvol,cov_hivol,cond_cov_gap,width,interval_score,CRPS,logscore -NNS native PI,0.845217041800643,0.561,0.9271704180064309,0.852572347266881,0.14855305466237945,5.658247904936664,8.673563693193246,1.0902136839659236,2.076059884663743 -NNS + split-CP (flat),0.8243569131832797,0.515,0.8993569131832798,0.84983922829582,0.1686495176848875,5.397754711147717,8.734633715776543,1.090996705962063,2.101657125801346 -NNS + Gaussian (flat),0.8206993569131832,0.516,0.8940514469453376,0.8421221864951768,0.17025723472668813,5.332100605902861,8.772635078222827,1.091613092749769,2.104833304629708 -NNS + split-CP (per-lead),0.890112540192926,0.5820000000000001,0.9829581993569132,0.8437299035369776,0.09909967845659165,7.995348220035973,10.520733085082481,1.1692625763837228,2.267904409014462 diff --git a/gists/timeseries/conformal/results_block/scores_all.csv b/gists/timeseries/conformal/results_block/scores_all.csv deleted file mode 100644 index 0e62ba6..0000000 --- a/gists/timeseries/conformal/results_block/scores_all.csv +++ /dev/null @@ -1,41 +0,0 @@ -method,marg_cov,worst_win_cov,cov_lowvol,cov_hivol,cond_cov_gap,width,interval_score,CRPS,logscore -NNS native PI,0.8476688102893891,0.67,0.9131832797427653,0.8070739549839229,0.09453376205787789,5.3336289551117595,8.347311881079282,1.0521806181725764,1.9988765275223412 -NNS + split-CP (flat),0.8271704180064309,0.58,0.8954983922829582,0.8311897106109325,0.14919614147909965,5.227161385540743,8.534113534345261,1.0564514057462153,2.047717813637678 -NNS + split-CP (per-lead),0.8886655948553055,0.65,0.9839228295819936,0.8086816720257235,0.09131832797427653,7.846909211270708,10.496079109394278,1.140506122590597,2.250303555178376 -NNS + Gaussian (flat),0.8295819935691319,0.58,0.8858520900321544,0.8279742765273312,0.14276527331189715,5.220493265073913,8.535512774975702,1.0556550754409413,2.0458238791253365 -NNS native PI,0.8287781350482315,0.52,0.9163987138263665,0.8617363344051447,0.2022508038585209,5.538396478733216,9.039315486588494,1.1017051722008695,2.123941052068062 -NNS + split-CP (flat),0.815112540192926,0.51,0.9067524115755627,0.8440514469453376,0.23118971061093252,5.423595785308306,9.16356799285496,1.1045221353056585,2.145312293713436 -NNS + split-CP (per-lead),0.8830385852090032,0.6,0.9887459807073955,0.8295819935691319,0.12025723472668814,7.990812843044477,10.886441408223009,1.185924631575878,2.292922429135039 -NNS + Gaussian (flat),0.8163183279742765,0.52,0.9163987138263665,0.8408360128617364,0.23118971061093252,5.371605324250339,9.132929020698203,1.1034425720507015,2.1344452670414986 -NNS native PI,0.8480707395498392,0.54,0.9453376205787781,0.8713826366559485,0.14115755627009652,5.498313881263709,8.568987367833104,1.0769091090418628,2.088025725407269 -NNS + split-CP (flat),0.8348070739549839,0.47,0.9115755627009646,0.8858520900321544,0.16366559485530552,5.475440096762711,8.646211618108355,1.0779606714369134,2.1374529629188985 -NNS + split-CP (per-lead),0.889871382636656,0.49,0.9758842443729904,0.8697749196141479,0.11061093247588427,7.666059526775997,10.240968683617762,1.1422851084449035,2.299705192621768 -NNS + Gaussian (flat),0.8340032154340836,0.47,0.905144694533762,0.8681672025723473,0.17009646302250803,5.3825939727945,8.623802541408669,1.0773609694299997,2.127147377666515 -NNS native PI,0.8404340836012861,0.55,0.9469453376205788,0.8762057877813505,0.17009646302250803,5.417943415993259,8.509212166486481,1.0513799372860273,2.0645442083490733 -NNS + split-CP (flat),0.8183279742765274,0.51,0.8858520900321544,0.8569131832797428,0.15241157556270102,5.073486836478882,8.35866609307891,1.0468951694182635,2.0402247395377504 -NNS + split-CP (per-lead),0.8971061093247589,0.59,0.9790996784565916,0.8553054662379421,0.08006430868167203,7.868177918914598,10.290785210091181,1.1317381317261397,2.2247260687632786 -NNS + Gaussian (flat),0.8207395498392283,0.54,0.8697749196141479,0.8569131832797428,0.13151125401929264,5.187680907668864,8.312439020985863,1.0459231999241414,2.026847378341962 -NNS native PI,0.8472668810289389,0.51,0.9405144694533762,0.7733118971061094,0.12668810289389065,5.912994900178406,8.952518643444352,1.1094460776961108,2.105079957413961 -NNS + split-CP (flat),0.8243569131832797,0.58,0.9067524115755627,0.7765273311897106,0.15401929260450165,5.307560736619756,8.610536381423858,1.099087789113522,2.0849945815307396 -NNS + split-CP (per-lead),0.8866559485530546,0.6,0.9823151125401929,0.7845659163987139,0.11543408360128615,7.9404115839516605,10.336416034686959,1.1735005767309783,2.2413111688768863 -NNS + Gaussian (flat),0.8171221864951769,0.57,0.887459807073955,0.7733118971061094,0.1508038585209004,5.2117633612580585,8.673499632507543,1.0995952439049514,2.1011013227051474 -NNS native PI,0.860128617363344,0.58,0.9501607717041801,0.8906752411575563,0.15562700964630227,5.662363541166043,8.391587116904702,1.0718817368677682,2.0521000120264206 -NNS + split-CP (flat),0.8243569131832797,0.37,0.887459807073955,0.8745980707395499,0.14919614147909965,5.3533136605690315,8.60722523849902,1.0765820133621553,2.1197718297356456 -NNS + split-CP (per-lead),0.8922829581993569,0.6,0.9871382636655949,0.8778135048231511,0.09453376205787789,7.897905901117482,10.200940248313179,1.1494769211085452,2.242593936246776 -NNS + Gaussian (flat),0.8247588424437299,0.37,0.9019292604501608,0.8729903536977492,0.15241157556270102,5.31108027242601,8.649115111814341,1.0773573984920315,2.1247815699699717 -NNS native PI,0.8432475884244373,0.62,0.9019292604501608,0.8408360128617364,0.10900321543408364,5.958905251551655,8.884737288789054,1.1533883414727397,2.112495798167121 -NNS + split-CP (flat),0.802652733118971,0.55,0.9019292604501608,0.8344051446945338,0.18778135048231515,5.514395594522053,9.282531631715036,1.1632614958510543,2.169479165301043 -NNS + split-CP (per-lead),0.8882636655948553,0.62,0.9823151125401929,0.842443729903537,0.08231511254019286,8.567512061462239,11.159806161705042,1.2453577968361804,2.3305953801588606 -NNS + Gaussian (flat),0.7930064308681672,0.54,0.887459807073955,0.8215434083601286,0.1958199356913184,5.443032862596285,9.404878484224922,1.165957599219953,2.1892321453449055 -NNS native PI,0.8404340836012861,0.42,0.9501607717041801,0.8665594855305466,0.19742765273311902,5.533785215198311,8.749410650539787,1.0714713083435363,2.076368446829223 -NNS + split-CP (flat),0.8336012861736335,0.47,0.9212218649517685,0.8488745980707395,0.16045016077170415,5.394905745099468,8.708916464271185,1.069461483415126,2.071863509791649 -NNS + split-CP (per-lead),0.8858520900321544,0.49,0.9887459807073955,0.837620578778135,0.11382636655948553,7.695935282614414,10.322007696267946,1.145102489810678,2.2482944871036135 -NNS + Gaussian (flat),0.8211414790996785,0.46,0.9115755627009646,0.8408360128617364,0.18135048231511253,5.203825372900099,8.80921517840864,1.071527536672299,2.088411117524039 -NNS native PI,0.8484726688102894,0.58,0.905144694533762,0.8778135048231511,0.13151125401929264,6.028058412975543,8.799144860895987,1.1309121052212678,2.0927473976307103 -NNS + split-CP (flat),0.8311897106109325,0.52,0.887459807073955,0.8697749196141479,0.1620578778135049,5.725218656291451,8.758728314500287,1.130064761156996,2.113226263951555 -NNS + split-CP (per-lead),0.8914790996784566,0.53,0.977491961414791,0.8569131832797428,0.09292604501607715,8.23717370280603,10.704707152434425,1.212676341149698,2.3029740791509625 -NNS + Gaussian (flat),0.8255627009646302,0.52,0.8858520900321544,0.8569131832797428,0.1620578778135049,5.649387814291777,8.818937346230603,1.1313117903011027,2.117039408279818 -NNS native PI,0.8476688102893891,0.62,0.9019292604501608,0.860128617363344,0.1572347266881029,5.69808899719474,8.493411469371221,1.0828624333564756,2.046419721223251 -NNS + split-CP (flat),0.8319935691318328,0.59,0.8890675241157556,0.8762057877813505,0.17652733118971065,5.482468614284766,8.675839888968568,1.0856801348147251,2.0865280978950644 -NNS + split-CP (per-lead),0.8979099678456591,0.65,0.9839228295819936,0.8745980707395499,0.0897106109324759,8.242584168402129,10.569179146091031,1.166057643863629,2.24561779290906 -NNS + Gaussian (flat),0.8247588424437299,0.59,0.8890675241157556,0.8617363344051447,0.1845659163987139,5.339542905768758,8.766021670973783,1.0879995420615687,2.0935035802978836 diff --git a/gists/timeseries/conformal/run_block_nns_intervals.py b/gists/timeseries/conformal/run_block_nns_intervals.py deleted file mode 100644 index 382bbf3..0000000 --- a/gists/timeseries/conformal/run_block_nns_intervals.py +++ /dev/null @@ -1,304 +0,0 @@ -""" -run_block_nns_intervals.py -================================================================================ -Apples-to-apples prediction-interval comparison, NNS point forecast held FIXED. - -Question answered ------------------ -At each origin t, given only data through t, NNS.ARMA.optim emits a single -h-step block forecast (h = implied_h = current_train*(1-0.9)/0.9, NO online -updating inside the block). We then build several prediction intervals on the -SAME NNS point forecast and the SAME NNS residuals, and ask: - - Does NNS's native prediction interval beat conformalizing NNS's own - residuals — and does letting the band widen with lead-time help? - -Because the point model is identical across every interval method, this isolates -interval construction with zero protocol mismatch. (Contrast issue #57, where the -NNS optimizer also saw the scored block — fixed here: the model is selected on a -strictly historical validation tail, and the scored block is never shown to it.) - -Interval methods compared (all on the identical NNS block point forecast): - * NNS native PI — results ± pi_width (flat width, NNS's own rule) - * NNS + split-CP (flat) — empirical 0.9 quantile of NNS validation residuals - * NNS + split-CP (per-lead)— quantile per lead-time k, pooled across past blocks - (widens with horizon; data-starved -> fallback early) - * NNS + Gaussian (flat) — z * std(NNS validation residuals) - -Run: - pip install ovvo-nns numpy pandas scipy matplotlib - python run_block_nns_intervals.py -""" -from __future__ import annotations - -import os -import math -import time -import warnings -from collections import defaultdict - -import numpy as np -import pandas as pd -from scipy import stats - -warnings.filterwarnings("ignore") - -try: - import nns as NNS -except ImportError: - raise SystemExit("ovvo-nns is required. Run: pip install ovvo-nns") - -# ── Constants ──────────────────────────────────────────────────────────────── -ALPHA = 0.10 -TARGET_COV = 1.0 - ALPHA -N_LAGS = 12 -CAL_END = 1000 # first NNS origin = N_LAGS + CAL_END -WINDOW = 100 -N_SEEDS = 10 -TRAINING_FRAC = 0.90 # implied_h = current_train*(1-frac)/frac -VAL_FRAC = 0.10 # historical validation-tail fraction for model selection -CP_MIN_POOL = 8 # min per-lead samples before trusting per-lead CP -LOCAL_WIN = 2 # +/- lead-time window to thicken per-lead pools - -os.makedirs("results_block", exist_ok=True) - -_MSE = lambda predicted, actual: np.mean((predicted - actual) ** 2) - -# ── DGP ────────────────────────────────────────────────────────────────────── -def make_timeseries(T: int = 3500, seed: int = 0, heavy_tail: bool = False) -> dict: - rng = np.random.default_rng(seed + 1) - tt = np.arange(1, T + 1, dtype=float) - level = 0.002 * tt + 1.50 * np.sin(2*np.pi*tt/50) + 0.75 * np.sin(2*np.pi*tt/200) - sigma = np.ones(T) - sigma[(tt > 900) & (tt <= 1400)] = 2.5 - sigma[(tt > 1900) & (tt <= 2450)] = 0.55 - sigma[(tt > 2800)] = 1.8 - eps = rng.standard_t(5, size=T) / math.sqrt(5/3) if heavy_tail else rng.standard_normal(T) - y = np.empty(T) - y[0] = level[0] + sigma[0] * eps[0] - for i in range(1, T): - y[i] = level[i] + 0.55 * (y[i-1] - level[i-1]) + sigma[i] * eps[i] - return {"y": y, "level": level, "sigma": sigma} - -# ── Scoring helpers ────────────────────────────────────────────────────────── -def z_alpha(alpha: float = ALPHA) -> float: - return float(stats.norm.ppf(1 - alpha / 2)) - -def coverage(lo, hi, y) -> float: - lo, hi, y = map(np.asarray, (lo, hi, y)) - return float(np.mean((y >= lo) & (y <= hi))) - -def mean_width(lo, hi) -> float: - return float(np.mean(np.asarray(hi) - np.asarray(lo))) - -def rolling_coverage(lo, hi, y, window: int = WINDOW) -> np.ndarray: - lo, hi, y = map(np.asarray, (lo, hi, y)) - n = len(y) - if n < window: - return np.array([]) - return np.array([coverage(lo[i:i+window], hi[i:i+window], y[i:i+window]) - for i in range(n - window + 1)]) - -def worst_window_coverage(lo, hi, y, window: int = WINDOW) -> float: - rc = rolling_coverage(lo, hi, y, window) - return float(np.min(rc)) if len(rc) > 0 else float("nan") - -def interval_score(lo, hi, y, alpha: float = ALPHA) -> float: - lo, hi, y = map(np.asarray, (lo, hi, y)) - return float(np.mean((hi - lo) - + (2/alpha) * np.maximum(lo - y, 0) - + (2/alpha) * np.maximum(y - hi, 0))) - -def coverage_by_stratum(lo, hi, y, sigma, k: int = 4) -> list[float]: - lo, hi, y, sigma = map(np.asarray, (lo, hi, y, sigma)) - ranks = stats.rankdata(sigma, method="ordinal") - grp = np.ceil(ranks / len(ranks) * k).astype(int).clip(1, k) - return [coverage(lo[grp == j], hi[grp == j], y[grp == j]) for j in range(1, k+1)] - -def crps_gaussian(mu, sigma, y) -> float: - sigma = np.maximum(np.asarray(sigma, float), 1e-8) - z = (np.asarray(y) - np.asarray(mu)) / sigma - return float(np.mean(sigma * (z * (2*stats.norm.cdf(z) - 1) - + 2*stats.norm.pdf(z) - 1/math.sqrt(math.pi)))) - -def log_score_gaussian(mu, sigma, y) -> float: - sigma = np.maximum(np.asarray(sigma, float), 1e-8) - return float(np.mean(-stats.norm.logpdf(np.asarray(y), loc=np.asarray(mu), scale=sigma))) - -def score_method(name, lo, hi, y, sigma, mu_=None, s_=None) -> dict: - lo_v = np.minimum(np.asarray(lo, float), np.asarray(hi, float)) - hi_v = np.maximum(np.asarray(lo, float), np.asarray(hi, float)) - cbs = coverage_by_stratum(lo_v, hi_v, y, sigma) - row = { - "method": name, - "marg_cov": coverage(lo_v, hi_v, y), - "worst_win_cov": worst_window_coverage(lo_v, hi_v, y), - "cov_lowvol": cbs[0], - "cov_hivol": cbs[-1], - "cond_cov_gap": max(abs(c - TARGET_COV) for c in cbs if not math.isnan(c)), - "width": mean_width(lo_v, hi_v), - "interval_score": interval_score(lo_v, hi_v, y), - "CRPS": float("nan"), - "logscore": float("nan"), - } - if mu_ is not None and s_ is not None: - row["CRPS"] = crps_gaussian(mu_, s_, y) - row["logscore"] = log_score_gaussian(mu_, s_, y) - return row - -# ── NNS seasonal period discovery (full-window) ────────────────────────────── -def get_nns_seas_periods(series: np.ndarray, train_n: int) -> list[int]: - max_period = int(train_n / 3) - 1 - try: - raw = NNS.nns_seas(series, plot=False).get("periods", np.array([])) - seen, valid = set(), [] - for p in raw: - pi = int(p) - if 1 < pi < max_period and pi not in seen: - seen.add(pi); valid.append(pi) - return valid if valid else [2] - except Exception: - return [2] - -def _emp_q(scores, level=TARGET_COV): - """Finite-sample conformal quantile of |scores| at coverage `level`.""" - s = np.sort(np.abs(np.asarray(scores, float))) - n = len(s) - if n == 0: - return None - k = int(np.ceil((n + 1) * level)) - 1 - return float("inf") if k >= n else float(s[k]) - -# ── Block walk-forward: NNS point fixed, intervals swapped ──────────────────── -def run_once(seed: int = 0, heavy_tail: bool = False, verbose: bool = True) -> dict: - d = make_timeseries(T=3500, seed=seed, heavy_tail=heavy_tail) - y, sig = d["y"], d["sigma"] - T_raw = len(y) - z = z_alpha(ALPHA) - - # accumulators: per interval-method lo/hi, shared point/y/sigma per scored step - acc_y, acc_sig, acc_pt = [], [], [] - lo_acc = defaultdict(list); hi_acc = defaultdict(list); sig_acc = defaultdict(list) - per_lead_pool: dict[int, list] = defaultdict(list) # realized |resid| by lead-time - - current_train = N_LAGS + CAL_END - t0 = time.time(); n_chunks = 0 - - while current_train < T_raw: - remaining = T_raw - current_train - implied_h = max(1, int(current_train * (1 - TRAINING_FRAC) / TRAINING_FRAC)) - h = min(remaining, implied_h) - train_slice = y[:current_train] - val_h = max(1, int(round(VAL_FRAC * current_train))) - periods = get_nns_seas_periods(train_slice, current_train) - - try: - fit = NNS.nns_arma_optim( - variable=train_slice, h=h, training_set=current_train - val_h, - seasonal_factor=periods, negative_values=True, - obj_fn=_MSE, objective="min", linear_approximation=True, - pred_int=TARGET_COV, print_trace=False, plot=False, - ) - except Exception as exc: - if verbose: - print(f" [seed {seed} chunk {n_chunks+1} failed: {exc}] – skipping") - current_train += h - continue - - point = np.asarray(fit["results"], float) # bias-corrected block point - nat_lo = np.asarray(fit["lower.pred.int"], float) - nat_hi = np.asarray(fit["upper.pred.int"], float) - val_err = np.asarray(fit["errors"], float) # validation-tail residuals - bias_shift = float(fit["bias.shift"]) - # residuals of the *reported* (bias-corrected) forecast on the validation tail - cal_resid = val_err + bias_shift - - yb = y[current_train:current_train + h] - sgb = sig[current_train:current_train + h] - n_chunks += 1 - - # ── interval method 1: NNS native PI ────────────────────────────── - lo_acc["NNS native PI"].extend(nat_lo) - hi_acc["NNS native PI"].extend(nat_hi) - nat_hw = np.maximum((nat_hi - nat_lo) / 2.0, 1e-8) - sig_acc["NNS native PI"].extend(nat_hw / z) - - # ── interval method 2: split-CP (flat) on NNS validation residuals ─ - q_flat = _emp_q(cal_resid, TARGET_COV) - q_flat = nat_hw.mean() if (q_flat is None or not np.isfinite(q_flat)) else q_flat - lo_acc["NNS + split-CP (flat)"].extend(point - q_flat) - hi_acc["NNS + split-CP (flat)"].extend(point + q_flat) - sig_acc["NNS + split-CP (flat)"].extend(np.full(h, max(q_flat / z, 1e-8))) - - # ── interval method 3: split-CP (per-lead-time) ─────────────────── - pl_lo = np.empty(h); pl_hi = np.empty(h); pl_s = np.empty(h) - for k in range(h): - pool = [] - for kk in range(max(0, k - LOCAL_WIN), k + LOCAL_WIN + 1): - pool.extend(per_lead_pool.get(kk, [])) - if len(pool) >= CP_MIN_POOL: - qk = _emp_q(pool, TARGET_COV) - if qk is None or not np.isfinite(qk): - qk = q_flat - else: - qk = q_flat # fallback while pool is thin - pl_lo[k] = point[k] - qk; pl_hi[k] = point[k] + qk; pl_s[k] = max(qk / z, 1e-8) - lo_acc["NNS + split-CP (per-lead)"].extend(pl_lo) - hi_acc["NNS + split-CP (per-lead)"].extend(pl_hi) - sig_acc["NNS + split-CP (per-lead)"].extend(pl_s) - - # ── interval method 4: Gaussian (flat) ──────────────────────────── - s_flat = max(float(np.std(cal_resid)), 1e-8) - lo_acc["NNS + Gaussian (flat)"].extend(point - z * s_flat) - hi_acc["NNS + Gaussian (flat)"].extend(point + z * s_flat) - sig_acc["NNS + Gaussian (flat)"].extend(np.full(h, s_flat)) - - # shared truth / point - acc_y.extend(yb); acc_sig.extend(sgb); acc_pt.extend(point) - - # update per-lead pools with realized residuals AFTER scoring (now data <= next t) - realized = np.abs(yb - point) - for k in range(h): - per_lead_pool[k].append(realized[k]) - - current_train += h - - y_arr = np.asarray(acc_y); sig_arr = np.asarray(acc_sig); pt_arr = np.asarray(acc_pt) - rows = [] - for name in lo_acc: - lo = np.asarray(lo_acc[name]); hi = np.asarray(hi_acc[name]) - s_ = np.asarray(sig_acc[name]) - rows.append(score_method(name, lo, hi, y_arr, sig_arr, mu_=pt_arr, s_=s_)) - - if verbose: - print(f" [seed {seed}] {n_chunks} blocks, " - f"point MAE={np.mean(np.abs(pt_arr - y_arr)):.3f}, {time.time()-t0:.0f}s") - - return {"scores": pd.DataFrame(rows), "lo": lo_acc, "hi": hi_acc, - "y": y_arr, "sigma": sig_arr, "point": pt_arr} - -# ── Aggregate over seeds ────────────────────────────────────────────────────── -def run_all() -> dict: - all_scores = [] - for s in range(N_SEEDS): - print(f"\n=== seed {s} ===") - all_scores.append(run_once(seed=s, verbose=True)["scores"]) - - scores_dt = pd.concat(all_scores, ignore_index=True) - metric_cols = ["marg_cov", "worst_win_cov", "cov_lowvol", "cov_hivol", - "cond_cov_gap", "width", "interval_score", "CRPS", "logscore"] - agg = (scores_dt.groupby("method", sort=False)[metric_cols] - .mean().reset_index().sort_values("interval_score")) - - scores_dt.to_csv("results_block/scores_all.csv", index=False) - agg.to_csv("results_block/scores.csv", index=False) - - print(f"\n=== BLOCK NNS INTERVAL COMPARISON " - f"(point=NNS fixed; mean over {N_SEEDS} seeds, alpha={ALPHA}, target={TARGET_COV}) ===\n") - print(agg.round(3).to_string(index=False)) - print("\nWrote results_block/scores.csv and scores_all.csv") - return {"scores": scores_dt, "summary": agg} - - -if __name__ == "__main__": - run_all() diff --git a/gists/timeseries/point_duel/run_point_duel.py b/gists/timeseries/conformal/run_conformal.py similarity index 54% rename from gists/timeseries/point_duel/run_point_duel.py rename to gists/timeseries/conformal/run_conformal.py index 4623b56..9245bc0 100644 --- a/gists/timeseries/point_duel/run_point_duel.py +++ b/gists/timeseries/conformal/run_conformal.py @@ -1,34 +1,39 @@ """ -run_point_duel.py +run_conformal.py ================================================================================ -Point-model duel under one honest protocol. - -Every method makes the SAME commitment NNS makes: at each origin t, given only -data through t, forecast an h-step block (h = implied_h = t*(1-0.9)/0.9) with NO -online updating inside the block. No method gets to peek at a realized value to -predict the next one. This is the apples-to-apples that the original benchmark -lacked (where the baselines forecast h=1 with true lags and refreshed every step). - -Point forecasters ------------------ - * NNS block — NNS.ARMA.optim, model selected on a strictly historical - validation tail (leak-free), native block forecast. - * Ridge (recursive)— ridge on N_LAGS lags, fit on all data <= t, then projected - h steps RECURSIVELY (its own predictions become the lags; - no true intermediate values). This removes the h=1 crutch. - * Persistence — last observed value carried forward h steps (floor). - -Intervals (secondary, same protocol) ------------------------------------- -Each point method is wrapped with the SAME per-lead-time split-conformal band, -calibrated on that method's own realized block residuals pooled across past -origins (flat fallback while thin). NNS additionally reports its native PI. - -Outputs a POINT table (MAE / RMSE, the headline) and an INTERVAL table. +Time-series block benchmark — NNS under one honest protocol (issue #57 redux). + +A single walk-forward emits BOTH analyses, from the same leak-free NNS block +forecast (computed once per origin): + + POINT DUEL — who forecasts best? + At each origin t, given only data through t, every method forecasts an + h-step block (h = implied_h = t*(1-0.9)/0.9) with NO online updating — no + method peeks at a realized value to predict the next one. + * NNS block — NNS.ARMA.optim, model selected on a strictly + historical validation tail (leak-free). + * Ridge (recursive)— ridge on N_LAGS lags, fit on all data <= t, projected + h steps recursively (own predictions become lags). + * Persistence — last value carried forward (floor). + + INTERVAL STUDY — given the NNS forecast, how should the band be drawn? + The NNS point forecast is held FIXED and only the interval construction + varies, on identical residuals: + * NNS native PI — results +/- pi_width (flat, NNS's own rule) + * NNS + split-CP (flat) — empirical (1-a) quantile of NNS residuals + * NNS + split-CP (per-lead)— quantile per lead-time k (widens w/ horizon) + * NNS + Gaussian (flat) — z * std(residuals) + Plus Ridge/Persistence + split-CP (per-lead) to show that interval quality + follows point quality. + +Framed as an adaptation to discern coverage guarantees on a heteroskedastic +process: marginal coverage is cheap; the test is conditional (per-regime / +worst-window) coverage and whether width adapts. Run: pip install ovvo-nns numpy pandas scipy scikit-learn - python run_point_duel.py + python run_conformal.py +Writes results/{point,interval}{,_all}.csv """ from __future__ import annotations @@ -69,8 +74,9 @@ CP_MIN_POOL = 8 LOCAL_WIN = 2 -os.makedirs("results_duel", exist_ok=True) +os.makedirs("results", exist_ok=True) _MSE = lambda predicted, actual: np.mean((predicted - actual) ** 2) +POINT_METHODS = ["NNS block", "Ridge (recursive)", "Persistence"] # ── DGP ────────────────────────────────────────────────────────────────────── def make_timeseries(T: int = 3500, seed: int = 0, heavy_tail: bool = False) -> dict: @@ -88,16 +94,14 @@ def make_timeseries(T: int = 3500, seed: int = 0, heavy_tail: bool = False) -> d y[i] = level[i] + 0.55 * (y[i-1] - level[i-1]) + sigma[i] * eps[i] return {"y": y, "level": level, "sigma": sigma} -# ── Lag features ───────────────────────────────────────────────────────────── def lag_features(y: np.ndarray, n_lags: int = N_LAGS): n = len(y) yy = y[n_lags:] X = np.column_stack([y[n_lags - k: n - k] for k in range(1, n_lags + 1)]) return X, yy -# ── Point forecasters (all produce an h-step block from the end of y[:t]) ──── +# ── Point forecasters (each returns an h-step block from end of y_hist) ────── def forecast_ridge_recursive(y_hist: np.ndarray, h: int, n_lags: int = N_LAGS) -> np.ndarray: - """Fit ridge on all data <= t, then project h steps recursively.""" X, yy = lag_features(y_hist, n_lags) if HAS_SKLEARN: model = make_pipeline(StandardScaler(), Ridge(alpha=1.0)) @@ -107,13 +111,10 @@ def forecast_ridge_recursive(y_hist: np.ndarray, h: int, n_lags: int = N_LAGS) - Xtr = np.column_stack([X, np.ones(len(X))]) coef, *_ = np.linalg.lstsq(Xtr, yy, rcond=None) predict = lambda lags: float(np.append(lags, 1.0) @ coef) - hist = list(y_hist[-n_lags:]) - out = [] + hist = list(y_hist[-n_lags:]); out = [] for _ in range(h): - lags = np.array(hist[-1::-1][:n_lags]) # [y_{t}, y_{t-1}, ..., y_{t-n_lags+1}] - yhat = predict(lags) - out.append(yhat) - hist.append(yhat) # feed prediction back (no true value) + lags = np.array(hist[-1::-1][:n_lags]) + yhat = predict(lags); out.append(yhat); hist.append(yhat) return np.asarray(out) def forecast_persistence(y_hist: np.ndarray, h: int) -> np.ndarray: @@ -133,7 +134,7 @@ def get_nns_seas_periods(series: np.ndarray, train_n: int) -> list[int]: return [2] def forecast_nns_block(y_hist: np.ndarray, h: int): - """Returns (point, native_lo, native_hi, cal_resid) — leak-free block forecast.""" + """Leak-free NNS block: returns (point, native_lo, native_hi, cal_resid).""" n = len(y_hist) val_h = max(1, int(round(VAL_FRAC * n))) periods = get_nns_seas_periods(y_hist, n) @@ -148,57 +149,59 @@ def forecast_nns_block(y_hist: np.ndarray, h: int): cal_resid = np.asarray(fit["errors"], float) + float(fit["bias.shift"]) return point, nat_lo, nat_hi, cal_resid -# ── Scoring helpers ────────────────────────────────────────────────────────── +# ── Metrics ────────────────────────────────────────────────────────────────── def z_alpha(alpha: float = ALPHA) -> float: return float(stats.norm.ppf(1 - alpha / 2)) - def coverage(lo, hi, y) -> float: - lo, hi, y = map(np.asarray, (lo, hi, y)) - return float(np.mean((y >= lo) & (y <= hi))) - + lo, hi, y = map(np.asarray, (lo, hi, y)); return float(np.mean((y >= lo) & (y <= hi))) def mean_width(lo, hi) -> float: return float(np.mean(np.asarray(hi) - np.asarray(lo))) - def rolling_coverage(lo, hi, y, window: int = WINDOW) -> np.ndarray: lo, hi, y = map(np.asarray, (lo, hi, y)); n = len(y) - if n < window: - return np.array([]) - return np.array([coverage(lo[i:i+window], hi[i:i+window], y[i:i+window]) - for i in range(n - window + 1)]) - + if n < window: return np.array([]) + return np.array([coverage(lo[i:i+window], hi[i:i+window], y[i:i+window]) for i in range(n-window+1)]) def worst_window_coverage(lo, hi, y, window: int = WINDOW) -> float: - rc = rolling_coverage(lo, hi, y, window) - return float(np.min(rc)) if len(rc) > 0 else float("nan") - + rc = rolling_coverage(lo, hi, y, window); return float(np.min(rc)) if len(rc) > 0 else float("nan") def interval_score(lo, hi, y, alpha: float = ALPHA) -> float: lo, hi, y = map(np.asarray, (lo, hi, y)) - return float(np.mean((hi - lo) + (2/alpha)*np.maximum(lo - y, 0) - + (2/alpha)*np.maximum(y - hi, 0))) - + return float(np.mean((hi - lo) + (2/alpha)*np.maximum(lo - y, 0) + (2/alpha)*np.maximum(y - hi, 0))) +def coverage_by_stratum(lo, hi, y, sigma, k: int = 4) -> list[float]: + lo, hi, y, sigma = map(np.asarray, (lo, hi, y, sigma)) + ranks = stats.rankdata(sigma, method="ordinal") + grp = np.ceil(ranks / len(ranks) * k).astype(int).clip(1, k) + return [coverage(lo[grp == j], hi[grp == j], y[grp == j]) for j in range(1, k+1)] def _emp_q(scores, level=TARGET_COV): s = np.sort(np.abs(np.asarray(scores, float))); n = len(s) - if n == 0: - return None + if n == 0: return None k = int(np.ceil((n + 1) * level)) - 1 return float("inf") if k >= n else float(s[k]) -# ── One seed ────────────────────────────────────────────────────────────────── -POINT_METHODS = ["NNS block", "Ridge (recursive)", "Persistence"] - +# ── One seed: single walk-forward, both analyses ───────────────────────────── def run_once(seed: int = 0, heavy_tail: bool = False, verbose: bool = True) -> dict: d = make_timeseries(T=3500, seed=seed, heavy_tail=heavy_tail) y, sig = d["y"], d["sigma"] T_raw = len(y); z = z_alpha(ALPHA) - pt_pred = defaultdict(list) # point forecasts by method + pt_pred = defaultdict(list) acc_y, acc_sig = [], [] - lo_acc = defaultdict(list); hi_acc = defaultdict(list) - pools = {m: defaultdict(list) for m in POINT_METHODS} # per-method per-lead resid pool - nat_lo_all, nat_hi_all = [], [] # NNS native PI + lo_acc = defaultdict(list); hi_acc = defaultdict(list); sig_acc = defaultdict(list) + per_lead = {m: defaultdict(list) for m in POINT_METHODS} # realized |resid| by lead + flat_pool = {m: [] for m in POINT_METHODS} # realized |resid| pooled current_train = N_LAGS + CAL_END t0 = time.time(); n_chunks = 0 + def cp_width(m, k, q_fallback): + pool = [] + for kk in range(max(0, k - LOCAL_WIN), k + LOCAL_WIN + 1): + pool.extend(per_lead[m].get(kk, [])) + if len(pool) >= CP_MIN_POOL: + q = _emp_q(pool, TARGET_COV) + if q is not None and np.isfinite(q): + return q + qf = _emp_q(flat_pool[m], TARGET_COV) if flat_pool[m] else None + return qf if (qf is not None and np.isfinite(qf)) else q_fallback + while current_train < T_raw: remaining = T_raw - current_train implied_h = max(1, int(current_train * (1 - TRAINING_FRAC) / TRAINING_FRAC)) @@ -208,12 +211,10 @@ def run_once(seed: int = 0, heavy_tail: bool = False, verbose: bool = True) -> d sgb = sig[current_train:current_train + h] try: - nns_pt, nat_lo, nat_hi, _ = forecast_nns_block(y_hist, h) + nns_pt, nat_lo, nat_hi, cal_resid = forecast_nns_block(y_hist, h) except Exception as exc: - if verbose: - print(f" [seed {seed} chunk {n_chunks+1} NNS failed: {exc}] skip") - current_train += h - continue + if verbose: print(f" [seed {seed} chunk {n_chunks+1} NNS failed: {exc}] skip") + current_train += h; continue preds = { "NNS block": nns_pt, @@ -222,66 +223,73 @@ def run_once(seed: int = 0, heavy_tail: bool = False, verbose: bool = True) -> d } n_chunks += 1 - # per-lead split-CP on each method's own past realized residuals + q_flat_nns = _emp_q(cal_resid, TARGET_COV) + if q_flat_nns is None or not np.isfinite(q_flat_nns): + q_flat_nns = float(np.std(cal_resid) + 1e-6) + s_flat_nns = max(float(np.std(cal_resid)), 1e-8) + + # ── INTERVAL STUDY on the fixed NNS point ──────────────────────────── + lo_acc["NNS native PI"].extend(nat_lo); hi_acc["NNS native PI"].extend(nat_hi) + sig_acc["NNS native PI"].extend(np.maximum((nat_hi - nat_lo)/2.0, 1e-8)/z) + + lo_acc["NNS + split-CP (flat)"].extend(nns_pt - q_flat_nns) + hi_acc["NNS + split-CP (flat)"].extend(nns_pt + q_flat_nns) + sig_acc["NNS + split-CP (flat)"].extend(np.full(h, max(q_flat_nns/z, 1e-8))) + + lo_acc["NNS + Gaussian (flat)"].extend(nns_pt - z*s_flat_nns) + hi_acc["NNS + Gaussian (flat)"].extend(nns_pt + z*s_flat_nns) + sig_acc["NNS + Gaussian (flat)"].extend(np.full(h, s_flat_nns)) + + # ── per-lead split-CP for each point method (shared machinery) ─────── for m in POINT_METHODS: p = preds[m] - lo = np.empty(h); hi = np.empty(h) + name = f"{m} + split-CP (per-lead)" if m != "NNS block" else "NNS + split-CP (per-lead)" + lo = np.empty(h); hi = np.empty(h); ss = np.empty(h) for k in range(h): - pool = [] - for kk in range(max(0, k - LOCAL_WIN), k + LOCAL_WIN + 1): - pool.extend(pools[m].get(kk, [])) - qk = _emp_q(pool, TARGET_COV) if len(pool) >= CP_MIN_POOL else None - if qk is None or not np.isfinite(qk): - # fallback: flat quantile of whatever realized resid we have, else inf->wide - flat_pool = [v for vs in pools[m].values() for v in vs] - qf = _emp_q(flat_pool, TARGET_COV) if flat_pool else None - qk = qf if (qf is not None and np.isfinite(qf)) else float(np.std(p) + 1e-6) - lo[k] = p[k] - qk; hi[k] = p[k] + qk - lo_acc[f"{m} + CP"].extend(lo); hi_acc[f"{m} + CP"].extend(hi) - - nat_lo_all.extend(nat_lo); nat_hi_all.extend(nat_hi) + qk = cp_width(m, k, q_flat_nns) + lo[k] = p[k] - qk; hi[k] = p[k] + qk; ss[k] = max(qk/z, 1e-8) + lo_acc[name].extend(lo); hi_acc[name].extend(hi); sig_acc[name].extend(ss) + for m in POINT_METHODS: pt_pred[m].extend(preds[m]) acc_y.extend(yb); acc_sig.extend(sgb) - # update pools AFTER scoring + # update pools AFTER scoring (now data <= next origin) for m in POINT_METHODS: r = np.abs(yb - preds[m]) + flat_pool[m].extend(r.tolist()) for k in range(h): - pools[m][k].append(r[k]) + per_lead[m][k].append(r[k]) current_train += h y_arr = np.asarray(acc_y); sig_arr = np.asarray(acc_sig) - # POINT metrics + # POINT table point_rows = [] for m in POINT_METHODS: - p = np.asarray(pt_pred[m]) - err = p - y_arr - point_rows.append({ - "method": m, - "MAE": float(np.mean(np.abs(err))), - "RMSE": float(np.sqrt(np.mean(err**2))), - "median_AE": float(np.median(np.abs(err))), - }) + err = np.asarray(pt_pred[m]) - y_arr + point_rows.append({"method": m, "MAE": float(np.mean(np.abs(err))), + "RMSE": float(np.sqrt(np.mean(err**2))), + "median_AE": float(np.median(np.abs(err)))}) - # INTERVAL metrics + # INTERVAL table int_rows = [] - def _introw(name, lo, hi): - lo = np.asarray(lo); hi = np.asarray(hi) - return {"method": name, "marg_cov": coverage(lo, hi, y_arr), - "worst_win_cov": worst_window_coverage(lo, hi, y_arr), - "width": mean_width(lo, hi), "interval_score": interval_score(lo, hi, y_arr)} - int_rows.append(_introw("NNS native PI", nat_lo_all, nat_hi_all)) for name in lo_acc: - int_rows.append(_introw(name, lo_acc[name], hi_acc[name])) + lo = np.asarray(lo_acc[name]); hi = np.asarray(hi_acc[name]); s_ = np.asarray(sig_acc[name]) + cbs = coverage_by_stratum(lo, hi, y_arr, sig_arr) + int_rows.append({ + "method": name, "marg_cov": coverage(lo, hi, y_arr), + "worst_win_cov": worst_window_coverage(lo, hi, y_arr), + "cov_lowvol": cbs[0], "cov_hivol": cbs[-1], + "cond_cov_gap": max(abs(c - TARGET_COV) for c in cbs if not math.isnan(c)), + "width": mean_width(lo, hi), "interval_score": interval_score(lo, hi, y_arr), + }) if verbose: mae = {m: np.mean(np.abs(np.asarray(pt_pred[m]) - y_arr)) for m in POINT_METHODS} print(f" [seed {seed}] {n_chunks} blocks MAE: " - + " ".join(f"{m}={mae[m]:.3f}" for m in POINT_METHODS) - + f" ({time.time()-t0:.0f}s)") + + " ".join(f"{m}={mae[m]:.3f}" for m in POINT_METHODS) + f" ({time.time()-t0:.0f}s)") return {"point": pd.DataFrame(point_rows), "interval": pd.DataFrame(int_rows)} @@ -298,17 +306,17 @@ def run_all() -> dict: point_agg = point_dt.groupby("method", sort=False).mean().reset_index().sort_values("RMSE") int_agg = int_dt.groupby("method", sort=False).mean().reset_index().sort_values("interval_score") - point_dt.to_csv("results_duel/point_all.csv", index=False) - int_dt.to_csv("results_duel/interval_all.csv", index=False) - point_agg.to_csv("results_duel/point.csv", index=False) - int_agg.to_csv("results_duel/interval.csv", index=False) + point_dt.to_csv("results/point_all.csv", index=False) + int_dt.to_csv("results/interval_all.csv", index=False) + point_agg.to_csv("results/point.csv", index=False) + int_agg.to_csv("results/interval.csv", index=False) - print(f"\n=== POINT-MODEL DUEL (block h-step, no online updating; " - f"mean over {N_SEEDS} seeds) ===\n") + pd.set_option("display.width", 200, "display.max_columns", 20) + print(f"\n=== POINT DUEL (block h-step, no online updating; mean over {N_SEEDS} seeds) ===\n") print(point_agg.round(3).to_string(index=False)) - print(f"\n=== INTERVALS (same protocol, per-method CP + NNS native) ===\n") + print(f"\n=== INTERVAL STUDY (point = NNS fixed; alpha={ALPHA}, target={TARGET_COV}) ===\n") print(int_agg.round(3).to_string(index=False)) - print("\nWrote results_duel/{point,interval}{,_all}.csv") + print("\nWrote results/{point,interval}{,_all}.csv") return {"point": point_agg, "interval": int_agg} diff --git a/gists/timeseries/point_duel/README.md b/gists/timeseries/point_duel/README.md deleted file mode 100644 index 854f635..0000000 --- a/gists/timeseries/point_duel/README.md +++ /dev/null @@ -1,93 +0,0 @@ -# Point-model duel — NNS vs. recursive ridge, one honest protocol - -A fair head-to-head on **point forecasting**, with prediction intervals as a -secondary check. Every method makes the same commitment NNS makes: - -> At each origin *t*, given only data through *t*, forecast an `h`-step block -> (`h = implied_h = t·(1−0.9)/0.9`) with **no online updating**. No method may -> peek at a realized value to predict the next one. - -This is the comparison the earlier conformal benchmark lacked: there the -baselines forecast one step ahead with the *true* lag and refreshed every step, -while NNS forecast a multi-step block blind. Here everyone forecasts the block -blind, so the contest measures forecasting-by-design, not access to the truth. - -## Contenders - -- **NNS block** — `NNS.ARMA.optim`, model (periods/method/bias) selected on a - strictly historical validation tail (leak-free), native block forecast. -- **Ridge (recursive)** — ridge on `N_LAGS` lags, fit on all data ≤ *t*, then - projected `h` steps **recursively** (its own predictions become the lags; no - true intermediate values). This is the baseline stripped of the h=1 crutch. -- **Persistence** — last value carried forward `h` steps (floor). - -Intervals: each point method is wrapped with the **same** per-lead-time -split-conformal band (calibrated on that method's own realized block residuals, -pooled across past origins). NNS also reports its native PI. - -## Results - -Mean over 10 seeds, T = 3500, ~12 blocks/seed. Lower is better throughout. - -**Point forecast (the headline):** - -``` - method MAE RMSE median_AE - NNS block 1.514 2.056 1.122 -Ridge (recursive) 2.664 3.230 2.403 - Persistence 2.493 3.241 1.991 -``` - -**Intervals (same protocol; per-method split-CP + NNS native PI):** - -``` - method marg_cov worst_win_cov width interval_score - NNS native PI 0.845 0.561 5.658 8.674 - NNS block + CP 0.894 0.288 8.306 11.339 -Ridge (recursive) + CP 0.835 0.124 9.819 15.189 - Persistence + CP 0.860 0.000 11.701 18.674 -``` - -## Interpretation - -- **NNS wins the point duel decisively.** MAE 1.51 vs. recursive ridge 2.66 - (~43% lower) and persistence 2.49; same ordering on RMSE and median absolute - error. The result is consistent across all 10 seeds, not seed-driven. -- **Recursive ridge is *worse than persistence* on MAE.** Once you remove the - h=1 crutch (true lag every step) and force a genuine multi-step block, the - linear AR model's error compounds over the horizon until naive carry-forward - beats it. This is the concrete version of the forecaster-by-design vs. - forecaster-by-compulsion distinction: a structural seasonal model projects a - real trajectory; a reactive linear model that only ever learned the one-step - map degrades to noise without the truth fed back. -- **Intervals follow point quality.** NNS's native PI is far the best Winkler - score (8.67) at the tightest width — because better, smaller residuals make a - tighter, better-centred band. The conformal wrappers on the weaker point - models inherit their larger errors and score worse. - -### Caveat on the interval table - -The shared per-lead-time split-conformal wrapper here has a cold-start fallback -(it has no realized block residuals for the first origins), which depresses -**worst-window** coverage for the `+ CP` rows (e.g. NNS block + CP at 0.288). -Read the **point** table (MAE / RMSE / median AE) as the clean, primary result -and the interval table as directional. The carefully-calibrated interval study -lives separately; this branch's job is the point-model duel. - -## Run it - -```bash -pip install ovvo-nns numpy pandas scipy scikit-learn -python run_point_duel.py -``` - -Writes `results_duel/point.csv`, `interval.csv` (aggregated) and the per-seed -`*_all.csv`. - -## Note - -Marginal coverage of every flat band falls short of the 0.90 target on this -heteroskedastic DGP — that is the exchangeability-violation penalty, common to -all methods, not specific to any one. The point tables (MAE / RMSE) are the -clean comparison; the interval tables should be read as efficiency-at-similar- -(under)coverage, not as a coverage guarantee. diff --git a/gists/timeseries/point_duel/results_duel/interval.csv b/gists/timeseries/point_duel/results_duel/interval.csv deleted file mode 100644 index 54abe05..0000000 --- a/gists/timeseries/point_duel/results_duel/interval.csv +++ /dev/null @@ -1,5 +0,0 @@ -method,marg_cov,worst_win_cov,width,interval_score -NNS native PI,0.845217041800643,0.561,5.658247904936664,8.673563693193246 -NNS block + CP,0.8942926045016077,0.288,8.30644088604347,11.339456828288366 -Ridge (recursive) + CP,0.834766881028939,0.124,9.819236038266109,15.188952711334176 -Persistence + CP,0.8596061093247588,0.0,11.700845119008003,18.674312892485847 diff --git a/gists/timeseries/point_duel/results_duel/interval_all.csv b/gists/timeseries/point_duel/results_duel/interval_all.csv deleted file mode 100644 index 1f87543..0000000 --- a/gists/timeseries/point_duel/results_duel/interval_all.csv +++ /dev/null @@ -1,41 +0,0 @@ -method,marg_cov,worst_win_cov,width,interval_score -NNS native PI,0.8476688102893891,0.67,5.3336289551117595,8.347311881079282 -NNS block + CP,0.8906752411575563,0.22,8.177099486196127,11.529523035347692 -Ridge (recursive) + CP,0.8372186495176849,0.1,10.278876556971408,16.046670825709686 -Persistence + CP,0.8436495176848875,0.0,10.213249743026658,16.33320386075765 -NNS native PI,0.8287781350482315,0.52,5.538396478733216,9.039315486588494 -NNS block + CP,0.8902733118971061,0.37,8.390728807696368,11.499697971060435 -Ridge (recursive) + CP,0.7990353697749196,0.08,9.39395853105691,15.636645535901216 -Persistence + CP,0.8581189710610932,0.0,11.002413778926346,16.322217439495994 -NNS native PI,0.8480707395498392,0.54,5.498313881263709,8.568987367833104 -NNS block + CP,0.8870578778135049,0.32,7.850041608639307,10.881610133044617 -Ridge (recursive) + CP,0.8331993569131833,0.03,9.461311846092478,15.290577454252144 -Persistence + CP,0.8388263665594855,0.0,13.454901313710085,20.728177627283294 -NNS native PI,0.8404340836012861,0.55,5.417943415993259,8.509212166486481 -NNS block + CP,0.8971061093247589,0.18,8.178784348833949,11.442682746189892 -Ridge (recursive) + CP,0.842443729903537,0.21,9.937090055812732,15.019945691471802 -Persistence + CP,0.9091639871382636,0.0,14.085426811335523,22.670411720818084 -NNS native PI,0.8472668810289389,0.51,5.912994900178406,8.952518643444352 -NNS block + CP,0.8979099678456591,0.27,8.346058510609698,11.244196033911283 -Ridge (recursive) + CP,0.8336012861736335,0.13,9.626844810715227,14.755633310870019 -Persistence + CP,0.9011254019292605,0.0,11.723029193525093,15.965745582710372 -NNS native PI,0.860128617363344,0.58,5.662363541166043,8.391587116904702 -NNS block + CP,0.8922829581993569,0.25,8.141163390003173,11.244453556266834 -Ridge (recursive) + CP,0.8420418006430869,0.25,9.708049657419254,14.244410697181051 -Persistence + CP,0.8942926045016077,0.0,15.174974529034394,25.340887411818077 -NNS native PI,0.8432475884244373,0.62,5.958905251551655,8.884737288789054 -NNS block + CP,0.9115755627009646,0.41,9.046462585915824,11.49001377083638 -Ridge (recursive) + CP,0.8512861736334405,0.13,10.192315755071641,15.29971782370236 -Persistence + CP,0.8110932475884244,0.0,11.285499121470306,19.764628557431646 -NNS native PI,0.8404340836012861,0.42,5.533785215198311,8.749410650539787 -NNS block + CP,0.8854501607717041,0.21,7.999981741192585,11.219855763022137 -Ridge (recursive) + CP,0.8283762057877814,0.1,9.51494956048423,14.79895664009931 -Persistence + CP,0.8802250803858521,0.0,9.804629243119559,14.085680509142973 -NNS native PI,0.8484726688102894,0.58,6.028058412975543,8.799144860895987 -NNS block + CP,0.8959003215434084,0.46,8.40742772515753,11.10618124347616 -Ridge (recursive) + CP,0.8259646302250804,0.06,9.845777723130777,15.51537851501687 -Persistence + CP,0.8046623794212219,0.0,9.304325751647738,16.85406180705558 -NNS native PI,0.8476688102893891,0.62,5.69808899719474,8.493411469371221 -NNS block + CP,0.8946945337620579,0.19,8.52666065619013,11.736354029728231 -Ridge (recursive) + CP,0.8545016077170418,0.15,10.233185885906426,15.281590619137305 -Persistence + CP,0.854903536977492,0.0,10.960001704284322,18.678114408344804 From 8e4e3b065c03e3258ab92126b97565f2e268eb01 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 23 Jun 2026 19:22:48 +0000 Subject: [PATCH 3/3] Drop upstream-guard note from benchmark README (#57) Correct `variable` specification at the call site (variable=train_slice, never y[:end_i]) prevents the evaluation-chunk leak, so there is no separate upstream guard to flag as outstanding. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01TX6vFofX1vnanE147tJFWF --- gists/timeseries/conformal/README.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/gists/timeseries/conformal/README.md b/gists/timeseries/conformal/README.md index 2aabb0d..8761992 100644 --- a/gists/timeseries/conformal/README.md +++ b/gists/timeseries/conformal/README.md @@ -118,10 +118,3 @@ python run_conformal.py Writes `results/point.csv`, `results/interval.csv` (aggregated) and the per-seed `*_all.csv`. - -## Not yet addressed - -`NNS.ARMA.optim` still treats the tail after `training_set` as its validation target -with no guard or warning when out-of-range data is passed — the upstream gap behind -#57. Adding that guard is left to a follow-up; until then, treat this as the -corrected-content restoration, not a closed case.