feat(llm): add shared model and run registry by houtanb · Pull Request #26 · forecastingresearch/utils

houtanb · 2026-05-29T07:25:47Z

Move canonical LLM model metadata and benchmarkable model-run declarations into utils so downstream repos can select shared runs by stable model_run_key.

Add Models.dev and Artificial Analysis metadata snapshots and loaders. Resolve release dates from Models.dev with manual fallbacks, separate canonical model_key values from provider_model_id routing strings, and validate model declarations during registry construction.

Require every ModelRun to declare an explicit, filename-safe model_run_key. Keep build_model_run_key as a naming helper for option coverage, validate duplicate keys, and expose MODEL_RUNS_BY_KEY/select_model_runs for benchmark selection.

Add Model.active and ACTIVE_MODEL_RUNS so historical runs remain in MODEL_RUNS while runs depending on inactive provider routes are excluded from current live-callable benchmark sweeps. Mark the Together deepseek-v3.1 route inactive and replace live smoke tests with the active MiniMax M2.7 route.

Add Artificial Analysis-backed model-run declarations as benchmark-selectable runs that are automatically included in MODEL_RUNS, with display names resolved from a minimized checked-in AA snapshot containing only stable IDs and display names.

Add third-party notices for Models.dev's MIT license and Artificial Analysis attribution, and include those notices in built wheel license metadata.

Move shared LLM provider dependencies into pyproject metadata, make requirements.txt delegate to .[dev], configure the package for Python 3.14, and preserve pytest-xdist for parallel integration tests.

Document registry conventions, local dev setup, validation commands, and Claude/agent handoff files. Add unit and integration coverage for metadata snapshots, registry validation, provider routing, explicit model-run keys, active model-run filtering, third-party notices, and selectable shared model runs.

As a byproduct of using Models.dev, the following model release dates have changed:

mistral-large-2411: 2024-11-18 -> 2024-11-01
deepseek-r1: 2025-01-20 -> 2024-12-26
deepseek-v3: 2024-12-25 -> 2025-01-20
glm-4.6: 2025-11-13 -> 2025-09-30
kimi-k2-thinking: 2025-11-05 -> 2025-11-06
kimi-k2.5: 2026-01-30 -> 2026-01-27
glm-5: 2026-02-12 -> 2026-02-11
glm-5.1: 2026-04-07 -> 2026-03-27
kimi-k2.6: 2026-04-20 -> 2026-04-21
claude-3-7-sonnet-20250219: 2025-02-24 -> 2025-02-19
claude-haiku-4-5-20251001: 2025-10-01 -> 2025-10-15
claude-opus-4-5-20251101: 2025-11-24 -> 2025-11-01
grok-4.3: 2026-05-01 -> 2026-04-17
gemini-2.5-flash: 2025-06-17 -> 2025-03-20
gemini-2.5-pro: 2025-06-17 -> 2025-03-20
gemini-3.1-flash-lite: 2026-05-08 -> 2026-05-07

Move canonical LLM model metadata and benchmarkable model-run declarations into utils so downstream repos can select shared runs by stable model_run_key. Add Models.dev and Artificial Analysis metadata snapshots and loaders. Resolve release dates from Models.dev with manual fallbacks, separate canonical model_key values from provider_model_id routing strings, and validate model declarations during registry construction. Require every ModelRun to declare an explicit, filename-safe model_run_key. Keep build_model_run_key as a naming helper for option coverage, validate duplicate keys, and expose MODEL_RUNS_BY_KEY/select_model_runs for benchmark selection. Add Model.active and ACTIVE_MODEL_RUNS so historical runs remain in MODEL_RUNS while runs depending on inactive provider routes are excluded from current live-callable benchmark sweeps. Mark the Together deepseek-v3.1 route inactive and replace live smoke tests with the active MiniMax M2.7 route. Add Artificial Analysis-backed model-run declarations as benchmark-selectable runs that are automatically included in MODEL_RUNS, with display names resolved from a minimized checked-in AA snapshot containing only stable IDs and display names. Add third-party notices for Models.dev's MIT license and Artificial Analysis attribution, and include those notices in built wheel license metadata. Move shared LLM provider dependencies into pyproject metadata, make requirements.txt delegate to .[dev], configure the package for Python 3.14, and preserve pytest-xdist for parallel integration tests. Document registry conventions, local dev setup, validation commands, and Claude/agent handoff files. Add unit and integration coverage for metadata snapshots, registry validation, provider routing, explicit model-run keys, active model-run filtering, third-party notices, and selectable shared model runs. As a byproduct of using Models.dev, the following model release dates have changed: mistral-large-2411: 2024-11-18 -> 2024-11-01 deepseek-r1: 2025-01-20 -> 2024-12-26 deepseek-v3: 2024-12-25 -> 2025-01-20 glm-4.6: 2025-11-13 -> 2025-09-30 kimi-k2-thinking: 2025-11-05 -> 2025-11-06 kimi-k2.5: 2026-01-30 -> 2026-01-27 glm-5: 2026-02-12 -> 2026-02-11 glm-5.1: 2026-04-07 -> 2026-03-27 kimi-k2.6: 2026-04-20 -> 2026-04-21 claude-3-7-sonnet-20250219: 2025-02-24 -> 2025-02-19 claude-haiku-4-5-20251001: 2025-10-01 -> 2025-10-15 claude-opus-4-5-20251101: 2025-11-24 -> 2025-11-01 grok-4.3: 2026-05-01 -> 2026-04-17 gemini-2.5-flash: 2025-06-17 -> 2025-03-20 gemini-2.5-pro: 2025-06-17 -> 2025-03-20 gemini-3.1-flash-lite: 2026-05-08 -> 2026-05-07

elsehow

This is a great refactor/PR.

My biggest high-level comment isn't about the code at all - I'm concerned about Models.dev's internal consistency. As I flagged in the comments, it seems like Models.dev gives inconsistent release dates for DeepSeek R1 and V3 across different model providers. It serves togetherai/DeepSeek-R1 = 2024-12-26 and togetherai/DeepSeek-V3 = 2025-01-20, but other providers in the same snapshot agree R1 = 2025-01-20, and ~8 (DigitalOcean, Vercel, SiliconFlow, …) agree V3 = 2024-12-26 — which matches reality (V3 shipped Dec 2024, R1 Jan 2025). Our registry references the togetherai route, so it inherited the one bad pair.

I haven't rigorously verified every Models.dev release, but it's probably safest to assume this type of issue might happen now and again.

There are a few concrete things we can do here:

Let manual_release_date override models.dev.
Raise on disagreement between manual_release_date and models.dev - maybe in a refresh script, which runs at snapshot-regeneration time. That's an early warning system, and a trigger to do (3), below.
Contribute to models.dev to fix any discrepancies we find.

I feel most confident that we should be doing (3). We can and should be checking the upstream provenance. Also, Models.dev is an amazing public good, and I think we can justify spending some of our time keeping it high-quality. At the same time, we don't know (yet) how fast their PR review cycles are/how actively maintained it is/their maintenance may change over time; we should have a 'manual override' as a backup in these cases. So, I'm leaning (1) too. (2) seems like a good CI practice. That's just my opinion - I'd be curious to hear your thoughts.

I left a few minor-ish comments in the code. Perhaps the biggest among them is the Python 3.14 dep, which could force consumers to upgrade their Python. I'm not sure we need to do that yet (but we might).

elsehow · 2026-06-03T16:18:17Z

 description = "Utilities for the Forecasting Research Institute codebase."
 readme = "README.md"
-requires-python = ">=3.10"
+requires-python = ">=3.14"


This may force every downstream consumer (like forecastbench-sim, anything importing utils) onto 3.14. It may be worth a line in the PR description. If the registry doesn't need 3.14 features, a lower floor may keep utils consumers free to choose their Python version.

elsehow · 2026-06-03T16:20:12Z

                api_key = get_secret(secret_name)
                _PROVIDER_API_KEYS[provider_cls] = api_key
-            except (RuntimeError, exceptions.NotFound):
+            except RuntimeError, exceptions.NotFound:


I believe this no-parens version will only run on 3.14+. Since it's not related to the PR's purpose, perhaps we could restore the parens and then restore earlier Python compatibility.

elsehow · 2026-06-03T16:21:18Z

-    full_name: str
-    token_limit: int
-    provider_cls: Type[BaseLLMProvider]
+    model_key: str


This is a breaking API change - I assume forecastbench is the only consumer right now, right? If not, perhaps worth notifying any other consumers.

ForecastBench and Timeseries bench are the only users at the moment, so I'm changing things more freely

elsehow · 2026-06-03T16:22:50Z

+    def __post_init__(self) -> None:
+        """Validate model-run metadata."""
+        _validate_model_run_key(self.model_run_key)
+        build_model_run_key(self.model.model_key, self.options)


It looks like this call exists only for its raise-on-unknown-option-path side effect. That's sort of an unexpected use for a function named build_*. What if there were a thin _validate_option_paths(options) wrapper (or a comment)? Something that makes it clear taht the intent is to validate.

elsehow · 2026-06-03T16:23:38Z

+        return self.model_run_key
+
+    @property
+    def display_name(self) -> str:


This returns the raw model_run_key (like claude-opus-4-6...) as the leaderboard name. I assume that's intended, and that the leaderboard is keyed on these run-keys to the human-readable names?

I want to punt on the display name for the moment. model_run_key is unique. I imagine I'll come up with a consistent naming schema for FB which I'll use now, but that might change in the future, e.g., I can see myself using claude-opus-4-8 as the model name, but we may want to change that to Claude Opus 4.8 at some point down the line, and I imagine we'd do that with a separate naming module, which would also handle if/how we display tools, the ordering of those tools, etc. I'll remove the returned AA key and just return model_run_key for now. The idea being that using the display_name is the "pretty printed" version of the model + options

elsehow · 2026-06-03T16:24:32Z

+def _parse_date(value: str | None) -> date | None:
+    """Parse an ISO date value from the snapshot."""
+    if value is None:
+        return None
+    if len(value) != len("YYYY-MM-DD"):
+        return None
+    try:
+        return date.fromisoformat(value)
+    except ValueError:
+        return None


Malformed dates silently become None. That could be expected, but may be worth logging (or erroring if we want to be strict)?

Their docs say they allow YYYY-MM as a release date. Not sure we should keep those, nor should we modify them to YYYY-MM-01 for example. Discarding forces us to override the date if we want to use it. I can log it as an error to be more transparent however

elsehow · 2026-06-03T16:35:24Z

+        """Return this model's release date from Models.dev or a manual fallback."""
+        metadata = self.models_dev_metadata
+        if metadata is not None and metadata.release_date is not None:
+            return metadata.release_date


So my understanding is that Models.dev always wins here. The manual_release_date is fallback-only and can never correct a date.

This is totally fine as long as we have 100% confidence in Models.dev's data quality. I bring this up because I'm seeing some inconsistencies there, at least in the DeepSeek models (I'll describe this more in my top-level comment).

Currently, we're only using Models.dev for the release date and manual_release_date was envisaged to be used if we don't have a Models.dev reference. Given your comment above about inconsistent dates, it makes sense to allow it to override the Models.dev provided date and allow them both to exist.

elsehow · 2026-06-03T16:36:26Z

+                f"{reference.provider_id}/{reference.model_id}: {exc}"
+            ) from exc
+
+        if metadata.release_date is not None:


__post_init__ accepts the Models.dev date without checking it against a provided manual_release_date. We could raise when a declared manual date disagrees with Models.dev. That way, a discrepancy would fail at construction (before inference happens). Up to you whether you think that's good behavior - it's an insurance policy in any case.

Agreed given your comment above

elsehow · 2026-06-03T16:37:26Z

+        model_key="deepseek-r1",
+        lab_key="DeepSeek",
+        models_dev_reference=ModelsDevReference(
+            provider_id="togetherai", model_id="deepseek-ai/DeepSeek-R1"
+        ),
+    ),
+    together_model(
+        model_key="deepseek-v3",
+        lab_key="DeepSeek",
+        models_dev_reference=ModelsDevReference(
+            provider_id="togetherai", model_id="deepseek-ai/DeepSeek-V3"
+        ),


I'll discuss this more in my top-level comment, but I'm seeing some inconsistent dates in Models.dev. Some DeepSeek r1 versions say the model was released 2025-01-2020; Together's says 2024-12-26. (I think Together may have actually switched the release dates for v1 and v3).

houtanb · 2026-06-04T10:43:09Z

@elsehow I fully agree that Models.dev release dates differ by provider and that's the biggest risk here. We currently only want to use them because it outsources the need to find release dates ourselves.

~~I'll make~~ I made a PR there to see how things work.

EDIT: PR Merged within less than 12 hours.

houtanb · 2026-06-04T10:43:41Z

@elsehow thank you so much for this review, it was very valuable!

houtanb requested a review from elsehow May 29, 2026 07:25

houtanb force-pushed the llm-model-runs branch from 88695b4 to c52d561 Compare June 3, 2026 14:28

houtanb force-pushed the llm-model-runs branch from c52d561 to dceb6b1 Compare June 3, 2026 15:55

elsehow suggested changes Jun 3, 2026

View reviewed changes

Conversation

houtanb commented May 29, 2026

Uh oh!

elsehow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houtanb commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

houtanb commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

houtanb commented Jun 4, 2026 •

edited

Loading