Skip to content

feat(llm): add shared model and run registry#26

Open
houtanb wants to merge 1 commit into
mainfrom
llm-model-runs
Open

feat(llm): add shared model and run registry#26
houtanb wants to merge 1 commit into
mainfrom
llm-model-runs

Conversation

@houtanb
Copy link
Copy Markdown
Member

@houtanb houtanb commented May 29, 2026

Move canonical LLM model metadata and benchmarkable model-run declarations into utils so downstream repos can select shared runs by stable model_run_key.

Add Models.dev and Artificial Analysis metadata snapshots and loaders. Resolve release dates from Models.dev with manual fallbacks, separate canonical model_key values from provider_model_id routing strings, and validate model declarations during registry construction.

Require every ModelRun to declare an explicit, filename-safe model_run_key. Keep build_model_run_key as a naming helper for option coverage, validate duplicate keys, and expose MODEL_RUNS_BY_KEY/select_model_runs for benchmark selection.

Add Model.active and ACTIVE_MODEL_RUNS so historical runs remain in MODEL_RUNS while runs depending on inactive provider routes are excluded from current live-callable benchmark sweeps. Mark the Together deepseek-v3.1 route inactive and replace live smoke tests with the active MiniMax M2.7 route.

Add Artificial Analysis-backed model-run declarations as benchmark-selectable runs that are automatically included in MODEL_RUNS, with display names resolved from a minimized checked-in AA snapshot containing only stable IDs and display names.

Add third-party notices for Models.dev's MIT license and Artificial Analysis attribution, and include those notices in built wheel license metadata.

Move shared LLM provider dependencies into pyproject metadata, make requirements.txt delegate to .[dev], configure the package for Python 3.14, and preserve pytest-xdist for parallel integration tests.

Document registry conventions, local dev setup, validation commands, and Claude/agent handoff files. Add unit and integration coverage for metadata snapshots, registry validation, provider routing, explicit model-run keys, active model-run filtering, third-party notices, and selectable shared model runs.

As a byproduct of using Models.dev, the following model release dates have changed:

mistral-large-2411: 2024-11-18 -> 2024-11-01
deepseek-r1: 2025-01-20 -> 2024-12-26
deepseek-v3: 2024-12-25 -> 2025-01-20
glm-4.6: 2025-11-13 -> 2025-09-30
kimi-k2-thinking: 2025-11-05 -> 2025-11-06
kimi-k2.5: 2026-01-30 -> 2026-01-27
glm-5: 2026-02-12 -> 2026-02-11
glm-5.1: 2026-04-07 -> 2026-03-27
kimi-k2.6: 2026-04-20 -> 2026-04-21
claude-3-7-sonnet-20250219: 2025-02-24 -> 2025-02-19
claude-haiku-4-5-20251001: 2025-10-01 -> 2025-10-15
claude-opus-4-5-20251101: 2025-11-24 -> 2025-11-01
grok-4.3: 2026-05-01 -> 2026-04-17
gemini-2.5-flash: 2025-06-17 -> 2025-03-20
gemini-2.5-pro: 2025-06-17 -> 2025-03-20
gemini-3.1-flash-lite: 2026-05-08 -> 2026-05-07

@houtanb houtanb requested a review from elsehow May 29, 2026 07:25
Move canonical LLM model metadata and benchmarkable model-run declarations into utils so downstream repos can select shared runs by stable model_run_key.

Add Models.dev and Artificial Analysis metadata snapshots and loaders. Resolve release dates from Models.dev with manual fallbacks, separate canonical model_key values from provider_model_id routing strings, and validate model declarations during registry construction.

Require every ModelRun to declare an explicit, filename-safe model_run_key. Keep build_model_run_key as a naming helper for option coverage, validate duplicate keys, and expose MODEL_RUNS_BY_KEY/select_model_runs for benchmark selection.

Add Model.active and ACTIVE_MODEL_RUNS so historical runs remain in MODEL_RUNS while runs depending on inactive provider routes are excluded from current live-callable benchmark sweeps. Mark the Together deepseek-v3.1 route inactive and replace live smoke tests with the active MiniMax M2.7 route.

Add Artificial Analysis-backed model-run declarations as benchmark-selectable runs that are automatically included in MODEL_RUNS, with display names resolved from a minimized checked-in AA snapshot containing only stable IDs and display names.

Add third-party notices for Models.dev's MIT license and Artificial Analysis attribution, and include those notices in built wheel license metadata.

Move shared LLM provider dependencies into pyproject metadata, make requirements.txt delegate to .[dev], configure the package for Python 3.14, and preserve pytest-xdist for parallel integration tests.

Document registry conventions, local dev setup, validation commands, and Claude/agent handoff files. Add unit and integration coverage for metadata snapshots, registry validation, provider routing, explicit model-run keys, active model-run filtering, third-party notices, and selectable shared model runs.

As a byproduct of using Models.dev, the following model release dates have changed:

  mistral-large-2411: 2024-11-18 -> 2024-11-01
  deepseek-r1: 2025-01-20 -> 2024-12-26
  deepseek-v3: 2024-12-25 -> 2025-01-20
  glm-4.6: 2025-11-13 -> 2025-09-30
  kimi-k2-thinking: 2025-11-05 -> 2025-11-06
  kimi-k2.5: 2026-01-30 -> 2026-01-27
  glm-5: 2026-02-12 -> 2026-02-11
  glm-5.1: 2026-04-07 -> 2026-03-27
  kimi-k2.6: 2026-04-20 -> 2026-04-21
  claude-3-7-sonnet-20250219: 2025-02-24 -> 2025-02-19
  claude-haiku-4-5-20251001: 2025-10-01 -> 2025-10-15
  claude-opus-4-5-20251101: 2025-11-24 -> 2025-11-01
  grok-4.3: 2026-05-01 -> 2026-04-17
  gemini-2.5-flash: 2025-06-17 -> 2025-03-20
  gemini-2.5-pro: 2025-06-17 -> 2025-03-20
  gemini-3.1-flash-lite: 2026-05-08 -> 2026-05-07
Copy link
Copy Markdown
Contributor

@elsehow elsehow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great refactor/PR.

My biggest high-level comment isn't about the code at all - I'm concerned about Models.dev's internal consistency. As I flagged in the comments, it seems like Models.dev gives inconsistent release dates for DeepSeek R1 and V3 across different model providers. It serves togetherai/DeepSeek-R1 = 2024-12-26 and togetherai/DeepSeek-V3 = 2025-01-20, but other providers in the same snapshot agree R1 = 2025-01-20, and ~8 (DigitalOcean, Vercel, SiliconFlow, …) agree V3 = 2024-12-26 — which matches reality (V3 shipped Dec 2024, R1 Jan 2025). Our registry references the togetherai route, so it inherited the one bad pair.

I haven't rigorously verified every Models.dev release, but it's probably safest to assume this type of issue might happen now and again.

There are a few concrete things we can do here:

  1. Let manual_release_date override models.dev.
  2. Raise on disagreement between manual_release_date and models.dev - maybe in a refresh script, which runs at snapshot-regeneration time. That's an early warning system, and a trigger to do (3), below.
  3. Contribute to models.dev to fix any discrepancies we find.

I feel most confident that we should be doing (3). We can and should be checking the upstream provenance. Also, Models.dev is an amazing public good, and I think we can justify spending some of our time keeping it high-quality. At the same time, we don't know (yet) how fast their PR review cycles are/how actively maintained it is/their maintenance may change over time; we should have a 'manual override' as a backup in these cases. So, I'm leaning (1) too. (2) seems like a good CI practice. That's just my opinion - I'd be curious to hear your thoughts.

I left a few minor-ish comments in the code. Perhaps the biggest among them is the Python 3.14 dep, which could force consumers to upgrade their Python. I'm not sure we need to do that yet (but we might).

Comment thread pyproject.toml
description = "Utilities for the Forecasting Research Institute codebase."
readme = "README.md"
requires-python = ">=3.10"
requires-python = ">=3.14"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may force every downstream consumer (like forecastbench-sim, anything importing utils) onto 3.14. It may be worth a line in the PR description. If the registry doesn't need 3.14 features, a lower floor may keep utils consumers free to choose their Python version.

api_key = get_secret(secret_name)
_PROVIDER_API_KEYS[provider_cls] = api_key
except (RuntimeError, exceptions.NotFound):
except RuntimeError, exceptions.NotFound:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this no-parens version will only run on 3.14+. Since it's not related to the PR's purpose, perhaps we could restore the parens and then restore earlier Python compatibility.

full_name: str
token_limit: int
provider_cls: Type[BaseLLMProvider]
model_key: str
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking API change - I assume forecastbench is the only consumer right now, right? If not, perhaps worth notifying any other consumers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForecastBench and Timeseries bench are the only users at the moment, so I'm changing things more freely

Comment thread utils/llm/model_runs.py
def __post_init__(self) -> None:
"""Validate model-run metadata."""
_validate_model_run_key(self.model_run_key)
build_model_run_key(self.model.model_key, self.options)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this call exists only for its raise-on-unknown-option-path side effect. That's sort of an unexpected use for a function named build_*. What if there were a thin _validate_option_paths(options) wrapper (or a comment)? Something that makes it clear taht the intent is to validate.

Comment thread utils/llm/model_runs.py
return self.model_run_key

@property
def display_name(self) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns the raw model_run_key (like claude-opus-4-6...) as the leaderboard name. I assume that's intended, and that the leaderboard is keyed on these run-keys to the human-readable names?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to punt on the display name for the moment. model_run_key is unique. I imagine I'll come up with a consistent naming schema for FB which I'll use now, but that might change in the future, e.g., I can see myself using claude-opus-4-8 as the model name, but we may want to change that to Claude Opus 4.8 at some point down the line, and I imagine we'd do that with a separate naming module, which would also handle if/how we display tools, the ordering of those tools, etc. I'll remove the returned AA key and just return model_run_key for now. The idea being that using the display_name is the "pretty printed" version of the model + options

Comment on lines +52 to +61
def _parse_date(value: str | None) -> date | None:
"""Parse an ISO date value from the snapshot."""
if value is None:
return None
if len(value) != len("YYYY-MM-DD"):
return None
try:
return date.fromisoformat(value)
except ValueError:
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Malformed dates silently become None. That could be expected, but may be worth logging (or erroring if we want to be strict)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Their docs say they allow YYYY-MM as a release date. Not sure we should keep those, nor should we modify them to YYYY-MM-01 for example. Discarding forces us to override the date if we want to use it. I can log it as an error to be more transparent however

"""Return this model's release date from Models.dev or a manual fallback."""
metadata = self.models_dev_metadata
if metadata is not None and metadata.release_date is not None:
return metadata.release_date
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my understanding is that Models.dev always wins here. The manual_release_date is fallback-only and can never correct a date.

This is totally fine as long as we have 100% confidence in Models.dev's data quality. I bring this up because I'm seeing some inconsistencies there, at least in the DeepSeek models (I'll describe this more in my top-level comment).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we're only using Models.dev for the release date and manual_release_date was envisaged to be used if we don't have a Models.dev reference. Given your comment above about inconsistent dates, it makes sense to allow it to override the Models.dev provided date and allow them both to exist.

f"{reference.provider_id}/{reference.model_id}: {exc}"
) from exc

if metadata.release_date is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__post_init__ accepts the Models.dev date without checking it against a provided manual_release_date. We could raise when a declared manual date disagrees with Models.dev. That way, a discrepancy would fail at construction (before inference happens). Up to you whether you think that's good behavior - it's an insurance policy in any case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed given your comment above

Comment on lines +600 to +611
model_key="deepseek-r1",
lab_key="DeepSeek",
models_dev_reference=ModelsDevReference(
provider_id="togetherai", model_id="deepseek-ai/DeepSeek-R1"
),
),
together_model(
model_key="deepseek-v3",
lab_key="DeepSeek",
models_dev_reference=ModelsDevReference(
provider_id="togetherai", model_id="deepseek-ai/DeepSeek-V3"
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll discuss this more in my top-level comment, but I'm seeing some inconsistent dates in Models.dev. Some DeepSeek r1 versions say the model was released 2025-01-2020; Together's says 2024-12-26. (I think Together may have actually switched the release dates for v1 and v3).

@houtanb
Copy link
Copy Markdown
Member Author

houtanb commented Jun 4, 2026

@elsehow I fully agree that Models.dev release dates differ by provider and that's the biggest risk here. We currently only want to use them because it outsources the need to find release dates ourselves.

I'll make I made a PR there to see how things work.

EDIT: PR Merged within less than 12 hours.

@houtanb
Copy link
Copy Markdown
Member Author

houtanb commented Jun 4, 2026

@elsehow thank you so much for this review, it was very valuable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants