diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..6879854 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,138 @@ +# Repository Instructions + +## Shared LLM Registry + +This package targets Python 3.14. Black is configured with +`target-version = ["py314"]`; do not broaden `requires-python` without first +checking that formatted code remains valid for the older target. + +## Local Development Setup + +Use Python 3.14 for local development: + +```bash +python3.14 -m venv .venv +source .venv/bin/activate +python -m pip install --upgrade pip +python -m pip install -r requirements.txt +``` + +`requirements.txt` delegates to `.[dev]`; it installs this package and the dev +tools from `pyproject.toml` without editable mode. + +When another repo needs local utils changes during development, use that repo's +virtual environment and install utils explicitly in editable mode, for example: + +```bash +python -m pip install -e ../utils +``` + +Do not add local relative paths to another repo's requirements files. Those +files should use the deployed git pin when ready to deploy. + +The shared LLM registry has two layers: + +- `utils.llm.model_registry.MODELS` contains canonical provider-callable base models. +- `utils.llm.model_runs.MODEL_RUNS` contains exact benchmarkable model-plus-options runs. + +Benchmarks should choose from `MODEL_RUNS` by `model_run_key`; forecast files should store that exact key. + +When adding a base model: + +- Add provider/lab registry entries first only if the provider or lab is missing. +- Look up the model in Models.dev. Prefer a `ModelsDevReference` when Models.dev + has the provider/model entry. +- In Models.dev source paths, `provider_id` is the folder under `providers/`, + and `model_id` is the TOML filename stem under `models/`, for example + `providers/anthropic/models/claude-opus-4-8.toml` maps to `anthropic` / + `claude-opus-4-8`. +- The checked-in Models.dev snapshot is not a catalog; it contains only + registry-referenced models and only `id`, `name`, and `release_date`. +- Use exact Models.dev `provider_id`/`model_id` values. If a reference is wrong, + refreshing the snapshot should fail and suggest nearby Models.dev entries. +- Use `manual_release_date` when the model is missing from Models.dev, when the + Models.dev entry lacks a usable full release date, or for deliberate + historical/manual entries. +- Put the model in the provider-specific list in `utils/llm/model_registry.py` (`OPENAI_MODELS`, `TOGETHER_MODELS`, `ANTHROPIC_MODELS`, `XAI_MODELS`, or `GOOGLE_MODELS`). +- Insert the model where `(release_date, model_key)` stays ascending within its + provider-specific list. +- Use `provider_model_id` for the exact string sent to the provider API. It may differ from `model_key`, especially for routed providers like Together. +- Set `active=False` only when a provider route should remain in registry history + but should be excluded from current live-callable benchmark runs. +- Do not add duplicate `model_key`s. `MODELS = create_models_list(...)` validates uniqueness. + +After changing `ModelsDevReference` values, refresh the Models.dev snapshot from the utils repo: +```bash +python - <<'PY' +from scripts.refresh_models_dev_metadata import write_models_dev_snapshot + +write_models_dev_snapshot() +PY +``` + +When adding a model run: + +- Add it to `utils/llm/model_runs.py` with + `_model_run(model_run_key=..., model_key=..., options=...)`. +- Write `model_run_key` explicitly as the stable benchmark identifier. Do not + rely on implicit generation from model/options. +- Put every runtime call option in the `ModelRun` declaration; do not add hidden defaults elsewhere. +- Use exact provider option names and values as they are passed to `get_response`. +- If an option affects performance and should appear in filenames/forecast keys, add or update a naming rule in `NAME_COMPONENT_RULES`. +- If an option is intentionally name-neutral, add it to `NAME_NEUTRAL_OPTION_PATHS`. +- Unknown option paths should fail loudly rather than silently producing ambiguous model-run keys. +- `build_model_run_key(...)` is a suggested-key helper for consistency checks and + new naming rules; the declared `model_run_key` remains the durable identity. +- Do not add duplicate `model_run_key`s. `MODEL_RUNS = create_model_runs_list(...)` validates uniqueness. +- `MODEL_RUNS` is the historical registry. `ACTIVE_MODEL_RUNS` is derived from + it by dropping runs whose base `Model` has `active=False`. +- Add unit tests for new naming behavior, registry inclusion, and routed provider options when relevant. + +## Artificial Analysis Model Runs + +When adding an Artificial Analysis-backed model run: + +- Use the checked-in Artificial Analysis snapshot as the source for the stable AA model ID and displayed AA name. +- Refresh the snapshot from the AA endpoint; do not hand-edit individual AA models into the JSON file. +- The official AA API key is `API_KEY_ARTIFICIAL_ANALYSIS` in GCP Secret Manager. +- Do not hard-code an AA display name in a `ModelRun`; set `artificial_analysis_id` and let the run read the display name from the snapshot. +- Do not add an `artificial_analysis_model` flag. A non-null `artificial_analysis_id` is the marker that a run is AA-backed. +- Add or update the canonical base `Model` only if the provider-callable model is missing from `utils.llm.model_registry`. +- Add the callable model-plus-options declaration to + `ARTIFICIAL_ANALYSIS_MODEL_RUN_DECLARATIONS` in + `utils/llm/artificial_analysis_model_runs.py`. Every declaration there is + automatically included in `utils.llm.model_runs.MODEL_RUNS`; do not add the + same AA run manually to `MODEL_RUNS`. +- Use the exact provider option names that are passed at runtime. Token suffixes in model-run keys must reflect the actual token cap option used for the call. + +Artificial Analysis token caps should be encoded in the run options this way: + +- Non-reasoning models: use `16_384` output tokens, adjusted downward if the model has a smaller context window or a lower maximum output-token cap. +- Reasoning models: use the maximum output tokens allowed by the model creator for that reasoning configuration. +- If the correct cap is not clear from provider/model documentation or the AA metadata, stop and confirm rather than guessing. + +After adding an AA model run: + +- Add or update unit tests that prove the AA ID resolves from the snapshot and that `display_name` matches the AA leaderboard name. +- Add or update shared registry coverage tests for the new selectable model-run key. +- Run the focused model-run and AA metadata tests, then run the full lint/test suite before committing. + +## Validation + +- Run `make lint` before committing. It runs `isort .`, `black .`, `flake8 .`, + and `pydocstyle .`. +- Run `make test` before committing code changes. Use `PYTEST_ARGS=...` for a + focused test pass while iterating. +- Run `make test-integration` or `make test-integration-parallel` only when the + relevant provider/GCP credentials are available. + +## Live Model-Run Smoke Tests + +Integration tests that hit real LLM APIs require provider API keys. + +- `tests/conftest.py` loads `.env`, then `configure_api_keys(from_gcp=True)` when pytest is run with `--integration`. +- `configure_api_keys(from_gcp=True)` reads provider keys from GCP Secret Manager using the secret names in `utils/helpers/constants.py`. +- The standard LLM secret names are `API_KEY_OPENAI`, `API_KEY_ANTHROPIC`, `API_KEY_GEMINI`, `API_KEY_XAI`, and `API_KEY_TOGETHERAI`. +- To test a specific shared model run, set `LLM_MODEL_RUN_KEYS` to one or more comma-separated `model_run_key`s and run `pytest --integration tests/integration/llm/test_model_runs.py`. +- The model-run integration test calls `model_run.get_response`, so it uses the run's declared provider route, provider model ID, and options. +- For a newly added model run, prefer running its exact smoke test before assuming the provider accepts the declared options. diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..43c994c --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +@AGENTS.md diff --git a/Makefile b/Makefile index 47870e8..683afe5 100644 --- a/Makefile +++ b/Makefile @@ -1,3 +1,5 @@ +PYTEST_ARGS ?= + lint: pyproject.toml setup.cfg isort . black . @@ -8,13 +10,13 @@ clean: find . -type f -name "*~" -exec rm -f {} + test: - pytest + pytest $(PYTEST_ARGS) test-integration: - pytest --integration + pytest --integration $(PYTEST_ARGS) test-integration-parallel: - pytest --integration -n auto + pytest --integration -n auto $(PYTEST_ARGS) coverage: - pytest --cov=utils --cov-report=term-missing --cov-report=html \ No newline at end of file + pytest --cov=utils --cov-report=term-missing --cov-report=html $(PYTEST_ARGS) diff --git a/README.md b/README.md index 13a8594..c5b3475 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ uv add fri-utils ``` -from utils.llm.model_registry import configure_api_keys, MODELS +from utils.llm.model_registry import configure_api_keys, MODELS_BY_KEY # Input the API key for any model provider you like! configure_api_keys( @@ -40,7 +40,7 @@ configure_api_keys( # Call any model we support! # See the full list of supported models in `utils/llm/model_registry.py` -model = next(m for m in MODELS if m.id == "gemini-2.5-flash") +model = MODELS_BY_KEY["gemini-2.5-pro"] model.get_response("Hello") # > "Hello! How can I help you?" ``` @@ -62,6 +62,12 @@ Use option names supported by the respective provider (`utils/llm/providers`). If you don’t see an option you need, feel free to open a GitHub issue! +### Third-party metadata + +The shared LLM registry includes normalized metadata from Models.dev and +Artificial Analysis. See `THIRD_PARTY_NOTICES.md` for Models.dev license terms +and Artificial Analysis attribution. + ### Configuring keys from GCP Secret Manager @@ -71,7 +77,7 @@ If so, you can use the `from_gcp=True` shortcut to set your keys for all model p ``` configure_api_keys(from_gcp=True) # Configure all provider keys from GCP. -model = next(m for m in MODELS if m.id == "gpt-4.1-mini") +model = MODELS_BY_KEY["gpt-5-mini-2025-08-07"] response = model.get_response("Hello") ``` @@ -82,6 +88,7 @@ If you're setting up a Google Cloud Project, the API keys must be stored in Secr - `API_KEY_OPENAI` for OpenAI - `API_KEY_XAI` for xAI - `API_KEY_TOGETHERAI` for Together AI +- `API_KEY_ARTIFICIAL_ANALYSIS` for refreshing the Artificial Analysis metadata snapshot You can also check `utils/helpers/constants.py` for the complete list of secret names. diff --git a/THIRD_PARTY_NOTICES.md b/THIRD_PARTY_NOTICES.md new file mode 100644 index 0000000..23e6c4a --- /dev/null +++ b/THIRD_PARTY_NOTICES.md @@ -0,0 +1,42 @@ +# Third-Party Notices + +This repository includes normalized metadata derived from third-party sources. + +## Models.dev + +The checked-in Models.dev snapshot is derived from https://models.dev/api.json +and the upstream repository https://github.com/anomalyco/models.dev + +Models.dev is licensed under the MIT License: + +```text +MIT License + +Copyright (c) 2025 models.dev + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. +``` + +## Artificial Analysis + +The checked-in Artificial Analysis snapshot is derived from the Artificial +Analysis free API and is minimized to the stable model IDs and display names +used by this package. + +Attribution: Artificial Analysis, https://artificialanalysis.ai/. diff --git a/pyproject.toml b/pyproject.toml index 89f69ba..9f4e069 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,17 +4,17 @@ build-backend = "setuptools.build_meta" [project] name = "fri-utils" -version = "0.1.0" +version = "0.2.0" description = "Utilities for the Forecasting Research Institute codebase." readme = "README.md" -requires-python = ">=3.10" +requires-python = ">=3.14" license = { file = "LICENSE" } authors = [{ name = "Forecasting Research Institute" }] dependencies = [ - "google-genai==1.73.1", - "anthropic==0.97.0", - "together==2.11.0", - "openai==2.33.0", + "google-genai==2.7.0", + "anthropic==0.105.2", + "together==2.16.0", + "openai==2.40.0", "google-cloud-secret-manager>=2.20.0", "google-cloud-storage>=2.14.0", "python-dotenv>=1.0.0", @@ -22,13 +22,14 @@ dependencies = [ [project.optional-dependencies] dev = [ - "black", - "flake8", - "flake8-bugbear", - "isort", - "pydocstyle", - "pytest", - "pytest-cov", + "black==26.5.1", + "flake8==7.3.0", + "flake8-bugbear==25.11.29", + "isort==8.0.1", + "pydocstyle==6.3.0", + "pytest==9.0.3", + "pytest-cov==7.1.0", + "pytest-xdist==3.8.0", ] [tool.setuptools.packages.find] @@ -38,8 +39,15 @@ include = [ ] exclude = ["tests*", "htmlcov*", "venv*"] +[tool.setuptools] +license-files = ["LICENSE", "THIRD_PARTY_NOTICES.md"] + +[tool.setuptools.package-data] +"utils.llm.metadata" = ["*.json"] + [tool.black] line-length = 100 +target-version = ["py314"] [tool.pytest.ini_options] markers = [ diff --git a/requirements.txt b/requirements.txt index 3bb98ee..e7ab444 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,15 +1 @@ -google-genai==1.73.1 -anthropic==0.97.0 -together==2.11.0 -openai==2.33.0 -google-cloud-secret-manager>=2.20.0 -google-cloud-storage>=2.14.0 -python-dotenv>=1.0.0 -isort -black -flake8 -flake8-bugbear -pydocstyle -pytest -pytest-cov -pytest-xdist +.[dev] diff --git a/scripts/refresh_models_dev_metadata.py b/scripts/refresh_models_dev_metadata.py new file mode 100644 index 0000000..eed914c --- /dev/null +++ b/scripts/refresh_models_dev_metadata.py @@ -0,0 +1,299 @@ +"""Refresh the checked-in LLM metadata snapshots.""" + +import ast +import difflib +import json +import urllib.request +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +from google.api_core import exceptions + +from utils.gcp.secret_manager import get_secret +from utils.helpers.constants import ARTIFICIAL_ANALYSIS_API_KEY_SECRET_NAME + +MODELS_DEV_URL = "https://models.dev/api.json" +ARTIFICIAL_ANALYSIS_URL = "https://artificialanalysis.ai/api/v2/data/llms/models" +DEFAULT_MODELS_DEV_OUTPUT_PATH = ( + Path(__file__).resolve().parents[1] / "utils" / "llm" / "metadata" / "models_dev_snapshot.json" +) +DEFAULT_MODEL_REGISTRY_PATH = ( + Path(__file__).resolve().parents[1] / "utils" / "llm" / "model_registry.py" +) +DEFAULT_ARTIFICIAL_ANALYSIS_OUTPUT_PATH = ( + Path(__file__).resolve().parents[1] + / "utils" + / "llm" + / "metadata" + / "artificial_analysis_snapshot.json" +) + +MODEL_FIELDS = ( + "id", + "name", + "release_date", +) + + +@dataclass(frozen=True, slots=True, order=True) +class ModelsDevReference: + """A provider/model reference into Models.dev.""" + + provider_id: str + model_id: str + + +def _sorted_dict(data: dict[str, Any]) -> dict[str, Any]: + """Return a copy of a dictionary with keys sorted recursively.""" + return {key: _sort_json_value(value) for key, value in sorted(data.items())} + + +def _sort_json_value(value: Any) -> Any: + """Return JSON-like data with dictionaries sorted recursively.""" + if isinstance(value, dict): + return _sorted_dict(value) + if isinstance(value, list): + return [_sort_json_value(item) for item in value] + return value + + +def _literal_string_keyword(call: ast.Call, keyword_name: str) -> str | None: + """Return a string literal keyword argument from an AST call, if present.""" + for keyword in call.keywords: + if keyword.arg == keyword_name and isinstance(keyword.value, ast.Constant): + value = keyword.value.value + if isinstance(value, str): + return value + return None + + +def read_models_dev_references_from_model_registry( + model_registry_path: Path = DEFAULT_MODEL_REGISTRY_PATH, +) -> frozenset[ModelsDevReference]: + """Read Models.dev references from the model registry without importing it.""" + tree = ast.parse(model_registry_path.read_text(), filename=str(model_registry_path)) + references = set() + for node in ast.walk(tree): + if not isinstance(node, ast.Call): + continue + if isinstance(node.func, ast.Name): + function_name = node.func.id + elif isinstance(node.func, ast.Attribute): + function_name = node.func.attr + else: + continue + if function_name != "ModelsDevReference": + continue + + provider_id = _literal_string_keyword(node, "provider_id") + model_id = _literal_string_keyword(node, "model_id") + if provider_id is None or model_id is None: + raise ValueError( + "ModelsDevReference calls in model_registry.py must use literal " + "provider_id and model_id keyword arguments." + ) + references.add(ModelsDevReference(provider_id=provider_id, model_id=model_id)) + return frozenset(references) + + +def _format_model_suggestions( + *, + provider_id: str, + provider_data: dict[str, Any], + missing_model_id: str, +) -> str: + """Format nearby Models.dev model suggestions for an incorrect reference.""" + models = provider_data.get("models", {}) + candidates_by_key = { + model_id: f'{provider_id}/{model_id} name="{model_data.get("name", "")}"' + for model_id, model_data in models.items() + } + search_space = list(candidates_by_key) + search_space.extend( + model_data.get("name", "") for model_data in models.values() if model_data.get("name") + ) + close_values = difflib.get_close_matches( + missing_model_id, + search_space, + n=5, + cutoff=0.35, + ) + suggestions = [] + for value in close_values: + if value in candidates_by_key: + suggestions.append(candidates_by_key[value]) + continue + for model_id, model_data in models.items(): + if model_data.get("name") == value: + suggestions.append(candidates_by_key[model_id]) + break + + # Preserve order while de-duplicating suggestions found by ID and display name. + suggestions = list(dict.fromkeys(suggestions)) + if not suggestions: + return f"No nearby model IDs found for provider {provider_id}." + return "Possible matches:\n " + "\n ".join(suggestions) + + +def _raise_missing_models_dev_reference( + *, + api_response: dict[str, Any], + reference: ModelsDevReference, +) -> None: + """Raise a targeted error for a missing Models.dev reference.""" + provider_data = api_response.get(reference.provider_id) + if provider_data is None: + provider_suggestions = difflib.get_close_matches( + reference.provider_id, + list(api_response), + n=5, + cutoff=0.35, + ) + suffix = ( + "Possible provider IDs:\n " + "\n ".join(provider_suggestions) + if provider_suggestions + else "No nearby provider IDs found." + ) + raise ValueError( + f"Missing Models.dev provider reference: {reference.provider_id}\n{suffix}" + ) + + suggestions = _format_model_suggestions( + provider_id=reference.provider_id, + provider_data=provider_data, + missing_model_id=reference.model_id, + ) + raise ValueError( + f"Missing Models.dev reference: {reference.provider_id}/{reference.model_id}\n" + f"{suggestions}" + ) + + +def normalize_models_dev_api_response( + api_response: dict[str, Any], + *, + models_dev_references: frozenset[ModelsDevReference], +) -> dict[str, Any]: + """Normalize a Models.dev API response into the checked-in snapshot shape.""" + providers = {} + references_by_provider: dict[str, list[ModelsDevReference]] = {} + for reference in sorted(models_dev_references): + references_by_provider.setdefault(reference.provider_id, []).append(reference) + + for provider_id, references in sorted(references_by_provider.items()): + provider_data = api_response.get(provider_id) + if provider_data is None: + _raise_missing_models_dev_reference( + api_response=api_response, + reference=references[0], + ) + + models = {} + provider_models = provider_data.get("models", {}) + for reference in sorted(references): + model_data = provider_models.get(reference.model_id) + if model_data is None: + _raise_missing_models_dev_reference( + api_response=api_response, + reference=reference, + ) + + normalized_model = {} + for field in MODEL_FIELDS: + if field in model_data: + value = model_data[field] + normalized_model[field] = ( + _sorted_dict(value) if isinstance(value, dict) else value + ) + models[reference.model_id] = normalized_model + providers[provider_id] = { + "id": provider_data["id"], + "name": provider_data["name"], + "models": models, + } + return { + "source": MODELS_DEV_URL, + "providers": providers, + } + + +def normalize_artificial_analysis_api_response(api_response: dict[str, Any]) -> dict[str, Any]: + """Normalize Artificial Analysis response fields needed at runtime.""" + return { + "source": ARTIFICIAL_ANALYSIS_URL, + "prompt_options": _sorted_dict(api_response.get("prompt_options") or {}), + "data": [ + {"id": model_data["id"], "name": model_data["name"]} + for model_data in sorted(api_response.get("data", []), key=lambda item: item["id"]) + ], + } + + +def fetch_models_dev_api_response() -> dict[str, Any]: + """Fetch the current Models.dev API response.""" + request = urllib.request.Request(MODELS_DEV_URL, headers={"User-Agent": "fri-utils"}) + with urllib.request.urlopen(request, timeout=30) as response: + return json.load(response) + + +def fetch_artificial_analysis_api_response() -> dict[str, Any]: + """Fetch the current Artificial Analysis LLM models API response.""" + try: + api_key = get_secret(ARTIFICIAL_ANALYSIS_API_KEY_SECRET_NAME) + except RuntimeError, exceptions.NotFound: + api_key = None + if not api_key: + raise RuntimeError( + f"Configure {ARTIFICIAL_ANALYSIS_API_KEY_SECRET_NAME} in GCP Secret Manager " + "to refresh the Artificial Analysis snapshot." + ) + request = urllib.request.Request( + ARTIFICIAL_ANALYSIS_URL, + headers={ + "User-Agent": "fri-utils", + "x-api-key": api_key, + }, + ) + with urllib.request.urlopen(request, timeout=30) as response: + return json.load(response) + + +def write_json_snapshot(snapshot: dict[str, Any], output_path: Path) -> None: + """Write a deterministic JSON snapshot.""" + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(json.dumps(snapshot, indent=2, sort_keys=True) + "\n") + + +def write_models_dev_snapshot(output_path: Path = DEFAULT_MODELS_DEV_OUTPUT_PATH) -> None: + """Fetch, normalize, and write the Models.dev snapshot.""" + snapshot = normalize_models_dev_api_response( + fetch_models_dev_api_response(), + models_dev_references=read_models_dev_references_from_model_registry(), + ) + write_json_snapshot(snapshot, output_path) + + +def write_artificial_analysis_snapshot( + output_path: Path = DEFAULT_ARTIFICIAL_ANALYSIS_OUTPUT_PATH, +) -> None: + """Fetch, normalize, and write the Artificial Analysis snapshot.""" + snapshot = normalize_artificial_analysis_api_response(fetch_artificial_analysis_api_response()) + write_json_snapshot(snapshot, output_path) + + +def write_snapshots() -> None: + """Fetch, normalize, and write all checked-in LLM metadata snapshots.""" + models_dev_snapshot = normalize_models_dev_api_response( + fetch_models_dev_api_response(), + models_dev_references=read_models_dev_references_from_model_registry(), + ) + artificial_analysis_snapshot = normalize_artificial_analysis_api_response( + fetch_artificial_analysis_api_response() + ) + write_json_snapshot(models_dev_snapshot, DEFAULT_MODELS_DEV_OUTPUT_PATH) + write_json_snapshot(artificial_analysis_snapshot, DEFAULT_ARTIFICIAL_ANALYSIS_OUTPUT_PATH) + + +if __name__ == "__main__": + write_snapshots() diff --git a/tests/integration/llm/providers/test_anthropic.py b/tests/integration/llm/providers/test_anthropic.py index 4b62795..7510368 100644 --- a/tests/integration/llm/providers/test_anthropic.py +++ b/tests/integration/llm/providers/test_anthropic.py @@ -1,17 +1,13 @@ """Integration tests for Anthropic model helpers.""" -from __future__ import annotations - import pytest import utils.llm.providers.anthropic as anthropic_module # type: ignore[import] -from tests.integration.helpers import ( - assert_capital_of_france, -) +from tests.integration.helpers import assert_capital_of_france from utils.llm.model_registry import MODELS, Model # type: ignore[import] ANTHROPIC_MODEL: Model | None = next( - (model for model in MODELS if model.id == "claude-sonnet-4-6"), None + (model for model in MODELS if model.model_key == "claude-sonnet-4-6"), None ) assert ANTHROPIC_MODEL is not None @@ -30,7 +26,7 @@ def test_anthropic_provider_get_response_live_call(): provider = anthropic_module.AnthropicProvider(api_key=api_key) assert_capital_of_france( lambda prompt: provider.get_response( - model_id=ANTHROPIC_MODEL.full_name, + model_id=ANTHROPIC_MODEL.provider_model_id, prompt=prompt, options={"temperature": 0, "max_tokens": 16}, ) diff --git a/tests/integration/llm/providers/test_google.py b/tests/integration/llm/providers/test_google.py index fe87c7d..23b2f90 100644 --- a/tests/integration/llm/providers/test_google.py +++ b/tests/integration/llm/providers/test_google.py @@ -1,7 +1,5 @@ """Integration tests for Google Gemini model helpers.""" -from __future__ import annotations - import pytest import utils.llm.providers.google as google_module # type: ignore[import] @@ -9,7 +7,7 @@ from utils.llm.model_registry import MODELS, Model # type: ignore[import] GOOGLE_MODEL: Model | None = next( - (model for model in MODELS if model.id == "gemini-2.5-flash"), None + (model for model in MODELS if model.model_key == "gemini-2.5-pro"), None ) assert GOOGLE_MODEL is not None @@ -28,7 +26,7 @@ def test_google_provider_get_response_live_call(): provider = google_module.GoogleProvider(api_key=api_key) assert_capital_of_france( lambda prompt: provider.get_response( - model_id=GOOGLE_MODEL.full_name, + model_id=GOOGLE_MODEL.provider_model_id, prompt=prompt, options={"temperature": 0}, ) diff --git a/tests/integration/llm/providers/test_openai.py b/tests/integration/llm/providers/test_openai.py index c72059f..8cec41b 100644 --- a/tests/integration/llm/providers/test_openai.py +++ b/tests/integration/llm/providers/test_openai.py @@ -1,7 +1,5 @@ """Integration tests for OpenAI model helpers.""" -from __future__ import annotations - import pytest import utils.llm.providers.openai as openai_module # type: ignore[import] @@ -9,7 +7,7 @@ from utils.llm.model_registry import MODELS, Model # type: ignore[import] OPENAI_MODEL: Model | None = next( - (model for model in MODELS if model.id == "gpt-5-2025-08-07"), None + (model for model in MODELS if model.model_key == "gpt-5-mini-2025-08-07"), None ) assert OPENAI_MODEL is not None @@ -28,7 +26,7 @@ def test_openai_provider_get_response_live_call(): provider = openai_module.OpenAIProvider(api_key=api_key) assert_capital_of_france( lambda prompt: provider.get_response( - model_id=OPENAI_MODEL.full_name, + model_id=OPENAI_MODEL.provider_model_id, prompt=prompt, options={"max_output_tokens": 256}, ) diff --git a/tests/integration/llm/providers/test_together.py b/tests/integration/llm/providers/test_together.py index 092284a..49eb69a 100644 --- a/tests/integration/llm/providers/test_together.py +++ b/tests/integration/llm/providers/test_together.py @@ -1,7 +1,5 @@ """Integration tests for Together AI model helpers.""" -from __future__ import annotations - import pytest import utils.llm.providers.together as together_module # type: ignore[import] @@ -9,9 +7,10 @@ from utils.llm.model_registry import MODELS, Model # type: ignore[import] TOGETHER_MODEL: Model | None = next( - (model for model in MODELS if model.id == "GLM-4.5-Air-FP8"), None + (model for model in MODELS if model.model_key == "minimax-m2.7"), None ) assert TOGETHER_MODEL is not None +assert TOGETHER_MODEL.active is True @pytest.mark.integration @@ -28,7 +27,7 @@ def test_together_provider_get_response_live_call(): provider = together_module.TogetherProvider(api_key=api_key) assert_capital_of_france( lambda prompt: provider.get_response( - model_id=TOGETHER_MODEL.full_name, + model_id=TOGETHER_MODEL.provider_model_id, prompt=prompt, options={"temperature": 0, "max_tokens": 256}, ) diff --git a/tests/integration/llm/providers/test_xai.py b/tests/integration/llm/providers/test_xai.py index cbb1f06..393c45a 100644 --- a/tests/integration/llm/providers/test_xai.py +++ b/tests/integration/llm/providers/test_xai.py @@ -1,14 +1,12 @@ """Integration tests for xAI model helpers.""" -from __future__ import annotations - import pytest import utils.llm.providers.xai as xai_module # type: ignore[import] from tests.integration.helpers import assert_capital_of_france # type: ignore[import] from utils.llm.model_registry import MODELS, Model # type: ignore[import] -XAI_MODEL: Model | None = next((model for model in MODELS if model.id == "grok-4-0709"), None) +XAI_MODEL: Model | None = next((model for model in MODELS if model.model_key == "grok-4.3"), None) assert XAI_MODEL is not None @@ -26,7 +24,7 @@ def test_xai_provider_get_response_live_call(): provider = xai_module.XAIProvider(api_key=api_key) assert_capital_of_france( lambda prompt: provider.get_response( - model_id=XAI_MODEL.full_name, + model_id=XAI_MODEL.provider_model_id, prompt=prompt, options={"temperature": 0}, ) diff --git a/tests/integration/llm/test_model_registry.py b/tests/integration/llm/test_model_registry.py index 53e5f80..491a186 100644 --- a/tests/integration/llm/test_model_registry.py +++ b/tests/integration/llm/test_model_registry.py @@ -1,35 +1,51 @@ -"""Integration tests that validate every registry model can be invoked.""" - -from __future__ import annotations +"""Integration tests that validate representative registry models can be invoked.""" import pytest -from utils.llm.model_registry import MODELS, Model # type: ignore[import] -from utils.llm.providers.anthropic import AnthropicProvider # type: ignore[import] -from utils.llm.providers.google import GoogleProvider # type: ignore[import] -from utils.llm.providers.openai import OpenAIProvider # type: ignore[import] -from utils.llm.providers.together import TogetherProvider # type: ignore[import] -from utils.llm.providers.xai import XAIProvider # type: ignore[import] +from utils.llm.model_registry import MODELS_BY_KEY, Model # type: ignore[import] +from utils.llm.provider_registry import PROVIDERS # type: ignore[import] from ..helpers import assert_capital_of_france +SMOKE_TEST_MODEL_KEYS = [ + "gpt-5-mini-2025-08-07", + "claude-sonnet-4-6", + "minimax-m2.7", + "grok-4.3", + "gemini-2.5-pro", +] + def _minimal_options_for_model(model: Model) -> dict: - if model.provider_cls is AnthropicProvider: - return {"max_tokens": 16} - if model.provider_cls is OpenAIProvider: - return {"max_output_tokens": 16} - if model.provider_cls in {TogetherProvider, XAIProvider}: + if model.provider == PROVIDERS["Anthropic"]: return {"max_tokens": 16} - if model.provider_cls is GoogleProvider: + if model.provider == PROVIDERS["OpenAI"]: + return {"max_output_tokens": 256} + if model.provider == PROVIDERS["Together"]: + return {"temperature": 0, "max_tokens": 256} + if model.provider == PROVIDERS["xAI"]: + return {"temperature": 0} + if model.provider == PROVIDERS["Google"]: return {} return {} +def test_together_smoke_options_leave_room_for_answer_text(): + """Keep routed Together smoke calls aligned with the provider smoke path.""" + model = MODELS_BY_KEY["minimax-m2.7"] + + assert _minimal_options_for_model(model) == {"temperature": 0, "max_tokens": 256} + + @pytest.mark.integration -@pytest.mark.parametrize("model", MODELS, ids=lambda item: item.id) +@pytest.mark.parametrize( + "model", + [MODELS_BY_KEY[model_key] for model_key in SMOKE_TEST_MODEL_KEYS], + ids=lambda item: item.model_key, +) def test_registered_model_live_call(model: Model): - """Each model entry should be callable via its registered provider.""" + """Representative active model entries should be callable via their providers.""" + assert model.active is True assert_capital_of_france( lambda prompt: model.get_response( prompt, diff --git a/tests/integration/llm/test_model_runs.py b/tests/integration/llm/test_model_runs.py new file mode 100644 index 0000000..98f11b8 --- /dev/null +++ b/tests/integration/llm/test_model_runs.py @@ -0,0 +1,30 @@ +"""Integration smoke tests for shared LLM model runs.""" + +import os + +import pytest + +from utils.llm import model_runs + +from .helpers import assert_capital_of_france + +DEFAULT_SMOKE_MODEL_RUN_KEYS = ("gpt-5-mini-2025-08-07-1024",) + + +def _selected_model_run_keys() -> tuple[str, ...]: + """Return model-run keys selected for live smoke testing.""" + raw_keys = os.getenv("LLM_MODEL_RUN_KEYS") + if not raw_keys: + return DEFAULT_SMOKE_MODEL_RUN_KEYS + return tuple(key.strip() for key in raw_keys.split(",") if key.strip()) + + +@pytest.mark.integration +@pytest.mark.parametrize( + "model_run", + model_runs.select_model_runs(_selected_model_run_keys()), + ids=lambda run: run.model_run_key, +) +def test_model_run_live_call(model_run: model_runs.ModelRun): + """A shared model run should be callable with its declared provider options.""" + assert_capital_of_france(model_run.get_response) diff --git a/tests/unit/test_artificial_analysis_metadata.py b/tests/unit/test_artificial_analysis_metadata.py new file mode 100644 index 0000000..6d71195 --- /dev/null +++ b/tests/unit/test_artificial_analysis_metadata.py @@ -0,0 +1,226 @@ +"""Tests for the checked-in Artificial Analysis metadata snapshot.""" + +import tomllib +from pathlib import Path + +import pytest + +from scripts import refresh_models_dev_metadata +from utils.llm.metadata import artificial_analysis + +ROOT_DIR = Path(__file__).resolve().parents[2] + + +def test_load_artificial_analysis_snapshot_exposes_model_metadata(): + """Load the snapshot and expose normalized model fields.""" + snapshot = artificial_analysis.load_artificial_analysis_snapshot() + + model = snapshot.get_model("2dad8957-4c16-4e74-bf2d-8b21514e0ae9") + opus_adaptive = snapshot.get_model("e9a09db3-8fd6-41dd-ba2f-20e0a2bff7f2") + opus_non_reasoning = snapshot.get_model("2fa8e143-77a8-4d05-bfa8-d3b54634c00f") + + assert snapshot.source == refresh_models_dev_metadata.ARTIFICIAL_ANALYSIS_URL + assert snapshot.prompt_options == {"parallel_queries": 1, "prompt_length": 1000} + assert model.id == "2dad8957-4c16-4e74-bf2d-8b21514e0ae9" + assert model.name == "o3-mini" + assert opus_adaptive.name == "Claude Opus 4.7 (Adaptive Reasoning, Max Effort)" + assert opus_non_reasoning.name == "Claude Opus 4.7 (Non-reasoning, High Effort)" + + +def test_artificial_analysis_snapshot_rejects_unknown_model(): + """Raise clear lookup errors for unknown Artificial Analysis model IDs.""" + snapshot = artificial_analysis.load_artificial_analysis_snapshot() + + with pytest.raises(KeyError, match="Unknown Artificial Analysis model_id missing-model"): + snapshot.get_model("missing-model") + + +def test_load_artificial_analysis_snapshot_supports_endpoint_dump_shape(tmp_path): + """Load AA model names from a generated full endpoint snapshot.""" + snapshot_path = tmp_path / "artificial_analysis_snapshot.json" + snapshot_path.write_text(""" +{ + "data": [ + { + "id": "opus-aa-id", + "name": "Claude Opus 4.7 (Adaptive Reasoning, Max Effort)", + "slug": "claude-opus-4-7" + } + ], + "prompt_options": { + "parallel_queries": 1, + "prompt_length": "medium" + }, + "source": "https://artificialanalysis.ai/api/v2/data/llms/models", + "status": 200 +} +""".strip()) + + snapshot = artificial_analysis.load_artificial_analysis_snapshot(snapshot_path) + + assert snapshot.get_model("opus-aa-id").name == ( + "Claude Opus 4.7 (Adaptive Reasoning, Max Effort)" + ) + + +def test_normalize_artificial_analysis_api_response_keeps_only_runtime_fields(): + """Keep only AA fields needed for stable IDs, display names, and attribution.""" + api_response = { + "status": 200, + "prompt_options": {"prompt_length": "medium", "parallel_queries": 1}, + "data": [ + { + "id": "z-model", + "name": "Z Model", + "slug": "z-model", + "model_creator": {"slug": "z-lab", "name": "Z Lab", "id": "z"}, + "evaluations": {"score_b": 2, "score_a": 1}, + "pricing": {"output": 2.5, "input": 1.5}, + "median_output_tokens_per_second": 12.5, + "median_time_to_first_token_seconds": 3.5, + "median_time_to_first_answer_token": 3.5, + "ignored": "drop me", + }, + { + "id": "a-model", + "name": "A Model", + "slug": "a-model", + "model_creator": {"id": "a", "name": "A Lab", "slug": "a-lab"}, + "evaluations": {}, + "pricing": {}, + }, + ], + } + + normalized = refresh_models_dev_metadata.normalize_artificial_analysis_api_response( + api_response + ) + + assert normalized["source"] == refresh_models_dev_metadata.ARTIFICIAL_ANALYSIS_URL + assert normalized["prompt_options"] == {"parallel_queries": 1, "prompt_length": "medium"} + assert [model["id"] for model in normalized["data"]] == ["a-model", "z-model"] + assert normalized["data"][1] == {"id": "z-model", "name": "Z Model"} + + +def test_checked_in_artificial_analysis_snapshot_is_minimal(): + """Do not redistribute the full AA endpoint response in package data.""" + snapshot = refresh_models_dev_metadata.json.loads(artificial_analysis.SNAPSHOT_PATH.read_text()) + + assert set(snapshot) == {"data", "prompt_options", "source"} + assert all(set(model) == {"id", "name"} for model in snapshot["data"]) + + +def test_artificial_analysis_snapshot_refresh_uses_gcp_secret(monkeypatch): + """Use the official GCP Secret Manager key for AA metadata refreshes.""" + monkeypatch.setattr( + refresh_models_dev_metadata, + "get_secret", + lambda secret_name: "gcp-aa-key", + ) + + request_headers = {} + + class FakeResponse: + def __enter__(self): + return self + + def __exit__(self, *args): + return None + + def fake_urlopen(request, timeout): + request_headers.update(request.headers) + assert timeout == 30 + return FakeResponse() + + monkeypatch.setattr(refresh_models_dev_metadata.urllib.request, "urlopen", fake_urlopen) + monkeypatch.setattr(refresh_models_dev_metadata.json, "load", lambda response: {"data": []}) + + assert refresh_models_dev_metadata.fetch_artificial_analysis_api_response() == {"data": []} + assert request_headers["X-api-key"] == "gcp-aa-key" + + +def test_artificial_analysis_snapshot_refresh_requires_api_key(monkeypatch): + """Require the GCP Secret Manager key when it is unavailable.""" + monkeypatch.setattr( + refresh_models_dev_metadata, + "get_secret", + lambda secret_name: (_ for _ in ()).throw(RuntimeError("GCP unavailable")), + ) + + with pytest.raises(RuntimeError, match="API_KEY_ARTIFICIAL_ANALYSIS"): + refresh_models_dev_metadata.fetch_artificial_analysis_api_response() + + +def test_write_snapshots_updates_models_dev_and_artificial_analysis(monkeypatch, tmp_path): + """Write both LLM metadata snapshots from the shared refresh entrypoint.""" + models_dev_output = tmp_path / "models_dev_snapshot.json" + artificial_analysis_output = tmp_path / "artificial_analysis_snapshot.json" + monkeypatch.setattr( + refresh_models_dev_metadata, + "DEFAULT_MODELS_DEV_OUTPUT_PATH", + models_dev_output, + ) + monkeypatch.setattr( + refresh_models_dev_metadata, + "DEFAULT_ARTIFICIAL_ANALYSIS_OUTPUT_PATH", + artificial_analysis_output, + ) + monkeypatch.setattr( + refresh_models_dev_metadata, + "fetch_models_dev_api_response", + lambda: { + "openai": { + "id": "openai", + "name": "OpenAI", + "models": { + "gpt-test": { + "id": "gpt-test", + "name": "GPT Test", + } + }, + } + }, + ) + monkeypatch.setattr( + refresh_models_dev_metadata, + "read_models_dev_references_from_model_registry", + lambda: frozenset( + { + refresh_models_dev_metadata.ModelsDevReference( + provider_id="openai", + model_id="gpt-test", + ) + } + ), + ) + monkeypatch.setattr( + refresh_models_dev_metadata, + "fetch_artificial_analysis_api_response", + lambda: { + "status": 200, + "data": [ + { + "id": "aa-test", + "name": "AA Test", + "slug": "aa-test", + "model_creator": {"id": "creator", "name": "Creator"}, + } + ], + }, + ) + + refresh_models_dev_metadata.write_snapshots() + + assert models_dev_output.exists() + assert artificial_analysis_output.exists() + assert "gpt-test" in models_dev_output.read_text() + assert "aa-test" in artificial_analysis_output.read_text() + + +def test_artificial_analysis_snapshot_is_included_as_package_data(): + """Include the JSON snapshot when utils is installed as a package.""" + pyproject = tomllib.loads((ROOT_DIR / "pyproject.toml").read_text()) + + package_data = pyproject["tool"]["setuptools"]["package-data"] + + assert package_data["utils.llm.metadata"] == ["*.json"] diff --git a/tests/unit/test_llm_model_runs.py b/tests/unit/test_llm_model_runs.py new file mode 100644 index 0000000..a247fb3 --- /dev/null +++ b/tests/unit/test_llm_model_runs.py @@ -0,0 +1,691 @@ +"""Unit tests for shared LLM model-run declarations.""" + +import ast +from datetime import date +from pathlib import Path +from unittest.mock import patch + +import pytest + +from utils.llm.lab_registry import LABS +from utils.llm.provider_registry import PROVIDERS + + +def test_model_keys_are_unique_and_file_safe(): + """Keep base model keys unique and safe for downstream identifiers.""" + from utils.llm import model_registry + + model_keys = [model.model_key for model in model_registry.MODELS] + + assert len(model_keys) == len(set(model_keys)) + assert all(key == key.lower() for key in model_keys) + assert all(" " not in key and "/" not in key and "_" not in key for key in model_keys) + + +def test_model_key_and_provider_model_id_are_distinct_for_together_models(): + """Keep canonical model keys separate from routed provider model IDs.""" + from utils.llm import model_registry + + model = model_registry.MODELS_BY_KEY["deepseek-v3.1"] + + assert model.model_key == "deepseek-v3.1" + assert model.provider_model_id == "deepseek-ai/DeepSeek-V3.1" + assert model.lab == LABS["DeepSeek"] + assert model.provider == PROVIDERS["Together"] + assert model.release_date == date(2025, 8, 21) + assert model.active is False + assert not hasattr(model, "token_limit") + assert not hasattr(model, "provider_cls") + + +def test_forecastbench_origin_main_models_are_in_canonical_registry(): + """Include recent ForecastBench origin/main models in the shared registry.""" + from utils.llm import model_registry + + expected_models = { + "deepseek-v4-pro": ( + "deepseek-ai/DeepSeek-V4-Pro", + LABS["DeepSeek"], + PROVIDERS["Together"], + date(2026, 4, 24), + ), + "gemini-3.5-flash": ( + "gemini-3.5-flash", + LABS["Google DeepMind"], + PROVIDERS["Google"], + date(2026, 5, 19), + ), + } + + for model_key, ( + provider_model_id, + lab, + provider, + release_date, + ) in expected_models.items(): + model = model_registry.MODELS_BY_KEY[model_key] + assert model.provider_model_id == provider_model_id + assert model.lab == lab + assert model.provider == provider + assert model.release_date == release_date + + +def test_model_release_date_resolves_from_models_dev_metadata(): + """Resolve model release dates from configured Models.dev metadata.""" + from utils.llm import model_registry + + model = model_registry.MODELS_BY_KEY["gpt-4o-2024-11-20"] + + assert "release_date" not in model_registry.Model.__dataclass_fields__ + assert "models_dev_provider_id" not in model_registry.Model.__dataclass_fields__ + assert "models_dev_model_id" not in model_registry.Model.__dataclass_fields__ + assert model.models_dev_reference == model_registry.ModelsDevReference( + provider_id="openai", + model_id="gpt-4o-2024-11-20", + ) + assert model.models_dev_metadata is not None + assert model.release_date == model.models_dev_metadata.release_date + assert model.release_date == date(2024, 11, 20) + + +def test_model_api_provider_route_is_independent_from_models_dev_provider(): + """Keep API routing separate from the Models.dev metadata provider.""" + from utils.llm import model_registry + + model = model_registry.MODELS_BY_KEY["glm-5.1"] + + assert model.provider == PROVIDERS["Together"] + assert model.provider_model_id == "zai-org/GLM-5.1" + assert model.models_dev_reference == model_registry.ModelsDevReference( + provider_id="zai", + model_id="glm-5.1", + ) + assert model.release_date == date(2026, 3, 27) + + +def test_models_without_models_dev_metadata_use_manual_release_dates(): + """Use manual release dates only when Models.dev metadata is unavailable.""" + from utils.llm import model_registry + + model = model_registry.MODELS_BY_KEY["gpt-4-0613"] + + assert model.models_dev_metadata is None + assert model.manual_release_date == date(2023, 6, 13) + assert model.release_date == date(2023, 6, 13) + + +def test_provider_specific_model_helpers_default_lab_provider_and_provider_model_id(): + """Use helper constructors to avoid repeated provider and lab boilerplate.""" + from utils.llm import model_registry + + model = model_registry.openai_model( + model_key="gpt-test", + models_dev_reference=model_registry.ModelsDevReference( + provider_id="openai", + model_id="gpt-4o", + ), + ) + + assert model.model_key == "gpt-test" + assert model.provider_model_id == "gpt-test" + assert model.lab == LABS["OpenAI"] + assert model.provider == PROVIDERS["OpenAI"] + assert model.models_dev_reference == model_registry.ModelsDevReference( + provider_id="openai", + model_id="gpt-4o", + ) + assert model.active is True + + +def test_together_model_helper_keeps_lab_and_route_explicit(): + """Keep Together creator lab and provider route explicit while reducing noise.""" + from utils.llm import model_registry + + model = model_registry.together_model( + model_key="glm-5.1", + provider_model_id="zai-org/GLM-5.1", + lab_key="Z.ai", + models_dev_reference=model_registry.ModelsDevReference( + provider_id="zai", + model_id="glm-5.1", + ), + ) + + assert model.lab == LABS["Z.ai"] + assert model.provider == PROVIDERS["Together"] + assert model.provider_model_id == "zai-org/GLM-5.1" + assert model.release_date == date(2026, 3, 27) + + +def test_model_without_models_dev_or_manual_release_date_fails_on_initialization(): + """Fail on construction when a model lacks a release date source.""" + from utils.llm import model_registry + + with pytest.raises(ValueError, match="missing-date-model"): + model_registry.Model( + model_key="missing-date-model", + provider_model_id="missing-date-model", + lab=LABS["OpenAI"], + provider=PROVIDERS["OpenAI"], + ) + + +def test_model_rejects_missing_models_dev_reference_on_initialization(): + """Reject model declarations whose Models.dev reference is not in the snapshot.""" + from utils.llm import model_registry + + with pytest.raises(ValueError, match="bad-reference-model"): + model_registry.openai_model( + model_key="bad-reference-model", + models_dev_reference=model_registry.ModelsDevReference( + provider_id="openai", + model_id="missing-model", + ), + ) + + +def test_openai_dated_model_uses_dated_provider_model_id(): + """Use fixed dated OpenAI provider model IDs instead of moving aliases.""" + from utils.llm import model_registry + + model = model_registry.MODELS_BY_KEY["gpt-4o-mini-2024-07-18"] + + assert model.provider_model_id == "gpt-4o-mini-2024-07-18" + + +def test_model_registry_models_are_grouped_by_provider(): + """Build the shared model registry from provider-specific groups.""" + from utils.llm import model_registry + + assert model_registry.MODELS == [ + *model_registry.OPENAI_MODELS, + *model_registry.TOGETHER_MODELS, + *model_registry.ANTHROPIC_MODELS, + *model_registry.XAI_MODELS, + *model_registry.GOOGLE_MODELS, + ] + assert {model.provider.name for model in model_registry.OPENAI_MODELS} == {"OpenAI"} + assert {model.provider.name for model in model_registry.TOGETHER_MODELS} == {"Together"} + assert {model.provider.name for model in model_registry.ANTHROPIC_MODELS} == {"Anthropic"} + assert {model.provider.name for model in model_registry.XAI_MODELS} == {"xAI"} + assert {model.provider.name for model in model_registry.GOOGLE_MODELS} == {"Google"} + + +def test_model_registry_provider_groups_are_sorted_by_release_date(): + """Keep provider-specific model groups sorted by release date.""" + from utils.llm import model_registry + + provider_groups = [ + model_registry.OPENAI_MODELS, + model_registry.TOGETHER_MODELS, + model_registry.ANTHROPIC_MODELS, + model_registry.XAI_MODELS, + model_registry.GOOGLE_MODELS, + ] + + for models in provider_groups: + release_order = [(model.release_date, model.model_key) for model in models] + assert release_order == sorted(release_order) + + +def test_create_models_list_rejects_duplicate_model_keys(): + """Reject duplicate model keys when creating the full model registry list.""" + from utils.llm import model_registry + + model = model_registry.MODELS_BY_KEY["gpt-4-0613"] + + with pytest.raises(ValueError, match="Duplicate LLM model_key: gpt-4-0613"): + model_registry.create_models_list([model, model]) + + +def test_model_runs_use_canonical_model_registry_objects(): + """Store canonical model registry objects on shared model runs.""" + from utils.llm import model_registry, model_runs + + run = model_runs.MODEL_RUNS_BY_KEY["deepseek-v3.1"] + + assert run.model is model_registry.MODELS_BY_KEY["deepseek-v3.1"] + + +def test_o3_mini_model_run_is_not_artificial_analysis_backed(): + """Keep o3-mini selectable without treating it as an AA-backed run.""" + from utils.llm import model_runs + + run = model_runs.MODEL_RUNS_BY_KEY["o3-mini-2025-01-31"] + + assert run.model_run_key == "o3-mini-2025-01-31" + assert run.artificial_analysis_id is None + assert run.display_name == "o3-mini-2025-01-31" + + +def test_model_run_constructor_requires_explicit_model_run_key(): + """Do not allow ModelRun keys to be generated implicitly.""" + from utils.llm import model_registry, model_runs + + with pytest.raises(TypeError): + model_runs.ModelRun( + model=model_registry.MODELS_BY_KEY["gpt-5.5-2026-04-23"], + ) + + +def test_model_run_declarations_use_literal_model_run_keys(): + """Keep shared model-run keys handwritten at declaration sites.""" + from utils.llm import model_runs + + source = Path(model_runs.__file__).read_text() + tree = ast.parse(source) + + calls = [ + node + for node in ast.walk(tree) + if isinstance(node, ast.Call) + and isinstance(node.func, ast.Name) + and node.func.id == "_model_run" + ] + + assert calls + for call in calls: + keyword = next( + (keyword for keyword in call.keywords if keyword.arg == "model_run_key"), + None, + ) + assert keyword is not None + assert isinstance(keyword.value, ast.Constant) + assert isinstance(keyword.value.value, str) + + +def test_artificial_analysis_model_runs_are_declared_in_dedicated_module(): + """Keep AA-backed model runs separate from the main registry list.""" + from utils.llm import artificial_analysis_model_runs, model_runs + + declaration_keys = [ + declaration["model_run_key"] + for declaration in artificial_analysis_model_runs.ARTIFICIAL_ANALYSIS_MODEL_RUN_DECLARATIONS + ] + registry_keys = [run.model_run_key for run in model_runs.ARTIFICIAL_ANALYSIS_MODEL_RUNS] + aa_keys_in_model_runs = [ + run.model_run_key for run in model_runs.MODEL_RUNS if run.artificial_analysis_id is not None + ] + + assert registry_keys == declaration_keys + assert aa_keys_in_model_runs == declaration_keys + assert "o3-mini-2025-01-31" not in declaration_keys + assert all(isinstance(key, str) for key in declaration_keys) + assert all( + declaration["artificial_analysis_id"] + for declaration in artificial_analysis_model_runs.ARTIFICIAL_ANALYSIS_MODEL_RUN_DECLARATIONS + ) + + +def test_artificial_analysis_opus_runs_use_aa_display_names_and_token_caps(): + """Align Opus AA model runs with AA display names and token cap conventions.""" + from utils.llm import model_runs + + non_reasoning = model_runs.MODEL_RUNS_BY_KEY["claude-opus-4-7-high-16384"] + adaptive = model_runs.MODEL_RUNS_BY_KEY["claude-opus-4-7-adaptive-thinking-max-128000"] + + assert non_reasoning.artificial_analysis_id == "2fa8e143-77a8-4d05-bfa8-d3b54634c00f" + assert non_reasoning.display_name == "Claude Opus 4.7 (Non-reasoning, High Effort)" + assert non_reasoning.options == { + "max_tokens": 16384, + "output_config": {"effort": "high"}, + } + assert adaptive.artificial_analysis_id == "e9a09db3-8fd6-41dd-ba2f-20e0a2bff7f2" + assert adaptive.display_name == "Claude Opus 4.7 (Adaptive Reasoning, Max Effort)" + assert adaptive.options == { + "max_tokens": 128000, + "output_config": {"effort": "max"}, + "thinking": {"type": "adaptive"}, + } + + +def test_artificial_analysis_model_runs_require_snapshot_ids(): + """Reject AA-backed model runs that reference missing snapshot IDs.""" + from utils.llm import model_registry, model_runs + + with pytest.raises(ValueError, match="Artificial Analysis"): + model_runs.ModelRun( + model_run_key="o3-mini-2025-01-31", + model=model_registry.MODELS_BY_KEY["o3-mini-2025-01-31"], + artificial_analysis_id="missing-aa-model", + ) + + +@pytest.mark.parametrize( + ("model_key", "options", "expected_key"), + [ + ("gpt-5.5-2026-04-23", {}, "gpt-5.5-2026-04-23"), + ( + "gpt-5.5-2026-04-23", + {"reasoning": {"effort": "high"}}, + "gpt-5.5-2026-04-23-high", + ), + ( + "gpt-5.5-2026-04-23", + {"reasoning": {"effort": "high"}, "tools": [{"type": "web_search"}]}, + "gpt-5.5-2026-04-23-high-web-search", + ), + ( + "claude-opus-4-7", + { + "max_tokens": 64000, + "output_config": {"effort": "high"}, + "thinking": {"type": "adaptive"}, + "tools": [ + { + "type": "web_search_20260209", + "name": "web_search", + "max_uses": 5, + } + ], + }, + "claude-opus-4-7-adaptive-thinking-high-web-search-64000", + ), + ( + "grok-4.20-0309-reasoning", + { + "tools": [{"type": "web_search"}, {"type": "x_search"}], + "max_tokens": 10000, + }, + "grok-4.20-0309-reasoning-web-search-x-search-10000", + ), + ("deepseek-v3.1", {"max_tokens": 10000}, "deepseek-v3.1-10000"), + ], +) +def test_model_run_key_is_generated_from_name_relevant_options(model_key, options, expected_key): + """Generate stable model-run keys from name-relevant options.""" + from utils.llm import model_runs + + assert model_runs.build_model_run_key(model_key, options) == expected_key + + +def test_name_neutral_options_do_not_appear_in_model_run_key(): + """Exclude name-neutral options from generated model-run keys.""" + from utils.llm import model_runs + + assert ( + model_runs.build_model_run_key( + "gemini-3.1-pro-preview", + { + "candidate_count": 1, + "temperature": 0, + "automatic_function_calling": {"disable": True}, + }, + ) + == "gemini-3.1-pro-preview" + ) + + +def test_model_run_exposes_explicit_model_run_key_as_name(): + """Use the handwritten model-run key as the compatibility name.""" + from utils.llm import model_registry, model_runs + + run = model_runs.ModelRun( + model_run_key="gemini-3.1-pro-preview", + model=model_registry.MODELS_BY_KEY["gemini-3.1-pro-preview"], + options={ + "candidate_count": 1, + "temperature": 0, + "automatic_function_calling": {"disable": True}, + }, + ) + + assert run.model_run_key == "gemini-3.1-pro-preview" + assert run.name == "gemini-3.1-pro-preview" + + +def test_unknown_option_paths_raise_in_model_run_validation(): + """Reject undeclared model-run options instead of silently naming them.""" + from utils.llm import model_registry, model_runs + + with pytest.raises(ValueError, match="name-relevant or name-neutral"): + model_runs.ModelRun( + model_run_key="gpt-5.5-2026-04-23", + model=model_registry.MODELS_BY_KEY["gpt-5.5-2026-04-23"], + options={"new_performance_option": True}, + ) + + +def test_model_run_routes_provider_model_id_to_get_response(): + """Route model-run calls through the provider model ID and merged options.""" + from utils.llm import model_registry, model_runs + + run = model_runs.ModelRun( + model_run_key="deepseek-v3.1", + model=model_registry.MODELS_BY_KEY["deepseek-v3.1"], + options={"temperature": 0}, + ) + + with patch("utils.llm.model_registry.get_response", return_value="forecast") as get_response: + response = run.get_response("prompt", max_tokens=10000) + + assert response == "forecast" + get_response.assert_called_once_with( + provider=PROVIDERS["Together"], + model_id="deepseek-ai/DeepSeek-V3.1", + prompt="prompt", + options={"temperature": 0, "max_tokens": 10000}, + ) + + +def test_shared_model_run_registry_contains_forecastbench_and_timeseriesbench_runs(): + """Expose model runs needed by ForecastBench and TimeSeriesBench.""" + from utils.llm import model_runs + + expected_keys = { + "gpt-4o-mini-2024-07-18", + "gpt-5-nano-2025-08-07", + "gpt-5-mini-2025-08-07", + "gpt-5-mini-2025-08-07-1024", + "gpt-5.2-2025-12-11", + "gpt-5.4-2026-03-05", + "gpt-5.4-2026-03-05-high", + "gpt-5.4-2026-03-05-high-web-search", + "gpt-5.4-mini-2026-03-17", + "gpt-5.4-nano-2026-03-17", + "gpt-5.5-2026-04-23", + "gpt-5.5-2026-04-23-medium", + "gpt-5.5-2026-04-23-high", + "gpt-5.5-2026-04-23-high-web-search", + "deepseek-v3.1", + "deepseek-v4-pro", + "minimax-m2.5", + "minimax-m2.7", + "kimi-k2.5", + "kimi-k2.6", + "glm-5.1", + "gemma-4-31b", + "claude-haiku-4-5-20251001-1024", + "claude-haiku-4-5-20251001-4096", + "claude-sonnet-4-5-20250929-1024", + "claude-sonnet-4-5-20250929-4096", + "claude-sonnet-4-6-1024", + "claude-sonnet-4-6-4096", + "claude-sonnet-4-6-adaptive-thinking-16000", + "claude-opus-4-6-4096", + "claude-opus-4-7-1024", + "claude-opus-4-7-4096", + "claude-opus-4-7-high-16384", + "claude-opus-4-7-adaptive-thinking-high-24000", + "claude-opus-4-7-adaptive-thinking-high-web-search-64000", + "claude-opus-4-7-adaptive-thinking-max-128000", + "claude-opus-4-8-1024", + "claude-opus-4-8-4096", + "claude-opus-4-8-adaptive-thinking-high-24000", + "claude-opus-4-8-adaptive-thinking-high-web-search-64000", + "grok-4-1-fast-reasoning", + "grok-4-1-fast-non-reasoning", + "grok-4.20-0309-reasoning", + "grok-4.20-0309-reasoning-web-search-x-search", + "grok-4.20-0309-non-reasoning", + "grok-4.3", + "gemini-2.5-pro", + "gemini-2.5-pro-web-search", + "gemini-3-flash-preview", + "gemini-3.1-flash-lite-preview", + "gemini-3.1-flash-lite", + "gemini-3.1-pro-preview", + "gemini-3.5-flash", + } + + assert expected_keys <= set(model_runs.MODEL_RUNS_BY_KEY) + + +def test_shared_model_run_keys_are_unique_and_file_safe(): + """Keep shared model-run keys unique and safe for filenames.""" + from utils.llm import model_runs + + keys = [run.model_run_key for run in model_runs.MODEL_RUNS] + + assert len(keys) == len(set(keys)) + assert all(key == key.lower() for key in keys) + assert all(" " not in key and "/" not in key and "_" not in key for key in keys) + + +def test_active_model_runs_exclude_runs_for_inactive_models(): + """Keep inactive provider routes in history while excluding them from live runs.""" + from utils.llm import model_runs + + all_keys = {run.model_run_key for run in model_runs.MODEL_RUNS} + active_keys = {run.model_run_key for run in model_runs.ACTIVE_MODEL_RUNS} + + assert "deepseek-v3.1" in all_keys + assert "deepseek-v3.1" not in active_keys + assert all(run.model.active for run in model_runs.ACTIVE_MODEL_RUNS) + assert model_runs.ACTIVE_MODEL_RUNS_BY_KEY == { + run.model_run_key: run for run in model_runs.ACTIVE_MODEL_RUNS + } + + +def test_create_model_runs_list_rejects_duplicate_model_run_keys(): + """Reject duplicate model-run keys when creating the shared run registry list.""" + from utils.llm import model_runs + + run = model_runs.MODEL_RUNS_BY_KEY["gpt-5.5-2026-04-23"] + + with pytest.raises(ValueError, match="Duplicate LLM model_run_key"): + model_runs.create_model_runs_list([run, run]) + + +def test_declared_model_run_options_do_not_share_mutable_objects(): + """Avoid sharing mutable nested option objects across declared runs.""" + from utils.llm import model_runs + + preview_run = model_runs.MODEL_RUNS_BY_KEY["gemini-3-flash-preview"] + lite_run = model_runs.MODEL_RUNS_BY_KEY["gemini-3.1-flash-lite-preview"] + + assert preview_run.options is not lite_run.options + assert ( + preview_run.options["automatic_function_calling"] + is not lite_run.options["automatic_function_calling"] + ) + + +def test_release_dates_exist_for_all_shared_models(): + """Expose release dates for every shared canonical model.""" + from utils.llm import model_registry + + release_dates = model_registry.model_release_dates_by_key() + + assert release_dates["gpt-5.5-2026-04-23"] == date(2026, 4, 23) + assert release_dates["deepseek-v3.1"] == date(2025, 8, 21) + assert release_dates["gemini-3.1-flash-lite"] == date(2026, 5, 7) + for model in model_registry.MODELS: + assert release_dates[model.model_key] == model.release_date + + +def test_historical_forecastbench_llm_release_dates_are_available(): + """Keep historical ForecastBench LLM release dates in the model registry.""" + from utils.llm import model_registry + + historical_model_keys = { + "claude-2.1", + "claude-3-5-sonnet-20240620", + "claude-3-5-sonnet-20241022", + "claude-3-7-sonnet-20250219", + "claude-3-haiku-20240307", + "claude-3-opus-20240229", + "claude-opus-4-1-20250805", + "claude-opus-4-20250514", + "claude-opus-4-5-20251101", + "claude-sonnet-4-20250514", + "deepseek-r1", + "deepseek-v3", + "gemini-1.5-flash", + "gemini-1.5-pro", + "gemini-2.0-flash-lite-001", + "gemini-2.5-flash", + "gemini-2.5-flash-preview-04-17", + "gemini-2.5-pro-exp-03-25", + "gemini-2.5-pro-preview-03-25", + "gemini-3-pro-preview", + "glm-4.5-air-fp8", + "glm-4.6", + "glm-4.7", + "glm-5", + "gpt-3.5-turbo-0125", + "gpt-4-0613", + "gpt-4-turbo-2024-04-09", + "gpt-4.1-2025-04-14", + "gpt-4.5-preview-2025-02-27", + "gpt-4o", + "gpt-4o-2024-05-13", + "gpt-4o-2024-11-20", + "gpt-5-2025-08-07", + "gpt-5.1-2025-11-13", + "grok-4-0709", + "grok-4-fast-non-reasoning", + "grok-4-fast-reasoning", + "grok-beta", + "kimi-k2-instruct", + "kimi-k2-instruct-0905", + "kimi-k2-thinking", + "llama-2-70b-chat-hf", + "llama-3-70b-chat-hf", + "llama-3-8b-chat-hf", + "llama-3.2-3b-instruct-turbo", + "llama-3.3-70b-instruct-turbo", + "llama-4-maverick-17b-128e-instruct-fp8", + "llama-4-scout-17b-16e-instruct", + "magistral-medium-2506", + "meta-llama-3.1-405b-instruct-turbo", + "mistral-large-2407", + "mistral-large-2411", + "mistral-large-latest", + "mixtral-8x22b-instruct-v0.1", + "mixtral-8x7b-instruct-v0.1", + "o3-2025-04-16", + "o3-mini-2025-01-31", + "o4-mini-2025-04-16", + "qwen1.5-110b-chat", + "qwen2.5-72b-instruct-turbo", + "qwen3-235b-a22b-fp8-tput", + "qwen3-235b-a22b-thinking-2507", + "qwq-32b-preview", + } + release_dates = model_registry.model_release_dates_by_key() + + assert historical_model_keys <= set(model_registry.MODELS_BY_KEY) + assert not hasattr(model_registry, "HISTORICAL_MODEL_RELEASE_DATES_BY_KEY") + assert release_dates["gpt-4-0613"] == date(2023, 6, 13) + assert release_dates["claude-2.1"] == date(2023, 11, 21) + assert release_dates["kimi-k2-instruct"] == date(2025, 7, 12) + assert release_dates["qwen3-235b-a22b-thinking-2507"] == date(2025, 7, 25) + assert all(key in release_dates for key in historical_model_keys) + assert not any(key.startswith("unusedgrok") for key in model_registry.MODELS_BY_KEY) + assert "Always 0" not in release_dates + assert "Naive Forecaster" not in release_dates + + +def test_select_model_runs_preserves_order_and_rejects_unknown_keys(): + """Select shared model runs in requested order and fail on unknown keys.""" + from utils.llm import model_runs + + selected = model_runs.select_model_runs(["gpt-5.4-2026-03-05", "deepseek-v3.1"]) + + assert [run.model_run_key for run in selected] == [ + "gpt-5.4-2026-03-05", + "deepseek-v3.1", + ] + with pytest.raises(KeyError, match="missing-model"): + model_runs.select_model_runs(["missing-model"]) diff --git a/tests/unit/test_llm_routing.py b/tests/unit/test_llm_routing.py index eefb7ac..3d0c11e 100644 --- a/tests/unit/test_llm_routing.py +++ b/tests/unit/test_llm_routing.py @@ -1,7 +1,6 @@ """Unit tests for LLM provider routing.""" -from __future__ import annotations - +from datetime import date from types import SimpleNamespace from typing import Any from unittest.mock import MagicMock, patch @@ -18,6 +17,8 @@ def test_labs_have_leaderboard_names(): assert LABS["OpenAI"].leaderboard_name == "OpenAI" assert LABS["Moonshot"].name == "Moonshot" assert LABS["Moonshot"].leaderboard_name == "Moonshot AI" + assert LABS["MiniMax"].name == "MiniMax" + assert LABS["MiniMax"].leaderboard_name == "MiniMax" assert "Google" not in LABS assert LABS["Google DeepMind"].name == "Google DeepMind" assert LABS["Google DeepMind"].leaderboard_name == "Google DeepMind" @@ -73,20 +74,19 @@ def get_response( } -def test_model_get_response_routes_full_name_and_options(): +def test_model_get_response_routes_provider_model_id_and_options(): """Model routing should call providers through the public final interface.""" from utils.llm import model_registry - from utils.llm.model_registry import Model, OpenAIProvider + from utils.llm.model_registry import Model observed: dict[str, Any] = {} options = {"temperature": 0, "max_tokens": 128} model = Model( - id="reasoning-model", - full_name="reasoning-model", - token_limit=128_000, - provider_cls=OpenAIProvider, + model_key="reasoning-model", + provider_model_id="reasoning-model", + provider=PROVIDERS["OpenAI"], lab=LABS["OpenAI"], - reasoning_model=True, + manual_release_date=date(2026, 1, 1), ) class FakeProvider: @@ -107,7 +107,7 @@ def get_response( assert response == "reasoning text" assert observed == { - "model_id": model.full_name, + "model_id": model.provider_model_id, "prompt": "forecast prompt", "options": options, } @@ -248,6 +248,7 @@ def test_anthropic_provider_forwards_options_without_asserting_max_tokens(): prompt="forecast", options={ "max_tokens": 16000, + "output_config": {"effort": "max"}, "thinking": {"type": "adaptive"}, "tools": [{"type": "web_search_20250305", "name": "web_search"}], }, @@ -258,6 +259,7 @@ def test_anthropic_provider_forwards_options_without_asserting_max_tokens(): model="claude-opus-4-6", messages=[{"role": "user", "content": "forecast"}], max_tokens=16000, + output_config={"effort": "max"}, thinking={"type": "adaptive"}, tools=[{"type": "web_search_20250305", "name": "web_search"}], ) diff --git a/tests/unit/test_models_dev_metadata.py b/tests/unit/test_models_dev_metadata.py new file mode 100644 index 0000000..236622d --- /dev/null +++ b/tests/unit/test_models_dev_metadata.py @@ -0,0 +1,282 @@ +"""Tests for the checked-in Models.dev metadata snapshot.""" + +import json +import tomllib +from datetime import date +from pathlib import Path + +import pytest + +from scripts import refresh_models_dev_metadata +from utils.llm.metadata import models_dev + +ROOT_DIR = Path(__file__).resolve().parents[2] + + +def test_load_models_dev_snapshot_exposes_provider_and_model_metadata(): + """Load the snapshot and expose normalized provider and model fields.""" + snapshot = models_dev.load_models_dev_snapshot() + + openai = snapshot.providers["openai"] + assert openai.name == "OpenAI" + + gpt_4o = openai.models["gpt-4o"] + assert gpt_4o.id == "gpt-4o" + assert gpt_4o.name == "GPT-4o" + assert gpt_4o.release_date == date(2024, 5, 13) + assert set(gpt_4o.raw) == {"id", "name", "release_date"} + + +def test_models_dev_snapshot_can_lookup_model_by_provider_and_model_id(): + """Look up a normalized model by Models.dev provider and model IDs.""" + snapshot = models_dev.load_models_dev_snapshot() + + model = snapshot.get_model(provider_id="anthropic", model_id="claude-3-haiku-20240307") + + assert model.name == "Claude Haiku 3" + assert model.release_date == date(2024, 3, 13) + + +def test_models_dev_snapshot_rejects_unknown_provider_or_model(): + """Raise clear lookup errors for unknown provider or model IDs.""" + snapshot = models_dev.load_models_dev_snapshot() + + try: + snapshot.get_model(provider_id="missing", model_id="gpt-4o") + except KeyError as exc: + assert "Unknown Models.dev provider_id missing" in str(exc) + else: + raise AssertionError("Expected missing provider lookup to fail") + + try: + snapshot.get_model(provider_id="openai", model_id="missing") + except KeyError as exc: + assert "Unknown Models.dev model_id missing for provider_id openai" in str(exc) + else: + raise AssertionError("Expected missing model lookup to fail") + + +def test_models_dev_snapshot_preserves_raw_partial_release_dates(tmp_path): + """Preserve month-only source dates while omitting typed date values.""" + snapshot_path = tmp_path / "models_dev_snapshot.json" + snapshot_path.write_text( + json.dumps( + { + "source": refresh_models_dev_metadata.MODELS_DEV_URL, + "providers": { + "abacus": { + "id": "abacus", + "name": "Abacus", + "models": { + "kimi-k2.5": { + "id": "kimi-k2.5", + "name": "Kimi K2.5", + "release_date": "2026-01", + } + }, + } + }, + } + ) + ) + snapshot = models_dev.load_models_dev_snapshot(snapshot_path) + + model = snapshot.get_model(provider_id="abacus", model_id="kimi-k2.5") + + assert model.release_date is None + assert model.raw["release_date"] == "2026-01" + + +def test_models_dev_snapshot_preserves_raw_invalid_release_dates(tmp_path): + """Preserve invalid source dates while omitting typed date values.""" + snapshot_path = tmp_path / "models_dev_snapshot.json" + snapshot_path.write_text( + json.dumps( + { + "source": refresh_models_dev_metadata.MODELS_DEV_URL, + "providers": { + "scaleway": { + "id": "scaleway", + "name": "Scaleway", + "models": { + "qwen3-embedding-8b": { + "id": "qwen3-embedding-8b", + "name": "Qwen3 Embedding 8B", + "release_date": "2025-25-11", + } + }, + } + }, + } + ) + ) + snapshot = models_dev.load_models_dev_snapshot(snapshot_path) + + model = snapshot.get_model(provider_id="scaleway", model_id="qwen3-embedding-8b") + + assert model.release_date is None + assert model.raw["release_date"] == "2025-25-11" + + +def test_read_models_dev_references_from_model_registry_uses_ast(tmp_path): + """Discover Models.dev references from declarations without importing the registry.""" + registry_path = tmp_path / "model_registry.py" + registry_path.write_text(""" +openai_model( + model_key="gpt-test", + models_dev_reference=ModelsDevReference( + provider_id="openai", + model_id="gpt-test", + ), +) +together_model( + model_key="manual-only", + manual_release_date=date(2026, 1, 1), +) +anthropic_model( + model_key="claude-test", + models_dev_reference=ModelsDevReference(provider_id="anthropic", model_id="claude-test"), +) +""") + + references = refresh_models_dev_metadata.read_models_dev_references_from_model_registry( + registry_path + ) + + assert references == frozenset( + { + refresh_models_dev_metadata.ModelsDevReference("anthropic", "claude-test"), + refresh_models_dev_metadata.ModelsDevReference("openai", "gpt-test"), + } + ) + + +def test_normalize_models_dev_api_response_keeps_expected_fields_sorted(): + """Normalize only referenced model release-date fields and sort providers and models.""" + api_response = { + "openai": { + "id": "openai", + "name": "OpenAI", + "models": { + "z-model": { + "id": "z-model", + "name": "Z Model", + "release_date": "2026-01-02", + "last_updated": "2026-01-03", + "limit": {"output": 2, "context": 1}, + "cost": {"input": 1.5}, + "reasoning": True, + "temperature": False, + "tool_call": True, + "ignored": "drop me", + }, + "a-model": { + "id": "a-model", + "name": "A Model", + "release_date": None, + "last_updated": None, + "limit": {}, + "cost": None, + "reasoning": False, + "temperature": True, + "tool_call": False, + }, + "unused-model": { + "id": "unused-model", + "name": "Unused Model", + "release_date": "2026-01-04", + }, + }, + }, + "unused-provider": { + "id": "unused-provider", + "name": "Unused Provider", + "models": { + "unused": { + "id": "unused", + "name": "Unused", + "release_date": "2026-01-05", + } + }, + }, + } + + normalized = refresh_models_dev_metadata.normalize_models_dev_api_response( + api_response, + models_dev_references=frozenset( + { + refresh_models_dev_metadata.ModelsDevReference("openai", "z-model"), + refresh_models_dev_metadata.ModelsDevReference("openai", "a-model"), + } + ), + ) + + assert list(normalized["providers"]) == ["openai"] + assert list(normalized["providers"]["openai"]["models"]) == ["a-model", "z-model"] + assert normalized["providers"]["openai"]["models"]["z-model"] == { + "id": "z-model", + "name": "Z Model", + "release_date": "2026-01-02", + } + + +def test_normalize_models_dev_api_response_errors_with_reference_suggestions(): + """Reject incorrect exact references and suggest nearby provider-local candidates.""" + api_response = { + "openai": { + "id": "openai", + "name": "OpenAI", + "models": { + "gpt-5.6-2026-06-01": { + "id": "gpt-5.6-2026-06-01", + "name": "GPT-5.6", + "release_date": "2026-06-01", + }, + "gpt-5.6-chat": { + "id": "gpt-5.6-chat", + "name": "GPT-5.6 Chat", + "release_date": "2026-06-01", + }, + }, + } + } + + with pytest.raises(ValueError) as excinfo: + refresh_models_dev_metadata.normalize_models_dev_api_response( + api_response, + models_dev_references=frozenset( + {refresh_models_dev_metadata.ModelsDevReference("openai", "gpt-5.6")} + ), + ) + + message = str(excinfo.value) + assert "Missing Models.dev reference: openai/gpt-5.6" in message + assert 'openai/gpt-5.6-2026-06-01 name="GPT-5.6"' in message + assert 'openai/gpt-5.6-chat name="GPT-5.6 Chat"' in message + + +def test_checked_in_models_dev_snapshot_contains_only_registry_references(): + """Keep the Models.dev snapshot scoped to the registry references that use it.""" + references = refresh_models_dev_metadata.read_models_dev_references_from_model_registry() + snapshot = json.loads(models_dev.SNAPSHOT_PATH.read_text()) + snapshot_references = frozenset( + refresh_models_dev_metadata.ModelsDevReference(provider_id, model_id) + for provider_id, provider_data in snapshot["providers"].items() + for model_id in provider_data["models"] + ) + + assert snapshot_references == references + assert all( + set(model_data) <= {"id", "name", "release_date"} + for provider_data in snapshot["providers"].values() + for model_data in provider_data["models"].values() + ) + + +def test_models_dev_snapshot_is_included_as_package_data(): + """Include the JSON snapshot when utils is installed as a package.""" + pyproject = tomllib.loads((ROOT_DIR / "pyproject.toml").read_text()) + + package_data = pyproject["tool"]["setuptools"]["package-data"] + + assert package_data["utils.llm.metadata"] == ["*.json"] diff --git a/tests/unit/test_project_dependencies.py b/tests/unit/test_project_dependencies.py new file mode 100644 index 0000000..0487de2 --- /dev/null +++ b/tests/unit/test_project_dependencies.py @@ -0,0 +1,66 @@ +"""Tests for package dependency ownership.""" + +import tomllib +from pathlib import Path + +ROOT = Path(__file__).resolve().parents[2] + +SHARED_RUNTIME_DEPENDENCIES = { + "anthropic", + "google-cloud-secret-manager", + "google-cloud-storage", + "google-genai", + "openai", + "python-dotenv", + "together", +} + + +def _requirement_name(requirement: str) -> str: + return ( + requirement.split("==", maxsplit=1)[0] + .split(">=", maxsplit=1)[0] + .split("<", maxsplit=1)[0] + .strip() + ) + + +def test_shared_runtime_dependencies_are_declared_in_pyproject(): + """Keep shared LLM runtime dependencies owned by pyproject metadata.""" + pyproject = tomllib.loads((ROOT / "pyproject.toml").read_text()) + dependency_names = { + _requirement_name(dependency) for dependency in pyproject["project"]["dependencies"] + } + + assert SHARED_RUNTIME_DEPENDENCIES <= dependency_names + + +def test_requirements_txt_delegates_to_pyproject_dev_extra(): + """Keep local requirements install behavior delegated to the dev extra.""" + requirements = (ROOT / "requirements.txt").read_text().splitlines() + + assert requirements == [".[dev]"] + + +def test_dev_extra_preserves_requirements_txt_dev_tooling(): + """Keep previous requirements.txt test tooling in the dev extra.""" + pyproject = tomllib.loads((ROOT / "pyproject.toml").read_text()) + dev_dependencies = pyproject["project"]["optional-dependencies"]["dev"] + dev_dependency_names = {_requirement_name(dependency) for dependency in dev_dependencies} + + assert "pytest-xdist" in dev_dependency_names + assert all("==" in dependency for dependency in dev_dependencies) + + +def test_third_party_notices_cover_metadata_sources(): + """Preserve license and attribution notices for checked-in metadata snapshots.""" + notices = (ROOT / "THIRD_PARTY_NOTICES.md").read_text() + pyproject = tomllib.loads((ROOT / "pyproject.toml").read_text()) + + assert "Models.dev" in notices + assert "https://github.com/anomalyco/models.dev" in notices + assert "MIT License" in notices + assert "Copyright (c) 2025 models.dev" in notices + assert "Artificial Analysis" in notices + assert "https://artificialanalysis.ai/" in notices + assert "THIRD_PARTY_NOTICES.md" in pyproject["tool"]["setuptools"]["license-files"] diff --git a/utils/helpers/constants.py b/utils/helpers/constants.py index 5265339..643837f 100644 --- a/utils/helpers/constants.py +++ b/utils/helpers/constants.py @@ -11,6 +11,7 @@ OPENAI_API_KEY_SECRET_NAME: str = "API_KEY_OPENAI" XAI_API_KEY_SECRET_NAME: str = "API_KEY_XAI" TOGETHER_API_KEY_SECRET_NAME: str = "API_KEY_TOGETHERAI" +ARTIFICIAL_ANALYSIS_API_KEY_SECRET_NAME: str = "API_KEY_ARTIFICIAL_ANALYSIS" __all__ = [ "GOOGLE_CLOUD_PROJECT_ENV_VAR", @@ -20,4 +21,5 @@ "OPENAI_API_KEY_SECRET_NAME", "XAI_API_KEY_SECRET_NAME", "TOGETHER_API_KEY_SECRET_NAME", + "ARTIFICIAL_ANALYSIS_API_KEY_SECRET_NAME", ] diff --git a/utils/llm/__init__.py b/utils/llm/__init__.py index 65474d6..e8af157 100644 --- a/utils/llm/__init__.py +++ b/utils/llm/__init__.py @@ -2,7 +2,13 @@ from importlib import import_module -__all__ = ["lab_registry", "model_registry", "provider_registry", "providers"] +__all__ = [ + "lab_registry", + "model_registry", + "model_runs", + "provider_registry", + "providers", +] def __getattr__(name: str): diff --git a/utils/llm/artificial_analysis_model_runs.py b/utils/llm/artificial_analysis_model_runs.py new file mode 100644 index 0000000..2dfd09a --- /dev/null +++ b/utils/llm/artificial_analysis_model_runs.py @@ -0,0 +1,38 @@ +"""Artificial Analysis-backed model-run declarations.""" + +from collections.abc import Callable +from typing import Any + +# Every declaration here is intended to be part of MODEL_RUNS. Add a run here +# only after its provider-callable options are ready for benchmark selection. +ARTIFICIAL_ANALYSIS_MODEL_RUN_DECLARATIONS: tuple[dict[str, Any], ...] = ( + { + "model_run_key": "claude-opus-4-7-high-16384", + "model_key": "claude-opus-4-7", + "options": { + "max_tokens": 16384, + "output_config": {"effort": "high"}, + }, + "artificial_analysis_id": "2fa8e143-77a8-4d05-bfa8-d3b54634c00f", + }, + { + "model_run_key": "claude-opus-4-7-adaptive-thinking-max-128000", + "model_key": "claude-opus-4-7", + "options": { + "max_tokens": 128000, + "output_config": {"effort": "max"}, + "thinking": {"type": "adaptive"}, + }, + "artificial_analysis_id": "e9a09db3-8fd6-41dd-ba2f-20e0a2bff7f2", + }, +) + + +def create_artificial_analysis_model_runs( + model_run_factory: Callable[..., Any], +) -> list[Any]: + """Build AA-backed model runs using the main registry's factory.""" + return [ + model_run_factory(**declaration) + for declaration in ARTIFICIAL_ANALYSIS_MODEL_RUN_DECLARATIONS + ] diff --git a/utils/llm/lab_registry.py b/utils/llm/lab_registry.py index dc7bd69..e74c5aa 100644 --- a/utils/llm/lab_registry.py +++ b/utils/llm/lab_registry.py @@ -23,8 +23,10 @@ def leaderboard_name(self) -> str: "Anthropic": Lab(name="Anthropic"), "DeepSeek": Lab(name="DeepSeek"), "Moonshot": Lab(name="Moonshot", display_name="Moonshot AI"), + "MiniMax": Lab(name="MiniMax"), "Google DeepMind": Lab(name="Google DeepMind"), "Meta": Lab(name="Meta"), + "Mistral AI": Lab(name="Mistral AI"), "OpenAI": Lab(name="OpenAI"), "Qwen": Lab(name="Qwen"), "xAI": Lab(name="xAI"), diff --git a/utils/llm/metadata/__init__.py b/utils/llm/metadata/__init__.py new file mode 100644 index 0000000..bb5319d --- /dev/null +++ b/utils/llm/metadata/__init__.py @@ -0,0 +1 @@ +"""LLM metadata snapshots and loaders.""" diff --git a/utils/llm/metadata/artificial_analysis.py b/utils/llm/metadata/artificial_analysis.py new file mode 100644 index 0000000..0cdbbcf --- /dev/null +++ b/utils/llm/metadata/artificial_analysis.py @@ -0,0 +1,63 @@ +"""Loader for the checked-in Artificial Analysis metadata snapshot.""" + +import json +from dataclasses import dataclass +from functools import lru_cache +from pathlib import Path +from typing import Any + +SNAPSHOT_PATH = Path(__file__).with_name("artificial_analysis_snapshot.json") + + +@dataclass(frozen=True, slots=True) +class ArtificialAnalysisModel: + """Normalized metadata for one Artificial Analysis LLM model entry.""" + + id: str + name: str + + +@dataclass(frozen=True, slots=True) +class ArtificialAnalysisSnapshot: + """Loaded Artificial Analysis metadata indexed by stable model ID.""" + + models: dict[str, ArtificialAnalysisModel] + source: str + prompt_options: dict[str, Any] + + def get_model(self, model_id: str) -> ArtificialAnalysisModel: + """Return a model by Artificial Analysis stable model ID.""" + try: + return self.models[model_id] + except KeyError as exc: + raise KeyError(f"Unknown Artificial Analysis model_id {model_id}") from exc + + +def _model_from_json(data: dict[str, Any]) -> ArtificialAnalysisModel: + """Build normalized model metadata from snapshot JSON.""" + return ArtificialAnalysisModel( + id=data["id"], + name=data["name"], + ) + + +def _models_from_snapshot_json(data: dict[str, Any]) -> dict[str, ArtificialAnalysisModel]: + """Build model metadata from current or legacy snapshot JSON.""" + if "data" in data: + return {model_data["id"]: _model_from_json(model_data) for model_data in data["data"]} + return { + model_id: _model_from_json(model_data) for model_id, model_data in data["models"].items() + } + + +@lru_cache(maxsize=1) +def load_artificial_analysis_snapshot( + path: Path = SNAPSHOT_PATH, +) -> ArtificialAnalysisSnapshot: + """Load the checked-in Artificial Analysis metadata snapshot.""" + data = json.loads(path.read_text()) + return ArtificialAnalysisSnapshot( + models=_models_from_snapshot_json(data), + source=data.get("source", ""), + prompt_options=data.get("prompt_options") or {}, + ) diff --git a/utils/llm/metadata/artificial_analysis_snapshot.json b/utils/llm/metadata/artificial_analysis_snapshot.json new file mode 100644 index 0000000..71b2a1d --- /dev/null +++ b/utils/llm/metadata/artificial_analysis_snapshot.json @@ -0,0 +1,2109 @@ +{ + "data": [ + { + "id": "0081ab31-d10a-44a0-a10d-eee5533fec65", + "name": "GLM-4.5V (Non-reasoning)" + }, + { + "id": "0097ebf5-124f-42f6-9463-33b00e711f03", + "name": "Gemini 3.5 Flash (high)" + }, + { + "id": "00f1248e-78e3-4230-8dc8-5e13ba8645e2", + "name": "MiMo-V2.5-Pro" + }, + { + "id": "0121d27b-5b8a-4901-8684-80589cc6d40d", + "name": "JT-35B-Flash" + }, + { + "id": "016d330a-2141-4afa-b2fc-62b314423dc1", + "name": "Mercury 2" + }, + { + "id": "0179b427-93dc-415c-bb4c-f980ddf8d088", + "name": "Qwen3.5 Omni Plus" + }, + { + "id": "018c60e8-e908-431a-ba57-c840b1df3987", + "name": "Nova 2.0 Omni (medium)" + }, + { + "id": "019e86f6-e66b-42d8-8a50-235a06b53003", + "name": "GPT-5.2 Codex (xhigh)" + }, + { + "id": "021b1b31-d2fc-4653-ab74-c10bd2f41c8e", + "name": "Qwen3.5 122B A10B (Non-reasoning)" + }, + { + "id": "033ade17-d9ec-44e0-b792-b5f1fcd5ab4c", + "name": "Gemini 3.5 Flash (minimal)" + }, + { + "id": "033e4aa9-a556-4224-87b0-341ed1070257", + "name": "Claude 3.5 Haiku" + }, + { + "id": "037dec2f-51e8-4127-a1f1-85155dae7a1d", + "name": "GPT-3.5 Turbo" + }, + { + "id": "0399e614-5d46-484f-9183-e4f32d74e1c6", + "name": "Gemini 2.0 Flash Thinking Experimental (Jan '25)" + }, + { + "id": "04586102-6a28-48f8-a82e-85775d7ed779", + "name": "Qwen2.5 Coder Instruct 32B" + }, + { + "id": "04751bd4-0c5d-416b-a5f2-83727c5bfcda", + "name": "Sonar Reasoning" + }, + { + "id": "04781a0e-40f0-4e2a-a4e5-18e389364a79", + "name": "Gemma 3 270M" + }, + { + "id": "04787c2b-0751-4269-8029-075b727d7aed", + "name": "Grok 4.20 0309 (Reasoning)" + }, + { + "id": "04d023f3-025c-4d78-9571-53edda3eaf2a", + "name": "GPT-5.1 Codex (high)" + }, + { + "id": "05a32e26-e609-4377-951b-8fa23d329926", + "name": "Mistral Medium 3.1" + }, + { + "id": "05e45a36-b5c6-47a1-8adb-9ddc19add5b3", + "name": "GPT-5 nano (minimal)" + }, + { + "id": "073b5329-c4b3-4f1f-8f97-4753aadf4398", + "name": "Gemini 2.5 Pro Preview (May' 25)" + }, + { + "id": "076f2674-bc4b-4925-be59-50832eb8c090", + "name": "o3-mini (high)" + }, + { + "id": "078f4dc8-5350-40a2-a5ea-e8359f795b70", + "name": "o1-preview" + }, + { + "id": "07c35e00-2b12-44c8-91cc-408629cd569e", + "name": "DeepSeek V3.2 Exp (Non-reasoning)" + }, + { + "id": "093883ed-f5fc-443b-8e18-afbfb166699e", + "name": "Qwen3 Coder 480B A35B Instruct" + }, + { + "id": "0985ada8-2ed8-404d-bd8b-7357666ce40f", + "name": "Qwen3.5 2B (Non-reasoning)" + }, + { + "id": "09f43999-b67b-4c1b-b050-44df41ed7e62", + "name": "Devstral 2" + }, + { + "id": "0a603978-03b9-4f47-a273-2f7fd969be85", + "name": "Claude 3.5 Sonnet (Oct '24)" + }, + { + "id": "0a7dda4d-cc9c-4a90-abc1-abb5772c901b", + "name": "DeepSeek V3.1 Terminus (Reasoning)" + }, + { + "id": "0b226b82-1462-4860-bf1a-f8aed7024791", + "name": "Qwen3 Omni 30B A3B Instruct" + }, + { + "id": "0d94dc87-12c8-4d4a-8d99-804ce3f17bc2", + "name": "Nova 2.0 Pro Preview (medium)" + }, + { + "id": "0de67206-4d36-4d10-b8f6-cf37fa747a03", + "name": "Kimi K2.6" + }, + { + "id": "0e34f05c-387e-4968-be15-ccec4a55d8c1", + "name": "DeepSeek R1 (Jan '25)" + }, + { + "id": "0e49fe2d-dd3c-4ae5-b56f-a1c89e14b89e", + "name": "DeepSeek R1 Distill Llama 8B" + }, + { + "id": "0e5f6140-1154-4583-a3e0-8c032a338892", + "name": "Qwen3 0.6B (Non-reasoning)" + }, + { + "id": "0e66bae9-41f1-42fc-9276-ce8cb6f72919", + "name": "Qwen3.5 397B A17B (Reasoning)" + }, + { + "id": "0faadeeb-320c-45cf-9c76-5f8768f342e6", + "name": "LFM2 1.2B" + }, + { + "id": "0fc6308e-fbd2-42d3-a216-06da3c43e34e", + "name": "Kimi K2.6 (Non-reasoning)" + }, + { + "id": "0fec07d5-a9b2-407a-b5f8-5bf10bd86b59", + "name": "Magistral Small 1" + }, + { + "id": "1299b9a8-af50-4742-a58b-24ff7eb48f9f", + "name": "LFM 40B" + }, + { + "id": "12adec16-19fe-4d92-aeff-5ef3eb7e780a", + "name": "MiniMax-M2.5" + }, + { + "id": "12f6a061-0ab3-4c76-b225-49abee253651", + "name": "Mistral Small 4 (Non-reasoning)" + }, + { + "id": "13358187-4584-479c-ab43-5bcdf8f297a4", + "name": "Claude 3.7 Sonnet (Reasoning)" + }, + { + "id": "1479f50b-d37f-4b55-bb8b-4212a15042eb", + "name": "MiMo-V2-Flash (Feb 2026)" + }, + { + "id": "149096f3-57b7-4413-80c2-a2c010a2995a", + "name": "GLM-5-Turbo" + }, + { + "id": "15b56b8e-7b93-4ed9-ac06-75c922e3b86e", + "name": "Ling-1T" + }, + { + "id": "16149b9c-a1e9-4669-a5cb-ff3c00d78f89", + "name": "gpt-oss-20B (low)" + }, + { + "id": "169e47f5-3d4d-4ad4-8f8b-ab46f0c73f67", + "name": "Qwen3.5 27B (Reasoning)" + }, + { + "id": "16c5b637-8bce-4252-81f2-1b87a36a4e4c", + "name": "o3" + }, + { + "id": "16f2578b-1b28-4be3-b371-700c2677bcd6", + "name": "Gemini 1.5 Pro (May '24)" + }, + { + "id": "191a2097-cce3-49cf-881e-0c790892059f", + "name": "Qwen3 4B (Reasoning)" + }, + { + "id": "198b717f-42c8-4ab7-a699-ae9373d669d3", + "name": "DeepSeek V3.1 (Reasoning)" + }, + { + "id": "1a8ba535-df18-459b-ad40-3199191296d7", + "name": "Llama 3.3 Nemotron Super 49B v1 (Reasoning)" + }, + { + "id": "1aa3694e-b656-4dbe-8f84-0c65d8897abb", + "name": "Step 3.5 Flash 2603" + }, + { + "id": "1b05e346-e86a-4a20-8feb-7da8c65a99aa", + "name": "Mistral Large 2 (Jul '24)" + }, + { + "id": "1cb96708-f6c1-4668-9804-5db6ecac01ed", + "name": "DBRX Instruct" + }, + { + "id": "1cf439b8-0cfd-47b2-9de2-9a2157e6762b", + "name": "GLM-4.5 (Reasoning)" + }, + { + "id": "1d0db5a3-3132-4213-a94b-c2e395d08283", + "name": "Qwen2.5 Instruct 32B" + }, + { + "id": "1d81aa1c-64c8-442a-9c41-81b37e407b91", + "name": "Gemini 2.5 Flash-Lite (Non-reasoning)" + }, + { + "id": "1dcea4f7-7e8b-49f8-abe2-5860ff9f349e", + "name": "Hermes 3 - Llama-3.1 70B" + }, + { + "id": "1edc272c-d799-44d6-909a-bf3c1909a3a0", + "name": "Sonar Reasoning Pro" + }, + { + "id": "1f054429-397e-4fdb-9e71-67bc92c1735e", + "name": "GPT-5.5 (xhigh)" + }, + { + "id": "1f05af98-1ec6-4506-a0b8-57a8c9b63878", + "name": "Mistral Medium" + }, + { + "id": "1f6478c9-3e22-4586-adbe-841782859677", + "name": "Nova 2.0 Omni (Non-reasoning)" + }, + { + "id": "1fc32894-1060-493b-af94-62bb1068555e", + "name": "Granite 4.0 Micro" + }, + { + "id": "1fc54cef-d179-48b1-a27d-046874e9b208", + "name": "Claude 3 Haiku" + }, + { + "id": "20da3b31-fc0a-4359-abec-d59367bf1d9f", + "name": "Qwen3.5 9B (Non-reasoning)" + }, + { + "id": "217b34ec-5920-4fc1-8886-6a70a324837d", + "name": "Mistral 7B Instruct" + }, + { + "id": "219ed587-60c5-4a48-9517-8480e08d0ca1", + "name": "Gemini 2.5 Flash (Reasoning)" + }, + { + "id": "222fb320-6e55-4672-846a-b6d5a24a45f4", + "name": "Gemma 3 4B Instruct" + }, + { + "id": "2236df45-0699-40d1-b5cc-69ee345d2257", + "name": "Qwen3.5 122B A10B (Reasoning)" + }, + { + "id": "22d09131-343b-4adf-8760-533e20a2155f", + "name": "MiMo-V2.5" + }, + { + "id": "23149f9b-c904-43e2-9ec4-afa2bf843941", + "name": "Grok 4.1 Fast (Reasoning)" + }, + { + "id": "235060f4-057d-4bd1-8b8e-4a92908c770e", + "name": "Hermes 4 - Llama-3.1 70B (Non-reasoning)" + }, + { + "id": "23b379f7-18df-492a-9fc1-a56c5a5b9cfc", + "name": "NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning)" + }, + { + "id": "2443ac9e-a3db-423d-accb-8963f6fb0a53", + "name": "Grok 3" + }, + { + "id": "248deb0d-426c-4fa9-86fa-bc60aa9c3719", + "name": "GLM 5V Turbo (Reasoning)" + }, + { + "id": "24ac5b00-5f03-4c47-8e37-522d1195383e", + "name": "Mistral Small 3.1" + }, + { + "id": "2660d74f-ce79-48a8-8b53-6e668e2071a2", + "name": "Claude Opus 4.5 (Reasoning)" + }, + { + "id": "2698f6c6-e436-47ce-a583-dbc25596c571", + "name": "Qwen3 Next 80B A3B Instruct" + }, + { + "id": "26c0b5df-efa7-470f-a65e-2d883329e493", + "name": "Llama Nemotron Super 49B v1.5 (Non-reasoning)" + }, + { + "id": "27202e5f-c82d-4710-92e9-4317877d4883", + "name": "Gemini 2.5 Pro" + }, + { + "id": "272ff333-442f-4169-a804-ac9177bc99d7", + "name": "MiniMax-M2.1" + }, + { + "id": "291a510a-dcc0-40df-8a80-b3aa31900a6c", + "name": "Grok 2 (Dec '24)" + }, + { + "id": "296ace9b-0815-43b2-bafa-fd6cec5cce36", + "name": "MiMo-V2-Omni-0327" + }, + { + "id": "29855680-7469-43eb-8b88-cd3fb1d99da3", + "name": "GPT-5 mini (high)" + }, + { + "id": "2aacdc07-5f4e-4ab9-8ea5-5f7ab93f9eeb", + "name": "Qwen3 4B 2507 (Reasoning)" + }, + { + "id": "2ac96b67-f4f8-4c8c-ac08-c7510faa7bb9", + "name": "Gemini 1.5 Pro (Sep '24)" + }, + { + "id": "2ae624ca-25b4-4cc8-8970-cdfdd3320691", + "name": "Gemini 1.0 Ultra" + }, + { + "id": "2bb84433-f38e-4edc-9b65-4d7b1f473db9", + "name": "Qwen3 1.7B (Non-reasoning)" + }, + { + "id": "2bfdd17a-e027-4068-a54e-b0e90a6df118", + "name": "Gemma 3 27B Instruct" + }, + { + "id": "2c4394a2-b443-470a-908e-5c4a271b780c", + "name": "GLM-4.7-Flash (Reasoning)" + }, + { + "id": "2cd04201-2b6e-47ef-853e-7601f705f2a8", + "name": "Phi-4 Multimodal Instruct" + }, + { + "id": "2cff73da-4855-403c-afc9-5540feadcc15", + "name": "Granite 3.3 8B (Non-reasoning)" + }, + { + "id": "2d19c2d1-062d-436e-b2c2-3d3ecad34acc", + "name": "DeepSeek-V2-Chat" + }, + { + "id": "2d28a13a-096e-475a-beb8-26bbd1c7d51c", + "name": "Qwen3.5 9B (Reasoning)" + }, + { + "id": "2dad8957-4c16-4e74-bf2d-8b21514e0ae9", + "name": "o3-mini" + }, + { + "id": "2dbb6dc7-8c40-4b6d-af9c-cf805f83b79a", + "name": "Grok 4 Fast (Non-reasoning)" + }, + { + "id": "2e40e695-3cec-43da-83f9-615af30b8e91", + "name": "Claude Sonnet 4.6 (Non-reasoning, High Effort)" + }, + { + "id": "2e46d2fd-eb2b-42b9-9fe4-be50630fe870", + "name": "Ling-2.6-1T" + }, + { + "id": "2e6400f5-85ca-4ebc-ba8f-c2811a631138", + "name": "Gemma 3 12B Instruct" + }, + { + "id": "2e8694f9-7782-47a6-a6ba-fdce89d939c8", + "name": "NVIDIA Nemotron Nano 9B V2 (Non-reasoning)" + }, + { + "id": "2e9ff877-fd2c-4ce7-b631-7ca1bdb6d13e", + "name": "Command A+" + }, + { + "id": "2f60d80d-c3d3-4a43-bded-0557898c4618", + "name": "EXAONE 4.0 32B (Non-reasoning)" + }, + { + "id": "2fa8e143-77a8-4d05-bfa8-d3b54634c00f", + "name": "Claude Opus 4.7 (Non-reasoning, High Effort)" + }, + { + "id": "3068def4-7270-4c06-a320-6f6a5623d564", + "name": "GLM-4.5V (Reasoning)" + }, + { + "id": "30c9ba61-d0a1-4794-938e-35865f379d15", + "name": "Qwen3.5 4B (Non-reasoning)" + }, + { + "id": "30ef2a79-e800-4165-9f13-2a338f120db7", + "name": "Qwen3.5 397B A17B (Non-reasoning)" + }, + { + "id": "3373245b-e6dc-4b66-a7b0-3f06f9b7bd46", + "name": "Qwen3 235B A22B 2507 Instruct" + }, + { + "id": "338216fb-62c2-48f1-898a-9166d12fb35e", + "name": "Olmo 3 7B Think" + }, + { + "id": "339a92c1-8a42-417f-8d1f-cdbc605acd9e", + "name": "HyperCLOVA X SEED Think (32B)" + }, + { + "id": "3435db18-9227-45a5-8e79-9546b14b5aaa", + "name": "Sonar Pro" + }, + { + "id": "344c6718-c573-41d4-9556-10287a3fa1fc", + "name": "Nova Micro" + }, + { + "id": "34ef2b5c-3df3-437e-b9c5-81346f8c14a8", + "name": "Sarvam M (Reasoning)" + }, + { + "id": "352f834f-a03c-4117-8a29-c3ccd8a568ce", + "name": "Qwen2.5 Turbo" + }, + { + "id": "3538d399-1b3f-455d-9b13-1d8f9fee26c8", + "name": "EXAONE 4.5 33B (Non-reasoning)" + }, + { + "id": "35d602fc-b8b8-4698-9f4d-f2ce11ca50e4", + "name": "Mistral Small (Feb '24)" + }, + { + "id": "369329e4-629f-425d-975a-e8980aec2965", + "name": "INTELLECT-3" + }, + { + "id": "36f73aaf-d38a-4b56-a2b3-d04d17186910", + "name": "gpt-oss-20B (high)" + }, + { + "id": "385376b1-9815-47dd-83cc-85aac34f247d", + "name": "MiniMax M1 40k" + }, + { + "id": "392063ba-c3b5-47e8-ba67-a7b0b34f6824", + "name": "GPT-5.4 mini (medium)" + }, + { + "id": "39b64e04-7a69-4aa2-9e2e-fe38c24681ec", + "name": "Olmo 3.1 32B Think" + }, + { + "id": "3b156101-b0d7-4438-b350-2d1f1168f40a", + "name": "Qwen3.6 27B (Non-reasoning)" + }, + { + "id": "3b608b70-6434-4baa-99ad-45d499703c67", + "name": "GPT-4.1" + }, + { + "id": "3bc32f13-5afa-4e28-bce1-10e57376686b", + "name": "Command A" + }, + { + "id": "3c5289e5-1c62-434c-bc44-c51c39f640a1", + "name": "LFM2.5-VL-1.6B" + }, + { + "id": "3cf875b8-b6b5-42c0-ad70-617d5be59d00", + "name": "Qwen3 VL 8B Instruct" + }, + { + "id": "3d4e7366-928c-4eff-a8b0-2919c7d334c9", + "name": "JT-MINI" + }, + { + "id": "3d64bf83-232e-427e-8590-26b478bae4a8", + "name": "Ling-mini-2.0" + }, + { + "id": "3de55b83-e02b-412e-8211-315bbebe3e94", + "name": "Gemini 2.0 Flash-Lite (Preview)" + }, + { + "id": "3e6cf518-a1f4-42d3-8fcf-827c9bd8e6d5", + "name": "Qwen3 30B A3B (Reasoning)" + }, + { + "id": "3edcb2ed-6981-4f88-a556-563f7f8f00aa", + "name": "Mixtral 8x7B Instruct" + }, + { + "id": "3ef6db79-1dfa-4780-8c4b-affe2740d9ac", + "name": "Molmo2-8B" + }, + { + "id": "3fd96175-4ef1-434c-8795-f873aec2abc1", + "name": "Mistral Small 4 (Reasoning)" + }, + { + "id": "405e2235-0925-4634-a3c7-fbd5f6394bc0", + "name": "Qwen3.5 0.8B (Non-reasoning)" + }, + { + "id": "40663ad2-b218-471e-bdd4-a1e0c2360e2b", + "name": "GLM-5 (Reasoning)" + }, + { + "id": "4077490a-bbfb-404e-979a-a97a20e3b5de", + "name": "Claude Opus 4.5 (Non-reasoning)" + }, + { + "id": "41f73c27-880c-4f30-8b07-9999ce89a4ae", + "name": "Qwen Chat 72B" + }, + { + "id": "41faf421-118b-465b-b170-d200776580d1", + "name": "Gemini 1.5 Flash (Sep '24)" + }, + { + "id": "43098bd0-77ca-408b-b698-9d60b1d1c3b8", + "name": "GLM-4.6V (Non-reasoning)" + }, + { + "id": "432d6c36-8825-47f3-b4eb-58529cea346b", + "name": "Solar Pro 2 (Preview) (Non-reasoning)" + }, + { + "id": "433e410e-5170-4f03-b92f-7927c220b2fe", + "name": "MiniCPM-V 4.6 1.3B" + }, + { + "id": "4343afb1-c928-44c9-92e2-68fa1195b6f5", + "name": "GPT-4o mini Realtime (Dec '24)" + }, + { + "id": "43573c57-2403-46fb-af4b-a93de9a0c3f5", + "name": "Qwen3 235B A22B (Non-reasoning)" + }, + { + "id": "4386585e-71b4-4a0c-8a63-afb333419cd6", + "name": "Claude Opus 4.6 (Non-reasoning, High Effort)" + }, + { + "id": "43da3718-3d6e-40dd-901a-05664179ff7f", + "name": "Mistral Small 3.2" + }, + { + "id": "43fc5506-c5ed-4dee-9b85-962bf7ae3986", + "name": "DeepSeek V3 (Dec '24)" + }, + { + "id": "441734a9-8901-4850-9bae-b474c370291f", + "name": "Kimi K2" + }, + { + "id": "444cdb1e-bab8-42cd-938c-b2d7a93e2da1", + "name": "DeepSeek R1 Distill Qwen 1.5B" + }, + { + "id": "44b19b51-5367-4ef9-a2ff-2f90b89a0867", + "name": "EXAONE 4.0 32B (Reasoning)" + }, + { + "id": "44db6283-aa82-4799-af4a-679fe0530845", + "name": "Solar Pro 2 (Non-reasoning)" + }, + { + "id": "4559e9f0-8aad-4681-89fb-68cb915e0f16", + "name": "Qwen3 14B (Reasoning)" + }, + { + "id": "45790612-02e3-4c42-b5bd-cd7ed2ea1f2f", + "name": "Sarvam 105B (high)" + }, + { + "id": "45c87531-2d57-48e0-8012-202cd636189e", + "name": "Llama 3.1 Instruct 405B" + }, + { + "id": "466aecdb-3d96-4191-bc52-b3366db38851", + "name": "Llama 3.1 Instruct 70B" + }, + { + "id": "46d8315e-1630-463f-ab62-84185fa0faab", + "name": "Qwen3.5 35B A3B (Reasoning)" + }, + { + "id": "4764d31d-f4af-4297-8bd1-e993f26bcb64", + "name": "MiMo-V2.5-Pro (Non-reasoning)" + }, + { + "id": "47b7df55-5804-40de-ba11-317de786710a", + "name": "Ring-flash-2.0" + }, + { + "id": "48194e0f-8226-4c57-8cb2-2a0fb68a84c9", + "name": "LFM2.5-1.2B-Instruct" + }, + { + "id": "48e50f00-1fd1-4acc-b337-61078aa341e6", + "name": "GPT-5 (high)" + }, + { + "id": "4928e950-7f37-4475-b0dc-c5bad781a321", + "name": "Mistral Large 3" + }, + { + "id": "493f6a1e-7717-4e98-9d6f-548b92c4702d", + "name": "GPT-5.4 (Non-reasoning)" + }, + { + "id": "498862c3-f9ac-49d2-852f-16a02bb0c38f", + "name": "GPT-5.2 (xhigh)" + }, + { + "id": "49e70a38-4ac1-4659-b490-09b2c7ff21d6", + "name": "Apertus 70B Instruct" + }, + { + "id": "49fd01f9-887d-4479-b8ce-771a81ecef4e", + "name": "Grok 4.1 Fast (Non-reasoning)" + }, + { + "id": "4a845d7b-a52d-43bb-80b7-b58c7a0c155e", + "name": "DeepSeek R1 Distill Llama 70B" + }, + { + "id": "4ae6c88d-9e4a-4850-89fe-18a1c04a66cc", + "name": "Qwen3 0.6B (Reasoning)" + }, + { + "id": "4bbceacb-cf47-464b-b60f-e1d1fe016d67", + "name": "MiniMax-M2.7" + }, + { + "id": "4c111fbc-d13a-42b4-858c-1dc17fe3c1d1", + "name": "Grok-1" + }, + { + "id": "4d6dd5ce-08cb-4e87-9288-1dd2f022aa35", + "name": "Doubao Seed Code" + }, + { + "id": "4dc12a38-b18f-4c43-8e1b-678f8434b5b1", + "name": "GPT-5.1 (high)" + }, + { + "id": "5016ea75-7b0e-4737-a7e6-1062c6d90fd4", + "name": "Gemini 3.5 Flash (medium)" + }, + { + "id": "504412c2-2ada-499b-aebf-7e0a35c9d286", + "name": "Claude 4 Opus (Non-reasoning)" + }, + { + "id": "508943e4-e9e1-4d10-9c0e-a7650e9d7315", + "name": "EXAONE 4.5 33B" + }, + { + "id": "509e94e3-f1cb-43fb-98ff-e0e9872cfd1f", + "name": "Qwen3.5 27B (Non-reasoning)" + }, + { + "id": "50f92d5f-f413-4c97-8dab-331101622a28", + "name": "Mistral Large 2 (Nov '24)" + }, + { + "id": "515852e7-ba9c-4571-8cf9-82ad6b45f22f", + "name": "PALM-2" + }, + { + "id": "51d0b717-953d-4b44-af61-406c6b7dff39", + "name": "Qwen3 VL 30B A3B Instruct" + }, + { + "id": "523125f4-a1da-4990-9abd-dd08a069100e", + "name": "Grok 4.20 0309 (Non-reasoning)" + }, + { + "id": "527e943a-adc6-4e69-93af-d1608e1b5fed", + "name": "DeepSeek V3.2 Speciale" + }, + { + "id": "5303601c-8133-4f52-bc4e-5241ee6b3c10", + "name": "Gemma 4 E4B (Non-reasoning)" + }, + { + "id": "538e945c-6c27-4fd3-995d-ded80a36cd10", + "name": "GPT-5.4 (low)" + }, + { + "id": "53c98840-47af-49aa-94e6-469fb17e9a1b", + "name": "Claude Opus 4.6 (Adaptive Reasoning, Max Effort)" + }, + { + "id": "540ebc58-a2d8-4dc9-ba6b-973efa52fab1", + "name": "Jamba 1.5 Mini" + }, + { + "id": "54442579-2a4d-40cc-b264-bc4ff29e311a", + "name": "OLMo 2 32B" + }, + { + "id": "546ec53f-273c-4af7-b13f-b88c41f45905", + "name": "Nova Pro" + }, + { + "id": "54c7f3fc-7078-442a-b472-e8691257a88c", + "name": "Granite 4.0 350M" + }, + { + "id": "55a3ebf6-6117-4cc1-8596-c6de6e552fd4", + "name": "Gemini 2.5 Flash Preview (Non-reasoning)" + }, + { + "id": "573bbd93-114c-4b71-9ede-a73a7d4bdf84", + "name": "Grok 4 Fast (Reasoning)" + }, + { + "id": "575498d6-60ec-466b-9372-fea19911fd07", + "name": "GPT-4o (March 2025, chatgpt-4o-latest)" + }, + { + "id": "5962d643-0a6f-4630-bb08-ab5720d80056", + "name": "Qwen3 1.7B (Reasoning)" + }, + { + "id": "598190f8-dc9c-4fea-a7ea-4b81c402ab18", + "name": "Gemini 3.1 Flash-Lite" + }, + { + "id": "598de97d-029e-47b6-96ec-dbc1e0f9045a", + "name": "Kimi Linear 48B A3B Instruct" + }, + { + "id": "599da8e0-bd9c-4b38-a127-b50e371fbcf8", + "name": "Llama 2 Chat 70B" + }, + { + "id": "59a1bb20-9170-4dc2-ba9c-12d326cf068e", + "name": "Solar Open 100B (Reasoning)" + }, + { + "id": "59b5b14b-5365-4ee7-824a-18a8e6309644", + "name": "GPT-5.3 Codex (xhigh)" + }, + { + "id": "59e22326-1bca-4432-a5fa-147fbe8854e7", + "name": "Mistral Medium 3" + }, + { + "id": "5a088cde-18e2-4dfa-98dd-d283e1c19654", + "name": "LFM2.5-1.2B-Thinking" + }, + { + "id": "5a49ef80-3af5-404b-8ac0-e1b230ae95de", + "name": "K2 Think V2" + }, + { + "id": "5aa1c578-af76-4b91-8699-cdd43582b3af", + "name": "GLM-5.1 (Reasoning)" + }, + { + "id": "5ad2f60f-ee05-49fd-85a0-cef69aa7cb7b", + "name": "o1" + }, + { + "id": "5b2beb12-81a9-47a1-8a2a-d0a727185b50", + "name": "Qwen3 Max (Preview)" + }, + { + "id": "5bb1f426-2d64-4d03-99fb-8041ee85c33b", + "name": "GPT-5.4 Pro (xhigh)" + }, + { + "id": "5c3dd927-48a3-4f3c-8045-9b135f62dfbb", + "name": "Nova Lite" + }, + { + "id": "5c6533f3-75a2-4109-b9a9-3623afc6b86a", + "name": "Seed-OSS-36B-Instruct" + }, + { + "id": "5ce30d25-5353-45bb-bef9-6b87480ba3a2", + "name": "GPT-4.5 (Preview)" + }, + { + "id": "5d11e7a1-4f70-4e5a-9364-e193761d6757", + "name": "GPT-5 Codex (high)" + }, + { + "id": "5d303dc9-c027-401f-9803-4e9aa3331007", + "name": "GLM-4.5-Air" + }, + { + "id": "5d4acc80-7a88-4e84-bfe7-99071b84e6a4", + "name": "Qwen3.5 2B (Reasoning)" + }, + { + "id": "5d8183dc-24f4-46c5-a1d0-d937de149364", + "name": "MiMo-V2-Pro" + }, + { + "id": "5d891609-fe1c-4e8d-b9f0-5b0ff6d9f439", + "name": "Tri-21B-Think" + }, + { + "id": "5da3c0e2-65d2-4bff-a410-cb2132ddafb6", + "name": "DeepSeek V4 Pro (Reasoning, High Effort)" + }, + { + "id": "5dba8d07-9992-483c-81db-dac97cb15ba8", + "name": "Granite 4.0 H Small" + }, + { + "id": "5e0164b3-d902-4bcb-a1b2-83b4f4cd6143", + "name": "Qwen3 30B A3B 2507 (Reasoning)" + }, + { + "id": "5e4e4590-a77e-4b66-95f8-f3960a1a7c68", + "name": "Mistral Large (Feb '24)" + }, + { + "id": "5e8b0d98-a3b4-42b5-93d8-ecb748788754", + "name": "Grok 4.3 (medium)" + }, + { + "id": "5e965af0-ca5c-4f47-9ba9-06000508b84a", + "name": "GPT-5 (medium)" + }, + { + "id": "5ea94a4a-55ac-4ea1-8898-2b3971e94af6", + "name": "Grok 4" + }, + { + "id": "5ec4a0db-da66-4e46-9682-fceeed755ef8", + "name": "Sonar" + }, + { + "id": "5fb47ff6-a30e-4c2c-96f2-55e95a13390f", + "name": "Llama 3.2 Instruct 11B (Vision)" + }, + { + "id": "6000145b-0e3d-4fef-a55f-bcaac84803b2", + "name": "DeepSeek R1 0528 Qwen3 8B" + }, + { + "id": "6000692c-f9a6-47f8-a5c0-e0874ac488bb", + "name": "Qwen3.6 Max Preview" + }, + { + "id": "6056731b-c705-455b-aa0d-43cbf29b1054", + "name": "LongCat Flash Lite" + }, + { + "id": "60cff809-05fb-4439-afe7-8b1476439f49", + "name": "Tri-21B-think Preview" + }, + { + "id": "61bd9367-e520-4ec8-989e-5fbf50e61610", + "name": "Nova 2.0 Lite (high)" + }, + { + "id": "62de31e8-a1a3-429c-b634-a2afccfd9363", + "name": "Gemini 2.5 Pro Preview (Mar' 25)" + }, + { + "id": "63872e9c-3377-4a6b-b477-7bba244c38e9", + "name": "NVIDIA Nemotron 3 Super 120B A12B (Reasoning)" + }, + { + "id": "64312f37-3701-4243-a4c9-7c07a58cd6b9", + "name": "Grok 4.20 0309 v2 (Non-reasoning)" + }, + { + "id": "651ef7ae-9a8f-477e-9c8e-460aa156ba02", + "name": "Qwen3.5 4B (Reasoning)" + }, + { + "id": "660965b2-66d2-49ee-a6b9-79a6ac47d3c0", + "name": "Magistral Medium 1" + }, + { + "id": "66445f84-b2e3-4202-afdc-92ba0f0e5f36", + "name": "Kimi K2 0905" + }, + { + "id": "666eb13f-0d22-4438-8eb0-01876e1a8604", + "name": "Motif-2-12.7B-Reasoning" + }, + { + "id": "66938aab-78fa-49d7-8461-b48b6833e837", + "name": "Qwen3.6 35B A3B (Non-reasoning)" + }, + { + "id": "66f4ce73-9a9b-4b49-9c6e-bedb9bfdc720", + "name": "Ministral 3 3B" + }, + { + "id": "686ab020-ee58-4a70-a9ac-24d675a73506", + "name": "LFM2 8B A1B" + }, + { + "id": "68c89ebf-779c-4445-9241-de964cd17355", + "name": "Gemini 2.5 Flash Preview (Reasoning)" + }, + { + "id": "69534bed-2ffd-4235-832b-e20a810333ab", + "name": "Qwen3.7 Max" + }, + { + "id": "6a5d56e1-bb68-4205-8d9b-26b97888bc84", + "name": "GLM-4.6 (Reasoning)" + }, + { + "id": "6a7c0e25-1dcb-4b15-8495-a8536a9da051", + "name": "GPT-4" + }, + { + "id": "6afbfb62-27e4-435e-9c85-d9fe1b92519e", + "name": "Gemini 2.5 Flash (Non-reasoning)" + }, + { + "id": "6b08a75a-19ee-40b4-be33-133b8ef42f92", + "name": "Llama 2 Chat 13B" + }, + { + "id": "6b79f899-e3c0-45f6-923c-243faccdb2fc", + "name": "GPT-5.5 (medium)" + }, + { + "id": "6ba9e8eb-8124-436d-842f-dbe36df80c27", + "name": "Hermes 4 - Llama-3.1 70B (Reasoning)" + }, + { + "id": "6d9a176d-feb8-4dac-8872-afe32b31897f", + "name": "DeepSeek V3.2 (Non-reasoning)" + }, + { + "id": "6da314d3-a984-4734-8f31-47dd32fb4699", + "name": "Qwen3 VL 32B Instruct" + }, + { + "id": "6dd8ba55-5680-44a9-b309-82928165d5f0", + "name": "GPT-5.2 (Non-reasoning)" + }, + { + "id": "6e1b44ff-c227-496b-aef4-19b70cd18c76", + "name": "Gemma 4 26B A4B (Reasoning)" + }, + { + "id": "6e2a2572-ed8d-4616-8d7d-68d125fc8ee7", + "name": "DeepSeek V4 Pro (Non-reasoning)" + }, + { + "id": "6e6e02fd-9cbd-417f-9bfc-673df89c313d", + "name": "NVIDIA Nemotron Nano 12B v2 VL (Reasoning)" + }, + { + "id": "6f1a7562-6e96-46ac-af4f-6ba5a7a3da96", + "name": "GPT-5.5 (Non-reasoning)" + }, + { + "id": "6f3534b1-1168-472e-b3e3-23ab521504f5", + "name": "Qwen3.5 Omni Flash" + }, + { + "id": "6fc35842-0165-44cf-8570-c484a92b3d8c", + "name": "GLM-4.7 (Reasoning)" + }, + { + "id": "6fd796d3-f346-4f66-97df-5da81714fc73", + "name": "Nova 2.0 Lite (low)" + }, + { + "id": "70152cb0-fb36-4732-a925-89ef40994be1", + "name": "Magistral Small 1.2" + }, + { + "id": "70882ec6-914c-41f0-9754-5e8f75005f77", + "name": "Gemma 4 26B A4B (Non-reasoning)" + }, + { + "id": "713fae11-c75c-4f10-ae2c-8e4074cd58af", + "name": "Ministral 3 14B" + }, + { + "id": "715e05fb-1313-441c-bf1d-8651c752a841", + "name": "K-EXAONE (Reasoning)" + }, + { + "id": "71e8d48c-1920-4f27-8ea9-1f10becc615a", + "name": "Llama 3.2 Instruct 3B" + }, + { + "id": "71f51ea9-94fe-4635-a80d-4cfffbb685f4", + "name": "Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)" + }, + { + "id": "72c358fd-7d45-4d68-89aa-699743710924", + "name": "GPT-4.1 nano" + }, + { + "id": "7393c56a-ec31-48e9-b804-c04f2d2cb641", + "name": "Llama 3.1 Nemotron Instruct 70B" + }, + { + "id": "739684ba-0f63-4e2a-b4ee-30741c9e9320", + "name": "Llama 3.1 Instruct 8B" + }, + { + "id": "739e531a-eb0a-478f-bb67-5845b79ce65d", + "name": "LFM2 2.6B" + }, + { + "id": "755d7281-2ed8-48c7-808f-709ec4cbfb71", + "name": "Gemma 4 E2B (Non-reasoning)" + }, + { + "id": "75e1c197-f239-4361-a9d6-66dccfead236", + "name": "DeepSeek V3 0324" + }, + { + "id": "76361085-f5dc-49ec-b069-fe56ca885933", + "name": "Command-R+ (Apr '24)" + }, + { + "id": "764dd095-6ebd-4760-8eb3-cbc40964db2e", + "name": "Sarvam 30B (high)" + }, + { + "id": "7656c62b-5345-435b-bf12-b6ce2ca0d58d", + "name": "Qwen2 Instruct 72B" + }, + { + "id": "76aa6af5-fdc6-4739-a300-983f14e74a67", + "name": "GPT-4 Turbo" + }, + { + "id": "76bce7fb-3a3f-4b66-a78d-35ccf3edf5d2", + "name": "Nova 2.0 Lite (Non-reasoning)" + }, + { + "id": "76dcf6ef-39ea-4be0-b693-b88da25b4caf", + "name": "NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)" + }, + { + "id": "7764d514-694f-444c-8d60-bdc6e24e223f", + "name": "DeepSeek LLM 67B Chat (V1)" + }, + { + "id": "7829427f-f0e3-4f6d-a228-5fbf70dacc02", + "name": "Claude 3 Opus" + }, + { + "id": "783a0ea2-1eef-422a-8c3d-f6d40d943f54", + "name": "Gemini 3 Flash Preview (Non-reasoning)" + }, + { + "id": "78452c64-7303-4192-bca5-2d9ec5c623d4", + "name": "Jamba 1.7 Large" + }, + { + "id": "7a7b52f6-fdef-4dae-9203-58c710ccc81d", + "name": "Mistral Saba" + }, + { + "id": "7ae943a9-9310-4472-a834-c61f0ab68485", + "name": "Qwen3 Max" + }, + { + "id": "7b269763-ecc0-41ef-aa29-47ef632ac065", + "name": "Gemini 2.0 Flash-Lite (Feb '25)" + }, + { + "id": "7c045ca0-b331-488d-af31-df0fd331dfd1", + "name": "Mistral Small (Sep '24)" + }, + { + "id": "7c73c3be-7f51-4d14-bec8-d5789488df25", + "name": "Gemini 3 Flash Preview (Reasoning)" + }, + { + "id": "7d73161f-002f-4c9c-b4f8-6c4d91f2ba8e", + "name": "Nanbeige4.1-3B" + }, + { + "id": "7ec1065a-c90e-41e4-bd17-abb7042eed76", + "name": "Qwen3 30B A3B 2507 Instruct" + }, + { + "id": "7f3c9423-3ee3-4369-a6d9-3f2a40aff00e", + "name": "GPT-5 (low)" + }, + { + "id": "806032ff-6252-4c22-ba99-a126e411b7a4", + "name": "Qwen3 Max Thinking" + }, + { + "id": "80f7860a-7665-4658-9f05-15bccf5f832f", + "name": "Llama 3.2 Instruct 1B" + }, + { + "id": "81444bc8-72f9-4a2d-ad43-27e3f0d2f461", + "name": "Tiny Aya Global" + }, + { + "id": "81b6ddfc-111e-4422-bd44-42ee6165b699", + "name": "GLM-4.7 (Non-reasoning)" + }, + { + "id": "8273650d-40e5-45ee-aeda-df71de784164", + "name": "Exaone 4.0 1.2B (Non-reasoning)" + }, + { + "id": "82879bb8-89fb-4adc-b519-315b8ef30b77", + "name": "Llama 3 Instruct 8B" + }, + { + "id": "82b207dd-d285-4a52-b2fc-2cbd27543899", + "name": "Hermes 4 - Llama-3.1 405B (Reasoning)" + }, + { + "id": "82b36b4d-84dd-4bc0-ad32-e3aee9442789", + "name": "MiMo-V2-Flash (Non-reasoning)" + }, + { + "id": "82ed9bd2-c97b-4c35-9312-94bb72001e36", + "name": "Granite 4.1 8B" + }, + { + "id": "83cb898e-05d9-4e4b-9de3-2d305014d923", + "name": "Claude Instant" + }, + { + "id": "84922739-425f-46e1-87ac-bb4268dcacbb", + "name": "Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)" + }, + { + "id": "84b49308-6b93-47aa-a4f6-776ee1a1e8cd", + "name": "o4-mini (high)" + }, + { + "id": "84e3f11e-d659-4941-8988-1dbfabbaf538", + "name": "GPT-5.2 (medium)" + }, + { + "id": "864da2a5-156c-45fd-873c-8923be91914f", + "name": "Magistral Medium 1.2" + }, + { + "id": "8665ca00-c687-44c7-875c-22618cb31c4f", + "name": "Qwen3.6 35B A3B (Reasoning)" + }, + { + "id": "877fdfc9-2026-477a-af96-e4fd602c0131", + "name": "Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)" + }, + { + "id": "8823351e-8232-4c9c-8a1d-cd2c1d2c1196", + "name": "QwQ 32B-Preview" + }, + { + "id": "882a5da3-94ca-4602-8693-c45970df17e2", + "name": "Ling-flash-2.0" + }, + { + "id": "8869f28a-a6ff-487f-8d32-93fe335fdda5", + "name": "GPT-5.4 nano (medium)" + }, + { + "id": "891bcdf2-8dd2-4dc3-829b-d963fde25876", + "name": "Apertus 8B Instruct" + }, + { + "id": "89a2c945-1fab-4ee4-9f45-83a9f46cb221", + "name": "DeepSeek V4 Flash (Reasoning, High Effort)" + }, + { + "id": "8a24865b-90d9-4e2b-a2fd-6851c2e9d627", + "name": "DeepSeek-V2.5" + }, + { + "id": "8a4a5ead-7789-4389-8400-30e9d20370b7", + "name": "Claude 4 Opus (Reasoning)" + }, + { + "id": "8b1a70d1-e05f-426b-9122-023d4629ab47", + "name": "GPT-3.5 Turbo (0613)" + }, + { + "id": "8c1be908-67b6-4cf4-ba08-83ddbe44fde3", + "name": "GPT-4o (Aug '24)" + }, + { + "id": "8c29d66d-bf98-4ea3-8572-5409353ecc66", + "name": "Qwen3.6 27B (Reasoning)" + }, + { + "id": "8c748e53-61ae-48b8-af8d-eb8298b1e9db", + "name": "Nemotron 3 Nano Omni 30B A3B Reasoning" + }, + { + "id": "8ca48626-ff5e-48b3-8401-38081376d706", + "name": "Hy3-preview (Reasoning)" + }, + { + "id": "8ddacd41-bf43-411b-aa30-43ebf0567dd8", + "name": "Gemini 1.0 Pro" + }, + { + "id": "8e78cf7a-5b76-4beb-beba-b99c6233b208", + "name": "Solar Pro 2 (Preview) (Reasoning)" + }, + { + "id": "8eb02396-f231-4189-ae15-05f7facebd9b", + "name": "GPT-5 nano (medium)" + }, + { + "id": "8f0a75d6-8d00-4c2e-bcd4-8e88a570a93c", + "name": "DeepSeek-Coder-V2" + }, + { + "id": "8f74a6ed-f82f-4a2f-a96b-4914993e47da", + "name": "Olmo 3 7B Instruct" + }, + { + "id": "90c2a9cf-ad7e-4332-9be2-2fd1309833e2", + "name": "Grok 4.3 (Non-reasoning)" + }, + { + "id": "90e078f2-051b-4c63-8919-76618971cb3f", + "name": "Claude 4.5 Sonnet (Reasoning)" + }, + { + "id": "91cb6144-4937-4e4e-aeda-b4341d355c10", + "name": "Claude 4.5 Sonnet (Non-reasoning)" + }, + { + "id": "91e3b45f-3f52-4511-8c15-8948854bebc5", + "name": "Granite 4.1 3B" + }, + { + "id": "91f3a4c8-b000-4513-942c-bfe283375c35", + "name": "Ling 2.6 Flash" + }, + { + "id": "922c69c7-9037-43c6-8bcf-a1c555e7f3eb", + "name": "Llama 4 Maverick" + }, + { + "id": "92b19c88-fa87-4595-957e-fe9aa5fa5ad4", + "name": "Jamba 1.6 Large" + }, + { + "id": "92f245a7-43b4-4ffd-8bfb-866746bf824d", + "name": "GLM-5.1 (Non-reasoning)" + }, + { + "id": "94229066-9381-4ee1-bf70-a16d63756a6e", + "name": "Gemini 1.5 Flash-8B" + }, + { + "id": "946e7aab-db1c-4c3f-b0b3-7720d0cff187", + "name": "GLM-4.6 (Non-reasoning)" + }, + { + "id": "948892b5-db03-4118-a4a8-ccd51ed871ea", + "name": "Grok 4.3 (high)" + }, + { + "id": "94a6d26e-a903-47f3-8323-ae422d237bb9", + "name": "Olmo 3.1 32B Instruct" + }, + { + "id": "94d09368-9035-47cf-963a-b4310b433a16", + "name": "MiMo-V2-Omni" + }, + { + "id": "9741f3c2-cbb1-4a3f-99ee-7bd7384d9038", + "name": "Ministral 3 8B" + }, + { + "id": "976cc8ad-7904-4056-83c5-960181f47d5f", + "name": "Llama 3.3 Instruct 70B" + }, + { + "id": "9815da7d-70f4-44d6-b539-9ffef0faa152", + "name": "Nemotron Cascade 2 30B A3B" + }, + { + "id": "98e3230e-cee1-4c19-b9f8-b6b6a826ca93", + "name": "R1 1776" + }, + { + "id": "9ca246a7-cf13-42c9-9182-5b5ad6b79026", + "name": "MiniMax M1 80k" + }, + { + "id": "9ca71ac4-41c8-42c0-87dd-5704a9e5b94d", + "name": "Llama 3.2 Instruct 90B (Vision)" + }, + { + "id": "9cc377dc-67ae-4042-bafc-0466b5f05089", + "name": "K-EXAONE (Non-reasoning)" + }, + { + "id": "9dba61f5-78ee-4190-8d1d-8e7063ffd386", + "name": "Qwen3 8B (Reasoning)" + }, + { + "id": "9e141c0d-fc82-4e07-bb2e-fe0003bc030b", + "name": "Claude 2.1" + }, + { + "id": "9eae4ec4-61b8-48bc-9843-3edd506ae933", + "name": "Devstral Small (Jul '25)" + }, + { + "id": "9ee13921-62a4-425a-a22b-df3302198d93", + "name": "NVIDIA Nemotron 3 Nano 4B" + }, + { + "id": "9f7c7566-a704-49a2-a383-cb3181da33a4", + "name": "GPT-4.1 mini" + }, + { + "id": "9f873c2f-2c2d-4ccb-9e1b-71bf61b052be", + "name": "Phi-4 Mini Instruct" + }, + { + "id": "a04f5b78-f397-4fd8-a2b1-00dcab50324c", + "name": "Grok Beta" + }, + { + "id": "a06bd3fc-86db-4a8e-ae6d-7459444d08c9", + "name": "Grok Code Fast 1" + }, + { + "id": "a20ae33a-46e1-41e6-81a0-fe8b00d2e538", + "name": "Nova 2.0 Pro Preview (Non-reasoning)" + }, + { + "id": "a29e66d6-1c3c-456a-8770-59ee3845b35d", + "name": "Ring-1T" + }, + { + "id": "a2c8e7b2-57bf-4d1e-96ea-7944d786d94d", + "name": "Granite 4.0 1B" + }, + { + "id": "a38d719a-709c-4983-b3e7-7090389ae9a6", + "name": "K2-V2 (high)" + }, + { + "id": "a5092ece-d5a7-461f-b036-3faef262423f", + "name": "Gemini 3 Deep Think" + }, + { + "id": "a518a64b-e337-48f3-85a1-ba7dc0e8f961", + "name": "ERNIE 5.0 Thinking Preview" + }, + { + "id": "a550ffca-f89e-4381-ade6-a85dc6a1fb4c", + "name": "Kimi K2.5 (Reasoning)" + }, + { + "id": "a6340098-d7ae-462d-b372-0a0a67fc44b4", + "name": "Claude 4.5 Haiku (Reasoning)" + }, + { + "id": "a68afa0b-7fe2-4e9d-bf3e-741cce3c6aeb", + "name": "Granite 4.0 H 350M" + }, + { + "id": "a6ea7ec0-0aca-4442-98fb-4296c6d18b31", + "name": "Qwen3.5 0.8B (Reasoning)" + }, + { + "id": "a71c1a35-ccc8-43f0-a5a2-070a690b9a00", + "name": "Apriel-v1.6-15B-Thinker" + }, + { + "id": "a7564055-f8ba-4c4b-9e2d-060f61263645", + "name": "Claude 4 Sonnet (Reasoning)" + }, + { + "id": "a797eaf3-6d75-4f29-86a8-e1243ce52d43", + "name": "Gemma 3n E4B Instruct" + }, + { + "id": "a803d3d0-d22e-49a0-ac2c-b9c6f1141065", + "name": "Qwen3 VL 235B A22B (Reasoning)" + }, + { + "id": "a83f84b3-473a-4276-9ae1-8909da723159", + "name": "DeepSeek R1 0528 (May '25)" + }, + { + "id": "a89c4b28-2d8c-456e-88ea-255fb51fd2b6", + "name": "GPT-5.4 (xhigh)" + }, + { + "id": "a8c67863-9d66-44dd-8d27-f58654ecde03", + "name": "Gemma 3n E2B Instruct" + }, + { + "id": "a8efb564-9d17-4d7f-8f43-e9110657ce21", + "name": "DeepHermes 3 - Mistral 24B Preview (Non-reasoning)" + }, + { + "id": "a971b0c0-4c0f-484a-b018-e36b5be3409e", + "name": "Reka Flash (Sep '24)" + }, + { + "id": "aa83359a-d804-4f0b-b5bf-dc637711c26f", + "name": "Llama 3 Instruct 70B" + }, + { + "id": "ab7f016c-a29b-4710-bdf6-6a5cd96aacca", + "name": "NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning)" + }, + { + "id": "aba82268-2bb7-4a0f-80be-9b7722e2145b", + "name": "Devstral Medium" + }, + { + "id": "abe9f0c7-f4f6-430d-ba42-f45afdd4841b", + "name": "Command-R (Mar '24)" + }, + { + "id": "ac1031bc-c53e-4af7-9c6e-2005e0ff44fa", + "name": "Apriel-v1.5-15B-Thinker" + }, + { + "id": "ac48c49d-9e77-4394-ac4e-d1ee51fd5fee", + "name": "Gemini 1.5 Flash (May '24)" + }, + { + "id": "aca9c1ad-fc86-49f3-a312-b1e517ea100c", + "name": "Claude 3.5 Sonnet (June '24)" + }, + { + "id": "acad0665-9457-4531-abd5-b59efd7a89ea", + "name": "Step3 VL 10B" + }, + { + "id": "ad173c8d-f14f-4230-90e9-60979b7720e7", + "name": "DeepSeek V4 Flash (Non-reasoning)" + }, + { + "id": "adf9a85e-abc3-4f28-937b-db6655cc5238", + "name": "Llama 4 Scout" + }, + { + "id": "ae447455-940d-4d30-9139-a664fa896eaf", + "name": "GPT-5.4 mini (Non-Reasoning)" + }, + { + "id": "ae4fe623-80ab-4ea3-8921-70a18ea0fc7e", + "name": "ERNIE 4.5 300B A47B" + }, + { + "id": "af134350-8ba3-4629-b56b-00bd6dcf60c4", + "name": "DeepSeek V3.2 Exp (Reasoning)" + }, + { + "id": "af74f222-05c7-422d-a653-b9c0707c9c72", + "name": "Molmo 7B-D" + }, + { + "id": "b00ecd62-a53f-4aed-b833-3e9d6b0170ba", + "name": "Qwen3 32B (Reasoning)" + }, + { + "id": "b0249961-b8b2-479d-8325-a29ea17c7b89", + "name": "Qwen3 4B 2507 Instruct" + }, + { + "id": "b07aef0a-b192-46a1-b1c9-40b06d1b9061", + "name": "Gemma 4 E2B (Reasoning)" + }, + { + "id": "b13c1257-d746-4027-8fc8-4892dc14701c", + "name": "GPT-5.5 (high)" + }, + { + "id": "b1fa84f8-1ed3-4124-b403-4655dafa4267", + "name": "Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning)" + }, + { + "id": "b26ff709-1773-4595-ae44-78e0a5bac29c", + "name": "DeepSeek R1 Distill Qwen 14B" + }, + { + "id": "b2dd592a-fbc5-458a-b26d-f3964cbab82f", + "name": "Qwen3 8B (Non-reasoning)" + }, + { + "id": "b2e68f0a-8f66-4e4c-9821-2b786cea601b", + "name": "Claude 3 Sonnet" + }, + { + "id": "b2f3191f-77d6-4155-8be6-330f0baa1ae5", + "name": "Gemini 3 Pro Preview (low)" + }, + { + "id": "b36ff8f3-0323-49d1-a063-ab09704fdb0c", + "name": "Nova 2.0 Omni (low)" + }, + { + "id": "b3735511-c6ff-4928-8d72-2181444a4eb3", + "name": "Arctic Instruct" + }, + { + "id": "b4784397-aa28-411b-b011-9c4331bfa9c8", + "name": "GPT-4o (ChatGPT)" + }, + { + "id": "b4ddb4c8-1400-44ab-8c2b-e2472088e7ff", + "name": "Jamba 1.5 Large" + }, + { + "id": "b4f14013-37dd-4c75-bd8a-378365d9ed77", + "name": "Jamba 1.7 Mini" + }, + { + "id": "b4f7d7a4-869a-4ee7-b17a-4046cd1e79fd", + "name": "Qwen2.5 Instruct 72B" + }, + { + "id": "b515503d-4d65-4a3f-8a4a-6c731e2b079f", + "name": "o1-mini" + }, + { + "id": "b58b8272-cd3f-44b9-9b68-612f40779ce2", + "name": "Hy3-preview (Non-reasoning)" + }, + { + "id": "b5c1c91a-7474-4409-9a9c-9c2ac45d9eb6", + "name": "GPT-4o mini" + }, + { + "id": "b6d2e43d-3082-43f5-9318-0f4dbcb54163", + "name": "Trinity Large Thinking" + }, + { + "id": "b7726745-9c77-40c3-8452-974cb53d6fbc", + "name": "DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning)" + }, + { + "id": "b89c4faf-219e-4171-a1aa-e3bd2fd0a924", + "name": "Solar Pro 3" + }, + { + "id": "b97ef678-2d31-4375-9416-67ea97f87204", + "name": "Qwen3 Omni 30B A3B (Reasoning)" + }, + { + "id": "b9dc72c6-7bea-4936-a55a-4b0c835fc755", + "name": "Qwen2.5 Max" + }, + { + "id": "ba04694d-326a-4a6a-8f1b-46316f872a7f", + "name": "Kimi K2.5 (Non-reasoning)" + }, + { + "id": "ba242e40-83b7-4cd3-a0e0-b56237984914", + "name": "GPT-5.4 mini (xhigh)" + }, + { + "id": "bbd93ebe-80da-4594-bb19-61e69d0331df", + "name": "Gemini 3.1 Pro Preview" + }, + { + "id": "bbe6d782-e630-48d5-b11c-3ce37f373f1e", + "name": "Qwen3 235B A22B (Reasoning)" + }, + { + "id": "bc26bfdb-4923-4442-a6ca-e77392923581", + "name": "GPT-5 mini (minimal)" + }, + { + "id": "bc4579d2-9c46-46c3-ace0-454039bf21bb", + "name": "DeepSeek-V2.5 (Dec '24)" + }, + { + "id": "bcca0e70-7e80-4c07-b1fa-b33bcfb19e51", + "name": "Gemini 2.0 Flash (Feb '25)" + }, + { + "id": "bd2c3517-00d8-4ba5-a989-1f1e52f3ffab", + "name": "Gemma 3n E4B Instruct Preview (May '25)" + }, + { + "id": "bddebfd3-0a8d-47f5-b722-bc4c2ca5a5dc", + "name": "Kimi K2 Thinking" + }, + { + "id": "be185709-ddb4-4268-9597-856464359b25", + "name": "MiMo-V2-Flash (Reasoning)" + }, + { + "id": "bf220674-68bd-43cc-a1b8-ce5ed4d2f18d", + "name": "DeepSeek V4 Pro (Reasoning, Max Effort)" + }, + { + "id": "bf60740e-6aa5-422f-ba49-ef6e9d171205", + "name": "Qwen3 32B (Non-reasoning)" + }, + { + "id": "c1045dc0-4fd3-4adb-9548-18763e0d051f", + "name": "GPT-4o (Nov '24)" + }, + { + "id": "c298d1a8-606c-4971-8613-ccdaaf941043", + "name": "GPT-5.4 nano (Non-Reasoning)" + }, + { + "id": "c2b1e769-7aee-4669-8076-73918bdebf6c", + "name": "Claude 4.5 Haiku (Non-reasoning)" + }, + { + "id": "c3274a19-6d3c-4d01-ab9b-5055a0a40429", + "name": "GPT-5 mini (medium)" + }, + { + "id": "c3738fb0-3408-4430-a699-760ae4b70c93", + "name": "GPT-5 (minimal)" + }, + { + "id": "c43aa1f9-31bd-4a99-be70-84c5e6bd2e75", + "name": "Qwen Chat 14B" + }, + { + "id": "c4c3b42f-e0f0-48ca-b6f9-b296e7697806", + "name": "Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)" + }, + { + "id": "c6a47d8a-7517-46e2-8383-329fe7241725", + "name": "Gemini 2.0 Pro Experimental (Feb '25)" + }, + { + "id": "c72cb85a-18a4-4235-b455-77dff2f16c50", + "name": "Grok 4.20 0309 v2 (Reasoning)" + }, + { + "id": "c7327e6e-b27f-4b1b-859d-159a34e0ba1c", + "name": "DeepSeek Coder V2 Lite Instruct" + }, + { + "id": "c7667559-d9b6-43f1-8cd8-8bdbc78d190b", + "name": "Gemini 2.5 Flash Preview (Sep '25) (Reasoning)" + }, + { + "id": "c76e0ae8-0fd2-45c0-a39d-d398fce9b128", + "name": "Falcon-H1R-7B" + }, + { + "id": "c77cfe51-f4a0-4692-9dee-5061ef667f23", + "name": "GPT-5.5 (low)" + }, + { + "id": "c8158c23-6fff-4c31-911d-954c32d80c28", + "name": "Step 3.5 Flash" + }, + { + "id": "c8673741-5e1a-46a1-9e4f-710a5c920982", + "name": "GLM-4.7-Flash (Non-reasoning)" + }, + { + "id": "c8a3fa87-735e-49a9-afb1-270c5e9f53f7", + "name": "Olmo 3 32B Think" + }, + { + "id": "c8a79180-7d16-4474-8701-9a77c0baa56a", + "name": "Qwen3 Next 80B A3B (Reasoning)" + }, + { + "id": "c99f3bde-7c08-4de8-bd5c-8ee9123ebffa", + "name": "gpt-oss-120b (low)" + }, + { + "id": "ca04852c-eaae-4881-a208-f9b2ca3b7cd6", + "name": "o3-pro" + }, + { + "id": "ca6c1412-f3c1-4391-9231-f83a702aa7af", + "name": "Mixtral 8x22B Instruct" + }, + { + "id": "cbac8c35-e069-4c73-823e-0953e6ed0e85", + "name": "Qwen3 Max Thinking (Preview)" + }, + { + "id": "cc1fa238-1a76-486d-a997-22309275eadd", + "name": "Devstral Small (May '25)" + }, + { + "id": "ccbfa8c3-a762-480b-aade-34fb9697f98c", + "name": "Claude 4.1 Opus (Reasoning)" + }, + { + "id": "cd26a386-4873-46ff-b853-d239050025a2", + "name": "Gemma 4 31B (Reasoning)" + }, + { + "id": "ce3d286e-093d-413d-a81a-0270309f039e", + "name": "Qwen3 VL 30B A3B (Reasoning)" + }, + { + "id": "ce819310-af7c-49d3-9a02-6845111e1788", + "name": "Devstral Small 2" + }, + { + "id": "ceb4d610-d0a4-48c1-bea0-80ed76f1e5ca", + "name": "QwQ 32B" + }, + { + "id": "cf095603-72b6-47f8-8ee1-09a42890f92a", + "name": "Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)" + }, + { + "id": "d034dafe-463d-4c50-956f-84fca657b26f", + "name": "Claude 4 Sonnet (Non-reasoning)" + }, + { + "id": "d0aa27aa-4705-4184-9a1d-483b78c9331c", + "name": "Gemma 4 31B (Non-reasoning)" + }, + { + "id": "d0b3d47e-aec6-425e-9de7-168dcc6d1e28", + "name": "GPT-5.1 (Non-reasoning)" + }, + { + "id": "d1122eff-ee85-4fdc-8a9f-23bee6590667", + "name": "Gemini 3 Pro Preview (high)" + }, + { + "id": "d1720545-d0a8-4c15-a53e-ef5ca99ac7ea", + "name": "Gemma 3 1B Instruct" + }, + { + "id": "d1768b3a-0a21-4e08-b3f6-56a9ab6cfbf3", + "name": "Qwen1.5 Chat 110B" + }, + { + "id": "d2d7dd95-770f-4cb0-9bbc-d275ac19c265", + "name": "GLM-4.6V (Reasoning)" + }, + { + "id": "d306cccd-0085-4b2f-8aa0-ffcdbb434695", + "name": "MiniCPM5-1B (Non-reasoning)" + }, + { + "id": "d370fcbf-c4a1-41a2-abc4-d204fcc3fcbf", + "name": "Qwen3 VL 32B (Reasoning)" + }, + { + "id": "d3968fd3-97d8-4693-8d26-19cefc6f5d5f", + "name": "Hermes 4 - Llama-3.1 405B (Non-reasoning)" + }, + { + "id": "d4be6393-8915-436c-a3a8-4e59bd5c89a9", + "name": "Granite 4.1 30B" + }, + { + "id": "d4fc3f33-f2b0-4da1-88ee-f1f82bd4de31", + "name": "GPT-5.4 nano (xhigh)" + }, + { + "id": "d58cf573-1bd3-4d1f-9182-5482a460f570", + "name": "Qwen3 VL 235B A22B Instruct" + }, + { + "id": "d621247c-d47e-458c-82cb-a166bc3b37e5", + "name": "DeepSeek V3.2 (Reasoning)" + }, + { + "id": "d734e2ce-5cf8-467f-8148-586d02671333", + "name": "Exaone 4.0 1.2B (Reasoning)" + }, + { + "id": "d80eb0f1-f62e-4d31-99d2-7a925eb126b0", + "name": "Gemma 4 E4B (Reasoning)" + }, + { + "id": "d8ddb241-b3e4-4c25-a6a3-72eb1b30c541", + "name": "DeepSeek V4 Flash (Reasoning, Max Effort)" + }, + { + "id": "d925845d-39ad-4de3-8495-f176b79828c0", + "name": "Claude 3.7 Sonnet (Non-reasoning)" + }, + { + "id": "d97713f2-afa6-4f8d-b2f3-ac89a24c4d6c", + "name": "Solar Mini" + }, + { + "id": "da9fe224-8af3-46d7-a8c4-6220779c3f35", + "name": "Qwen3 Coder 30B A3B Instruct" + }, + { + "id": "dae31abc-0587-44d0-ba53-f78e96b6e486", + "name": "K2-V2 (low)" + }, + { + "id": "dafbb6d2-4825-43d1-a927-feedcfd2e998", + "name": "Granite 4.0 H 1B" + }, + { + "id": "dbbcf240-a69e-4078-8f16-7c94c7a8c514", + "name": "Reka Flash 3" + }, + { + "id": "dd059b25-d82a-4ead-82a4-4adceaaec48b", + "name": "Mistral Medium 3.5" + }, + { + "id": "dd738be7-2b69-4775-91a5-8851d3341c2d", + "name": "K2-V2 (medium)" + }, + { + "id": "ddc748d0-6a9b-466b-8d6c-68417980d56d", + "name": "Claude 2.0" + }, + { + "id": "de0beaf0-c951-487a-8eb4-3dd12e74122c", + "name": "Pixtral Large" + }, + { + "id": "dec8073c-57e2-41c0-b1aa-7a62960f103f", + "name": "Qwen3 VL 8B (Reasoning)" + }, + { + "id": "ded8d96e-835f-4359-947a-a4c3bb78e983", + "name": "Phi-3 Mini Instruct 3.8B" + }, + { + "id": "df4c5a29-4b5c-4fef-9f7f-5e24e118ab65", + "name": "Mi:dm K 2.5 Pro" + }, + { + "id": "df8d14e0-3997-4e4d-b4ad-9c047acc9c69", + "name": "Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)" + }, + { + "id": "df95f83f-5ebb-466a-9d2d-b95efc8c012c", + "name": "DeepSeek R1 Distill Qwen 32B" + }, + { + "id": "dfb9292d-bc7c-4425-a260-4256217e709f", + "name": "DeepSeek V3.1 Terminus (Non-reasoning)" + }, + { + "id": "dfeeb904-e784-4d5c-ad66-9400146b150b", + "name": "Gemini 2.0 Flash Thinking Experimental (Dec '24)" + }, + { + "id": "e0099b99-d368-4562-b0de-4016ea58af54", + "name": "Mi:dm K 2.5 Pro Preview" + }, + { + "id": "e18e5e6a-5a31-4c0b-b80b-ac401392f446", + "name": "GPT-5 nano (high)" + }, + { + "id": "e1cfa926-9e2b-4a0d-8c31-48366a5041c5", + "name": "Llama Nemotron Super 49B v1.5 (Reasoning)" + }, + { + "id": "e2e9ddc3-8c2d-4bf5-a60a-83a1afe61034", + "name": "GPT-4o Realtime (Dec '24)" + }, + { + "id": "e3396f8f-7994-4df5-bdab-43745681ef0a", + "name": "GPT-5.5 Pro (xhigh)" + }, + { + "id": "e410e854-104d-4b35-a171-899ff9d974bb", + "name": "Qwen2.5 Coder Instruct 7B " + }, + { + "id": "e46198a7-cd29-4afd-933d-cdf180f0f305", + "name": "Qwen3 4B (Non-reasoning)" + }, + { + "id": "e58bbffd-fdc2-412a-b6d7-ca0e3f5d611a", + "name": "Nova Premier" + }, + { + "id": "e5dd499f-c330-45ec-9ff0-a99209c82af7", + "name": "Mistral Small 3" + }, + { + "id": "e8d4100e-165b-4c5d-ac11-ac553590a334", + "name": "o1-pro" + }, + { + "id": "e8ffd75b-766f-4551-8c52-6e54706220eb", + "name": "Jamba 1.6 Mini" + }, + { + "id": "e98e911e-9fb2-4a9a-826e-3d681d0cdca8", + "name": "GPT-4o (May '24)" + }, + { + "id": "e9a09db3-8fd6-41dd-ba2f-20e0a2bff7f2", + "name": "Claude Opus 4.7 (Adaptive Reasoning, Max Effort)" + }, + { + "id": "ea5d2c10-1051-437d-95c2-18d5e4d14ff3", + "name": "Cogito v2.1 (Reasoning)" + }, + { + "id": "eab1492c-b853-4852-aa71-06b0ec2481c1", + "name": "GPT-5 (ChatGPT)" + }, + { + "id": "eb4ba465-3fcd-4065-9fe2-e8225e7b2c6c", + "name": "Ring-2.6-1T" + }, + { + "id": "eb689f7a-f210-4a87-b407-f249897f2764", + "name": "Solar Pro 2 (Reasoning)" + }, + { + "id": "ebf3b39f-0be6-43a1-a37a-f9b2978c9916", + "name": "Muse Spark" + }, + { + "id": "ec3b22e6-48ac-416a-b4ae-55565a4f3046", + "name": "Grok 3 Reasoning Beta" + }, + { + "id": "ec60e57e-76d3-42e3-a0e3-80662225a639", + "name": "OpenChat 3.5 (1210)" + }, + { + "id": "ecc6524a-d521-458a-8327-5009e8ce6549", + "name": "Qwen3 14B (Non-reasoning)" + }, + { + "id": "ee708f92-374e-4123-b900-e22d7b2afc19", + "name": "Phi-4" + }, + { + "id": "eebfef01-709e-4ffe-b72f-0db75ef2434b", + "name": "Qwen3.5 35B A3B (Non-reasoning)" + }, + { + "id": "f0083258-8646-45b8-8082-7aaf6c2ea82a", + "name": "gpt-oss-120b (high)" + }, + { + "id": "f164b41f-44c5-4675-bca3-fea1db4bd9ae", + "name": "GLM-5 (Non-reasoning)" + }, + { + "id": "f1d52583-9d20-4099-99ac-b5df9430c3b6", + "name": "NVIDIA Nemotron Nano 9B V2 (Reasoning)" + }, + { + "id": "f2e21112-192e-4aed-ae82-68ca3b38e667", + "name": "Claude Sonnet 4.6 (Non-reasoning, Low Effort)" + }, + { + "id": "f2f60e3a-e5f5-4471-acd2-9f2f29c76007", + "name": "Claude 4.1 Opus (Non-reasoning)" + }, + { + "id": "f3169f25-8c6f-48e4-ae87-0cf872dc0ec1", + "name": "Qwen3 30B A3B (Non-reasoning)" + }, + { + "id": "f371ad68-6947-4767-a78f-1f6c81f96b93", + "name": "Qwen3.6 Plus" + }, + { + "id": "f4274721-ef28-4121-aa88-8e97267a5a82", + "name": "Nova 2.0 Pro Preview (low)" + }, + { + "id": "f4e8194a-d0e6-48eb-92be-4307de5aeeec", + "name": "Gemini 2.5 Flash-Lite (Reasoning)" + }, + { + "id": "f5d83128-047f-496d-ba49-8a428abe8345", + "name": "Qwen3 VL 4B Instruct" + }, + { + "id": "f6ccbe1d-bd7e-484b-9795-18cc9f91552d", + "name": "Qwen3 235B A22B 2507 (Reasoning)" + }, + { + "id": "f73f4711-9c61-40a7-a258-b71c14727f53", + "name": "Llama 3.1 Tulu3 405B" + }, + { + "id": "f74ea286-cd29-4eb4-af14-1389b19c21e5", + "name": "MiniMax-M2" + }, + { + "id": "f78138d6-2e04-4a84-919a-20d177cb6ff1", + "name": "Jamba Reasoning 3B" + }, + { + "id": "f818a7bb-6f23-4b24-8d52-6b9c1a5ca628", + "name": "OLMo 2 7B" + }, + { + "id": "f93d0750-b659-4ceb-a123-7e657904ef2b", + "name": "Qwen3 VL 4B (Reasoning)" + }, + { + "id": "fb112343-c82c-4b43-afea-996bd5101d62", + "name": "KAT-Coder-Pro V1" + }, + { + "id": "fb65266f-5a7d-403c-85d5-ccdf0d1ca838", + "name": "DeepSeek V3.1 (Non-reasoning)" + }, + { + "id": "fbc58677-e324-4b45-a979-7fd8eec555cd", + "name": "LFM2 24B A2B" + }, + { + "id": "fbdf8da1-b341-448c-b3cb-8aff1d8f70b9", + "name": "Nova 2.0 Lite (medium)" + }, + { + "id": "fc4223e8-4586-4ca1-97ca-bb55ff586947", + "name": "KAT Coder Pro V2" + }, + { + "id": "fc92f822-04b7-420d-9c07-a21af5e9aac7", + "name": "Qwen3 Coder Next" + }, + { + "id": "fd4454ff-e703-46c0-a7f5-fa69af09486d", + "name": "GPT-5.1 Codex mini (high)" + }, + { + "id": "fddb72bd-60d3-41af-acc5-3df9a290eb8e", + "name": "Gemini 2.0 Flash (experimental)" + }, + { + "id": "fe11ab6c-a4dd-4c28-9fef-07da76d5ed14", + "name": "Llama 65B" + }, + { + "id": "fe2c2289-d261-4433-8681-46448372c1f6", + "name": "Grok 4.3 (low)" + }, + { + "id": "ff9bc5e5-a02f-4270-983e-4b3f834f3363", + "name": "Grok 3 mini Reasoning (high)" + }, + { + "id": "ffd65ef7-fbdb-4145-98ae-b5d01cda770b", + "name": "Llama 2 Chat 7B" + } + ], + "prompt_options": { + "parallel_queries": 1, + "prompt_length": 1000 + }, + "source": "https://artificialanalysis.ai/api/v2/data/llms/models" +} diff --git a/utils/llm/metadata/models_dev.py b/utils/llm/metadata/models_dev.py new file mode 100644 index 0000000..275c849 --- /dev/null +++ b/utils/llm/metadata/models_dev.py @@ -0,0 +1,89 @@ +"""Loader for the checked-in Models.dev metadata snapshot.""" + +import json +from dataclasses import dataclass +from datetime import date +from functools import lru_cache +from pathlib import Path +from typing import Any + +SNAPSHOT_PATH = Path(__file__).with_name("models_dev_snapshot.json") + + +@dataclass(frozen=True, slots=True) +class ModelsDevModel: + """Normalized metadata for one Models.dev model.""" + + id: str + name: str + release_date: date | None + raw: dict[str, Any] + + +@dataclass(frozen=True, slots=True) +class ModelsDevProvider: + """Normalized metadata for one Models.dev provider.""" + + id: str + name: str + models: dict[str, ModelsDevModel] + + +@dataclass(frozen=True, slots=True) +class ModelsDevSnapshot: + """Loaded Models.dev metadata indexed by provider and model ID.""" + + providers: dict[str, ModelsDevProvider] + + def get_model(self, *, provider_id: str, model_id: str) -> ModelsDevModel: + """Return a model by Models.dev provider and model IDs.""" + try: + provider = self.providers[provider_id] + except KeyError as exc: + raise KeyError(f"Unknown Models.dev provider_id {provider_id}") from exc + try: + return provider.models[model_id] + except KeyError as exc: + raise KeyError( + f"Unknown Models.dev model_id {model_id} for provider_id {provider_id}" + ) from exc + + +def _parse_date(value: str | None) -> date | None: + """Parse an ISO date value from the snapshot.""" + if value is None: + return None + if len(value) != len("YYYY-MM-DD"): + return None + try: + return date.fromisoformat(value) + except ValueError: + return None + + +def _model_from_json(data: dict[str, Any]) -> ModelsDevModel: + """Build normalized model metadata from snapshot JSON.""" + return ModelsDevModel( + id=data["id"], + name=data["name"], + release_date=_parse_date(data.get("release_date")), + raw=data, + ) + + +@lru_cache(maxsize=1) +def load_models_dev_snapshot(path: Path = SNAPSHOT_PATH) -> ModelsDevSnapshot: + """Load the checked-in Models.dev metadata snapshot.""" + data = json.loads(path.read_text()) + providers = {} + for provider_id, provider_data in data["providers"].items(): + models = { + model_id: _model_from_json(model_data) + for model_id, model_data in provider_data["models"].items() + } + providers[provider_id] = ModelsDevProvider( + id=provider_data["id"], + name=provider_data["name"], + models=models, + ) + return ModelsDevSnapshot(providers=providers) diff --git a/utils/llm/metadata/models_dev_snapshot.json b/utils/llm/metadata/models_dev_snapshot.json new file mode 100644 index 0000000..f490257 --- /dev/null +++ b/utils/llm/metadata/models_dev_snapshot.json @@ -0,0 +1,355 @@ +{ + "providers": { + "anthropic": { + "id": "anthropic", + "models": { + "claude-3-5-sonnet-20240620": { + "id": "claude-3-5-sonnet-20240620", + "name": "Claude Sonnet 3.5", + "release_date": "2024-06-20" + }, + "claude-3-5-sonnet-20241022": { + "id": "claude-3-5-sonnet-20241022", + "name": "Claude Sonnet 3.5 v2", + "release_date": "2024-10-22" + }, + "claude-3-7-sonnet-20250219": { + "id": "claude-3-7-sonnet-20250219", + "name": "Claude Sonnet 3.7", + "release_date": "2025-02-19" + }, + "claude-3-haiku-20240307": { + "id": "claude-3-haiku-20240307", + "name": "Claude Haiku 3", + "release_date": "2024-03-13" + }, + "claude-3-opus-20240229": { + "id": "claude-3-opus-20240229", + "name": "Claude Opus 3", + "release_date": "2024-02-29" + }, + "claude-haiku-4-5-20251001": { + "id": "claude-haiku-4-5-20251001", + "name": "Claude Haiku 4.5", + "release_date": "2025-10-15" + }, + "claude-opus-4-1-20250805": { + "id": "claude-opus-4-1-20250805", + "name": "Claude Opus 4.1", + "release_date": "2025-08-05" + }, + "claude-opus-4-20250514": { + "id": "claude-opus-4-20250514", + "name": "Claude Opus 4", + "release_date": "2025-05-22" + }, + "claude-opus-4-5-20251101": { + "id": "claude-opus-4-5-20251101", + "name": "Claude Opus 4.5", + "release_date": "2025-11-01" + }, + "claude-opus-4-6": { + "id": "claude-opus-4-6", + "name": "Claude Opus 4.6", + "release_date": "2026-02-05" + }, + "claude-opus-4-7": { + "id": "claude-opus-4-7", + "name": "Claude Opus 4.7", + "release_date": "2026-04-16" + }, + "claude-opus-4-8": { + "id": "claude-opus-4-8", + "name": "Claude Opus 4.8", + "release_date": "2026-05-28" + }, + "claude-sonnet-4-20250514": { + "id": "claude-sonnet-4-20250514", + "name": "Claude Sonnet 4", + "release_date": "2025-05-22" + }, + "claude-sonnet-4-5-20250929": { + "id": "claude-sonnet-4-5-20250929", + "name": "Claude Sonnet 4.5", + "release_date": "2025-09-29" + }, + "claude-sonnet-4-6": { + "id": "claude-sonnet-4-6", + "name": "Claude Sonnet 4.6", + "release_date": "2026-02-17" + } + }, + "name": "Anthropic" + }, + "deepseek": { + "id": "deepseek", + "models": { + "deepseek-v4-pro": { + "id": "deepseek-v4-pro", + "name": "DeepSeek V4 Pro", + "release_date": "2026-04-24" + } + }, + "name": "DeepSeek" + }, + "google": { + "id": "google", + "models": { + "gemini-2.5-flash": { + "id": "gemini-2.5-flash", + "name": "Gemini 2.5 Flash", + "release_date": "2025-03-20" + }, + "gemini-2.5-pro": { + "id": "gemini-2.5-pro", + "name": "Gemini 2.5 Pro", + "release_date": "2025-03-20" + }, + "gemini-3-flash-preview": { + "id": "gemini-3-flash-preview", + "name": "Gemini 3 Flash Preview", + "release_date": "2025-12-17" + }, + "gemini-3-pro-preview": { + "id": "gemini-3-pro-preview", + "name": "Gemini 3 Pro Preview", + "release_date": "2025-11-18" + }, + "gemini-3.1-flash-lite": { + "id": "gemini-3.1-flash-lite", + "name": "Gemini 3.1 Flash Lite", + "release_date": "2026-05-07" + }, + "gemini-3.1-flash-lite-preview": { + "id": "gemini-3.1-flash-lite-preview", + "name": "Gemini 3.1 Flash Lite Preview", + "release_date": "2026-03-03" + }, + "gemini-3.1-pro-preview": { + "id": "gemini-3.1-pro-preview", + "name": "Gemini 3.1 Pro Preview", + "release_date": "2026-02-19" + }, + "gemini-3.5-flash": { + "id": "gemini-3.5-flash", + "name": "Gemini 3.5 Flash", + "release_date": "2026-05-19" + }, + "gemma-4-31b-it": { + "id": "gemma-4-31b-it", + "name": "Gemma 4 31B IT", + "release_date": "2026-04-02" + } + }, + "name": "Google" + }, + "minimax": { + "id": "minimax", + "models": { + "MiniMax-M2.5": { + "id": "MiniMax-M2.5", + "name": "MiniMax-M2.5", + "release_date": "2026-02-12" + }, + "MiniMax-M2.7": { + "id": "MiniMax-M2.7", + "name": "MiniMax-M2.7", + "release_date": "2026-03-18" + } + }, + "name": "MiniMax (minimax.io)" + }, + "mistral": { + "id": "mistral", + "models": { + "mistral-large-2411": { + "id": "mistral-large-2411", + "name": "Mistral Large 2.1", + "release_date": "2024-11-01" + } + }, + "name": "Mistral" + }, + "moonshotai": { + "id": "moonshotai", + "models": { + "kimi-k2-thinking": { + "id": "kimi-k2-thinking", + "name": "Kimi K2 Thinking", + "release_date": "2025-11-06" + }, + "kimi-k2.6": { + "id": "kimi-k2.6", + "name": "Kimi K2.6", + "release_date": "2026-04-21" + } + }, + "name": "Moonshot AI" + }, + "openai": { + "id": "openai", + "models": { + "gpt-4.1": { + "id": "gpt-4.1", + "name": "GPT-4.1", + "release_date": "2025-04-14" + }, + "gpt-4o": { + "id": "gpt-4o", + "name": "GPT-4o", + "release_date": "2024-05-13" + }, + "gpt-4o-2024-05-13": { + "id": "gpt-4o-2024-05-13", + "name": "GPT-4o (2024-05-13)", + "release_date": "2024-05-13" + }, + "gpt-4o-2024-11-20": { + "id": "gpt-4o-2024-11-20", + "name": "GPT-4o (2024-11-20)", + "release_date": "2024-11-20" + }, + "gpt-4o-mini": { + "id": "gpt-4o-mini", + "name": "GPT-4o mini", + "release_date": "2024-07-18" + }, + "gpt-5": { + "id": "gpt-5", + "name": "GPT-5", + "release_date": "2025-08-07" + }, + "gpt-5-mini": { + "id": "gpt-5-mini", + "name": "GPT-5 Mini", + "release_date": "2025-08-07" + }, + "gpt-5-nano": { + "id": "gpt-5-nano", + "name": "GPT-5 Nano", + "release_date": "2025-08-07" + }, + "gpt-5.1": { + "id": "gpt-5.1", + "name": "GPT-5.1", + "release_date": "2025-11-13" + }, + "gpt-5.2": { + "id": "gpt-5.2", + "name": "GPT-5.2", + "release_date": "2025-12-11" + }, + "gpt-5.4": { + "id": "gpt-5.4", + "name": "GPT-5.4", + "release_date": "2026-03-05" + }, + "gpt-5.4-mini": { + "id": "gpt-5.4-mini", + "name": "GPT-5.4 mini", + "release_date": "2026-03-17" + }, + "gpt-5.4-nano": { + "id": "gpt-5.4-nano", + "name": "GPT-5.4 nano", + "release_date": "2026-03-17" + }, + "gpt-5.5": { + "id": "gpt-5.5", + "name": "GPT-5.5", + "release_date": "2026-04-23" + }, + "o3": { + "id": "o3", + "name": "o3", + "release_date": "2025-04-16" + }, + "o4-mini": { + "id": "o4-mini", + "name": "o4-mini", + "release_date": "2025-04-16" + } + }, + "name": "OpenAI" + }, + "togetherai": { + "id": "togetherai", + "models": { + "deepseek-ai/DeepSeek-R1": { + "id": "deepseek-ai/DeepSeek-R1", + "name": "DeepSeek R1", + "release_date": "2024-12-26" + }, + "deepseek-ai/DeepSeek-V3": { + "id": "deepseek-ai/DeepSeek-V3", + "name": "DeepSeek V3", + "release_date": "2025-01-20" + }, + "deepseek-ai/DeepSeek-V3-1": { + "id": "deepseek-ai/DeepSeek-V3-1", + "name": "DeepSeek V3.1", + "release_date": "2025-08-21" + }, + "meta-llama/Llama-3.3-70B-Instruct-Turbo": { + "id": "meta-llama/Llama-3.3-70B-Instruct-Turbo", + "name": "Llama 3.3 70B", + "release_date": "2024-12-06" + }, + "moonshotai/Kimi-K2.5": { + "id": "moonshotai/Kimi-K2.5", + "name": "Kimi K2.5", + "release_date": "2026-01-27" + } + }, + "name": "Together AI" + }, + "xai": { + "id": "xai", + "models": { + "grok-4.20-0309-non-reasoning": { + "id": "grok-4.20-0309-non-reasoning", + "name": "Grok 4.20 (Non-Reasoning)", + "release_date": "2026-03-09" + }, + "grok-4.20-0309-reasoning": { + "id": "grok-4.20-0309-reasoning", + "name": "Grok 4.20 (Reasoning)", + "release_date": "2026-03-09" + }, + "grok-4.3": { + "id": "grok-4.3", + "name": "Grok 4.3", + "release_date": "2026-04-17" + } + }, + "name": "xAI" + }, + "zai": { + "id": "zai", + "models": { + "glm-4.6": { + "id": "glm-4.6", + "name": "GLM-4.6", + "release_date": "2025-09-30" + }, + "glm-4.7": { + "id": "glm-4.7", + "name": "GLM-4.7", + "release_date": "2025-12-22" + }, + "glm-5": { + "id": "glm-5", + "name": "GLM-5", + "release_date": "2026-02-11" + }, + "glm-5.1": { + "id": "glm-5.1", + "name": "GLM-5.1", + "release_date": "2026-03-27" + } + }, + "name": "Z.AI" + } + }, + "source": "https://models.dev/api.json" +} diff --git a/utils/llm/model_registry.py b/utils/llm/model_registry.py index e28725b..5734e47 100644 --- a/utils/llm/model_registry.py +++ b/utils/llm/model_registry.py @@ -1,8 +1,53 @@ -"""Central model registry for LLM providers.""" - -from __future__ import annotations - +"""Central model registry for LLM providers. + +Adding a base model: + +1. Look up the model in Models.dev. If present, copy its exact `provider_id` & `model_id` into + `ModelsDevReference`; the checked-in snapshot is only a generated subset, not the catalog. + In Models.dev source paths, `provider_id` is the folder under `providers/`, and + `model_id` is the TOML filename stem under `models/`, e.g. + `providers/anthropic/models/claude-opus-4-8.toml` -> `anthropic` / `claude-opus-4-8`. + +2. Add the model to the provider-specific list below with the provider helper. `model_key` is our + stable key; set `provider_model_id` only when the provider API ID differs. For routed + providers like Together, set `lab_key`. + + Example: + ``` + openai_model( + model_key="gpt-5.5-2026-04-23", + models_dev_reference=ModelsDevReference( + provider_id="openai", + model_id="gpt-5.5", + ), + ) + ``` + +3. If Models.dev is missing the model or a full release date, set `manual_release_date`. + It can also be a fallback alongside `ModelsDevReference` when Models.dev has the + entry but not the date. + +4. Insert the entry where `(release_date, model_key)` stays ascending within that provider + list. Use `active=False` only for historical routes that should stay registered but + leave `ACTIVE_MODEL_RUNS`. + +5. Add benchmark call configs in `model_runs.py` with explicit `model_run_key` values. + +After changing `ModelsDevReference` values, refresh the Models.dev snapshot from the repo's root +directory: +``` +python - <<'PY' +from scripts.refresh_models_dev_metadata import write_models_dev_snapshot + +write_models_dev_snapshot() +PY +``` +Incorrect exact references fail with nearby Models.dev suggestions. +""" + +from collections.abc import Sequence from dataclasses import dataclass +from datetime import date from functools import lru_cache from typing import Any, Final, Type @@ -17,6 +62,7 @@ XAI_API_KEY_SECRET_NAME, ) from .lab_registry import LABS, Lab +from .metadata.models_dev import ModelsDevModel, load_models_dev_snapshot from .provider_registry import PROVIDERS, Provider from .providers.anthropic import AnthropicProvider from .providers.base import BaseLLMProvider @@ -37,10 +83,6 @@ PROVIDERS["Together"]: TogetherProvider, } -_PROVIDER_CLASS_TO_PROVIDER: dict[Type[BaseLLMProvider], Provider] = { - provider_cls: provider for provider, provider_cls in _PROVIDER_TO_CLASS.items() -} - # Mapping from provider classes to GCP secret names _PROVIDER_CLASS_TO_SECRET_NAME: dict[Type[BaseLLMProvider], str] = { OpenAIProvider: OPENAI_API_KEY_SECRET_NAME, @@ -51,16 +93,80 @@ } +@dataclass(frozen=True, slots=True) +class ModelsDevReference: + """Reference to an underlying model entry in Models.dev.""" + + provider_id: str + model_id: str + + @dataclass(frozen=True, slots=True) class Model: - """Registered LLM model metadata.""" + """Canonical LLM model metadata.""" - id: str - full_name: str - token_limit: int - provider_cls: Type[BaseLLMProvider] + model_key: str + provider_model_id: str lab: Lab - reasoning_model: bool = False + provider: Provider + models_dev_reference: ModelsDevReference | None = None + manual_release_date: date | None = None + active: bool = True + + def __post_init__(self) -> None: + """Validate the model declaration against configured metadata.""" + if self.models_dev_reference is None: + if self.manual_release_date is None: + raise ValueError(f"Model {self.model_key} is missing release date source") + return + + try: + metadata = self.models_dev_metadata + except KeyError as exc: + reference = self.models_dev_reference + raise ValueError( + f"Model {self.model_key} has invalid Models.dev reference " + f"{reference.provider_id}/{reference.model_id}: {exc}" + ) from exc + + if metadata.release_date is not None: + return + if self.manual_release_date is None: + raise ValueError(f"Model {self.model_key} is missing release date metadata") + + @property + def models_dev_provider_id(self) -> str | None: + """Return the Models.dev provider ID for compatibility and debugging.""" + if self.models_dev_reference is None: + return None + return self.models_dev_reference.provider_id + + @property + def models_dev_model_id(self) -> str | None: + """Return the Models.dev model ID for compatibility and debugging.""" + if self.models_dev_reference is None: + return None + return self.models_dev_reference.model_id + + @property + def models_dev_metadata(self) -> ModelsDevModel | None: + """Return Models.dev metadata for this model when a lookup is configured.""" + if self.models_dev_reference is None: + return None + return load_models_dev_snapshot().get_model( + provider_id=self.models_dev_reference.provider_id, + model_id=self.models_dev_reference.model_id, + ) + + @property + def release_date(self) -> date: + """Return this model's release date from Models.dev or a manual fallback.""" + metadata = self.models_dev_metadata + if metadata is not None and metadata.release_date is not None: + return metadata.release_date + if self.manual_release_date is not None: + return self.manual_release_date + raise ValueError(f"Model {self.model_key} is missing release date metadata") def get_response( self, @@ -68,15 +174,137 @@ def get_response( options: dict[str, Any] | None = None, ) -> str: """Request a response from the model's provider.""" - provider = _PROVIDER_CLASS_TO_PROVIDER[self.provider_cls] return get_response( - provider, - self.full_name, + self.provider, + self.provider_model_id, prompt=prompt, options=options, ) +def provider_model( + *, + model_key: str, + lab_key: str, + provider_key: str, + provider_model_id: str | None = None, + models_dev_reference: ModelsDevReference | None = None, + manual_release_date: date | None = None, + active: bool = True, +) -> Model: + """Create a model declaration for a provider route.""" + return Model( + model_key=model_key, + provider_model_id=provider_model_id or model_key, + lab=LABS[lab_key], + provider=PROVIDERS[provider_key], + models_dev_reference=models_dev_reference, + manual_release_date=manual_release_date, + active=active, + ) + + +def openai_model( + *, + model_key: str, + provider_model_id: str | None = None, + models_dev_reference: ModelsDevReference | None = None, + manual_release_date: date | None = None, + active: bool = True, +) -> Model: + """Create an OpenAI model declaration.""" + return provider_model( + model_key=model_key, + provider_model_id=provider_model_id, + lab_key="OpenAI", + provider_key="OpenAI", + models_dev_reference=models_dev_reference, + manual_release_date=manual_release_date, + active=active, + ) + + +def anthropic_model( + *, + model_key: str, + provider_model_id: str | None = None, + models_dev_reference: ModelsDevReference | None = None, + manual_release_date: date | None = None, + active: bool = True, +) -> Model: + """Create an Anthropic model declaration.""" + return provider_model( + model_key=model_key, + provider_model_id=provider_model_id, + lab_key="Anthropic", + provider_key="Anthropic", + models_dev_reference=models_dev_reference, + manual_release_date=manual_release_date, + active=active, + ) + + +def xai_model( + *, + model_key: str, + provider_model_id: str | None = None, + models_dev_reference: ModelsDevReference | None = None, + manual_release_date: date | None = None, + active: bool = True, +) -> Model: + """Create an xAI model declaration.""" + return provider_model( + model_key=model_key, + provider_model_id=provider_model_id, + lab_key="xAI", + provider_key="xAI", + models_dev_reference=models_dev_reference, + manual_release_date=manual_release_date, + active=active, + ) + + +def google_model( + *, + model_key: str, + provider_model_id: str | None = None, + models_dev_reference: ModelsDevReference | None = None, + manual_release_date: date | None = None, + active: bool = True, +) -> Model: + """Create a Google model declaration.""" + return provider_model( + model_key=model_key, + provider_model_id=provider_model_id, + lab_key="Google DeepMind", + provider_key="Google", + models_dev_reference=models_dev_reference, + manual_release_date=manual_release_date, + active=active, + ) + + +def together_model( + *, + model_key: str, + lab_key: str, + provider_model_id: str | None = None, + models_dev_reference: ModelsDevReference | None = None, + manual_release_date: date | None = None, + active: bool = True, +) -> Model: + """Create a Together-routed model declaration.""" + return provider_model( + model_key=model_key, + provider_model_id=provider_model_id, + lab_key=lab_key, + provider_key="Together", + models_dev_reference=models_dev_reference, + manual_release_date=manual_release_date, + active=active, + ) + + def _get_api_key_for_provider(provider_cls: Type[BaseLLMProvider]) -> str | None: """Look up API key for a provider from the registry configuration. @@ -135,7 +363,7 @@ def configure_api_keys( try: api_key = get_secret(secret_name) _PROVIDER_API_KEYS[provider_cls] = api_key - except (RuntimeError, exceptions.NotFound): + except RuntimeError, exceptions.NotFound: # GCP not configured or secret doesn't exist, skip this provider pass @@ -204,204 +432,542 @@ def validate_provider_keys(providers: list[Provider]) -> None: ) -MODELS: Final[list[Model]] = [ - Model( - id="gpt-4.1-mini", - full_name="gpt-4.1-mini", - token_limit=128_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - ), - Model( - id="gpt-4o-mini", - full_name="gpt-4o-mini", - token_limit=128_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - ), - Model( - id="gpt-5-2025-08-07", - full_name="gpt-5-2025-08-07", - token_limit=128_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - reasoning_model=True, - ), - Model( - id="gpt-5-mini-2025-08-07", - full_name="gpt-5-mini-2025-08-07", - token_limit=128_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - reasoning_model=True, - ), - Model( - id="gpt-5-nano-2025-08-07", - full_name="gpt-5-nano-2025-08-07", - token_limit=128_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - reasoning_model=True, - ), - Model( - id="gpt-5.1-2025-11-13", - full_name="gpt-5.1-2025-11-13", - token_limit=128_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - reasoning_model=True, - ), - Model( - id="gpt-5.2-2025-12-11", - full_name="gpt-5.2-2025-12-11", - token_limit=128_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - reasoning_model=True, - ), - Model( - id="o3-2025-04-16", - full_name="o3-2025-04-16", - token_limit=200_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - reasoning_model=True, - ), - Model( - id="gpt-4.1-2025-04-14", - full_name="gpt-4.1-2025-04-14", - token_limit=128_000, - provider_cls=OpenAIProvider, - lab=LABS["OpenAI"], - ), - Model( - id="DeepSeek-V3.1", - full_name="deepseek-ai/DeepSeek-V3.1", - token_limit=128_000, - provider_cls=TogetherProvider, - lab=LABS["DeepSeek"], - ), - Model( - id="Qwen3-235B-A22B-Thinking-2507", - full_name="Qwen/Qwen3-235B-A22B-Thinking-2507", - token_limit=262_144, - provider_cls=TogetherProvider, - lab=LABS["Qwen"], - ), - Model( - id="GLM-4.5-Air-FP8", - full_name="zai-org/GLM-4.5-Air-FP8", - token_limit=131_072, - provider_cls=TogetherProvider, - lab=LABS["Z.ai"], - ), - Model( - id="GLM-4.6", - full_name="zai-org/GLM-4.6", - token_limit=202_752, - provider_cls=TogetherProvider, - lab=LABS["Z.ai"], - reasoning_model=False, - ), - Model( - id="claude-sonnet-4-5-20250929", - full_name="claude-sonnet-4-5-20250929", - token_limit=200_000, - provider_cls=AnthropicProvider, - lab=LABS["Anthropic"], - ), - Model( - id="claude-haiku-4-5-20251001", - full_name="claude-haiku-4-5-20251001", - token_limit=200_000, - provider_cls=AnthropicProvider, - lab=LABS["Anthropic"], - ), - Model( - id="claude-opus-4-1-20250805", - full_name="claude-opus-4-1-20250805", - token_limit=200_000, - provider_cls=AnthropicProvider, - lab=LABS["Anthropic"], - ), - Model( - id="claude-opus-4-5-20251101", - full_name="claude-opus-4-5-20251101", - token_limit=200_000, - provider_cls=AnthropicProvider, - lab=LABS["Anthropic"], - ), - Model( - id="claude-sonnet-4-6", - full_name="claude-sonnet-4-6", - token_limit=200_000, - provider_cls=AnthropicProvider, - lab=LABS["Anthropic"], - ), - Model( - id="claude-sonnet-4-20250514", - full_name="claude-sonnet-4-20250514", - token_limit=200_000, - provider_cls=AnthropicProvider, - lab=LABS["Anthropic"], - ), - Model( - id="grok-4-fast-reasoning", - full_name="grok-4-fast-reasoning", - token_limit=2_000_000, - provider_cls=XAIProvider, - lab=LABS["xAI"], - ), - Model( - id="grok-4-fast-non-reasoning", - full_name="grok-4-fast-non-reasoning", - token_limit=2_000_000, - provider_cls=XAIProvider, - lab=LABS["xAI"], - ), - Model( - id="grok-4-0709", - full_name="grok-4-0709", - token_limit=256_000, - provider_cls=XAIProvider, - lab=LABS["xAI"], - ), - Model( - id="grok-4-1-fast-reasoning", - full_name="grok-4-1-fast-reasoning", - token_limit=2_000_000, - provider_cls=XAIProvider, - lab=LABS["xAI"], - reasoning_model=True, - ), - Model( - id="grok-4-1-fast-non-reasoning", - full_name="grok-4-1-fast-non-reasoning", - token_limit=2_000_000, - provider_cls=XAIProvider, - lab=LABS["xAI"], - reasoning_model=False, - ), - Model( - id="gemini-2.5-pro", - full_name="gemini-2.5-pro", - token_limit=1_048_576, - provider_cls=GoogleProvider, - lab=LABS["Google DeepMind"], - ), - Model( - id="gemini-2.5-flash", - full_name="models/gemini-2.5-flash", - token_limit=1_048_576, - provider_cls=GoogleProvider, - lab=LABS["Google DeepMind"], - ), - Model( - id="gemini-3-pro-preview", - full_name="gemini-3-pro-preview", - token_limit=1_048_576, - provider_cls=GoogleProvider, - lab=LABS["Google DeepMind"], - reasoning_model=False, +# OpenAI models: https://developers.openai.com/api/docs/models +OPENAI_MODELS: Final[list[Model]] = [ + openai_model( + model_key="gpt-4-0613", + manual_release_date=date(2023, 6, 13), + ), + openai_model( + model_key="gpt-3.5-turbo-0125", + manual_release_date=date(2024, 1, 25), + ), + openai_model( + model_key="gpt-4-turbo-2024-04-09", + manual_release_date=date(2024, 4, 9), + ), + openai_model( + model_key="gpt-4o", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-4o"), + ), + openai_model( + model_key="gpt-4o-2024-05-13", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-4o-2024-05-13"), + ), + openai_model( + model_key="gpt-4o-mini-2024-07-18", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-4o-mini"), + ), + openai_model( + model_key="gpt-4o-2024-11-20", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-4o-2024-11-20"), + ), + openai_model( + model_key="o3-mini-2025-01-31", + manual_release_date=date(2025, 1, 31), + ), + openai_model( + model_key="gpt-4.5-preview-2025-02-27", + manual_release_date=date(2025, 2, 27), + ), + openai_model( + model_key="gpt-4.1-2025-04-14", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-4.1"), + ), + openai_model( + model_key="o3-2025-04-16", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="o3"), + ), + openai_model( + model_key="o4-mini-2025-04-16", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="o4-mini"), + ), + openai_model( + model_key="gpt-5-2025-08-07", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5"), + ), + openai_model( + model_key="gpt-5-mini-2025-08-07", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5-mini"), + ), + openai_model( + model_key="gpt-5-nano-2025-08-07", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5-nano"), + ), + openai_model( + model_key="gpt-5.1-2025-11-13", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5.1"), + ), + openai_model( + model_key="gpt-5.2-2025-12-11", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5.2"), + ), + openai_model( + model_key="gpt-5.4-2026-03-05", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5.4"), + ), + openai_model( + model_key="gpt-5.4-mini-2026-03-17", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5.4-mini"), + ), + openai_model( + model_key="gpt-5.4-nano-2026-03-17", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5.4-nano"), + ), + openai_model( + model_key="gpt-5.5-2026-04-23", + models_dev_reference=ModelsDevReference(provider_id="openai", model_id="gpt-5.5"), + ), +] + +# Together models: https://docs.together.ai/docs/serverless-models +TOGETHER_MODELS: Final[list[Model]] = [ + together_model( + model_key="llama-2-70b-chat-hf", + lab_key="Meta", + manual_release_date=date(2023, 7, 18), + ), + together_model( + model_key="mixtral-8x7b-instruct-v0.1", + lab_key="Mistral AI", + manual_release_date=date(2023, 12, 11), + ), + together_model( + model_key="mistral-large-latest", + lab_key="Mistral AI", + manual_release_date=date(2024, 2, 26), + ), + together_model( + model_key="mixtral-8x22b-instruct-v0.1", + lab_key="Mistral AI", + manual_release_date=date(2024, 4, 17), + ), + together_model( + model_key="llama-3-70b-chat-hf", + lab_key="Meta", + manual_release_date=date(2024, 4, 18), + ), + together_model( + model_key="llama-3-8b-chat-hf", + lab_key="Meta", + manual_release_date=date(2024, 4, 18), + ), + together_model( + model_key="qwen1.5-110b-chat", + lab_key="Qwen", + manual_release_date=date(2024, 4, 25), + ), + together_model( + model_key="meta-llama-3.1-405b-instruct-turbo", + lab_key="Meta", + manual_release_date=date(2024, 7, 23), + ), + together_model( + model_key="mistral-large-2407", + lab_key="Mistral AI", + manual_release_date=date(2024, 7, 24), + ), + together_model( + model_key="qwen2.5-72b-instruct-turbo", + lab_key="Qwen", + manual_release_date=date(2024, 9, 19), + ), + together_model( + model_key="llama-3.2-3b-instruct-turbo", + lab_key="Meta", + manual_release_date=date(2024, 9, 25), + ), + together_model( + model_key="mistral-large-2411", + lab_key="Mistral AI", + models_dev_reference=ModelsDevReference( + provider_id="mistral", model_id="mistral-large-2411" + ), + ), + together_model( + model_key="qwq-32b-preview", + lab_key="Qwen", + manual_release_date=date(2024, 11, 28), + ), + together_model( + model_key="llama-3.3-70b-instruct-turbo", + lab_key="Meta", + models_dev_reference=ModelsDevReference( + provider_id="togetherai", model_id="meta-llama/Llama-3.3-70B-Instruct-Turbo" + ), + ), + together_model( + model_key="deepseek-r1", + lab_key="DeepSeek", + models_dev_reference=ModelsDevReference( + provider_id="togetherai", model_id="deepseek-ai/DeepSeek-R1" + ), + ), + together_model( + model_key="deepseek-v3", + lab_key="DeepSeek", + models_dev_reference=ModelsDevReference( + provider_id="togetherai", model_id="deepseek-ai/DeepSeek-V3" + ), + ), + together_model( + model_key="llama-4-maverick-17b-128e-instruct-fp8", + lab_key="Meta", + manual_release_date=date(2025, 4, 5), + ), + together_model( + model_key="llama-4-scout-17b-16e-instruct", + lab_key="Meta", + manual_release_date=date(2025, 4, 5), + ), + together_model( + model_key="qwen3-235b-a22b-fp8-tput", + lab_key="Qwen", + manual_release_date=date(2025, 4, 29), + ), + together_model( + model_key="magistral-medium-2506", + lab_key="Mistral AI", + manual_release_date=date(2025, 5, 28), + ), + together_model( + model_key="kimi-k2-instruct", + lab_key="Moonshot", + manual_release_date=date(2025, 7, 12), + ), + together_model( + model_key="qwen3-235b-a22b-thinking-2507", + lab_key="Qwen", + manual_release_date=date(2025, 7, 25), + ), + together_model( + model_key="glm-4.5-air-fp8", + lab_key="Z.ai", + manual_release_date=date(2025, 7, 28), + ), + together_model( + model_key="deepseek-v3.1", + provider_model_id="deepseek-ai/DeepSeek-V3.1", + lab_key="DeepSeek", + active=False, + models_dev_reference=ModelsDevReference( + provider_id="togetherai", model_id="deepseek-ai/DeepSeek-V3-1" + ), + ), + together_model( + model_key="kimi-k2-instruct-0905", + lab_key="Moonshot", + manual_release_date=date(2025, 9, 5), + ), + together_model( + model_key="glm-4.6", + lab_key="Z.ai", + models_dev_reference=ModelsDevReference(provider_id="zai", model_id="glm-4.6"), + ), + together_model( + model_key="kimi-k2-thinking", + lab_key="Moonshot", + models_dev_reference=ModelsDevReference( + provider_id="moonshotai", model_id="kimi-k2-thinking" + ), + ), + together_model( + model_key="glm-4.7", + lab_key="Z.ai", + models_dev_reference=ModelsDevReference(provider_id="zai", model_id="glm-4.7"), + ), + together_model( + model_key="kimi-k2.5", + provider_model_id="moonshotai/Kimi-K2.5", + lab_key="Moonshot", + models_dev_reference=ModelsDevReference( + provider_id="togetherai", model_id="moonshotai/Kimi-K2.5" + ), + ), + together_model( + model_key="glm-5", + lab_key="Z.ai", + models_dev_reference=ModelsDevReference(provider_id="zai", model_id="glm-5"), + ), + together_model( + model_key="minimax-m2.5", + provider_model_id="MiniMaxAI/MiniMax-M2.5", + lab_key="MiniMax", + models_dev_reference=ModelsDevReference(provider_id="minimax", model_id="MiniMax-M2.5"), + ), + together_model( + model_key="minimax-m2.7", + provider_model_id="MiniMaxAI/MiniMax-M2.7", + lab_key="MiniMax", + models_dev_reference=ModelsDevReference(provider_id="minimax", model_id="MiniMax-M2.7"), + ), + together_model( + model_key="glm-5.1", + provider_model_id="zai-org/GLM-5.1", + lab_key="Z.ai", + models_dev_reference=ModelsDevReference(provider_id="zai", model_id="glm-5.1"), + ), + together_model( + model_key="gemma-4-31b", + provider_model_id="google/gemma-4-31B-it", + lab_key="Google DeepMind", + models_dev_reference=ModelsDevReference(provider_id="google", model_id="gemma-4-31b-it"), + ), + together_model( + model_key="kimi-k2.6", + provider_model_id="moonshotai/Kimi-K2.6", + lab_key="Moonshot", + models_dev_reference=ModelsDevReference(provider_id="moonshotai", model_id="kimi-k2.6"), + ), + together_model( + model_key="deepseek-v4-pro", + provider_model_id="deepseek-ai/DeepSeek-V4-Pro", + lab_key="DeepSeek", + models_dev_reference=ModelsDevReference(provider_id="deepseek", model_id="deepseek-v4-pro"), + ), +] + +# Anthropic models: https://platform.claude.com/docs/en/about-claude/models/overview +ANTHROPIC_MODELS: Final[list[Model]] = [ + anthropic_model( + model_key="claude-2.1", + manual_release_date=date(2023, 11, 21), + ), + anthropic_model( + model_key="claude-3-opus-20240229", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-3-opus-20240229" + ), + ), + anthropic_model( + model_key="claude-3-haiku-20240307", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-3-haiku-20240307" + ), + ), + anthropic_model( + model_key="claude-3-5-sonnet-20240620", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-3-5-sonnet-20240620" + ), + ), + anthropic_model( + model_key="claude-3-5-sonnet-20241022", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-3-5-sonnet-20241022" + ), + ), + anthropic_model( + model_key="claude-3-7-sonnet-20250219", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-3-7-sonnet-20250219" + ), + ), + anthropic_model( + model_key="claude-opus-4-20250514", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-opus-4-20250514" + ), + ), + anthropic_model( + model_key="claude-sonnet-4-20250514", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-sonnet-4-20250514" + ), + ), + anthropic_model( + model_key="claude-opus-4-1-20250805", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-opus-4-1-20250805" + ), + ), + anthropic_model( + model_key="claude-sonnet-4-5-20250929", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-sonnet-4-5-20250929" + ), + ), + anthropic_model( + model_key="claude-haiku-4-5-20251001", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-haiku-4-5-20251001" + ), + ), + anthropic_model( + model_key="claude-opus-4-5-20251101", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-opus-4-5-20251101" + ), + ), + anthropic_model( + model_key="claude-opus-4-6", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-opus-4-6" + ), + ), + anthropic_model( + model_key="claude-sonnet-4-6", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-sonnet-4-6" + ), + ), + anthropic_model( + model_key="claude-opus-4-7", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", model_id="claude-opus-4-7" + ), + ), + anthropic_model( + model_key="claude-opus-4-8", + models_dev_reference=ModelsDevReference( + provider_id="anthropic", + model_id="claude-opus-4-8", + ), + ), +] + +# xAI models: https://console.x.ai/ -> API Models +XAI_MODELS: Final[list[Model]] = [ + xai_model( + model_key="grok-beta", + manual_release_date=date(2024, 11, 4), + ), + xai_model( + model_key="grok-4-0709", + manual_release_date=date(2025, 7, 9), + ), + xai_model( + model_key="grok-4-fast-non-reasoning", + manual_release_date=date(2025, 9, 19), + ), + xai_model( + model_key="grok-4-fast-reasoning", + manual_release_date=date(2025, 9, 19), + ), + xai_model( + model_key="grok-4-1-fast-non-reasoning", + manual_release_date=date(2025, 11, 17), + ), + xai_model( + model_key="grok-4-1-fast-reasoning", + manual_release_date=date(2025, 11, 17), + ), + xai_model( + model_key="grok-4.20-0309-non-reasoning", + models_dev_reference=ModelsDevReference( + provider_id="xai", model_id="grok-4.20-0309-non-reasoning" + ), + ), + xai_model( + model_key="grok-4.20-0309-reasoning", + models_dev_reference=ModelsDevReference( + provider_id="xai", model_id="grok-4.20-0309-reasoning" + ), + ), + xai_model( + model_key="grok-4.3", + models_dev_reference=ModelsDevReference(provider_id="xai", model_id="grok-4.3"), + ), +] + +# Google models: https://ai.google.dev/gemini-api/docs/models +GOOGLE_MODELS: Final[list[Model]] = [ + google_model( + model_key="gemini-1.5-flash", + manual_release_date=date(2024, 5, 1), + ), + google_model( + model_key="gemini-1.5-pro", + manual_release_date=date(2024, 5, 1), + ), + google_model( + model_key="gemini-2.0-flash-lite-001", + manual_release_date=date(2025, 2, 5), + ), + google_model( + model_key="gemini-2.5-flash", + models_dev_reference=ModelsDevReference(provider_id="google", model_id="gemini-2.5-flash"), + ), + google_model( + model_key="gemini-2.5-pro", + models_dev_reference=ModelsDevReference(provider_id="google", model_id="gemini-2.5-pro"), + ), + google_model( + model_key="gemini-2.5-pro-exp-03-25", + manual_release_date=date(2025, 3, 25), + ), + google_model( + model_key="gemini-2.5-pro-preview-03-25", + manual_release_date=date(2025, 4, 4), + ), + google_model( + model_key="gemini-2.5-flash-preview-04-17", + manual_release_date=date(2025, 4, 17), + ), + google_model( + model_key="gemini-3-pro-preview", + models_dev_reference=ModelsDevReference( + provider_id="google", model_id="gemini-3-pro-preview" + ), + ), + google_model( + model_key="gemini-3-flash-preview", + models_dev_reference=ModelsDevReference( + provider_id="google", model_id="gemini-3-flash-preview" + ), + ), + google_model( + model_key="gemini-3.1-pro-preview", + models_dev_reference=ModelsDevReference( + provider_id="google", model_id="gemini-3.1-pro-preview" + ), + ), + google_model( + model_key="gemini-3.1-flash-lite-preview", + models_dev_reference=ModelsDevReference( + provider_id="google", model_id="gemini-3.1-flash-lite-preview" + ), + ), + google_model( + model_key="gemini-3.1-flash-lite", + models_dev_reference=ModelsDevReference( + provider_id="google", model_id="gemini-3.1-flash-lite" + ), + ), + google_model( + model_key="gemini-3.5-flash", + models_dev_reference=ModelsDevReference(provider_id="google", model_id="gemini-3.5-flash"), ), ] + + +def _validate_unique_model_keys(models: Sequence[Model]) -> None: + """Reject duplicate model keys in a model registry list.""" + seen_model_keys = set() + for model in models: + if model.model_key in seen_model_keys: + raise ValueError(f"Duplicate LLM model_key: {model.model_key}") + seen_model_keys.add(model.model_key) + + +def create_models_list(models: Sequence[Model]) -> list[Model]: + """Create a validated model registry list.""" + _validate_unique_model_keys(models) + return list(models) + + +MODELS: Final[list[Model]] = create_models_list( + [ + *OPENAI_MODELS, + *TOGETHER_MODELS, + *ANTHROPIC_MODELS, + *XAI_MODELS, + *GOOGLE_MODELS, + ] +) +MODELS_BY_KEY: Final[dict[str, Model]] = {model.model_key: model for model in MODELS} + + +def model_release_dates_by_key() -> dict[str, date]: + """Return release dates keyed by canonical model_key.""" + return {model.model_key: model.release_date for model in MODELS} diff --git a/utils/llm/model_runs.py b/utils/llm/model_runs.py new file mode 100644 index 0000000..ce9af43 --- /dev/null +++ b/utils/llm/model_runs.py @@ -0,0 +1,637 @@ +"""Shared LLM model-run registry. + +``model_run_key`` is handwritten at each declaration site and is the stable +identifier used by benchmark files. ``build_model_run_key`` remains a helper for +checking naming conventions and adding option-name rules, but it must not be +used to silently derive a run key. +""" + +import logging +from collections.abc import Iterable, Sequence +from copy import deepcopy +from dataclasses import dataclass, field +from datetime import date +from typing import Any + +from .artificial_analysis_model_runs import create_artificial_analysis_model_runs +from .lab_registry import Lab +from .metadata.artificial_analysis import load_artificial_analysis_snapshot +from .model_registry import MODELS_BY_KEY, Model +from .provider_registry import Provider + +logger = logging.getLogger(__name__) + + +@dataclass(frozen=True, slots=True) +class ModelRun: + """Concrete LLM run with provider options.""" + + model_run_key: str + model: Model + options: dict[str, Any] = field(default_factory=dict) + artificial_analysis_id: str | None = None + + def __post_init__(self) -> None: + """Validate model-run metadata.""" + _validate_model_run_key(self.model_run_key) + build_model_run_key(self.model.model_key, self.options) + + if self.artificial_analysis_id is None: + return + + try: + load_artificial_analysis_snapshot().get_model(self.artificial_analysis_id) + except KeyError as exc: + raise ValueError( + "Artificial Analysis model runs must reference a valid " + f"artificial_analysis_id: {self.artificial_analysis_id}" + ) from exc + + @property + def name(self) -> str: + """Return the model-run key for compatibility with benchmark code.""" + return self.model_run_key + + @property + def display_name(self) -> str: + """Return the display name for leaderboards and reports.""" + if self.artificial_analysis_id is None: + return self.model_run_key + + return load_artificial_analysis_snapshot().get_model(self.artificial_analysis_id).name + + @property + def id(self) -> str: + """Return the model-run identifier.""" + return self.model_run_key + + @property + def model_key(self) -> str: + """Return the canonical base model key.""" + return self.model.model_key + + @property + def provider_model_id(self) -> str: + """Return the provider API model identifier.""" + return self.model.provider_model_id + + @property + def lab(self) -> Lab: + """Return the model-making lab.""" + return self.model.lab + + @property + def provider(self) -> Provider: + """Return the API provider route.""" + return self.model.provider + + @property + def model_organization(self) -> str: + """Return the model lab display name.""" + return self.model.lab.leaderboard_name + + @property + def release_date(self) -> date: + """Return the underlying model release date.""" + return self.model.release_date + + def __repr__(self) -> str: + """Return a concise model-run representation.""" + if self.options: + return f"" + return f"" + + def get_response(self, prompt: str, **kwargs: Any) -> str: + """Request a response from the configured provider and model.""" + from utils.llm.model_registry import get_response + + merged_options = {**self.options, **kwargs} + logger.info( + "Requesting LLM response provider=%s provider_model_id=%s options=%s", + self.provider.name, + self.provider_model_id, + merged_options, + ) + return get_response( + provider=self.provider, + model_id=self.provider_model_id, + prompt=prompt, + options=merged_options, + ) + + +NAME_NEUTRAL_OPTION_PATHS = { + ("temperature",), + ("candidate_count",), + ("automatic_function_calling",), +} + + +def _iter_leaf_paths(value: Any, prefix: tuple[str, ...] = ()) -> Iterable[tuple[str, ...]]: + """Yield leaf paths for nested option data.""" + if isinstance(value, dict): + for key, nested in value.items(): + yield from _iter_leaf_paths(nested, (*prefix, str(key))) + elif isinstance(value, list): + yield prefix + else: + yield prefix + + +def _is_path_covered(path: tuple[str, ...], covered_prefixes: set[tuple[str, ...]]) -> bool: + """Return whether path is covered by a consumed or neutral prefix.""" + return any(path[: len(prefix)] == prefix for prefix in covered_prefixes) + + +def _thinking_suffixes( + options: dict[str, Any], +) -> tuple[list[str], set[tuple[str, ...]]]: + """Return suffixes and consumed paths for model thinking options.""" + thinking = options.get("thinking") + if not isinstance(thinking, dict): + return [], set() + if thinking.get("type") == "adaptive": + return ["adaptive-thinking"], {("thinking",)} + raise ValueError(f"Unsupported thinking option for model-run naming: {thinking}") + + +def _effort_suffixes(options: dict[str, Any]) -> tuple[list[str], set[tuple[str, ...]]]: + """Return suffixes and consumed paths for effort options.""" + suffixes = [] + consumed: set[tuple[str, ...]] = set() + + reasoning = options.get("reasoning") + if isinstance(reasoning, dict) and "effort" in reasoning: + suffixes.append(str(reasoning["effort"]).replace("_", "-").lower()) + consumed.add(("reasoning",)) + + output_config = options.get("output_config") + if isinstance(output_config, dict) and "effort" in output_config: + suffixes.append(str(output_config["effort"]).replace("_", "-").lower()) + consumed.add(("output_config",)) + + return suffixes, consumed + + +def _tool_suffixes(options: dict[str, Any]) -> tuple[list[str], set[tuple[str, ...]]]: + """Return suffixes and consumed paths for tool options.""" + tools = options.get("tools") + if not tools: + return [], set() + if not isinstance(tools, list): + raise ValueError("tools option must be a list for model-run naming") + + suffixes = [] + for tool in tools: + if not isinstance(tool, dict): + raise ValueError(f"Unsupported tool option for model-run naming: {tool}") + tool_type = tool.get("type") + if tool_type in {"web_search", "web_search_20260209"} or "googleSearch" in tool: + suffix = "web-search" + elif tool_type == "x_search": + suffix = "x-search" + else: + raise ValueError(f"Unsupported tool option for model-run naming: {tool}") + if suffix not in suffixes: + suffixes.append(suffix) + + tool_order = {"web-search": 0, "x-search": 1} + return sorted(suffixes, key=tool_order.__getitem__), {("tools",)} + + +def _token_suffixes(options: dict[str, Any]) -> tuple[list[str], set[tuple[str, ...]]]: + """Return suffixes and consumed paths for token cap options.""" + suffixes = [] + consumed: set[tuple[str, ...]] = set() + + for key in ("max_tokens", "max_output_tokens"): + if key in options: + suffixes.append(str(options[key])) + consumed.add((key,)) + + return suffixes, consumed + + +NAME_COMPONENT_RULES = ( + _thinking_suffixes, + _effort_suffixes, + _tool_suffixes, + _token_suffixes, +) + + +def build_model_run_key(model_key: str, options: dict[str, Any]) -> str: + """Build a suggested model-run key from a base model key and options.""" + suffixes = [] + consumed_prefixes = set(NAME_NEUTRAL_OPTION_PATHS) + + for rule in NAME_COMPONENT_RULES: + rule_suffixes, rule_consumed = rule(options) + suffixes.extend(rule_suffixes) + consumed_prefixes.update(rule_consumed) + + unknown_paths = sorted( + path for path in _iter_leaf_paths(options) if not _is_path_covered(path, consumed_prefixes) + ) + if unknown_paths: + raise ValueError( + "ModelRun options must be name-relevant or name-neutral. " + f"Unknown option paths: {unknown_paths}" + ) + + if suffixes: + return "-".join([model_key, *suffixes]) + return model_key + + +def _validate_model_run_key(model_run_key: str) -> None: + """Reject model-run keys that are unsafe for downstream filenames.""" + if not isinstance(model_run_key, str): + raise TypeError("ModelRun model_run_key must be a string") + if not model_run_key: + raise ValueError("ModelRun model_run_key must be non-empty") + if model_run_key != model_run_key.lower(): + raise ValueError(f"ModelRun model_run_key must be lowercase: {model_run_key}") + if any(char in model_run_key for char in (" ", "/", "_")): + raise ValueError(f"ModelRun model_run_key is not filename-safe: {model_run_key}") + + +def _model_run( + *, + model_run_key: str, + model_key: str, + options: dict[str, Any] | None = None, + artificial_analysis_id: str | None = None, +) -> ModelRun: + """Create a model run from a canonical model key.""" + return ModelRun( + model_run_key=model_run_key, + model=MODELS_BY_KEY[model_key], + options=deepcopy(options) if options is not None else {}, + artificial_analysis_id=artificial_analysis_id, + ) + + +def _validate_unique_model_run_keys(runs: Sequence[ModelRun]) -> None: + """Reject duplicate model-run keys in a model-run registry list.""" + seen_model_run_keys = set() + for run in runs: + if run.model_run_key in seen_model_run_keys: + raise ValueError(f"Duplicate LLM model_run_key: {run.model_run_key}") + seen_model_run_keys.add(run.model_run_key) + + +def create_model_runs_list(runs: Sequence[ModelRun]) -> list[ModelRun]: + """Create a validated model-run registry list.""" + _validate_unique_model_run_keys(runs) + return list(runs) + + +ARTIFICIAL_ANALYSIS_MODEL_RUNS = create_artificial_analysis_model_runs(_model_run) + + +MODEL_RUNS: list[ModelRun] = create_model_runs_list( + [ + # AA declarations are benchmark-selectable runs, not metadata-only + # records, so declaring them in the AA module adds them here. + *ARTIFICIAL_ANALYSIS_MODEL_RUNS, + _model_run( + model_run_key="gpt-4o-mini-2024-07-18", + model_key="gpt-4o-mini-2024-07-18", + options={"temperature": 0}, + ), + _model_run( + model_run_key="o3-mini-2025-01-31", + model_key="o3-mini-2025-01-31", + ), + _model_run( + model_run_key="gpt-5-nano-2025-08-07", + model_key="gpt-5-nano-2025-08-07", + ), + _model_run( + model_run_key="gpt-5-mini-2025-08-07", + model_key="gpt-5-mini-2025-08-07", + ), + _model_run( + model_run_key="gpt-5-mini-2025-08-07-1024", + model_key="gpt-5-mini-2025-08-07", + options={"max_output_tokens": 1024}, + ), + _model_run( + model_run_key="gpt-5.2-2025-12-11", + model_key="gpt-5.2-2025-12-11", + ), + _model_run( + model_run_key="gpt-5.4-2026-03-05", + model_key="gpt-5.4-2026-03-05", + ), + _model_run( + model_run_key="gpt-5.4-2026-03-05-high", + model_key="gpt-5.4-2026-03-05", + options={"reasoning": {"effort": "high"}}, + ), + _model_run( + model_run_key="gpt-5.4-2026-03-05-high-web-search", + model_key="gpt-5.4-2026-03-05", + options={ + "reasoning": {"effort": "high"}, + "tools": [{"type": "web_search"}], + }, + ), + _model_run( + model_run_key="gpt-5.4-mini-2026-03-17", + model_key="gpt-5.4-mini-2026-03-17", + ), + _model_run( + model_run_key="gpt-5.4-nano-2026-03-17", + model_key="gpt-5.4-nano-2026-03-17", + ), + _model_run( + model_run_key="gpt-5.5-2026-04-23", + model_key="gpt-5.5-2026-04-23", + ), + _model_run( + model_run_key="gpt-5.5-2026-04-23-medium", + model_key="gpt-5.5-2026-04-23", + options={"reasoning": {"effort": "medium"}}, + ), + _model_run( + model_run_key="gpt-5.5-2026-04-23-high", + model_key="gpt-5.5-2026-04-23", + options={"reasoning": {"effort": "high"}}, + ), + _model_run( + model_run_key="gpt-5.5-2026-04-23-high-web-search", + model_key="gpt-5.5-2026-04-23", + options={ + "reasoning": {"effort": "high"}, + "tools": [{"type": "web_search"}], + }, + ), + _model_run( + model_run_key="deepseek-v3.1", + model_key="deepseek-v3.1", + options={"temperature": 0}, + ), + _model_run( + model_run_key="deepseek-v4-pro", + model_key="deepseek-v4-pro", + options={"temperature": 0}, + ), + _model_run( + model_run_key="minimax-m2.5", + model_key="minimax-m2.5", + options={"temperature": 0}, + ), + _model_run( + model_run_key="minimax-m2.7", + model_key="minimax-m2.7", + options={"temperature": 0}, + ), + _model_run( + model_run_key="kimi-k2.5", + model_key="kimi-k2.5", + options={"temperature": 0}, + ), + _model_run( + model_run_key="kimi-k2.6", + model_key="kimi-k2.6", + options={"temperature": 0}, + ), + _model_run( + model_run_key="glm-5.1", + model_key="glm-5.1", + options={"temperature": 0}, + ), + _model_run( + model_run_key="gemma-4-31b", + model_key="gemma-4-31b", + options={"temperature": 0}, + ), + _model_run( + model_run_key="claude-haiku-4-5-20251001-1024", + model_key="claude-haiku-4-5-20251001", + options={"max_tokens": 1024, "temperature": 0}, + ), + _model_run( + model_run_key="claude-haiku-4-5-20251001-4096", + model_key="claude-haiku-4-5-20251001", + options={"max_tokens": 4096}, + ), + _model_run( + model_run_key="claude-sonnet-4-5-20250929-1024", + model_key="claude-sonnet-4-5-20250929", + options={"max_tokens": 1024, "temperature": 0}, + ), + _model_run( + model_run_key="claude-sonnet-4-5-20250929-4096", + model_key="claude-sonnet-4-5-20250929", + options={"max_tokens": 4096}, + ), + _model_run( + model_run_key="claude-sonnet-4-6-1024", + model_key="claude-sonnet-4-6", + options={"max_tokens": 1024, "temperature": 0}, + ), + _model_run( + model_run_key="claude-sonnet-4-6-4096", + model_key="claude-sonnet-4-6", + options={"max_tokens": 4096}, + ), + _model_run( + model_run_key="claude-sonnet-4-6-adaptive-thinking-16000", + model_key="claude-sonnet-4-6", + options={ + "max_tokens": 16000, + "thinking": {"type": "adaptive"}, + }, + ), + _model_run( + model_run_key="claude-opus-4-6-4096", + model_key="claude-opus-4-6", + options={"max_tokens": 4096}, + ), + _model_run( + model_run_key="claude-opus-4-7-1024", + model_key="claude-opus-4-7", + options={"max_tokens": 1024}, + ), + _model_run( + model_run_key="claude-opus-4-7-4096", + model_key="claude-opus-4-7", + options={"max_tokens": 4096}, + ), + _model_run( + model_run_key="claude-opus-4-7-adaptive-thinking-high-24000", + model_key="claude-opus-4-7", + options={ + "max_tokens": 24000, + "output_config": {"effort": "high"}, + "thinking": {"type": "adaptive"}, + }, + ), + _model_run( + model_run_key="claude-opus-4-7-adaptive-thinking-high-web-search-64000", + model_key="claude-opus-4-7", + options={ + "max_tokens": 64000, + "output_config": {"effort": "high"}, + "thinking": {"type": "adaptive"}, + "tools": [ + { + "type": "web_search_20260209", + "name": "web_search", + "max_uses": 5, + } + ], + }, + ), + _model_run( + model_run_key="claude-opus-4-8-1024", + model_key="claude-opus-4-8", + options={"max_tokens": 1024}, + ), + _model_run( + model_run_key="claude-opus-4-8-4096", + model_key="claude-opus-4-8", + options={"max_tokens": 4096}, + ), + _model_run( + model_run_key="claude-opus-4-8-adaptive-thinking-high-24000", + model_key="claude-opus-4-8", + options={ + "max_tokens": 24000, + "output_config": {"effort": "high"}, + "thinking": {"type": "adaptive"}, + }, + ), + _model_run( + model_run_key="claude-opus-4-8-adaptive-thinking-high-web-search-64000", + model_key="claude-opus-4-8", + options={ + "max_tokens": 64000, + "output_config": {"effort": "high"}, + "thinking": {"type": "adaptive"}, + "tools": [ + { + "type": "web_search_20260209", + "name": "web_search", + "max_uses": 5, + } + ], + }, + ), + _model_run( + model_run_key="grok-4-1-fast-reasoning", + model_key="grok-4-1-fast-reasoning", + ), + _model_run( + model_run_key="grok-4-1-fast-non-reasoning", + model_key="grok-4-1-fast-non-reasoning", + ), + _model_run( + model_run_key="grok-4.20-0309-reasoning", + model_key="grok-4.20-0309-reasoning", + options={"temperature": 0}, + ), + _model_run( + model_run_key="grok-4.20-0309-reasoning-web-search-x-search", + model_key="grok-4.20-0309-reasoning", + options={ + "tools": [{"type": "web_search"}, {"type": "x_search"}], + }, + ), + _model_run( + model_run_key="grok-4.20-0309-non-reasoning", + model_key="grok-4.20-0309-non-reasoning", + options={"temperature": 0}, + ), + _model_run( + model_run_key="grok-4.3", + model_key="grok-4.3", + options={"temperature": 0}, + ), + _model_run( + model_run_key="gemini-2.5-pro", + model_key="gemini-2.5-pro", + options={"temperature": 0}, + ), + _model_run( + model_run_key="gemini-2.5-pro-web-search", + model_key="gemini-2.5-pro", + options={ + "temperature": 0, + "tools": [{"googleSearch": {}}], + }, + ), + _model_run( + model_run_key="gemini-3-flash-preview", + model_key="gemini-3-flash-preview", + options={ + "candidate_count": 1, + "temperature": 0, + "automatic_function_calling": {"disable": True}, + }, + ), + _model_run( + model_run_key="gemini-3.1-flash-lite-preview", + model_key="gemini-3.1-flash-lite-preview", + options={ + "candidate_count": 1, + "temperature": 0, + "automatic_function_calling": {"disable": True}, + }, + ), + _model_run( + model_run_key="gemini-3.1-flash-lite", + model_key="gemini-3.1-flash-lite", + options={ + "candidate_count": 1, + "temperature": 0, + "automatic_function_calling": {"disable": True}, + }, + ), + _model_run( + model_run_key="gemini-3.1-pro-preview", + model_key="gemini-3.1-pro-preview", + options={ + "candidate_count": 1, + "temperature": 0, + "automatic_function_calling": {"disable": True}, + }, + ), + _model_run( + model_run_key="gemini-3.5-flash", + model_key="gemini-3.5-flash", + options={ + "candidate_count": 1, + "temperature": 0, + "automatic_function_calling": {"disable": True}, + }, + ), + ] +) +MODEL_RUNS_BY_KEY: dict[str, ModelRun] = {run.model_run_key: run for run in MODEL_RUNS} + +# MODEL_RUNS is historical. ACTIVE_MODEL_RUNS is the current live-callable +# subset for benchmarks and integration sweeps. +ACTIVE_MODEL_RUNS: list[ModelRun] = [run for run in MODEL_RUNS if run.model.active] +ACTIVE_MODEL_RUNS_BY_KEY: dict[str, ModelRun] = { + run.model_run_key: run for run in ACTIVE_MODEL_RUNS +} + + +def get_model_run(model_run_key: str) -> ModelRun: + """Return a shared model run by key.""" + try: + return MODEL_RUNS_BY_KEY[model_run_key] + except KeyError as exc: + available = ", ".join(sorted(MODEL_RUNS_BY_KEY)) + raise KeyError( + f"Unknown LLM model_run_key {model_run_key}. Available: {available}" + ) from exc + + +def select_model_runs(model_run_keys: Sequence[str]) -> list[ModelRun]: + """Return model runs in the requested order.""" + return [get_model_run(model_run_key) for model_run_key in model_run_keys]