Evaluation Infrastructure for AI Agents
-
Updated
Feb 25, 2026 - TypeScript
Evaluation Infrastructure for AI Agents
Benchmarking Open-Ended Inference Optimization by AI Agents
Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.
Eval-first AI agent that triages property maintenance emails. The real work is the eval system around it: trace-driven error analysis, code graders and validated LLM-as-judge (TPR/TNR), component and end-to-end evals, a failure taxonomy, and a CI regression gate. LangGraph, FastAPI, Langfuse.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Playbook for PMs shipping AI products with PRDs, evals, HITL, launch gates, cost, and observability.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.
Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.
Dali is an open infrastructure project focused on citation integrity, evidentiary lineage, and reproducibility for legal AI systems. It evaluates whether AI generated legal citations and workflows remain attributable, reconstructable, and verifiable across modern AI environments.
Offline, auditable benchmark for one-shot LLM market decisions.
Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.
AI-powered remote job aggregator with true-remote filtering, fuzzy skill match, and a working AI Evals Engineer portfolio (Hamel/Shreya methodology, calibrated judges, failure taxonomy).
A lightweight workbench for dataset-driven agent and LLM evaluation.
Mercor — AI-powered talent marketplace and labeling workforce
Harbor-format AI evaluation tasks for synthetic adtech revenue operations workflows
Lightweight eval framework for LLMs & AI apps combining deterministic scoring, LLM-as-judge, and regression testing.
Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.
To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."