ai-evals

Star

Here are 39 public repositories matching this topic...

solana8800 / langeval

Sponsor

Star

Evaluation Infrastructure for AI Agents

ai-evaluation agent-evaluation ai-evals

Updated Feb 25, 2026
TypeScript

aisa-group / InferenceBench

Star

Benchmarking Open-Ended Inference Optimization by AI Agents

benchmarks ai-safety vllm sglang claude-code codex-cli ai-evals ai-research-automation

Updated May 16, 2026
Python

productfoundry101 / ai-evals-bootcamp

Star

Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.

bootcamp red-teaming rag prompt-engineering llmops ai-product-management llm-evaluation claude-code ai-pm ai-evals

Updated May 6, 2026

yiouli / pixie-qa

Star

Agent skill for AI agent development

skill dev eval llm agent-skills ai-evals

Updated Apr 22, 2026
HTML

mohsinsheikhani / property-maintenance-agent

Star

Eval-first AI agent that triages property maintenance emails. The real work is the eval system around it: trace-driven error analysis, code graders and validated LLM-as-judge (TPR/TNR), component and end-to-end evals, a failure taxonomy, and a CI regression gate. LangGraph, FastAPI, Langfuse.

python evaluation openai ai-agents pydantic fastapi ai-engineering prompt-engineering llmops langfuse llm-evaluation langgraph llm-as-a-judge llm-observability agentic-ai context-engineering ai-evals

Updated May 26, 2026
Python

RafaelParonis / jailbench

Star

🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.

python flask analytics openai alignment model-evaluation ai-safety security-testing red-teaming model-robustness anthropic litellm content-safety llm-jailbreaks tool-calling llm-benchmark ai-evals textual-tui

Updated May 30, 2026
Python

zaidazmi / AI-PM-PLAYBOOK

Star

Playbook for PMs shipping AI products with PRDs, evals, HITL, launch gates, cost, and observability.

prd human-in-the-loop ai-agents evals ai-product-management llm-evals vibe-coding ai-pm ai-evals

Updated May 26, 2026
TypeScript

vibheksoni / jailbench

Star

Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.

Updated Aug 12, 2025
Python

danielrosehill / Awesome-AI-Evaluations-Tools

Star

Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal

evaluations evals ai-evals

Updated May 25, 2026
Python

MohsinCreed / LangfuseOllama

Star

Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.

docker open-source self-hosted free no-cost local-llm ollama langfuse llm-evaluation prompt-evaluation offline-ai llm-as-judge llm-observability ai-evals

Updated Apr 13, 2026
TypeScript

SuperfiedStudd / ai-evals-orchestration

Star

End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.

gemini openai multi-model transcription human-in-the-loop model-comparison supabase anthropic llm-evaluation ai-evals evaluation-pipeline

Updated Mar 10, 2026
TypeScript

vitron-ai / aip-foundry-themis-starter

Star

Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.

react typescript schema-validation themis contract-testing osdk developer-tooling agentic-workflows ai-evals foundry-workflows

Updated Mar 28, 2026
TypeScript

yenk / Dali

Star

Dali is an open infrastructure project focused on citation integrity, evidentiary lineage, and reproducibility for legal AI systems. It evaluates whether AI generated legal citations and workflows remain attributable, reconstructable, and verifiable across modern AI environments.

open-source benchmark oss mcp provenance reproducibility legaltech evidence legal-ai ai-evaluation open-infrastructure ai-evals deterministic-ai legal-citations defensible-ai legal-infrastructure citation-integrity evidentiary-infrastructure

Updated May 28, 2026
Python

ishtiaqrahman / capitalbench

Star

Offline, auditable benchmark for one-shot LLM market decisions.

finance benchmark reproducibility llm-evaluation ai-evals capitalbench

Updated May 30, 2026
Python

vishal-labade / llm_exp_platform_v2

Star

Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.

experimentation causal-inference product-analytics llm-evaluation llm-benchmarking ai-evals

Updated Mar 8, 2026
Python

majdukovic / job-radar

Star

AI-powered remote job aggregator with true-remote filtering, fuzzy skill match, and a working AI Evals Engineer portfolio (Hamel/Shreya methodology, calibrated judges, failure taxonomy).

typescript nextjs job-search posthog supabase ai-evaluation llm inngest prompt-engineering anthropic ai-evals

Updated May 14, 2026
HTML

IsaacCavallaro / agent-evals-workbench

Star

A lightweight workbench for dataset-driven agent and LLM evaluation.

python cli regression-testing llm-evals agent-evals openai-compatible ai-evals eval-harness

Updated May 1, 2026
Python

api-evangelist / mercor

Star

Mercor — AI-powered talent marketplace and labeling workforce

human-intelligence data-pipelines sft rlhf ai-evals talent-marketplace apex-benchmarks

Updated May 24, 2026

davidspiegs / adtech-eval-lab

Star

Harbor-format AI evaluation tasks for synthetic adtech revenue operations workflows

benchmarking adtech harbor synthetic-data revenue-operations ai-evals

Updated May 25, 2026
Python

vineethcv / eval-engine

Star

Lightweight eval framework for LLMs & AI apps combining deterministic scoring, LLM-as-judge, and regression testing.

python testing evaluation openai llm ai-quality evals ai-evals ai-quality-assurance

Updated Apr 8, 2026
Python

Improve this page

Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evals

Here are 39 public repositories matching this topic...

solana8800 / langeval

aisa-group / InferenceBench

productfoundry101 / ai-evals-bootcamp

yiouli / pixie-qa

mohsinsheikhani / property-maintenance-agent

RafaelParonis / jailbench

zaidazmi / AI-PM-PLAYBOOK

vibheksoni / jailbench

danielrosehill / Awesome-AI-Evaluations-Tools

MohsinCreed / LangfuseOllama

SuperfiedStudd / ai-evals-orchestration

vitron-ai / aip-foundry-themis-starter

yenk / Dali

ishtiaqrahman / capitalbench

vishal-labade / llm_exp_platform_v2

majdukovic / job-radar

IsaacCavallaro / agent-evals-workbench

api-evangelist / mercor

davidspiegs / adtech-eval-lab

vineethcv / eval-engine

Improve this page

Add this topic to your repo