Production-grade RAG evaluation toolkit with LLM-as-judge, cost accounting, and CI/CD regression gates.
This monorepo provides a composable suite of packages for evaluating Retrieval-Augmented Generation (RAG) systems across four core metrics — faithfulness, relevance, context precision, and context recall — with heuristic scoring, LLM-based judging, budget enforcement, and automated quality gating for CI pipelines.
- Generation metrics — faithfulness, relevance, and answer-correctness with fast lexical scorers; supply an
EmbeddingProviderfor paraphrase-aware semantic scoring - Retrieval metrics — context precision/recall plus ranking metrics (MRR, nDCG, precision/recall/hit@k) from retrieved vs. relevant chunk IDs
- LLM-as-judge — multi-provider judging (Anthropic, OpenAI, Google, and any OpenAI-compatible gateway or local model) with calibration, consensus voting, and agreement-based confidence
- Cost accounting — per-sample and per-run token tracking with budget enforcement and alert thresholds
- Quality gates — threshold and baseline-comparison gates with noise-tolerance bands,
warn/failseverity, and formatted CI output with exit codes - MCP server — three-layer tool API (
judge.*,suite.*,gate.*) for agent-driven evaluation - Dataset management — multi-format loading, Zod validation, synthetic generation, and version tracking
- Observability — structured Pino logging, OpenTelemetry tracing, and Prometheus-compatible metrics
- Dual ESM/CJS — every package ships
cjsandesmoutput for maximum compatibility
Packages are published under the @reaatech scope and can be installed individually:
# Core types and schemas
pnpm add @reaatech/rag-eval-core
# Metric scorers
pnpm add @reaatech/rag-eval-metrics
# LLM judge
pnpm add @reaatech/rag-eval-judge
# Cost tracking
pnpm add @reaatech/rag-eval-cost
# Quality gates
pnpm add @reaatech/rag-eval-gate
# Dataset management
pnpm add @reaatech/rag-eval-dataset
# Central orchestrator
pnpm add @reaatech/rag-eval-suite
# MCP server
pnpm add @reaatech/rag-eval-mcp-server @modelcontextprotocol/sdk
# CLI tool
pnpm add @reaatech/rag-eval-cli
# Observability utilities
pnpm add @reaatech/rag-eval-observability# Clone the repository
git clone https://github.com/reaatech/rag-eval-pack.git
cd rag-eval-pack
# Install dependencies
pnpm install
# Build all packages
pnpm build
# Run the test suite
pnpm test
# Run linting
pnpm lintEvaluate a RAG system's output in a few lines:
import { EvaluationSuite } from "@reaatech/rag-eval-suite";
const suite = new EvaluationSuite({
metrics: ["faithfulness", "relevance", "context_precision", "context_recall"],
judge: { model: "claude-opus-4-7" },
gates: [
{ name: "min-faithfulness", type: "threshold", metric: "avg_faithfulness", operator: ">=", threshold: 0.85 },
],
cost: { budget_limit: 10.00 },
});
const result = await suite.runFromFile("datasets/eval-samples.jsonl");
console.log("Overall score:", result.results.metrics.overall_score);
console.log("Faithfulness:", result.results.metrics.avg_faithfulness);
console.log("Total cost:", result.results.total_cost);
console.log("Gates passed:", result.gate_result?.passed);Or use the CLI:
rag-eval-pack evaluate --dataset dataset.jsonl --output results.json
rag-eval-pack gate --results results.json --gates gates.yaml
rag-eval-pack report --results results.json --output report.mdSee datasets/examples/ for sample datasets and configuration files.
Each generation metric can run at three levels of fidelity and cost — pick per use case:
| Mode | How it works | Cost | Catches paraphrase? |
|---|---|---|---|
| Lexical (default) | Word/character overlap + intent heuristics. Deterministic, no network, no key. | Free | No — surface form only |
| Semantic | Cosine similarity of embeddings via an EmbeddingProvider you supply. |
Embedding API cost | Yes |
| LLM judge | An LLM rates the sample with calibration and optional consensus. | Token cost | Yes, with reasoning |
Lexical scoring is reported as lexical_similarity; semantic_similarity is populated only when an embedding provider is configured. Use lexical for fast pre-commit smoke checks, semantic for paraphrase-sensitive metrics, and the judge for the final quality bar.
import { RelevanceScorer } from "@reaatech/rag-eval-metrics";
const scorer = new RelevanceScorer({
embeddingProvider: { embed: async (texts) => myEmbedAPI(texts) },
});Point the judge at any OpenAI-compatible gateway or local model via explicit config:
const suite = new EvaluationSuite({
metrics: ["faithfulness", "relevance"],
judge: { provider: "openai", base_url: "http://localhost:11434/v1", model: "llama3.1" },
});| Package | Description |
|---|---|
@reaatech/rag-eval-core |
Canonical types, Zod schemas, and domain models |
@reaatech/rag-eval-metrics |
Metric scorers: faithfulness, relevance, context precision/recall, retrieval ranking, and answer correctness |
@reaatech/rag-eval-judge |
LLM-as-judge with calibration, consensus, and cost tracking |
@reaatech/rag-eval-cost |
Pricing, budgeting, and cost reporting |
@reaatech/rag-eval-gate |
Quality gates and CI regression checks |
@reaatech/rag-eval-dataset |
Dataset loading, validation, generation, and versioning |
@reaatech/rag-eval-suite |
Central orchestration engine |
@reaatech/rag-eval-mcp-server |
MCP server for agent-driven evaluation |
@reaatech/rag-eval-cli |
CLI entry point and commands |
@reaatech/rag-eval-observability |
Structured logging, tracing, and metrics |
ARCHITECTURE.md— System design, package relationships, and data flowsAGENTS.md— Coding conventions, tool architecture, and development guidelinesCONTRIBUTING.md— Contribution workflow and release process