rag-eval-pack

Production-grade RAG evaluation toolkit with LLM-as-judge, cost accounting, and CI/CD regression gates.

This monorepo provides a composable suite of packages for evaluating Retrieval-Augmented Generation (RAG) systems across four core metrics — faithfulness, relevance, context precision, and context recall — with heuristic scoring, LLM-based judging, budget enforcement, and automated quality gating for CI pipelines.

Features

Generation metrics — faithfulness, relevance, and answer-correctness with fast lexical scorers; supply an EmbeddingProvider for paraphrase-aware semantic scoring
Retrieval metrics — context precision/recall plus ranking metrics (MRR, nDCG, precision/recall/hit@k) from retrieved vs. relevant chunk IDs
LLM-as-judge — multi-provider judging (Anthropic, OpenAI, Google, and any OpenAI-compatible gateway or local model) with calibration, consensus voting, and agreement-based confidence
Cost accounting — per-sample and per-run token tracking with budget enforcement and alert thresholds
Quality gates — threshold and baseline-comparison gates with noise-tolerance bands, warn/fail severity, and formatted CI output with exit codes
MCP server — three-layer tool API (judge.*, suite.*, gate.*) for agent-driven evaluation
Dataset management — multi-format loading, Zod validation, synthetic generation, and version tracking
Observability — structured Pino logging, OpenTelemetry tracing, and Prometheus-compatible metrics
Dual ESM/CJS — every package ships cjs and esm output for maximum compatibility

Installation

Using the packages

Packages are published under the @reaatech scope and can be installed individually:

# Core types and schemas
pnpm add @reaatech/rag-eval-core

# Metric scorers
pnpm add @reaatech/rag-eval-metrics

# LLM judge
pnpm add @reaatech/rag-eval-judge

# Cost tracking
pnpm add @reaatech/rag-eval-cost

# Quality gates
pnpm add @reaatech/rag-eval-gate

# Dataset management
pnpm add @reaatech/rag-eval-dataset

# Central orchestrator
pnpm add @reaatech/rag-eval-suite

# MCP server
pnpm add @reaatech/rag-eval-mcp-server @modelcontextprotocol/sdk

# CLI tool
pnpm add @reaatech/rag-eval-cli

# Observability utilities
pnpm add @reaatech/rag-eval-observability

Contributing

# Clone the repository
git clone https://github.com/reaatech/rag-eval-pack.git
cd rag-eval-pack

# Install dependencies
pnpm install

# Build all packages
pnpm build

# Run the test suite
pnpm test

# Run linting
pnpm lint

Quick Start

Evaluate a RAG system's output in a few lines:

import { EvaluationSuite } from "@reaatech/rag-eval-suite";

const suite = new EvaluationSuite({
  metrics: ["faithfulness", "relevance", "context_precision", "context_recall"],
  judge: { model: "claude-opus-4-7" },
  gates: [
    { name: "min-faithfulness", type: "threshold", metric: "avg_faithfulness", operator: ">=", threshold: 0.85 },
  ],
  cost: { budget_limit: 10.00 },
});

const result = await suite.runFromFile("datasets/eval-samples.jsonl");

console.log("Overall score:", result.results.metrics.overall_score);
console.log("Faithfulness:", result.results.metrics.avg_faithfulness);
console.log("Total cost:", result.results.total_cost);
console.log("Gates passed:", result.gate_result?.passed);

Or use the CLI:

rag-eval-pack evaluate --dataset dataset.jsonl --output results.json
rag-eval-pack gate --results results.json --gates gates.yaml
rag-eval-pack report --results results.json --output report.md

See datasets/examples/ for sample datasets and configuration files.

Scoring modes

Each generation metric can run at three levels of fidelity and cost — pick per use case:

Mode	How it works	Cost	Catches paraphrase?
Lexical (default)	Word/character overlap + intent heuristics. Deterministic, no network, no key.	Free	No — surface form only
Semantic	Cosine similarity of embeddings via an `EmbeddingProvider` you supply.	Embedding API cost	Yes
LLM judge	An LLM rates the sample with calibration and optional consensus.	Token cost	Yes, with reasoning

Lexical scoring is reported as lexical_similarity; semantic_similarity is populated only when an embedding provider is configured. Use lexical for fast pre-commit smoke checks, semantic for paraphrase-sensitive metrics, and the judge for the final quality bar.

import { RelevanceScorer } from "@reaatech/rag-eval-metrics";

const scorer = new RelevanceScorer({
  embeddingProvider: { embed: async (texts) => myEmbedAPI(texts) },
});

Point the judge at any OpenAI-compatible gateway or local model via explicit config:

const suite = new EvaluationSuite({
  metrics: ["faithfulness", "relevance"],
  judge: { provider: "openai", base_url: "http://localhost:11434/v1", model: "llama3.1" },
});

Packages

Package	Description
`@reaatech/rag-eval-core`	Canonical types, Zod schemas, and domain models
`@reaatech/rag-eval-metrics`	Metric scorers: faithfulness, relevance, context precision/recall, retrieval ranking, and answer correctness
`@reaatech/rag-eval-judge`	LLM-as-judge with calibration, consensus, and cost tracking
`@reaatech/rag-eval-cost`	Pricing, budgeting, and cost reporting
`@reaatech/rag-eval-gate`	Quality gates and CI regression checks
`@reaatech/rag-eval-dataset`	Dataset loading, validation, generation, and versioning
`@reaatech/rag-eval-suite`	Central orchestration engine
`@reaatech/rag-eval-mcp-server`	MCP server for agent-driven evaluation
`@reaatech/rag-eval-cli`	CLI entry point and commands
`@reaatech/rag-eval-observability`	Structured logging, tracing, and metrics

Documentation

ARCHITECTURE.md — System design, package relationships, and data flows
AGENTS.md — Coding conventions, tool architecture, and development guidelines
CONTRIBUTING.md — Contribution workflow and release process

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.changeset		.changeset
.github		.github
datasets/examples		datasets/examples
docker		docker
infra		infra
packages		packages
skills		skills
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.npmrc		.npmrc
.nvmrc		.nvmrc
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
tsconfig.typecheck.json		tsconfig.typecheck.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rag-eval-pack

Features

Installation

Using the packages

Contributing

Quick Start

Scoring modes

Packages

Documentation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rag-eval-pack

Features

Installation

Using the packages

Contributing

Quick Start

Scoring modes

Packages

Documentation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages