Skip to content

reaatech/rag-eval-pack

Repository files navigation

rag-eval-pack

CI License: MIT TypeScript

Production-grade RAG evaluation toolkit with LLM-as-judge, cost accounting, and CI/CD regression gates.

This monorepo provides a composable suite of packages for evaluating Retrieval-Augmented Generation (RAG) systems across four core metrics — faithfulness, relevance, context precision, and context recall — with heuristic scoring, LLM-based judging, budget enforcement, and automated quality gating for CI pipelines.

Features

  • Generation metrics — faithfulness, relevance, and answer-correctness with fast lexical scorers; supply an EmbeddingProvider for paraphrase-aware semantic scoring
  • Retrieval metrics — context precision/recall plus ranking metrics (MRR, nDCG, precision/recall/hit@k) from retrieved vs. relevant chunk IDs
  • LLM-as-judge — multi-provider judging (Anthropic, OpenAI, Google, and any OpenAI-compatible gateway or local model) with calibration, consensus voting, and agreement-based confidence
  • Cost accounting — per-sample and per-run token tracking with budget enforcement and alert thresholds
  • Quality gates — threshold and baseline-comparison gates with noise-tolerance bands, warn/fail severity, and formatted CI output with exit codes
  • MCP server — three-layer tool API (judge.*, suite.*, gate.*) for agent-driven evaluation
  • Dataset management — multi-format loading, Zod validation, synthetic generation, and version tracking
  • Observability — structured Pino logging, OpenTelemetry tracing, and Prometheus-compatible metrics
  • Dual ESM/CJS — every package ships cjs and esm output for maximum compatibility

Installation

Using the packages

Packages are published under the @reaatech scope and can be installed individually:

# Core types and schemas
pnpm add @reaatech/rag-eval-core

# Metric scorers
pnpm add @reaatech/rag-eval-metrics

# LLM judge
pnpm add @reaatech/rag-eval-judge

# Cost tracking
pnpm add @reaatech/rag-eval-cost

# Quality gates
pnpm add @reaatech/rag-eval-gate

# Dataset management
pnpm add @reaatech/rag-eval-dataset

# Central orchestrator
pnpm add @reaatech/rag-eval-suite

# MCP server
pnpm add @reaatech/rag-eval-mcp-server @modelcontextprotocol/sdk

# CLI tool
pnpm add @reaatech/rag-eval-cli

# Observability utilities
pnpm add @reaatech/rag-eval-observability

Contributing

# Clone the repository
git clone https://github.com/reaatech/rag-eval-pack.git
cd rag-eval-pack

# Install dependencies
pnpm install

# Build all packages
pnpm build

# Run the test suite
pnpm test

# Run linting
pnpm lint

Quick Start

Evaluate a RAG system's output in a few lines:

import { EvaluationSuite } from "@reaatech/rag-eval-suite";

const suite = new EvaluationSuite({
  metrics: ["faithfulness", "relevance", "context_precision", "context_recall"],
  judge: { model: "claude-opus-4-7" },
  gates: [
    { name: "min-faithfulness", type: "threshold", metric: "avg_faithfulness", operator: ">=", threshold: 0.85 },
  ],
  cost: { budget_limit: 10.00 },
});

const result = await suite.runFromFile("datasets/eval-samples.jsonl");

console.log("Overall score:", result.results.metrics.overall_score);
console.log("Faithfulness:", result.results.metrics.avg_faithfulness);
console.log("Total cost:", result.results.total_cost);
console.log("Gates passed:", result.gate_result?.passed);

Or use the CLI:

rag-eval-pack evaluate --dataset dataset.jsonl --output results.json
rag-eval-pack gate --results results.json --gates gates.yaml
rag-eval-pack report --results results.json --output report.md

See datasets/examples/ for sample datasets and configuration files.

Scoring modes

Each generation metric can run at three levels of fidelity and cost — pick per use case:

Mode How it works Cost Catches paraphrase?
Lexical (default) Word/character overlap + intent heuristics. Deterministic, no network, no key. Free No — surface form only
Semantic Cosine similarity of embeddings via an EmbeddingProvider you supply. Embedding API cost Yes
LLM judge An LLM rates the sample with calibration and optional consensus. Token cost Yes, with reasoning

Lexical scoring is reported as lexical_similarity; semantic_similarity is populated only when an embedding provider is configured. Use lexical for fast pre-commit smoke checks, semantic for paraphrase-sensitive metrics, and the judge for the final quality bar.

import { RelevanceScorer } from "@reaatech/rag-eval-metrics";

const scorer = new RelevanceScorer({
  embeddingProvider: { embed: async (texts) => myEmbedAPI(texts) },
});

Point the judge at any OpenAI-compatible gateway or local model via explicit config:

const suite = new EvaluationSuite({
  metrics: ["faithfulness", "relevance"],
  judge: { provider: "openai", base_url: "http://localhost:11434/v1", model: "llama3.1" },
});

Packages

Package Description
@reaatech/rag-eval-core Canonical types, Zod schemas, and domain models
@reaatech/rag-eval-metrics Metric scorers: faithfulness, relevance, context precision/recall, retrieval ranking, and answer correctness
@reaatech/rag-eval-judge LLM-as-judge with calibration, consensus, and cost tracking
@reaatech/rag-eval-cost Pricing, budgeting, and cost reporting
@reaatech/rag-eval-gate Quality gates and CI regression checks
@reaatech/rag-eval-dataset Dataset loading, validation, generation, and versioning
@reaatech/rag-eval-suite Central orchestration engine
@reaatech/rag-eval-mcp-server MCP server for agent-driven evaluation
@reaatech/rag-eval-cli CLI entry point and commands
@reaatech/rag-eval-observability Structured logging, tracing, and metrics

Documentation

  • ARCHITECTURE.md — System design, package relationships, and data flows
  • AGENTS.md — Coding conventions, tool architecture, and development guidelines
  • CONTRIBUTING.md — Contribution workflow and release process

License

MIT

About

RAG evaluation toolkit — faithfulness, answer relevance, context precision/recall, cost accounting, CI gates. Pairs with hybrid-rag-qdrant and agent-eval-harness.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages