Skip to content

Corbell-AI/evalmonkey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

EvalMonkey Logo

EvalMonkey

Agent Benchmarking & Chaos Engineering Framework
"Don't just trust your agent. Prove it works. Then break it."

EvalMonkey Demo

Overview

Agents are fundamentally non-deterministic. They rely on external APIs, tool loops, and massive context windows. EvalMonkey is the ultimate, strictly local, open-source execution harness that enables developers to:

  1. ๐ŸŽฏ Benchmark Capabilities: Run standard Agent benchmark datasets against your agent endpoints natively!
  2. ๐Ÿ”ฅ Inject Chaos: Mutate headers, spike latency, and corrupt schemas dynamically to prove true resilience.
  3. ๐Ÿ“ˆ Track Production Reliability: Locally store all scores to visualize a single Production Reliability metric that aggregates capability plus chaos-resilience over time!

EvalMonkey natively supports evaluating ANY LLM: AWS Bedrock, Azure, GCP, OpenAI, and Ollama.

Note on API Keys: If you have special setups that generate long-lived, static API keys for Bedrock, Azure, or GCP, simply supply them in the .env! EvalMonkey seamlessly supports both standard IAM / Service Account credential flows and long-term stateless authentication strings.

โšก๏ธ Quick Start

git clone https://github.com/Corbell-AI/evalmonkey
cd evalmonkey
pip install -e .

Step 1 โ€” Run this once inside your agent's project folder:

cd /your/crewai-project       # wherever your agent lives
evalmonkey init --framework crewai --name "My Research Crew" --port 8000

This auto-generates a pre-filled evalmonkey.yaml with the correct request/response format for your framework. Supported: crewai, langchain, openai, bedrock, autogen, ollama, strands, custom.

Step 2 โ€” Edit the two settings that matter:

# evalmonkey.yaml โ€” generated for CrewAI
agent:
  name: "My Research Crew"
  framework: crewai
  url: http://localhost:8000/chat       # โ† where your agent listens
  request_key: message
  response_path: reply

  # โ† EvalMonkey will start this for you automatically!
  # It spawns the process, waits for it to turn on, benchmarks, then stops it.
  agent_command: "python src/agent.py"  # or: uvicorn src.agent:app --port 8000
  agent_startup_wait: 3                 # seconds to wait after launch

eval_model: "gpt-4o"   # โ† the LLM used as benchmark judge

Step 3 โ€” Run everything. EvalMonkey starts your agent, benchmarks it, then stops it:

evalmonkey run-benchmark --scenario mmlu
evalmonkey run-chaos --scenario mmlu --chaos-profile client_prompt_injection
evalmonkey history --scenario mmlu

EvalMonkey discovers evalmonkey.yaml from the current working directory โ€” the same convention used by pytest, promptfoo, and docker-compose. Run all commands from your agent's project folder.

๐Ÿค Works With Any Agent โ€” No Code Changes Required

EvalMonkey talks to your agent over plain HTTP. As long as your agent is running and has an endpoint URL, you're done. That's it.

# Point EvalMonkey at your existing running agent
evalmonkey run-benchmark --scenario mmlu --target-url http://localhost:8000/chat

Your agent returns a different JSON format? Use two flags to map any request/response shape:

Flag What it does Example
--request-key Which key to send the question under message, prompt, input
--response-path Dot-path to extract the answer from output.text, choices.0.message.content, result
# CrewAI agent that takes {"message":""} and returns {"reply":""}
evalmonkey run-benchmark --scenario mmlu \
  --target-url http://localhost:8000/chat \
  --request-key message \
  --response-path reply

# OpenAI-compatible endpoint returning {"choices":[{"message":{"content":""}}]}
evalmonkey run-benchmark --scenario arc \
  --target-url http://localhost:8000/v1/chat/completions \
  --request-key content \
  --response-path choices.0.message.content

Supported Frameworks

Framework Notes
๐Ÿฆœ LangChain Any Chain, LCEL pipe, or AgentExecutor behind FastAPI
๐Ÿค– CrewAI Any Crew behind a /chat or custom endpoint
โœจ OpenAI Agents SDK Native OpenAI Chat Completions format supported via --response-path
โ˜๏ธ AWS Bedrock / Agent Core Any Bedrock endpoint, IAM or long-lived key
๐Ÿงฉ Microsoft AutoGen Any ConversableAgent behind HTTP
๐Ÿฆ™ Ollama Running locally at http://localhost:11434
๐Ÿงต Strands SDK Built-in sample apps included
๐ŸŒ Any HTTP Agent Flask, Express.js, Go โ€” if it accepts POST it works
๐Ÿ“ฆ Don't have an HTTP endpoint yet? Use our ready-made thin adapters (click to expand)

Copy the relevant file from apps/framework_adapters/ next to your agent code, swap in your Crew/Chain/Agent, and run it. No changes needed to EvalMonkey.

  • langchain_adapter.py โ€” wraps any LangChain chain
  • crewai_adapter.py โ€” wraps any CrewAI Crew
  • openai_agents_adapter.py โ€” wraps OpenAI Agents SDK
  • bedrock_agentcore_adapter.py โ€” wraps AWS Bedrock Converse API
  • autogen_adapter.py โ€” wraps Microsoft AutoGen Crew

Each adapter is ~40 lines and exposes a /solve endpoint on localhost.


๐ŸŒ Supported Standard Benchmarks

EvalMonkey natively supports 10 off-the-shelf benchmark datasets pulled directly from HuggingFace. List them anytime via the CLI:

evalmonkey list-benchmarks
Scenario ID Description
gsm8k Grade School Math word problems focusing on multi-step reasoning capabilities.
xlam XLAM Function Calling 60k: Tests agent tool execution logic and parameter structuring.
swe-bench SWE-Bench: Resolving real-world GitHub issues for coding agents.
gaia-benchmark GAIA: General AI Assistants testing on real-world web/tool multi-step tasks.
webarena WebArena: Highly interactive computer usage and browser manipulation.
human-eval HumanEval: Fundamental Python code generation from docstrings.
mmlu Massive Multitask Language Understanding: Broad generalized knowledge across 57 subjects.
arc AI2 Reasoning Challenge: Complex grade-school science questions.
truthfulqa TruthfulQA: Tests whether an agent mimics human falsehoods or hallucination.
hella-swag HellaSwag: Commonsense natural language inferences.
๐Ÿ› ๏ธ Build Your Own Custom Benchmarks (click to expand)

Yes, people absolutely bring their own datasets! The most powerful way to test an agent is to grab 10-50 real questions from your production logs, dump them into a CSV, and evaluate your agent against them.

EvalMonkey natively supports auto-parsing .yaml, .json, and .csv files!

You don't need any complex ETL pipelines. Just drop a file (e.g. evals.csv, evals.json, or custom_evals.yaml) in your execution directory and pass it to EvalMonkey!

1. CSV Example (evals.csv)

If using a CSV, just make sure you have the columns id and expected_behavior_rubric. Any other column you add (like question, topic, image_url) will be automatically gathered and sent in the JSON payload directly to your agent!

id expected_behavior_rubric question
get_benefits Must return the URL linking to the company hr portal Where do I sign up for medical benefits?
time_off Provide the exact number of standard vacation days (15) How many days of PTO do I get?
evalmonkey run-benchmark --scenario get_benefits --eval-file evals.csv

2. JSON / YAML Example (evals.json)

If you use JSON or YAML, you must nest the agent payload keys explicitly under an input_payload dict object:

[
  {
    "id": "onboarding_query",
    "description": "Test HR agent's ability to return the onboarding link.",
    "expected_behavior_rubric": "Must contain exactly the URL https://hr.example.com/benefits",
    "input_payload": {
      "question": "Where do I sign up for benefits?"
    }
  }
]
evalmonkey run-benchmark --scenario onboarding_query --eval-file evals.json

๐Ÿ› ๏ธ Experiences

Experience 1: Local Sample Agents (Single Command Start)

Easiest Experience: Test our built-in sample agents with a single command! EvalMonkey will spawn the sample agent in the background automatically and run the benchmark.

# Run against just the first 5 records
evalmonkey run-benchmark --scenario gsm8k --sample-agent rag_app

# Run a statistically robust test against 50 different records!
evalmonkey run-benchmark --scenario gsm8k --sample-agent rag_app --limit 50

Metrics Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Benchmark Results                                        โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ Scenario  gsm8k                                          โ”‚
โ”‚ Score     90/100 (Diff: +5)                              โ”‚
โ”‚ Previous  85/100                                         โ”‚
โ”‚ Reasoning Agent correctly utilized calculator for ...    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Experience 2: Benchmarking Your Custom Local Agents

Provide your own API target!

evalmonkey run-benchmark --scenario mmlu --target-url http://localhost:8000/my-custom-agent

๐Ÿ’ก Why Chaos Benchmark Your Agents?

Resiliency and Reliability are arguably the most crucial components of any highly distributed system. Multi-agent workflowsโ€”with their isolated contexts, recursive tool calls, and cascading API dependenciesโ€”behave fundamentally identically to microservice architectures! As your agents push logic out to the real world, you must securely benchmark against brutal realities, dropped schemas, and malicious payload injections.


Experience 3: Injecting AI-Specific Chaos Engineering (Next-Gen)

EvalMonkey goes far beyond standard network testing by deeply assessing your agent's Production Resilience! We support two distinct classes of Chaos injections depending on how deeply you wish to test:

Class A: Client-Side Injections (Zero Code Changes Required)

You don't need to change a single line of your target agent's code for these tests! EvalMonkey intercepts the benchmark dataset payload before transmission and maliciously damages the HTTP body so you can measure your agent's LLM fallbacks against bad actors!

Profile Description
client_prompt_injection Appends adversarial "IGNORE PREVIOUS INSTRUCTIONS" jailbreaks to test system-message robustness.
client_typo_injection Heavily obfuscates spelled words to test your LLM's semantic inference flexibility.
client_schema_mutation Alters incoming JSON schema keys (e.g. question -> query) to verify robust API strictness handling without crashing.
client_language_shift Radically changes request instructions to attempt safety bypasses.
client_payload_bloat Floods the payload with thousands of characters to natively test token limits and prompt truncation crash safety.
client_empty_payload Sends entirely blank strings to verify graceful rejection handling.
client_context_truncation Maliciously slices the request text exactly in half.
# Testing a single prompt injection against your agent without modifying your code!
evalmonkey run-chaos --scenario arc --chaos-profile client_prompt_injection

# ๐ŸŒช๏ธ INJECT ALL 7 CLIENT MUTATIONS SEQUENTIALLY
evalmonkey run-chaos-suite --scenario gsm8k --limit 3

Class B: Agent-Side Injections (Middleware Catch Required)

To deeply verify context truncation, multi-step LLM hallucination recovery, and tool back-offs, EvalMonkey attaches the X-Chaos-Profile header over HTTP. You write 3 lines of logic in your FastAPI/Flask proxy to trigger the exact system breakage! (See our Sample Apps for reference!)

Profile Description
schema_error Simulates internal tools crashing and returning completely malformed strings mid-generation.
latency_spike Simulates violent HTTP lag, letting you verify recursive timeouts.
rate_limit_429 Simulates your core LLM provider suddenly hitting API Request Limits mid-workflow.
context_overflow Safely floods context sizes natively to test intelligent prompt truncation.
hallucinated_tool Maliciously injects fake data into tool memory to test your agent's logic verification steps.
empty_response Completely drops state parameters abruptly.
# Testing a server-side framework context overflow!
evalmonkey run-chaos --scenario mmlu --sample-agent research_agent --chaos-profile context_overflow

Metrics Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ ๐Ÿ”ฅ Chaos Engineering Report ๐Ÿ”ฅ                             โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ Scenario:                  xlam                          โ”‚
โ”‚ Chaos Profile:             schema_error                  โ”‚
โ”‚ Baseline Capability Score: 90                            โ”‚
โ”‚ Post-Chaos Resilience:     30                            โ”‚
โ”‚ Status:                    DEGRADED CAPABILITY           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐Ÿค– MCP Server (Cursor & Claude Integration)

EvalMonkey natively ships with a Model Context Protocol (MCP) server! This allows AI IDEs (like Cursor) or external agents (like Claude Desktop) to invoke EvalMonkey tools automatically while they build your agent.

Setting Up in Claude Desktop / Cursor

Add the following to your MCP configuration file (e.g. claude_desktop_config.json):

{
  "mcpServers": {
    "evalmonkey": {
      "command": "evalmonkey",
      "args": ["serve-mcp"]
    }
  }
}

Once connected, your AI assistant will gain the ability to list benchmarks, trigger full evaluation runs, inject chaos payload mutators, and pull historical trends entirely autonomously while helping you write your agent!


Experience 4: Historical Production Reliability

Check your agent's reliability trends over time!

evalmonkey history --scenario gsm8k

Metrics Output:

๐Ÿ“ˆ Historical Trend for: gsm8k ๐Ÿ“ˆ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Date             โ”‚ Run Type โ”‚ Score โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 2026-04-16 18:32 โ”‚ BASELINE โ”‚    85 โ”‚
โ”‚ 2026-04-16 18:33 โ”‚ BASELINE โ”‚    90 โ”‚
โ”‚ 2026-04-16 18:35 โ”‚ CHAOS    โ”‚    30 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐Ÿš€ Production Reliability Metric: 66.0 / 100.0
(Calculated as 60% of most recent baseline capability + 40% most recent chaos resilience)

๐Ÿ“„ License

This project is licensed under Apache 2.0. See the LICENSE file for details.

About

Agent Benchmarking & Chaos Engineering Framework

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages