RL-based software engineering evaluation environment for benchmarking coding agents across debugging, concurrency, API integration, and hidden test scenarios.
CodeArena is a modular evaluation platform designed to assess the performance of LLM-generated software solutions on realistic software engineering tasks. The system simulates how frontier coding agents are benchmarked using automated grading pipelines, hidden test execution, runtime validation, and failure-pattern analysis.
The environment evaluates generated code across multiple dimensions including:
- correctness
- execution reliability
- concurrency handling
- retry logic
- hidden edge-case performance
- runtime behavior
CodeArena is inspired by modern AI evaluation systems used for measuring the reasoning and coding capabilities of advanced software agents.
- Automated grading for software engineering tasks
- Hidden test evaluation and edge-case validation
- Runtime monitoring and execution analysis
- Concurrency and state-management evaluation
- API retry and failure-handling scenarios
- Failure-pattern analysis across generated solutions
- Modular benchmark architecture for extensibility
Coding Task
↓
Submission Runner
↓
Automated Grader
↓
Hidden Test Evaluation
↓
Failure Analysis + Metrics
tasks.py defines realistic software engineering problems including:
- algorithmic debugging
- retry policies
- concurrent execution
- state consistency validation
Each benchmark includes:
- public tests
- hidden validation cases
- expected outputs
- runtime constraints
runner.py executes generated code submissions with:
- timeout protection
- isolated execution
- runtime measurement
- execution-state validation
The runner captures:
- stdout/stderr
- runtime behavior
- timeout failures
- execution exceptions
grader.py evaluates submissions using:
- correctness scoring
- hidden test performance
- runtime validation
- execution success/failure tracking
The grading pipeline is extensible and supports additional reward metrics for RL-style evaluation workflows.
analyzer.py identifies recurring weaknesses in generated solutions such as:
- runtime failures
- dependency issues
- timeout errors
- concurrency bugs
- incorrect edge-case handling
This enables systematic analysis of coding-agent behavior and failure modes.
- A benchmark task is selected.
- A generated solution is executed.
- Public and hidden test cases are evaluated.
- Runtime and execution behavior are measured.
- The grading engine computes performance scores.
- Failure analysis aggregates recurring weaknesses.
codearena/
├── main.py
├── tasks.py
├── runner.py
├── grader.py
├── analyzer.py
├── submissions/
│ └── sample_solution.py
└── README.md
python3 main.py=== CodeArena Evaluation Started ===
Task: two_sum
Score: 30/30
Passed: 3/3
Average runtime: 0.22 ms
Task: retry_policy
Score: 30/30
Passed: 3/3
Average runtime: 0.12 ms
Task: concurrent_counter
Score: 30/30
Passed: 3/3
Average runtime: 1.34 ms
=== Failure Analysis ===
Counter()
=== CodeArena Evaluation Complete ===
- Python
- Automated Graders
- Hidden Test Evaluation
- Concurrency Testing
- Runtime Analysis
- RL-style Benchmarking Concepts
- Designing scalable software evaluation environments
- Building automated grading and validation pipelines
- Measuring runtime and execution correctness
- Analyzing LLM-generated code failures
- Structuring extensible benchmark systems for AI agents
- Docker-based sandbox execution
- Multi-language benchmark support
- RL reward signal integration
- Distributed evaluation workers
- Real LLM API integration for agent benchmarking
- Web dashboard for submission analytics and evaluation metrics
CodeArena reflects core ideas used in modern AI coding-agent evaluation systems:
- hidden test benchmarking
- execution validation
- software reliability scoring
- failure-pattern analysis
- runtime-aware evaluation
The project demonstrates how automated evaluation environments can be designed to assess coding agents beyond simple correctness metrics.
Ancy Patel