CodeArena

RL-based software engineering evaluation environment for benchmarking coding agents across debugging, concurrency, API integration, and hidden test scenarios.

Overview

CodeArena is a modular evaluation platform designed to assess the performance of LLM-generated software solutions on realistic software engineering tasks. The system simulates how frontier coding agents are benchmarked using automated grading pipelines, hidden test execution, runtime validation, and failure-pattern analysis.

The environment evaluates generated code across multiple dimensions including:

correctness
execution reliability
concurrency handling
retry logic
hidden edge-case performance
runtime behavior

CodeArena is inspired by modern AI evaluation systems used for measuring the reasoning and coding capabilities of advanced software agents.

Key Capabilities

Automated grading for software engineering tasks
Hidden test evaluation and edge-case validation
Runtime monitoring and execution analysis
Concurrency and state-management evaluation
API retry and failure-handling scenarios
Failure-pattern analysis across generated solutions
Modular benchmark architecture for extensibility

System Architecture

Coding Task
     ↓
Submission Runner
     ↓
Automated Grader
     ↓
Hidden Test Evaluation
     ↓
Failure Analysis + Metrics

Core Components

1. Benchmark Task Environment

tasks.py defines realistic software engineering problems including:

algorithmic debugging
retry policies
concurrent execution
state consistency validation

Each benchmark includes:

public tests
hidden validation cases
expected outputs
runtime constraints

2. Submission Runner

runner.py executes generated code submissions with:

timeout protection
isolated execution
runtime measurement
execution-state validation

The runner captures:

stdout/stderr
runtime behavior
timeout failures
execution exceptions

3. Automated Grading Engine

grader.py evaluates submissions using:

correctness scoring
hidden test performance
runtime validation
execution success/failure tracking

The grading pipeline is extensible and supports additional reward metrics for RL-style evaluation workflows.

4. Failure Analysis Engine

analyzer.py identifies recurring weaknesses in generated solutions such as:

runtime failures
dependency issues
timeout errors
concurrency bugs
incorrect edge-case handling

This enables systematic analysis of coding-agent behavior and failure modes.

Evaluation Workflow

A benchmark task is selected.
A generated solution is executed.
Public and hidden test cases are evaluated.
Runtime and execution behavior are measured.
The grading engine computes performance scores.
Failure analysis aggregates recurring weaknesses.

Project Structure

codearena/
├── main.py
├── tasks.py
├── runner.py
├── grader.py
├── analyzer.py
├── submissions/
│   └── sample_solution.py
└── README.md

Run the Project

python3 main.py

Sample Output

=== CodeArena Evaluation Started ===

Task: two_sum
Score: 30/30
Passed: 3/3
Average runtime: 0.22 ms

Task: retry_policy
Score: 30/30
Passed: 3/3
Average runtime: 0.12 ms

Task: concurrent_counter
Score: 30/30
Passed: 3/3
Average runtime: 1.34 ms

=== Failure Analysis ===
Counter()

=== CodeArena Evaluation Complete ===

Tech Stack

Python
Automated Graders
Hidden Test Evaluation
Concurrency Testing
Runtime Analysis
RL-style Benchmarking Concepts

Key Learnings

Designing scalable software evaluation environments
Building automated grading and validation pipelines
Measuring runtime and execution correctness
Analyzing LLM-generated code failures
Structuring extensible benchmark systems for AI agents

Future Improvements

Docker-based sandbox execution
Multi-language benchmark support
RL reward signal integration
Distributed evaluation workers
Real LLM API integration for agent benchmarking
Web dashboard for submission analytics and evaluation metrics

Why This Project Matters

CodeArena reflects core ideas used in modern AI coding-agent evaluation systems:

hidden test benchmarking
execution validation
software reliability scoring
failure-pattern analysis
runtime-aware evaluation

The project demonstrates how automated evaluation environments can be designed to assess coding agents beyond simple correctness metrics.

Author

Ancy Patel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeArena

Overview

Key Capabilities

System Architecture

Core Components

1. Benchmark Task Environment

2. Submission Runner

3. Automated Grading Engine

4. Failure Analysis Engine

Evaluation Workflow

Project Structure

Run the Project

Sample Output

Tech Stack

Key Learnings

Future Improvements

Why This Project Matters

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
submissions		submissions
.gitignore		.gitignore
README.md		README.md
analyzer.py		analyzer.py
grader.py		grader.py
main.py		main.py
runner.py		runner.py
tasks.py		tasks.py

Folders and files

Latest commit

History

Repository files navigation

CodeArena

Overview

Key Capabilities

System Architecture

Core Components

1. Benchmark Task Environment

2. Submission Runner

3. Automated Grading Engine

4. Failure Analysis Engine

Evaluation Workflow

Project Structure

Run the Project

Sample Output

Tech Stack

Key Learnings

Future Improvements

Why This Project Matters

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages