Skip to content

ancypatel-21/codearena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeArena

RL-based software engineering evaluation environment for benchmarking coding agents across debugging, concurrency, API integration, and hidden test scenarios.


Overview

CodeArena is a modular evaluation platform designed to assess the performance of LLM-generated software solutions on realistic software engineering tasks. The system simulates how frontier coding agents are benchmarked using automated grading pipelines, hidden test execution, runtime validation, and failure-pattern analysis.

The environment evaluates generated code across multiple dimensions including:

  • correctness
  • execution reliability
  • concurrency handling
  • retry logic
  • hidden edge-case performance
  • runtime behavior

CodeArena is inspired by modern AI evaluation systems used for measuring the reasoning and coding capabilities of advanced software agents.


Key Capabilities

  • Automated grading for software engineering tasks
  • Hidden test evaluation and edge-case validation
  • Runtime monitoring and execution analysis
  • Concurrency and state-management evaluation
  • API retry and failure-handling scenarios
  • Failure-pattern analysis across generated solutions
  • Modular benchmark architecture for extensibility

System Architecture

Coding Task
     ↓
Submission Runner
     ↓
Automated Grader
     ↓
Hidden Test Evaluation
     ↓
Failure Analysis + Metrics

Core Components

1. Benchmark Task Environment

tasks.py defines realistic software engineering problems including:

  • algorithmic debugging
  • retry policies
  • concurrent execution
  • state consistency validation

Each benchmark includes:

  • public tests
  • hidden validation cases
  • expected outputs
  • runtime constraints

2. Submission Runner

runner.py executes generated code submissions with:

  • timeout protection
  • isolated execution
  • runtime measurement
  • execution-state validation

The runner captures:

  • stdout/stderr
  • runtime behavior
  • timeout failures
  • execution exceptions

3. Automated Grading Engine

grader.py evaluates submissions using:

  • correctness scoring
  • hidden test performance
  • runtime validation
  • execution success/failure tracking

The grading pipeline is extensible and supports additional reward metrics for RL-style evaluation workflows.


4. Failure Analysis Engine

analyzer.py identifies recurring weaknesses in generated solutions such as:

  • runtime failures
  • dependency issues
  • timeout errors
  • concurrency bugs
  • incorrect edge-case handling

This enables systematic analysis of coding-agent behavior and failure modes.


Evaluation Workflow

  1. A benchmark task is selected.
  2. A generated solution is executed.
  3. Public and hidden test cases are evaluated.
  4. Runtime and execution behavior are measured.
  5. The grading engine computes performance scores.
  6. Failure analysis aggregates recurring weaknesses.

Project Structure

codearena/
├── main.py
├── tasks.py
├── runner.py
├── grader.py
├── analyzer.py
├── submissions/
│   └── sample_solution.py
└── README.md

Run the Project

python3 main.py

Sample Output

=== CodeArena Evaluation Started ===

Task: two_sum
Score: 30/30
Passed: 3/3
Average runtime: 0.22 ms

Task: retry_policy
Score: 30/30
Passed: 3/3
Average runtime: 0.12 ms

Task: concurrent_counter
Score: 30/30
Passed: 3/3
Average runtime: 1.34 ms

=== Failure Analysis ===
Counter()

=== CodeArena Evaluation Complete ===

Tech Stack

  • Python
  • Automated Graders
  • Hidden Test Evaluation
  • Concurrency Testing
  • Runtime Analysis
  • RL-style Benchmarking Concepts

Key Learnings

  • Designing scalable software evaluation environments
  • Building automated grading and validation pipelines
  • Measuring runtime and execution correctness
  • Analyzing LLM-generated code failures
  • Structuring extensible benchmark systems for AI agents

Future Improvements

  • Docker-based sandbox execution
  • Multi-language benchmark support
  • RL reward signal integration
  • Distributed evaluation workers
  • Real LLM API integration for agent benchmarking
  • Web dashboard for submission analytics and evaluation metrics

Why This Project Matters

CodeArena reflects core ideas used in modern AI coding-agent evaluation systems:

  • hidden test benchmarking
  • execution validation
  • software reliability scoring
  • failure-pattern analysis
  • runtime-aware evaluation

The project demonstrates how automated evaluation environments can be designed to assess coding agents beyond simple correctness metrics.


Author

Ancy Patel

About

RL-based software engineering evaluation environment for benchmarking coding agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages