A calibrated cost-quality configuration framework for single-session coding agents. You declare a policy once; the runtime runs a calibrated solver configuration for the workload, keeps the patch an execution verifier confirms, and stops spending on instances a deployable signal marks unsolvable at this tier.
From the paper? This codebase is the artifact for The Economics of Coding Agents: Calibrating the Cost-Quality Frontier. The exact code state cited in the paper is tagged
v1.0-arxiv;mainmay have evolved since. Every load-bearing number reproduces offline - see Reproducing the paper's claims.
- Solvers - a catalog of execution variants (signaled budget, plan-then-execute, relevance ranker, signal injection, per-tool-class hint, phase-staged, two-pass critique, and more) that generate candidate patches.
- Repro-verifier (the quality lever) - keep the patch a generated reproduction confirms, scored without consulting held-out tests.
- Governor (the cost lever) - stop spending on instances a deployable signal (a failed reproduction together with a broken regression suite) marks doomed at this tier.
Two things ship, both calibrated and checked on a held-out split: (1) a per-workload menu of single configurations on the cost-quality Pareto frontier, and (2) a regression-suite governor that recovers a majority of wrongly-abandoned resolves, narrowing the gap to the unbounded floor. The single-tier run-both-and-abandon composition is reported as a first-class negative: it adds zero resolves over its best single member at roughly twice the cost, and an escalation gate loses its selectivity once the strong tier runs.
Same models, same data split, same scorer; cost as one modeled per-instance surface.
| Comparison | Result |
|---|---|
| vs. Agentless (head-to-head, SWE-bench Lite, 300 instances) | 172/300 resolved (57.3%) @ $0.445/inst vs. Agentless 152/300 (50.7%) @ $0.456, more resolved at lower cost (Pareto; McNemar p=0.0055, 13.8% lower cost per resolve) |
| Calibrated cross-tier menu (resolved/150) | cost-min 84 @ $0.17 · balanced 98 @ $0.41 · quality 111 @ $0.47 · quality-max 128 @ $0.48 · ceiling 138 @ $0.60 |
| Regression-suite governor | recovers a majority of wrongly-abandoned resolves (narrowing the gap to the unbounded floor) |
Full methodology, per-claim evidence, and the negative results are in
paper/paper.pdf and CLAIMS.md.
pip install -e ".[dev]"from code_capsules import policy_for, CodeCapsulesRunner
# 1. Pick a calibrated preset from the shipped cross-tier menu.
policy = policy_for(workload="hard_workload", tier="sonnet", knee="balanced")
print(policy.variant, policy.turn_budget, f"${policy.cost_per_task:.2f}/attempt")
# 2. Or load the deployable lever (diverse solver pair + repro-verifier + governor).
lever = CodeCapsulesRunner.from_policy_file("policy.yaml")
# lever.run(sampler, grade_fn) # wire your model sampler + grader; see examples/deployable_lever.pyThe configuration layer is fully offline (no API keys). Running the
lever against real models requires a model sampler and a grader - see
examples/deployable_lever.py.
The headline claims reproduce offline (no Docker, no API keys, no
model calls) from the committed evaluation data in evals/,
including the deployment-governor analysis (the diverse-sample
agreement signal, the regression-suite gate, the escalation gate, and
the value-of-resolve rule):
python3 benchmarks/verify_criteria.py # PASS/FAIL per claim; every claim reproduces offlineOr browse them interactively in the evidence explorer - a dependency-free static page that shows each claim with its paper value, status, and reproduce-command, and lets you drill into any evaluation run (per-instance rows, resolved / cost, aggregates):
bash benchmarks/explore.sh # builds the data, serves localhost, opens the browserEach claim maps to a scorer and its evidence files in
CLAIMS.md; the scorers live in
benchmarks/.
examples/quickstart_policy.py- the configuration DSL (presets + custom configs)examples/deployable_lever.py- run the 3-component controller end to endexamples/workload_routing.py- route by workload classexamples/advanced/custom_quality_gate.py- plug in your own quality gateexamples/calibrate_deployment_defaults.py- how the shipped presets were calibrated
The cross-vendor HumanEval/MBPP cost study (claim C8) is reproduced offline by
benchmarks/cross_vendor/score.py over the
committed CSVs - see CLAIMS.md.
The framework's reusable contribution is its extension surface:
eight Protocols you implement against your own domain (the SWE-bench /
HumanEval / MBPP numbers calibrate the shipped defaults; you derive your
own for your workload). The primitives - WorkloadClassifier,
RoutingStrategy, Variant, Signal, QualityGate, CascadeTrigger,
CostModel, ModelClient - are defined in
src/code_capsules/api/protocols.py
and registered by name. Concrete classes need not inherit from anything
(PEP 544 structural typing). See
examples/advanced/custom_quality_gate.py and
examples/workload_routing.py for
worked extensions.
pytest -m "not integration and not slow and not benchmark"The offline suite uses scripted adapters - no API keys required. Live evaluation and the SWE-bench Docker path are reserved for separate benchmarking and are not part of CI.
@article{ray2026codecapsules,
title = {The Economics of Coding Agents: Calibrating the Cost-Quality Frontier},
author = {Ray, Aninda},
year = {2026},
note = {arXiv preprint, forthcoming.},
url = {https://github.com/aray-17/code-capsules}
}The arXiv URL will be added here once the preprint is live. See also
CITATION.cff.
This is a single-maintainer research project. Bug reports are triaged within ~2 weeks; pull requests are reviewed within ~3 weeks. Best effort, not SLA. See CONTRIBUTING.md for what falls in scope. Forks are welcome - Apache 2.0 explicitly permits forking and divergence.
Apache License 2.0. See LICENSE and NOTICE.
Independent author. Correspondence: research@anindaray.com. ORCID: 0009-0007-3029-8265.