Polish: audit follow-ups (CLAIMS wording, evidence hygiene, cwd-robust scorers) by aray-17 · Pull Request #1 · aray-17/code-capsules

aray-17 · 2026-06-15T04:15:09Z

Follow-ups from the artifact stress-test audit. No reproduced number changes — every paper claim was already data-correct (independently recomputed from raw evals/ and cross-checked against the PDF). These fix wording, evidence-explorer hygiene, and scorer portability.

Changes

CLAIMS.md C11 — reword escalation economics as three distinct, non-comparable options over different doomed sets (Haiku-doomed → Opus ~$2.20; Sonnet-doomed → Opus ~$3.07–$7.51), not a single sequential cascade. The paper's Table 11 already presented these correctly; only the CLAIMS.md gloss was loose.
CLAIMS.md C12 — baseline is the signaled-budget cell (not "Floor"); state the precise +6.7pp lifts (Sonnet 37.3→44.0%, codex 20.7→27.3%).
CLAIMS.md menu row — name the source JSONL for every cell, including the leak-free "quality 77" file (exp4_lever_floor100_siginject.jsonl) that was previously unnamed in the evidence column.
build_evidence_index.py — add ratelimited to SKIP so the quarantined *.RATELIMITED-MIXED.jsonl held-out file no longer surfaces in the public evidence explorer; regenerate evidence_index.json (that one entry removed). Note: mixed_workload_* is intentionally kept (it's a labeled experiment, not a corrupted file).
test_runtime_e2e_replay.py — fix a stale comment naming a nonexistent *.CONTAMINATED-MIXED.jsonl (the real quarantined file is *.RATELIMITED-MIXED.jsonl). Test logic was already correct (exact paths, not globs).
agreement_signal.py / value_of_resolve.py — anchor evals/ globs on the repo root so these standalone-invokable scorers reproduce from any working directory (previously CWD-relative → empty results if run from elsewhere; verify_criteria.py was unaffected since it runs with cwd=ROOT).

Verification

benchmarks/verify_criteria.py → 12/12 claims reproduce.
Both scorers now produce correct numbers when run from /tmp (foreign CWD).
Offline suite: 483 passed, 1 skipped, 3 deselected.

🤖 Generated with Claude Code

…t scorers) From the artifact stress-test audit. No reproduced number changed; all data was already correct — these fix wording, evidence-explorer hygiene, and scorer portability. - CLAIMS.md C11: reword escalation economics as three distinct options over different doomed sets (Haiku-doomed ~$2.20; Sonnet-doomed ~$3.07-$7.51), not a single sequential cascade (the paper's Table 11 was already correct). - CLAIMS.md C12: baseline is the signaled-budget cell, not "Floor"; state the precise +6.7pp lifts (Sonnet 37.3->44.0%, codex 20.7->27.3%). - CLAIMS.md menu row: name the source JSONL for every cell, including the leak-free "quality 77" file (exp4_lever_floor100_siginject.jsonl) that was unnamed. - build_evidence_index.py: add "ratelimited" to SKIP so the quarantined RATELIMITED-MIXED held-out file no longer surfaces in the evidence explorer; regenerate evidence_index.json. - test_runtime_e2e_replay.py: fix stale comment that named a nonexistent *.CONTAMINATED-MIXED.jsonl (actual quarantined file is *.RATELIMITED-MIXED.jsonl). - agreement_signal.py, value_of_resolve.py: anchor evals/ globs on the repo root so these standalone-invokable scorers reproduce from any working directory. Verified: 12/12 claims reproduce; both scorers produce correct numbers from a foreign CWD; offline suite 483 passed, 1 skipped, 3 deselected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aray-17 merged commit ff5d296 into main Jun 15, 2026
1 check passed

aray-17 deleted the fix/audit-polish branch June 15, 2026 04:53

aray-17 mentioned this pull request Jun 15, 2026

docs: document signed-merge policy (squash/merge, not rebase) #2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polish: audit follow-ups (CLAIMS wording, evidence hygiene, cwd-robust scorers)#1

Polish: audit follow-ups (CLAIMS wording, evidence hygiene, cwd-robust scorers)#1
aray-17 merged 1 commit into
mainfrom
fix/audit-polish

aray-17 commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aray-17 commented Jun 15, 2026

Changes

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant