Skip to content

Polish: audit follow-ups (CLAIMS wording, evidence hygiene, cwd-robust scorers)#1

Merged
aray-17 merged 1 commit into
mainfrom
fix/audit-polish
Jun 15, 2026
Merged

Polish: audit follow-ups (CLAIMS wording, evidence hygiene, cwd-robust scorers)#1
aray-17 merged 1 commit into
mainfrom
fix/audit-polish

Conversation

@aray-17

@aray-17 aray-17 commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Follow-ups from the artifact stress-test audit. No reproduced number changes — every paper claim was already data-correct (independently recomputed from raw evals/ and cross-checked against the PDF). These fix wording, evidence-explorer hygiene, and scorer portability.

Changes

  • CLAIMS.md C11 — reword escalation economics as three distinct, non-comparable options over different doomed sets (Haiku-doomed → Opus ~$2.20; Sonnet-doomed → Opus ~$3.07–$7.51), not a single sequential cascade. The paper's Table 11 already presented these correctly; only the CLAIMS.md gloss was loose.
  • CLAIMS.md C12 — baseline is the signaled-budget cell (not "Floor"); state the precise +6.7pp lifts (Sonnet 37.3→44.0%, codex 20.7→27.3%).
  • CLAIMS.md menu row — name the source JSONL for every cell, including the leak-free "quality 77" file (exp4_lever_floor100_siginject.jsonl) that was previously unnamed in the evidence column.
  • build_evidence_index.py — add ratelimited to SKIP so the quarantined *.RATELIMITED-MIXED.jsonl held-out file no longer surfaces in the public evidence explorer; regenerate evidence_index.json (that one entry removed). Note: mixed_workload_* is intentionally kept (it's a labeled experiment, not a corrupted file).
  • test_runtime_e2e_replay.py — fix a stale comment naming a nonexistent *.CONTAMINATED-MIXED.jsonl (the real quarantined file is *.RATELIMITED-MIXED.jsonl). Test logic was already correct (exact paths, not globs).
  • agreement_signal.py / value_of_resolve.py — anchor evals/ globs on the repo root so these standalone-invokable scorers reproduce from any working directory (previously CWD-relative → empty results if run from elsewhere; verify_criteria.py was unaffected since it runs with cwd=ROOT).

Verification

  • benchmarks/verify_criteria.py12/12 claims reproduce.
  • Both scorers now produce correct numbers when run from /tmp (foreign CWD).
  • Offline suite: 483 passed, 1 skipped, 3 deselected.

🤖 Generated with Claude Code

…t scorers)

From the artifact stress-test audit. No reproduced number changed; all data was
already correct — these fix wording, evidence-explorer hygiene, and scorer portability.

- CLAIMS.md C11: reword escalation economics as three distinct options over
  different doomed sets (Haiku-doomed ~$2.20; Sonnet-doomed ~$3.07-$7.51), not a
  single sequential cascade (the paper's Table 11 was already correct).
- CLAIMS.md C12: baseline is the signaled-budget cell, not "Floor"; state the
  precise +6.7pp lifts (Sonnet 37.3->44.0%, codex 20.7->27.3%).
- CLAIMS.md menu row: name the source JSONL for every cell, including the
  leak-free "quality 77" file (exp4_lever_floor100_siginject.jsonl) that was unnamed.
- build_evidence_index.py: add "ratelimited" to SKIP so the quarantined
  RATELIMITED-MIXED held-out file no longer surfaces in the evidence explorer;
  regenerate evidence_index.json.
- test_runtime_e2e_replay.py: fix stale comment that named a nonexistent
  *.CONTAMINATED-MIXED.jsonl (actual quarantined file is *.RATELIMITED-MIXED.jsonl).
- agreement_signal.py, value_of_resolve.py: anchor evals/ globs on the repo root
  so these standalone-invokable scorers reproduce from any working directory.

Verified: 12/12 claims reproduce; both scorers produce correct numbers from a
foreign CWD; offline suite 483 passed, 1 skipped, 3 deselected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aray-17 aray-17 merged commit ff5d296 into main Jun 15, 2026
1 check passed
@aray-17 aray-17 deleted the fix/audit-polish branch June 15, 2026 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant