Objective
pipeline input writes "threshold": 0.5 into <test-id>/llm_graders/<name>.json — hardcoded at apps/cli/src/commands/pipeline/input.ts:264. plugins/agentv-dev/skills/agentv-bench/agents/grader.md doesn't tell the grader subagent whether/how to apply this threshold when producing the per-assertion passed boolean.
Two reasonable interpretations of passed:
- Threshold-based:
passed = score >= threshold.
- Independent qualitative judgment: the grader decides
passed based on assertion semantics, regardless of score.
Different subagents will pick different interpretations silently. grading.json consumers (Studio dashboards, results validate, any external tooling reading index.jsonl) can't rely on a consistent passed contract across runs.
Observed during PR #1151 e2e: the dispatched subagent chose interpretation (2) — emitted raw score, passed reflects qualitative judgment, threshold was never read.
Design latitude
Pick one semantic and make it consistent across subagent mode and CLI mode:
- CLI mode's LLM grader lives in
packages/core/src/evaluation/graders/llm-grader.ts. Whatever it does for per-assertion passed should be the canonical behavior; document it in grader.md Step 2 (LLM-graded assertions table) and Step 9 output spec.
- Note:
llm-grader.ts:1331-1383 uses a required_min_score threshold for test-level gating (force fail the whole test if below threshold) — that is a different mechanism from per-assertion passed and should not be conflated.
- If
threshold is not used anywhere for per-assertion pass/fail in CLI mode, the hardcoded 0.5 write in input.ts:264 is dead config. Consider removing it or clearly documenting that it's for aggregation/display only.
Acceptance signals
grader.md has an unambiguous statement of what passed means for llm-grader assertions, explicitly naming the role of threshold (used / ignored / display-only).
- Subagent mode and CLI mode produce the same
passed verdict for the same (score, threshold) pair and response.
- If
threshold is determined to be unused for per-assertion pass/fail: either input.ts:264 stops writing it, or grader.md documents the remaining purpose.
Non-goals
- Weighted pass_rate / aggregate scoring rules.
required_min_score test-level gating at llm-grader.ts:1331 (separate feature, different semantic).
Related
Objective
pipeline inputwrites"threshold": 0.5into<test-id>/llm_graders/<name>.json— hardcoded atapps/cli/src/commands/pipeline/input.ts:264.plugins/agentv-dev/skills/agentv-bench/agents/grader.mddoesn't tell the grader subagent whether/how to apply this threshold when producing the per-assertionpassedboolean.Two reasonable interpretations of
passed:passed = score >= threshold.passedbased on assertion semantics, regardless of score.Different subagents will pick different interpretations silently.
grading.jsonconsumers (Studio dashboards,results validate, any external tooling readingindex.jsonl) can't rely on a consistentpassedcontract across runs.Observed during PR #1151 e2e: the dispatched subagent chose interpretation (2) — emitted raw score,
passedreflects qualitative judgment,thresholdwas never read.Design latitude
Pick one semantic and make it consistent across subagent mode and CLI mode:
packages/core/src/evaluation/graders/llm-grader.ts. Whatever it does for per-assertionpassedshould be the canonical behavior; document it ingrader.mdStep 2 (LLM-graded assertions table) and Step 9 output spec.llm-grader.ts:1331-1383uses arequired_min_scorethreshold for test-level gating (force fail the whole test if below threshold) — that is a different mechanism from per-assertionpassedand should not be conflated.thresholdis not used anywhere for per-assertion pass/fail in CLI mode, the hardcoded0.5write ininput.ts:264is dead config. Consider removing it or clearly documenting that it's for aggregation/display only.Acceptance signals
grader.mdhas an unambiguous statement of whatpassedmeans forllm-graderassertions, explicitly naming the role ofthreshold(used / ignored / display-only).passedverdict for the same(score, threshold)pair and response.thresholdis determined to be unused for per-assertion pass/fail: eitherinput.ts:264stops writing it, orgrader.mddocuments the remaining purpose.Non-goals
required_min_scoretest-level gating atllm-grader.ts:1331(separate feature, different semantic).Related