docs(grader): clarify how threshold in llm_graders config should affect the passed boolean

## Objective

`pipeline input` writes `"threshold": 0.5` into `<test-id>/llm_graders/<name>.json` — hardcoded at `apps/cli/src/commands/pipeline/input.ts:264`. `plugins/agentv-dev/skills/agentv-bench/agents/grader.md` doesn't tell the grader subagent whether/how to apply this threshold when producing the per-assertion `passed` boolean.

Two reasonable interpretations of `passed`:

1. **Threshold-based:** `passed = score >= threshold`.
2. **Independent qualitative judgment:** the grader decides `passed` based on assertion semantics, regardless of score.

Different subagents will pick different interpretations silently. `grading.json` consumers (Studio dashboards, `results validate`, any external tooling reading `index.jsonl`) can't rely on a consistent `passed` contract across runs.

Observed during PR #1151 e2e: the dispatched subagent chose interpretation (2) — emitted raw score, `passed` reflects qualitative judgment, `threshold` was never read.

## Design latitude

Pick one semantic and make it consistent across subagent mode and CLI mode:

- CLI mode's LLM grader lives in `packages/core/src/evaluation/graders/llm-grader.ts`. Whatever it does for per-assertion `passed` should be the canonical behavior; document it in `grader.md` Step 2 (LLM-graded assertions table) and Step 9 output spec.
- Note: `llm-grader.ts:1331-1383` uses a `required_min_score` threshold for **test-level gating** (force fail the whole test if below threshold) — that is a different mechanism from per-assertion `passed` and should not be conflated.
- If `threshold` is not used anywhere for per-assertion pass/fail in CLI mode, the hardcoded `0.5` write in `input.ts:264` is dead config. Consider removing it or clearly documenting that it's for aggregation/display only.

## Acceptance signals

- `grader.md` has an unambiguous statement of what `passed` means for `llm-grader` assertions, explicitly naming the role of `threshold` (used / ignored / display-only).
- Subagent mode and CLI mode produce the same `passed` verdict for the same `(score, threshold)` pair and response.
- If `threshold` is determined to be unused for per-assertion pass/fail: either `input.ts:264` stops writing it, or `grader.md` documents the remaining purpose.

## Non-goals

- Weighted pass_rate / aggregate scoring rules.
- `required_min_score` test-level gating at `llm-grader.ts:1331` (separate feature, different semantic).

## Related

- #1148, #1151

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(grader): clarify how threshold in llm_graders config should affect the passed boolean #1153

Objective

Design latitude

Acceptance signals

Non-goals

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docs(grader): clarify how threshold in llm_graders config should affect the passed boolean #1153

Description

Objective

Design latitude

Acceptance signals

Non-goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions