Skip to content

docs(grader): clarify how threshold in llm_graders config should affect the passed boolean #1153

@christso

Description

@christso

Objective

pipeline input writes "threshold": 0.5 into <test-id>/llm_graders/<name>.json — hardcoded at apps/cli/src/commands/pipeline/input.ts:264. plugins/agentv-dev/skills/agentv-bench/agents/grader.md doesn't tell the grader subagent whether/how to apply this threshold when producing the per-assertion passed boolean.

Two reasonable interpretations of passed:

  1. Threshold-based: passed = score >= threshold.
  2. Independent qualitative judgment: the grader decides passed based on assertion semantics, regardless of score.

Different subagents will pick different interpretations silently. grading.json consumers (Studio dashboards, results validate, any external tooling reading index.jsonl) can't rely on a consistent passed contract across runs.

Observed during PR #1151 e2e: the dispatched subagent chose interpretation (2) — emitted raw score, passed reflects qualitative judgment, threshold was never read.

Design latitude

Pick one semantic and make it consistent across subagent mode and CLI mode:

  • CLI mode's LLM grader lives in packages/core/src/evaluation/graders/llm-grader.ts. Whatever it does for per-assertion passed should be the canonical behavior; document it in grader.md Step 2 (LLM-graded assertions table) and Step 9 output spec.
  • Note: llm-grader.ts:1331-1383 uses a required_min_score threshold for test-level gating (force fail the whole test if below threshold) — that is a different mechanism from per-assertion passed and should not be conflated.
  • If threshold is not used anywhere for per-assertion pass/fail in CLI mode, the hardcoded 0.5 write in input.ts:264 is dead config. Consider removing it or clearly documenting that it's for aggregation/display only.

Acceptance signals

  • grader.md has an unambiguous statement of what passed means for llm-grader assertions, explicitly naming the role of threshold (used / ignored / display-only).
  • Subagent mode and CLI mode produce the same passed verdict for the same (score, threshold) pair and response.
  • If threshold is determined to be unused for per-assertion pass/fail: either input.ts:264 stops writing it, or grader.md documents the remaining purpose.

Non-goals

  • Weighted pass_rate / aggregate scoring rules.
  • required_min_score test-level gating at llm-grader.ts:1331 (separate feature, different semantic).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions