Skip to content

Add source-backed evaluator CLI flows#919

Merged
ahgpt merged 41 commits into
mainfrom
fix/goal-steering
Jun 11, 2026
Merged

Add source-backed evaluator CLI flows#919
ahgpt merged 41 commits into
mainfrom
fix/goal-steering

Conversation

@wuman001

Copy link
Copy Markdown
Collaborator

Summary

  • Add source-backed evaluator CLI flows for task, answer, and trajectory evaluation.
  • Run task inputs through the default CLI main agent when --agent is omitted, then judge the produced state.
  • Support trajectory replay, generated trajectory evaluation, answer veto gating, and keep local eval artifacts out of Git.

Test Plan

  • pytest tests/docs/test_evaluator_report_docs.py tests/evaluations/test_evaluation_input_sources.py tests/core/test_evaluator_runtime.py tests/core/test_evaluator_top_level_command.py tests/test_slash_commands.py -q
  • openspec validate aworld-cli-evaluator-source-run-2026-06-10 --strict
  • python -m py_compile aworld/evaluations/sources.py aworld/evaluations/report.py aworld-cli/src/aworld_cli/evaluator_runtime.py aworld-cli/src/aworld_cli/commands/evaluation_cmd.py aworld-cli/src/aworld_cli/top_level_commands/evaluator_cmd.py
  • conda run -n aworld_env aworld-cli evaluator --input ~/Documents/task_input.jsonl --kind task --judge-agent eval/trajectory_evaluator/answer_quality_agent.md --out-dir eval/trajectory_evaluator/reports

wuman001 added 30 commits June 1, 2026 21:47

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive framework-owned evaluation input source layer, state adapters, and a replay runtime harness to support replaying existing outputs (such as task+answer files and AWorld trajectory logs) without re-executing targets. It also adds rollout-owning runtime composition, user simulators, step-level rewards, retry wrappers, environment isolation, and independent repeated trials with pass@k/pass^k metrics. On the CLI side, a source-backed evaluator mode is added along with report schema export, validation, and lifecycle hooks. The review feedback highlights a potential crash in StateCheckGrader.grade due to unhandled exceptions during non-numeric comparisons, a local import that should be moved to the top of the file per PEP 8, and several broken documentation links containing absolute local file paths.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +160 to +167
try:
actual = _resolve_path(sources[self.source], self.path)
passed = _compare_values(actual, self.op, self.expected)
reason = "matched" if passed else f"expected {self.expected!r}, got {actual!r}"
except KeyError:
actual = None
passed = False
reason = f"missing path: {'.'.join(self.path)}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _compare_values function attempts to convert values to float for comparison operators like >, <, etc. If the actual value from the state is a non-numeric string, this will raise a ValueError which is not handled in this try...except block. This could crash the evaluation for certain state check graders. Consider wrapping the call to _compare_values in its own try...except block to handle potential ValueError or TypeError exceptions gracefully.

        try:
            actual = _resolve_path(sources[self.source], self.path)
            try:
                passed = _compare_values(actual, self.op, self.expected)
                reason = "matched" if passed else f"expected {self.expected!r}, got {actual!r}"
            except (ValueError, TypeError):
                passed = False
                reason = f"incompatible types for comparison: expected comparable to {self.expected!r}, got {actual!r}"
        except KeyError:
            actual = None
            passed = False
            reason = f"missing path: {'.'.join(self.path)}"

target_path = resolve_cli_target_path(target)
workspace_path = str(target_path.parent if target_path.is_file() else target_path)
suite_selection = resolve_workspace_suite_selection(target=target, suite=suite)
from aworld.evaluations.substrate import resolve_eval_suite_selection

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This from ... import ... statement is inside the run_evaluator_cli function. According to PEP 8, imports should usually be at the top of the file for clarity and to avoid potential issues with repeated imports. If this local import is intended to avoid a circular dependency or for a significant performance optimization, it would be good to add a comment explaining why. Otherwise, please move it to the top of the file for better code organization and readability.

Comment on lines +110 to +170
See [declared_evaluator_suite.example.json](/Users/wuman/Documents/workspace/aworld-mas/aworld/examples/aworld_quick_start/cli/declared_evaluator_suite.example.json) for a complete example. The current manifest schema is exported by `aworld_cli.evaluator_runtime.get_declared_evaluator_suite_schema()`.

Resolution rules:

- builtin suites are always available
- declared suites are discovered relative to the evaluation target workspace, not just the current shell cwd
- declared manifests currently extend `app-evaluator`; they are not yet a generic user-defined suite authoring API
- `--list-suites --target ...` and actual evaluator execution use the same target-relative discovery path

## Plugin Hooks

`aworld-cli evaluator` is a builtin plugin-backed command with narrow lifecycle hook points intended for CLI assembly concerns, not framework scoring semantics.

Available hook points:

- `evaluator.pre_discover`: inspect or annotate target/workspace inputs before suite discovery
- `evaluator.post_discover`: react to resolved suite candidates
- `evaluator.pre_run`: add lightweight CLI metadata before evaluation starts
- `evaluator.post_run`: upload or post-process the completed report
- `evaluator.render_summary`: augment rendered terminal summary text

Current event payloads:

- `evaluator.pre_discover`: `target`, `workspace_path`
- `evaluator.post_discover`: `target`, `workspace_path`, `suite_names`
- `evaluator.pre_run` for target mode: `mode`, `target`, `suite`, `workspace_path`
- `evaluator.pre_run` for source mode: `mode`, `input`, `kind`, `task_id`, `judge_agent`, `agent`, `workspace_path`, `output_path`
- `evaluator.post_run` for target mode: `mode`, `report`, `target`, `suite`, `workspace_path`
- `evaluator.post_run` for source mode: `mode`, `report`, `input`, `kind`, `task_id`, `judge_agent`, `agent`, `workspace_path`, `output_path`
- `evaluator.render_summary`: `report`, `workspace_path`

Hook boundaries:

- mutable hook state is limited to lightweight CLI assembly metadata
- hooks should not replace suite logic, judge logic, or gate calculation
- suitable side effects include report upload, notifications, and summary augmentation

## Report Contract

Evaluator reports are JSON documents with a stable top-level format marker:

```json
{
"report_format": {
"id": "aworld.evaluator.report",
"version": 1
}
}
```

Key report sections:

- `metrics`: normalized aggregate metrics for the resolved suite
- `results`: per-case judge output plus normalized per-case metrics
- `gate`: structured `pass` / `fail` / `needs_approval` decision
- `automation`: exit-code-oriented summary fields for scripts and CI
- `suite_selection`: resolved/defaulted suite selection diagnostics
- `source_selection`: source input diagnostics for source-backed `aworld-cli evaluator --input ...`
- `approval`: approval decision metadata when the gate requires human confirmation

See [evaluator_report.example.json](/Users/wuman/Documents/workspace/aworld-mas/aworld/examples/aworld_quick_start/cli/evaluator_report.example.json) for a minimal example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The links on lines 110 and 170 use absolute file paths from your local machine (e.g., /Users/wuman/...). These will result in broken links for other users and in the rendered documentation. Please replace them with relative paths.

Comment on lines +7 to +8
- [Evaluator](/Users/wuman/Documents/workspace/aworld-mas/aworld/docs/AWorld%20CLI/Commands/Evaluator.md): suite-backed evaluation, schema export, and report validation
- [Gateway](/Users/wuman/Documents/workspace/aworld-mas/aworld/docs/AWorld%20CLI/Commands/Gateway.md): multi-channel gateway lifecycle and command bridge behavior

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The links in this section use absolute file paths from your local machine. This will cause them to be broken for other users and in the rendered documentation. Please update them to use relative paths.

@wuman001 wuman001 requested review from ahgpt and rainsonGain June 11, 2026 02:03
@ahgpt ahgpt merged commit c193218 into main Jun 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants