Add source-backed evaluator CLI flows#919
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive framework-owned evaluation input source layer, state adapters, and a replay runtime harness to support replaying existing outputs (such as task+answer files and AWorld trajectory logs) without re-executing targets. It also adds rollout-owning runtime composition, user simulators, step-level rewards, retry wrappers, environment isolation, and independent repeated trials with pass@k/pass^k metrics. On the CLI side, a source-backed evaluator mode is added along with report schema export, validation, and lifecycle hooks. The review feedback highlights a potential crash in StateCheckGrader.grade due to unhandled exceptions during non-numeric comparisons, a local import that should be moved to the top of the file per PEP 8, and several broken documentation links containing absolute local file paths.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| actual = _resolve_path(sources[self.source], self.path) | ||
| passed = _compare_values(actual, self.op, self.expected) | ||
| reason = "matched" if passed else f"expected {self.expected!r}, got {actual!r}" | ||
| except KeyError: | ||
| actual = None | ||
| passed = False | ||
| reason = f"missing path: {'.'.join(self.path)}" |
There was a problem hiding this comment.
The _compare_values function attempts to convert values to float for comparison operators like >, <, etc. If the actual value from the state is a non-numeric string, this will raise a ValueError which is not handled in this try...except block. This could crash the evaluation for certain state check graders. Consider wrapping the call to _compare_values in its own try...except block to handle potential ValueError or TypeError exceptions gracefully.
try:
actual = _resolve_path(sources[self.source], self.path)
try:
passed = _compare_values(actual, self.op, self.expected)
reason = "matched" if passed else f"expected {self.expected!r}, got {actual!r}"
except (ValueError, TypeError):
passed = False
reason = f"incompatible types for comparison: expected comparable to {self.expected!r}, got {actual!r}"
except KeyError:
actual = None
passed = False
reason = f"missing path: {'.'.join(self.path)}"| target_path = resolve_cli_target_path(target) | ||
| workspace_path = str(target_path.parent if target_path.is_file() else target_path) | ||
| suite_selection = resolve_workspace_suite_selection(target=target, suite=suite) | ||
| from aworld.evaluations.substrate import resolve_eval_suite_selection |
There was a problem hiding this comment.
This from ... import ... statement is inside the run_evaluator_cli function. According to PEP 8, imports should usually be at the top of the file for clarity and to avoid potential issues with repeated imports. If this local import is intended to avoid a circular dependency or for a significant performance optimization, it would be good to add a comment explaining why. Otherwise, please move it to the top of the file for better code organization and readability.
| See [declared_evaluator_suite.example.json](/Users/wuman/Documents/workspace/aworld-mas/aworld/examples/aworld_quick_start/cli/declared_evaluator_suite.example.json) for a complete example. The current manifest schema is exported by `aworld_cli.evaluator_runtime.get_declared_evaluator_suite_schema()`. | ||
|
|
||
| Resolution rules: | ||
|
|
||
| - builtin suites are always available | ||
| - declared suites are discovered relative to the evaluation target workspace, not just the current shell cwd | ||
| - declared manifests currently extend `app-evaluator`; they are not yet a generic user-defined suite authoring API | ||
| - `--list-suites --target ...` and actual evaluator execution use the same target-relative discovery path | ||
|
|
||
| ## Plugin Hooks | ||
|
|
||
| `aworld-cli evaluator` is a builtin plugin-backed command with narrow lifecycle hook points intended for CLI assembly concerns, not framework scoring semantics. | ||
|
|
||
| Available hook points: | ||
|
|
||
| - `evaluator.pre_discover`: inspect or annotate target/workspace inputs before suite discovery | ||
| - `evaluator.post_discover`: react to resolved suite candidates | ||
| - `evaluator.pre_run`: add lightweight CLI metadata before evaluation starts | ||
| - `evaluator.post_run`: upload or post-process the completed report | ||
| - `evaluator.render_summary`: augment rendered terminal summary text | ||
|
|
||
| Current event payloads: | ||
|
|
||
| - `evaluator.pre_discover`: `target`, `workspace_path` | ||
| - `evaluator.post_discover`: `target`, `workspace_path`, `suite_names` | ||
| - `evaluator.pre_run` for target mode: `mode`, `target`, `suite`, `workspace_path` | ||
| - `evaluator.pre_run` for source mode: `mode`, `input`, `kind`, `task_id`, `judge_agent`, `agent`, `workspace_path`, `output_path` | ||
| - `evaluator.post_run` for target mode: `mode`, `report`, `target`, `suite`, `workspace_path` | ||
| - `evaluator.post_run` for source mode: `mode`, `report`, `input`, `kind`, `task_id`, `judge_agent`, `agent`, `workspace_path`, `output_path` | ||
| - `evaluator.render_summary`: `report`, `workspace_path` | ||
|
|
||
| Hook boundaries: | ||
|
|
||
| - mutable hook state is limited to lightweight CLI assembly metadata | ||
| - hooks should not replace suite logic, judge logic, or gate calculation | ||
| - suitable side effects include report upload, notifications, and summary augmentation | ||
|
|
||
| ## Report Contract | ||
|
|
||
| Evaluator reports are JSON documents with a stable top-level format marker: | ||
|
|
||
| ```json | ||
| { | ||
| "report_format": { | ||
| "id": "aworld.evaluator.report", | ||
| "version": 1 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Key report sections: | ||
|
|
||
| - `metrics`: normalized aggregate metrics for the resolved suite | ||
| - `results`: per-case judge output plus normalized per-case metrics | ||
| - `gate`: structured `pass` / `fail` / `needs_approval` decision | ||
| - `automation`: exit-code-oriented summary fields for scripts and CI | ||
| - `suite_selection`: resolved/defaulted suite selection diagnostics | ||
| - `source_selection`: source input diagnostics for source-backed `aworld-cli evaluator --input ...` | ||
| - `approval`: approval decision metadata when the gate requires human confirmation | ||
|
|
||
| See [evaluator_report.example.json](/Users/wuman/Documents/workspace/aworld-mas/aworld/examples/aworld_quick_start/cli/evaluator_report.example.json) for a minimal example. |
| - [Evaluator](/Users/wuman/Documents/workspace/aworld-mas/aworld/docs/AWorld%20CLI/Commands/Evaluator.md): suite-backed evaluation, schema export, and report validation | ||
| - [Gateway](/Users/wuman/Documents/workspace/aworld-mas/aworld/docs/AWorld%20CLI/Commands/Gateway.md): multi-channel gateway lifecycle and command bridge behavior |
Summary
Test Plan