Add source-backed evaluator CLI flows by wuman001 · Pull Request #919 · inclusionAI/AWorld

wuman001 · 2026-06-10T14:15:41Z

Summary

Add source-backed evaluator CLI flows for task, answer, and trajectory evaluation.
Run task inputs through the default CLI main agent when --agent is omitted, then judge the produced state.
Support trajectory replay, generated trajectory evaluation, answer veto gating, and keep local eval artifacts out of Git.

Test Plan

pytest tests/docs/test_evaluator_report_docs.py tests/evaluations/test_evaluation_input_sources.py tests/core/test_evaluator_runtime.py tests/core/test_evaluator_top_level_command.py tests/test_slash_commands.py -q
openspec validate aworld-cli-evaluator-source-run-2026-06-10 --strict
python -m py_compile aworld/evaluations/sources.py aworld/evaluations/report.py aworld-cli/src/aworld_cli/evaluator_runtime.py aworld-cli/src/aworld_cli/commands/evaluation_cmd.py aworld-cli/src/aworld_cli/top_level_commands/evaluator_cmd.py
conda run -n aworld_env aworld-cli evaluator --input ~/Documents/task_input.jsonl --kind task --judge-agent eval/trajectory_evaluator/answer_quality_agent.md --out-dir eval/trajectory_evaluator/reports

gemini-code-assist

Code Review

This pull request introduces a comprehensive framework-owned evaluation input source layer, state adapters, and a replay runtime harness to support replaying existing outputs (such as task+answer files and AWorld trajectory logs) without re-executing targets. It also adds rollout-owning runtime composition, user simulators, step-level rewards, retry wrappers, environment isolation, and independent repeated trials with pass@k/pass^k metrics. On the CLI side, a source-backed evaluator mode is added along with report schema export, validation, and lifecycle hooks. The review feedback highlights a potential crash in StateCheckGrader.grade due to unhandled exceptions during non-numeric comparisons, a local import that should be moved to the top of the file per PEP 8, and several broken documentation links containing absolute local file paths.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T14:20:03Z

+        try:
+            actual = _resolve_path(sources[self.source], self.path)
+            passed = _compare_values(actual, self.op, self.expected)
+            reason = "matched" if passed else f"expected {self.expected!r}, got {actual!r}"
+        except KeyError:
+            actual = None
+            passed = False
+            reason = f"missing path: {'.'.join(self.path)}"


The _compare_values function attempts to convert values to float for comparison operators like >, <, etc. If the actual value from the state is a non-numeric string, this will raise a ValueError which is not handled in this try...except block. This could crash the evaluation for certain state check graders. Consider wrapping the call to _compare_values in its own try...except block to handle potential ValueError or TypeError exceptions gracefully.

try: actual = _resolve_path(sources[self.source], self.path) try: passed = _compare_values(actual, self.op, self.expected) reason = "matched" if passed else f"expected {self.expected!r}, got {actual!r}" except (ValueError, TypeError): passed = False reason = f"incompatible types for comparison: expected comparable to {self.expected!r}, got {actual!r}" except KeyError: actual = None passed = False reason = f"missing path: {'.'.join(self.path)}"

gemini-code-assist · 2026-06-10T14:20:03Z

+    target_path = resolve_cli_target_path(target)
+    workspace_path = str(target_path.parent if target_path.is_file() else target_path)
+    suite_selection = resolve_workspace_suite_selection(target=target, suite=suite)
+    from aworld.evaluations.substrate import resolve_eval_suite_selection


This from ... import ... statement is inside the run_evaluator_cli function. According to PEP 8, imports should usually be at the top of the file for clarity and to avoid potential issues with repeated imports. If this local import is intended to avoid a circular dependency or for a significant performance optimization, it would be good to add a comment explaining why. Otherwise, please move it to the top of the file for better code organization and readability.

gemini-code-assist · 2026-06-10T14:20:03Z

+See [declared_evaluator_suite.example.json](/Users/wuman/Documents/workspace/aworld-mas/aworld/examples/aworld_quick_start/cli/declared_evaluator_suite.example.json) for a complete example. The current manifest schema is exported by `aworld_cli.evaluator_runtime.get_declared_evaluator_suite_schema()`.
+
+Resolution rules:
+
+- builtin suites are always available
+- declared suites are discovered relative to the evaluation target workspace, not just the current shell cwd
+- declared manifests currently extend `app-evaluator`; they are not yet a generic user-defined suite authoring API
+- `--list-suites --target ...` and actual evaluator execution use the same target-relative discovery path
+
+## Plugin Hooks
+
+`aworld-cli evaluator` is a builtin plugin-backed command with narrow lifecycle hook points intended for CLI assembly concerns, not framework scoring semantics.
+
+Available hook points:
+
+- `evaluator.pre_discover`: inspect or annotate target/workspace inputs before suite discovery
+- `evaluator.post_discover`: react to resolved suite candidates
+- `evaluator.pre_run`: add lightweight CLI metadata before evaluation starts
+- `evaluator.post_run`: upload or post-process the completed report
+- `evaluator.render_summary`: augment rendered terminal summary text
+
+Current event payloads:
+
+- `evaluator.pre_discover`: `target`, `workspace_path`
+- `evaluator.post_discover`: `target`, `workspace_path`, `suite_names`
+- `evaluator.pre_run` for target mode: `mode`, `target`, `suite`, `workspace_path`
+- `evaluator.pre_run` for source mode: `mode`, `input`, `kind`, `task_id`, `judge_agent`, `agent`, `workspace_path`, `output_path`
+- `evaluator.post_run` for target mode: `mode`, `report`, `target`, `suite`, `workspace_path`
+- `evaluator.post_run` for source mode: `mode`, `report`, `input`, `kind`, `task_id`, `judge_agent`, `agent`, `workspace_path`, `output_path`
+- `evaluator.render_summary`: `report`, `workspace_path`
+
+Hook boundaries:
+
+- mutable hook state is limited to lightweight CLI assembly metadata
+- hooks should not replace suite logic, judge logic, or gate calculation
+- suitable side effects include report upload, notifications, and summary augmentation
+
+## Report Contract
+
+Evaluator reports are JSON documents with a stable top-level format marker:
+
+```json
+{
+  "report_format": {
+    "id": "aworld.evaluator.report",
+    "version": 1
+  }
+}
+```
+
+Key report sections:
+
+- `metrics`: normalized aggregate metrics for the resolved suite
+- `results`: per-case judge output plus normalized per-case metrics
+- `gate`: structured `pass` / `fail` / `needs_approval` decision
+- `automation`: exit-code-oriented summary fields for scripts and CI
+- `suite_selection`: resolved/defaulted suite selection diagnostics
+- `source_selection`: source input diagnostics for source-backed `aworld-cli evaluator --input ...`
+- `approval`: approval decision metadata when the gate requires human confirmation
+
+See [evaluator_report.example.json](/Users/wuman/Documents/workspace/aworld-mas/aworld/examples/aworld_quick_start/cli/evaluator_report.example.json) for a minimal example.


The links on lines 110 and 170 use absolute file paths from your local machine (e.g., /Users/wuman/...). These will result in broken links for other users and in the rendered documentation. Please replace them with relative paths.

gemini-code-assist · 2026-06-10T14:20:03Z

+- [Evaluator](/Users/wuman/Documents/workspace/aworld-mas/aworld/docs/AWorld%20CLI/Commands/Evaluator.md): suite-backed evaluation, schema export, and report validation
+- [Gateway](/Users/wuman/Documents/workspace/aworld-mas/aworld/docs/AWorld%20CLI/Commands/Gateway.md): multi-channel gateway lifecycle and command bridge behavior


The links in this section use absolute file paths from your local machine. This will cause them to be broken for other users and in the rendered documentation. Please update them to use relative paths.

wuman001 added 30 commits June 1, 2026 21:47

feat: add suite-backed evaluator flow

6f0d7c4

feat: add evaluator judge timeout fallback

b4edd91

feat: improve evaluator cli flow

45cc8e5

feat: add visual-aware evaluator suite resolution

0e7cffc

feat: expose evaluator suite selection diagnostics

d0bbade

feat: add structured evaluator report metadata

5eb15dc

feat: export evaluator report schema

dd421c7

feat: tighten evaluator report schema

850d7fa

feat: add evaluator report validation command

57e9d6a

docs: add evaluator report contract guides

67fbf7c

feat: load declared evaluator suites

262188b

docs: add declared evaluator suite contract

1f94675

fix: refresh declared evaluator suite discovery

339a89e

fix: isolate declared evaluator suites by workspace

859c77e

feat: add execution-backed evaluator substrate

834efc1

docs: align evaluator substrate docs

2cc7c30

feat: extend evaluator v2 substrate

8dea37d

fix: harden evaluator v2 contracts

99c9201

docs: add evaluator v2 openspec change

fb7a2c4

fix: harden evaluator trajectory configuration

9feb9b6

docs: add evaluator runtime composition change

34202d4

docs: align runtime composition outcomes and trials

444a705

docs: specify runtime outcome and trial boundaries

c2fb3c0

feat: add evaluator runtime composition

9f48600

docs: add evaluator trials pass metrics change

ef66197

feat: add evaluator trial pass metrics

ff2c1ac

feat: add evaluator environment isolation

e4ebc7c

feat: add adaptive evaluator user simulator

caee5e5

test: add manual trajectory evaluator replay case

de008e9

fix: trim evaluator report metadata and replay metrics

56225de

wuman001 added 9 commits June 10, 2026 19:47

feat: add evaluator input source framework

5f8bc4f

feat: add source-backed evaluator cli run

2429022

test: add source cli trajectory manual replay

cb955f0

feat: simplify evaluator source commands

e4bb56a

fix: avoid nested loop in evaluation slash command

cfaaaf5

feat: add answer quality evaluator agent

3aad470

Add source-backed evaluator task flows

a098b64

Ignore local evaluator artifacts

1502cef

Gate answer evaluations on veto signal

512c9b4

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

wuman001 added 2 commits June 11, 2026 09:54

Handle non-numeric state check comparisons

f9c40d5

Remove openspec changes from PR

84998aa

wuman001 requested review from ahgpt and rainsonGain June 11, 2026 02:03

ahgpt approved these changes Jun 11, 2026

View reviewed changes

ahgpt merged commit c193218 into main Jun 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add source-backed evaluator CLI flows#919

Add source-backed evaluator CLI flows#919
ahgpt merged 41 commits into
mainfrom
fix/goal-steering

wuman001 commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- [Evaluator](/Users/wuman/Documents/workspace/aworld-mas/aworld/docs/AWorld%20CLI/Commands/Evaluator.md): suite-backed evaluation, schema export, and report validation
		- [Gateway](/Users/wuman/Documents/workspace/aworld-mas/aworld/docs/AWorld%20CLI/Commands/Gateway.md): multi-channel gateway lifecycle and command bridge behavior

Conversation

wuman001 commented Jun 10, 2026

Summary

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants