Skip to content

Support named evaluator judge backends#920

Merged
ahgpt merged 1 commit into
mainfrom
fix/goal-steering
Jun 11, 2026
Merged

Support named evaluator judge backends#920
ahgpt merged 1 commit into
mainfrom
fix/goal-steering

Conversation

@wuman001

Copy link
Copy Markdown
Collaborator

Summary

  • Add source evaluator judge selectors for CLI agent/team names and process-local backend refs.
  • Add answer-quality judge fixtures for --judge-agent-name and --judge-backend-ref validation.
  • Update evaluator CLI docs for selector usage, hook payloads, and agent search path behavior.

Test Plan

  • pytest tests/docs/test_evaluator_report_docs.py tests/core/test_evaluator_runtime.py tests/core/test_evaluator_top_level_command.py tests/test_slash_commands.py tests/evaluations/test_answer_quality_judge_fixtures.py -q
  • python -m py_compile aworld-cli/src/aworld_cli/evaluator_runtime.py aworld-cli/src/aworld_cli/top_level_commands/evaluator_cmd.py aworld-cli/src/aworld_cli/commands/evaluation_cmd.py tests/fixtures/evaluator_judges/agents/answer_quality_judge.py tests/fixtures/evaluator_judges/answer_quality_backend.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new judge selectors, --judge-agent-name and --judge-backend-ref, to the aworld-cli evaluator command, making them mutually exclusive with the existing --judge-agent option. This allows running evaluations using a registered local agent/team by name or importing a custom Python factory/callable as a judge backend. The review feedback highlights several improvement opportunities in evaluator_runtime.py, including catching and wrapping import/attribute errors during dynamic module loading, preventing unbound method errors when a class implements execute, and resolving a concurrency race condition when caching the CLI agent executor.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +287 to +297
def _load_ref(ref: str) -> Any:
module_name, separator, attr_path = ref.partition(":")
if not separator or not module_name or not attr_path:
raise ValueError(f"judge backend ref must use module:callable format: {ref}")
module = importlib.import_module(module_name)
value: Any = module
for attr in attr_path.split("."):
if not attr:
raise ValueError(f"judge backend ref has an empty attribute segment: {ref}")
value = getattr(value, attr)
return value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The importlib.import_module and getattr calls can raise ModuleNotFoundError, ImportError, or AttributeError if the user provides an invalid module or callable name via --judge-backend-ref. Since these exceptions are not caught by the except (FileNotFoundError, ValueError, KeyError) blocks in the CLI command handlers, they will cause the CLI to crash with a traceback.

We should catch these exceptions inside _load_ref and wrap them in a ValueError to ensure a clean error message is displayed to the user.

def _load_ref(ref: str) -> Any:
    module_name, separator, attr_path = ref.partition(":")
    if not separator or not module_name or not attr_path:
        raise ValueError(f"judge backend ref must use module:callable format: {ref}")
    try:
        module = importlib.import_module(module_name)
    except ImportError as exc:
        raise ValueError(f"Failed to import module '{module_name}': {exc}") from exc
    value: Any = module
    for attr in attr_path.split("."):
        if not attr:
            raise ValueError(f"judge backend ref has an empty attribute segment: {ref}")
        try:
            value = getattr(value, attr)
        except AttributeError as exc:
            raise ValueError(f"Attribute '{attr}' not found in '{module_name}': {exc}") from exc
    return value

Comment on lines +321 to +330
def _load_source_judge_backend_ref(ref: str) -> JudgeBackend:
value = _load_ref(ref)
if hasattr(value, "execute"):
return value
if callable(value) and _can_call_without_arguments(value):
produced = value()
if inspect.isawaitable(produced):
raise ValueError("judge backend ref factory must be synchronous")
return _coerce_source_judge_backend(produced, backend_id=f"judge-backend-ref:{ref}")
return _coerce_source_judge_backend(value, backend_id=f"judge-backend-ref:{ref}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If value is a class that implements execute (e.g., a custom subclass of JudgeBackend), hasattr(value, "execute") will evaluate to True on the class itself. Returning the class directly causes a runtime error when the framework attempts to call execute as an unbound method.

We should ensure we do not return the class directly, allowing it to fall through to the callable check where it can be instantiated.

Suggested change
def _load_source_judge_backend_ref(ref: str) -> JudgeBackend:
value = _load_ref(ref)
if hasattr(value, "execute"):
return value
if callable(value) and _can_call_without_arguments(value):
produced = value()
if inspect.isawaitable(produced):
raise ValueError("judge backend ref factory must be synchronous")
return _coerce_source_judge_backend(produced, backend_id=f"judge-backend-ref:{ref}")
return _coerce_source_judge_backend(value, backend_id=f"judge-backend-ref:{ref}")
def _load_source_judge_backend_ref(ref: str) -> JudgeBackend:
value = _load_ref(ref)
if hasattr(value, "execute") and not inspect.isclass(value):
return value
if callable(value) and _can_call_without_arguments(value):
produced = value()
if inspect.isawaitable(produced):
raise ValueError("judge backend ref factory must be synchronous")
return _coerce_source_judge_backend(produced, backend_id=f"judge-backend-ref:{ref}")
return _coerce_source_judge_backend(value, backend_id=f"judge-backend-ref:{ref}")

Comment on lines +333 to +349
def _build_cli_agent_judge_backend(*, agent_name: str, backend_id: str, prompt_builder):
executor_cache: dict[str, Any] = {}

async def _executor(prompt, system_prompt):
if isinstance(prompt, tuple):
raise ValueError("CLI agent judge backend only supports text prompts")
executor = executor_cache.get("executor")
if executor is None:
executor = await _load_cli_agent_executor(agent_name)
executor_cache["executor"] = executor
swarm = getattr(executor, "swarm", None)
if swarm is not None:
response = await Runners.run(input=str(prompt), swarm=swarm)
else:
response = await executor.chat(str(prompt))
return str(getattr(response, "answer", response))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since _executor is an asynchronous function, concurrent evaluation tasks can trigger _load_cli_agent_executor simultaneously before the first call finishes caching the executor. This leads to redundant loading of agents and creation of multiple CLI runtimes/executors.

Using an asyncio.Lock with a double-checked locking pattern avoids this race condition efficiently.

def _build_cli_agent_judge_backend(*, agent_name: str, backend_id: str, prompt_builder):
    executor_cache: dict[str, Any] = {}
    lock = asyncio.Lock()

    async def _executor(prompt, system_prompt):
        if isinstance(prompt, tuple):
            raise ValueError("CLI agent judge backend only supports text prompts")
        executor = executor_cache.get("executor")
        if executor is None:
            async with lock:
                executor = executor_cache.get("executor")
                if executor is None:
                    executor = await _load_cli_agent_executor(agent_name)
                    executor_cache["executor"] = executor
        swarm = getattr(executor, "swarm", None)
        if swarm is not None:
            response = await Runners.run(input=str(prompt), swarm=swarm)
        else:
            response = await executor.chat(str(prompt))
        return str(getattr(response, "answer", response))

@wuman001 wuman001 requested a review from ahgpt June 11, 2026 02:49
@ahgpt ahgpt merged commit 98ad653 into main Jun 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants