chore: sync public mirror from internal by haasonsaas · Pull Request #395 · evalops/maestro

haasonsaas · 2026-05-12T17:25:34Z

Summary

sync the sanitized public tree from evalops/maestro-internal
keep evalops/maestro as a generated public mirror of the private source of truth
preserve public-owned CI and trusted-publishing workflows from the public checkout
internal source SHA: bf0d22c180540dad2fd97f83400b14628300b625
last generated public sync base: 6aa0e4d7456d3c430baac1f2d6d7488a60fdbb92
previewed public-tree drift: 8 file(s) to copy/update and 0 stale file(s) to delete
public-only commits since last generated sync: 0

Source-of-truth status

Public Mirror Drift Audit

package: @evalops/maestro
private source: https://github.com/evalops/maestro-internal@main (bf0d22c18054)
public projection: https://github.com/evalops/maestro@main (6aa0e4d7456d)
files to copy or update: 8
stale files to delete: 0
result: drift detected
invariant: public_projection_has_drift

Sample Changed Paths

copy/update packages/tui-rs/src/headless/generated_protocol.rs
copy/update scripts/evals/run-platform-fermata-scenario-suite.ts
copy/update src/agent/providers/scripted.ts
copy/update src/platform/fermata-eval-client.ts
copy/update src/platform/fermata-scenario-suite.ts
copy/update test/agent/scripted-provider.test.ts
copy/update test/platform/fermata-eval-client.test.ts
copy/update test/platform/fermata-scenario-suite.test.ts

Guidance

Let internal main generate and merge the public sync PR before relying on public main.

Drift sample

copy/update packages/tui-rs/src/headless/generated_protocol.rs
copy/update scripts/evals/run-platform-fermata-scenario-suite.ts
copy/update src/agent/providers/scripted.ts
copy/update src/platform/fermata-eval-client.ts
copy/update src/platform/fermata-scenario-suite.ts
copy/update test/agent/scripted-provider.test.ts
copy/update test/platform/fermata-eval-client.test.ts
copy/update test/platform/fermata-scenario-suite.test.ts

Public-only commits since last generated sync

none detected since last generated sync

Validation

generated by the sync-public-release-mirror workflow in public-tree mode

Test Plan

generated by the sync-public-release-mirror workflow in public-tree mode
public-source-provenance require-internal-pr check confirms internal source PR lineage
CI, integration, rust-hosted-conformance, coverage, Socket, and Cursor checks must pass before merge

Staged Rollout

Staging is unnecessary for this generated mirror PR: it does not independently promote user-visible behavior. It mirrors already-reviewed internal source from evalops/maestro-internal@bf0d22c180540dad2fd97f83400b14628300b625, including existing hidden/evaluation surfaces, and keeps public package parity behind the established public-source-provenance gate.

cursor · 2026-05-12T17:25:42Z

PR Summary

Medium Risk
Expands the Fermata scenario-suite payload/schema and CLI surface (new assertion kinds and judge options), which can change evaluation behavior and CI outcomes if misconfigured.

Overview
Adds native Fermata trajectory guarding to scenario suites by emitting an ASSERTION_KIND_AGENT_TRAJECTORY case derived from recorded scenario provenance, assertion statuses, budgets, and trace join keys.

Extends Fermata LLM judging with pairwise rubric support (ASSERTION_KIND_LLM_PAIRWISE_RUBRIC) plus new judge fields (advisoryOnly, rubricVersion, calibrationCohort) and corresponding CLI/env parsing in run-platform-fermata-scenario-suite.ts.

Tweaks scripted replay validation/error messaging (tool-call id type check and transient error wording) and updates/expands tests to assert the new Connect payload shapes and additional generated cases.

^{Reviewed by Cursor Bugbot for commit 8132ac6. Bugbot is set up for automated code reviews on this repo. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d703f2fdc6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T17:33:08Z

 			if (statement.kind === "error") {
 				partial.stopReason = "error";
-				partial.errorMessage = scriptedErrorMessage(statement);
+				partial.errorMessage = statement.message;
 				yield { type: "error", reason: "error", error: partial };


Keep transient scripted errors retryable

When a scripted frame emits kind: "error" with type: "transient", we now pass through statement.message verbatim, which drops the explicit retry hint that isRetryableError previously matched (e.g., try again). In practice this turns transient replay failures into non-retryable terminal errors unless the fixture author manually includes retry keywords, so scripted eval runs can fail immediately instead of exercising retry behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-12T17:33:08Z

 					`Replay scenario ${label} frame ${frame.index} statement ${statementOffset} tool_call tool must be a non-empty string`,
 				);


Reinstate validation for tool_call id type

The parser no longer rejects non-string tool_call.id values even though the scenario contract expects id?: string. That allows invalid fixtures (for example numeric IDs) to pass parsing and flow into tool-call emission/matching paths, where IDs are treated as strings, creating hard-to-debug mismatches in tool result correlation and violating the expected wire shape.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

github-actions Bot and others added 2 commits May 12, 2026 20:58

chore: sync public mirror from internal

878cdd8

fix: preserve scripted replay validation

8132ac6

haasonsaas force-pushed the sync/public-release-mirror branch from d703f2f to 8132ac6 Compare May 13, 2026 03:59

haasonsaas merged commit 1490f0b into main May 13, 2026
10 checks passed

haasonsaas deleted the sync/public-release-mirror branch May 13, 2026 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: sync public mirror from internal#395

chore: sync public mirror from internal#395
haasonsaas merged 2 commits into
mainfrom
sync/public-release-mirror

haasonsaas commented May 12, 2026

Uh oh!

cursor Bot commented May 12, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		`Replay scenario ${label} frame ${frame.index} statement ${statementOffset} tool_call tool must be a non-empty string`,
		);

Conversation

haasonsaas commented May 12, 2026

Summary

Source-of-truth status

Public Mirror Drift Audit

Sample Changed Paths

Guidance

Drift sample

Public-only commits since last generated sync

Validation

Test Plan

Staged Rollout

Uh oh!

cursor Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented May 12, 2026 •

edited

Loading