Skip to content

chore: sync public mirror from internal#395

Merged
haasonsaas merged 2 commits into
mainfrom
sync/public-release-mirror
May 13, 2026
Merged

chore: sync public mirror from internal#395
haasonsaas merged 2 commits into
mainfrom
sync/public-release-mirror

Conversation

@haasonsaas
Copy link
Copy Markdown
Contributor

Summary

  • sync the sanitized public tree from evalops/maestro-internal
  • keep evalops/maestro as a generated public mirror of the private source of truth
  • preserve public-owned CI and trusted-publishing workflows from the public checkout
  • internal source SHA: bf0d22c180540dad2fd97f83400b14628300b625
  • last generated public sync base: 6aa0e4d7456d3c430baac1f2d6d7488a60fdbb92
  • previewed public-tree drift: 8 file(s) to copy/update and 0 stale file(s) to delete
  • public-only commits since last generated sync: 0

Source-of-truth status

Public Mirror Drift Audit

  • package: @evalops/maestro
  • private source: https://github.com/evalops/maestro-internal@main (bf0d22c18054)
  • public projection: https://github.com/evalops/maestro@main (6aa0e4d7456d)
  • files to copy or update: 8
  • stale files to delete: 0
  • result: drift detected
  • invariant: public_projection_has_drift

Sample Changed Paths

  • copy/update packages/tui-rs/src/headless/generated_protocol.rs
  • copy/update scripts/evals/run-platform-fermata-scenario-suite.ts
  • copy/update src/agent/providers/scripted.ts
  • copy/update src/platform/fermata-eval-client.ts
  • copy/update src/platform/fermata-scenario-suite.ts
  • copy/update test/agent/scripted-provider.test.ts
  • copy/update test/platform/fermata-eval-client.test.ts
  • copy/update test/platform/fermata-scenario-suite.test.ts

Guidance

Let internal main generate and merge the public sync PR before relying on public main.

Drift sample

  • copy/update packages/tui-rs/src/headless/generated_protocol.rs
  • copy/update scripts/evals/run-platform-fermata-scenario-suite.ts
  • copy/update src/agent/providers/scripted.ts
  • copy/update src/platform/fermata-eval-client.ts
  • copy/update src/platform/fermata-scenario-suite.ts
  • copy/update test/agent/scripted-provider.test.ts
  • copy/update test/platform/fermata-eval-client.test.ts
  • copy/update test/platform/fermata-scenario-suite.test.ts

Public-only commits since last generated sync

  • none detected since last generated sync

Validation

  • generated by the sync-public-release-mirror workflow in public-tree mode

Test Plan

  • generated by the sync-public-release-mirror workflow in public-tree mode
  • public-source-provenance require-internal-pr check confirms internal source PR lineage
  • CI, integration, rust-hosted-conformance, coverage, Socket, and Cursor checks must pass before merge

Staged Rollout

  • Staging is unnecessary for this generated mirror PR: it does not independently promote user-visible behavior. It mirrors already-reviewed internal source from evalops/maestro-internal@bf0d22c180540dad2fd97f83400b14628300b625, including existing hidden/evaluation surfaces, and keeps public package parity behind the established public-source-provenance gate.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 12, 2026

PR Summary

Medium Risk
Expands the Fermata scenario-suite payload/schema and CLI surface (new assertion kinds and judge options), which can change evaluation behavior and CI outcomes if misconfigured.

Overview
Adds native Fermata trajectory guarding to scenario suites by emitting an ASSERTION_KIND_AGENT_TRAJECTORY case derived from recorded scenario provenance, assertion statuses, budgets, and trace join keys.

Extends Fermata LLM judging with pairwise rubric support (ASSERTION_KIND_LLM_PAIRWISE_RUBRIC) plus new judge fields (advisoryOnly, rubricVersion, calibrationCohort) and corresponding CLI/env parsing in run-platform-fermata-scenario-suite.ts.

Tweaks scripted replay validation/error messaging (tool-call id type check and transient error wording) and updates/expands tests to assert the new Connect payload shapes and additional generated cases.

Reviewed by Cursor Bugbot for commit 8132ac6. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d703f2fdc6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 520 to 523
if (statement.kind === "error") {
partial.stopReason = "error";
partial.errorMessage = scriptedErrorMessage(statement);
partial.errorMessage = statement.message;
yield { type: "error", reason: "error", error: partial };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep transient scripted errors retryable

When a scripted frame emits kind: "error" with type: "transient", we now pass through statement.message verbatim, which drops the explicit retry hint that isRetryableError previously matched (e.g., try again). In practice this turns transient replay failures into non-retryable terminal errors unless the fixture author manually includes retry keywords, so scripted eval runs can fail immediately instead of exercising retry behavior.

Useful? React with 👍 / 👎.

Comment on lines 183 to 184
`Replay scenario ${label} frame ${frame.index} statement ${statementOffset} tool_call tool must be a non-empty string`,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reinstate validation for tool_call id type

The parser no longer rejects non-string tool_call.id values even though the scenario contract expects id?: string. That allows invalid fixtures (for example numeric IDs) to pass parsing and flow into tool-call emission/matching paths, where IDs are treated as strings, creating hard-to-debug mismatches in tool result correlation and violating the expected wire shape.

Useful? React with 👍 / 👎.

@haasonsaas haasonsaas force-pushed the sync/public-release-mirror branch from d703f2f to 8132ac6 Compare May 13, 2026 03:59
@haasonsaas haasonsaas merged commit 1490f0b into main May 13, 2026
10 checks passed
@haasonsaas haasonsaas deleted the sync/public-release-mirror branch May 13, 2026 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant