Stabilize benchmark signal with variance-aware sampling and rolling baselines by Copilot · Pull Request #4622 · Azure/typespec-azure

Copilot · 2026-06-12T00:35:24Z

Benchmark results were overly sensitive to GitHub runner noise, causing high run-to-run deviation and unreliable PR deltas. This change makes benchmark comparisons more stable by increasing sample quality, gating on variance, and comparing against a rolling mainline baseline instead of a single latest run.

Sampling and runner stability
- Benchmark workflow now uses higher-fidelity defaults (warmup=3, iterations=25).
- Added configurable benchmark runner selection (workflow input + repo variable) to support dedicated/stable runners.
- Pinned Node version in benchmark workflow setup to reduce environment drift.
Variance-aware execution (noise gate)
- Added per-spec runtime variability statistics (mean/median/stddev/CV/min/max/sample count).
- Introduced optional noise-gating reruns when total runtime CV exceeds threshold:
  - --noise-cv-threshold
  - --max-reruns
  - --rerun-iterations
- Runner now records whether reruns were triggered and how many were performed.
Rolling baseline for PR comparisons
- PR comment baseline now prefers a rolling aggregate over recent main history (results/history.json) with fallback to results/latest.json.
- Added --baseline-window to control rolling window size.
- Baseline labeling now distinguishes synthetic rolling baselines from commit SHAs.
Benchmark output and docs updates
- Added shared statistics utilities for variability calculations.
- Updated benchmark summaries/comments to surface variability context.
- Updated benchmark README and tests to cover new CLI/options and formatting behavior.

node packages/benchmark/dist/src/cli.js run \
  --iterations 25 \
  --warmup 3 \
  --noise-cv-threshold 0.08 \
  --max-reruns 1 \
  --rerun-iterations 10 \
  --output /tmp/benchmark-results.json

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

azure-sdk · 2026-06-12T00:43:37Z

No changes needing a change description found.

azure-sdk · 2026-06-12T00:52:35Z

You can try these changes here

🛝 Playground	🌐 Website

Copilot AI and others added 2 commits June 11, 2026 21:43

Implement benchmark variance and rolling baseline

468b5e6

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Address benchmark review follow-ups

e6f1059

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot AI assigned Copilot and timotheeguerin Jun 12, 2026

Copilot created this pull request from a session on behalf of timotheeguerin June 12, 2026 00:35 View session

microsoft-github-policy-service Bot added the eng label Jun 12, 2026

timotheeguerin marked this pull request as ready for review June 12, 2026 00:41

timotheeguerin requested review from bterlson, markcowl and timotheeguerin as code owners June 12, 2026 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622

Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622
Copilot wants to merge 2 commits into
mainfrom
copilot/improve-benchmark-accuracy

Copilot AI commented Jun 12, 2026

Uh oh!

azure-sdk commented Jun 12, 2026

Uh oh!

azure-sdk commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Jun 12, 2026

Uh oh!

azure-sdk commented Jun 12, 2026

Uh oh!

azure-sdk commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants