Skip to content

Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622

Open
Copilot wants to merge 2 commits into
mainfrom
copilot/improve-benchmark-accuracy
Open

Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622
Copilot wants to merge 2 commits into
mainfrom
copilot/improve-benchmark-accuracy

Conversation

Copilot AI commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Benchmark results were overly sensitive to GitHub runner noise, causing high run-to-run deviation and unreliable PR deltas. This change makes benchmark comparisons more stable by increasing sample quality, gating on variance, and comparing against a rolling mainline baseline instead of a single latest run.

  • Sampling and runner stability

    • Benchmark workflow now uses higher-fidelity defaults (warmup=3, iterations=25).
    • Added configurable benchmark runner selection (workflow input + repo variable) to support dedicated/stable runners.
    • Pinned Node version in benchmark workflow setup to reduce environment drift.
  • Variance-aware execution (noise gate)

    • Added per-spec runtime variability statistics (mean/median/stddev/CV/min/max/sample count).
    • Introduced optional noise-gating reruns when total runtime CV exceeds threshold:
      • --noise-cv-threshold
      • --max-reruns
      • --rerun-iterations
    • Runner now records whether reruns were triggered and how many were performed.
  • Rolling baseline for PR comparisons

    • PR comment baseline now prefers a rolling aggregate over recent main history (results/history.json) with fallback to results/latest.json.
    • Added --baseline-window to control rolling window size.
    • Baseline labeling now distinguishes synthetic rolling baselines from commit SHAs.
  • Benchmark output and docs updates

    • Added shared statistics utilities for variability calculations.
    • Updated benchmark summaries/comments to surface variability context.
    • Updated benchmark README and tests to cover new CLI/options and formatting behavior.
node packages/benchmark/dist/src/cli.js run \
  --iterations 25 \
  --warmup 3 \
  --noise-cv-threshold 0.08 \
  --max-reruns 1 \
  --rerun-iterations 10 \
  --output /tmp/benchmark-results.json

Copilot AI and others added 2 commits June 11, 2026 21:43
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
@azure-sdk

Copy link
Copy Markdown
Collaborator

No changes needing a change description found.

@azure-sdk

Copy link
Copy Markdown
Collaborator

You can try these changes here

🛝 Playground 🌐 Website

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants