Cancel in-flight Run Sweep on PR merge#1368
Conversation
Trigger on pull_request.closed (merged=true) and gh run cancel any in_progress/queued/waiting run-sweep.yml runs for the PR's head branch, so a sweep that's still running when the PR is merged is reclaimed automatically instead of needing a manual cancel.
| ids=$(gh api "repos/${REPO}/actions/workflows/run-sweep.yml/runs" \ | ||
| -f "event=pull_request" \ | ||
| -f "branch=${BRANCH}" \ | ||
| -f "status=${status}" \ | ||
| --jq '.workflow_runs[].id') |
There was a problem hiding this comment.
🔴 gh api defaults to POST as soon as any -f/-F/--input flag is present (it's GET only when no parameters are added). The list-runs endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs is GET-only, so this call will get a 4xx, ids ends up empty, and no in-flight sweep runs are ever cancelled — defeating the entire purpose of this workflow. Fix by adding --method GET, inlining the query string in the URL (matching the pattern in runner-offline-digest.yml:33), or switching to gh run list -w run-sweep.yml -b "$BRANCH" -e pull_request -s "$status" --json databaseId --jq '.[].databaseId'.
Extended reasoning...
What the bug is
Per the gh api manual: "The default HTTP request method is 'GET' normally and 'POST' if any parameters were added." Parameters are added via -f/--raw-field, -F/--field, or --input. This PR's new step passes three -f flags (event, branch, status) without an explicit --method, so gh api silently switches to POST and sends the fields as a JSON request body.
Why the call fails
The GitHub REST endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs only supports GET — there is no "create workflow run" verb at this path (workflow dispatch lives at .../dispatches). POSTing returns a 4xx (typically 404/405), gh api exits non-zero, and --jq runs against an error body, producing no IDs.
The silent-failure mode
Under set -euo pipefail, a failed command substitution inside an assignment does not abort bash — this is the well-known POSIX/bash quirk where the assignment is the simple command, not the substitution. So ids=$(gh api ...) ends up empty, the inner for run_id in $ids loop iterates zero times for each of the three statuses, and the step prints Cancelled 0 sweep run(s) with a green check. The workflow appears to work but accomplishes nothing — exactly the scenario the PR description says it's trying to fix.
Cross-references in this repo
The contrast with existing callers makes the convention clear:
.github/workflows/runner-offline-digest.yml:33correctly inlines query params in the URL:gh api --paginate "/repos/$REPO/actions/runners?per_page=100"— no-fflags, stays GET..github/workflows/docker-tag-monitor.yml:186,192use--method POSTexplicitly when POSTing.
This new file is the only place mixing -f flags with a GET-only endpoint and no explicit --method.
Step-by-step proof
- CI fires
pull_request: closedwithmerged == trueon a PR touchingperf-changelog.yaml. - Loop enters
status=in_progress. gh api repos/OWNER/REPO/actions/workflows/run-sweep.yml/runs -f event=pull_request -f branch=feature-x -f status=in_progress --jq '.workflow_runs[].id'runs.- Because three
-fflags are present,gh apisendsPOSTwith body{"event":"pull_request","branch":"feature-x","status":"in_progress"}. - GitHub responds with 4xx (no POST verb on this URL).
ghexits non-zero;jqon the error JSON yields nothing. ids=$()— command substitution failure does not trip-einside an assignment, so the script continues withidsempty.for run_id in $idsiterates 0 times.- Same outcome for
queuedandwaiting. - Final
echoprintsCancelled 0 sweep run(s) for branch feature-x.— green check, zero cancellations.
Fix
Any of:
gh api --method GET repos/.../runs -f event=pull_request -f branch="$BRANCH" -f status="$status" --jq ...- Inline the query string:
gh api "repos/$REPO/actions/workflows/run-sweep.yml/runs?event=pull_request&branch=$BRANCH&status=$status" --jq ...(matchesrunner-offline-digest.yml:33) gh run list -w run-sweep.yml -b "$BRANCH" -e pull_request -s "$status" --json databaseId --jq '.[].databaseId'
| run: | | ||
| set -euo pipefail | ||
| cancelled=0 | ||
| for status in in_progress queued waiting; do |
There was a problem hiding this comment.
🟡 The status loop only covers in_progress queued waiting, but the GitHub list-workflow-runs API also accepts requested and pending as pre-completion states. A sweep dispatched seconds before merge could briefly be in requested (runner being provisioned) and be missed by the cancellation pass. Adding requested pending to the loop closes this narrow race.
Extended reasoning...
What the bug is
The cancellation loop at .github/workflows/cancel-sweep-on-merge.yml:24 iterates only in_progress queued waiting:
for status in in_progress queued waiting; do
ids=$(gh api "repos/${REPO}/actions/workflows/run-sweep.yml/runs" \
-f "event=pull_request" \
-f "branch=${BRANCH}" \
-f "status=${status}" \
--jq '.workflow_runs[].id')
...
doneThe status filter on GET /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs accepts a broader set of pre-completion values, including requested and pending. A run that has just been dispatched can briefly sit in requested while a runner is being provisioned before it transitions to queued.
Step-by-step proof
- A reviewer pushes a commit to a
perf-changelog.yamlPR atT=0;run-sweep.ymlis dispatched on thepull_request: synchronizeevent. - At
T=0.1sthe workflow_run is created and entersrequestedwhile GitHub locates a runner. - The reviewer merges the PR at
T=2s(very plausibly the same kind of pre-merge window the PR description targets — the 25781240028 example was 8s). cancel-sweep-on-merge.ymlfires on thepull_request: closed (merged=true)event and queries/runs?status=in_progress|queued|waiting. The run is inrequested, so it is not listed in any of the three queries.- The loop exits with
cancelled=0; the sweep continues to a runner and starts the actual benchmark work, exactly the wasted-runner scenario the workflow was added to prevent.
Addressing the refutation
One reviewer argued that requested/pending are not real states for a pull_request-triggered workflow without environment gates — that pending is set only for first-time-contributor approval and requested only for workflow_run trigger chains. That description is largely correct for the steady-state meaning of these statuses, but the status filter values are not perfectly symmetric with the lifecycle phases a workflow run can transit. In practice the requested state is observable for very brief windows during initial run creation, and confirming verifiers note this is the exact window the PR is trying to win. Even if rare, the cost of adding two strings to the loop is negligible and the cost of missing the cancellation is a wasted set of perf runners — the same problem the PR is trying to solve.
Impact and fix
Impact is low because the window is narrow and the most likely state during the merge race (in_progress once a runner has picked up the job, or queued while waiting) is already covered — that's why this is filed as nit rather than blocking. The PR author should still consider this since the workflow is a single-pass best-effort design with no retry, so any state not in the loop is permanently missed for that invocation.
Fix is a one-line change:
for status in in_progress queued waiting requested pending; do| echo "Cancelling ${status} run ${run_id}" | ||
| gh run cancel "${run_id}" --repo "${REPO}" || true | ||
| cancelled=$((cancelled + 1)) |
There was a problem hiding this comment.
🟡 The cancelled counter at line 32 is incremented unconditionally even though gh run cancel ... || true swallows failures, so the final log "Cancelled ${cancelled} sweep run(s)" can overstate the actual count when a cancel attempt fails (e.g., the run completed between the list and cancel calls, transient API error). Purely an observability issue — fix is to gate the increment: if gh run cancel "${run_id}" --repo "${REPO}"; then cancelled=$((cancelled + 1)); fi, or rename the log to "Attempted to cancel N sweep run(s)".
Extended reasoning...
The bug
At lines 31-33 of .github/workflows/cancel-sweep-on-merge.yml:
echo "Cancelling ${status} run ${run_id}"
gh run cancel "${run_id}" --repo "${REPO}" || true
cancelled=$((cancelled + 1))The || true is there for a legitimate reason — a cancel can fail benignly if the run completed between the API list call and the cancel call (a normal race), or if there's a transient API blip. But because the || true short-circuits the exit code to 0, the very next line cancelled=$((cancelled + 1)) runs unconditionally. The counter therefore records attempts, not successes.
How it manifests
The final line echo "Cancelled ${cancelled} sweep run(s) for branch ${BRANCH}." will overstate reality whenever gh run cancel fails. The most common failure mode is the very race this workflow is trying to win: between gh api .../runs?status=in_progress returning a run id and the subsequent gh run cancel call, the run can naturally complete. In that case gh run cancel exits non-zero (the run is no longer cancellable), but the script logs as if it had successfully cancelled.
Step-by-step proof
- Workflow lists in-progress runs: returns run id
12345. - Run
12345finishes naturally (it was about to complete anyway). - Workflow calls
gh run cancel 12345 --repo ... || true—ghexits non-zero (cannot cancel a completed run), but|| truemasks it. cancelled=$((cancelled + 1))runs anyway → counter is now 1.- Final log:
Cancelled 1 sweep run(s) for branch foo.— but zero runs were actually cancelled by this workflow.
Impact
Purely an observability/log-accuracy issue. The workflow's functional purpose (cancelling in-flight sweeps when a PR merges) is unaffected — the cancel attempt either lands or the run was already done, which is functionally equivalent. Nothing downstream consumes cancelled. An operator debugging "why is a sweep still running after merge?" might see "Cancelled 2" and conclude the workflow did its job, when in reality the cancels failed and the runs are still chugging — but in practice the failure modes that would matter (permissions error, run id stale) would be visible in the prior Cancelling ${status} run ${run_id} echo plus any gh stderr that surfaces.
Suggested fix
Either gate the increment on cancel success:
if gh run cancel "${run_id}" --repo "${REPO}"; then
cancelled=$((cancelled + 1))
fi…or rename the log to "Attempted to cancel ${cancelled} sweep run(s)" to keep the loose-counting semantics honest. Either is fine; this is polish, not a blocker.
Summary
.github/workflows/cancel-sweep-on-merge.ymlthat fires onpull_request: closed(merged=true) for PRs touchingperf-changelog.yaml, and usesgh run cancelto terminate any in-flight (in_progress,queued,waiting)run-sweep.ymlruns on the PR's head branch.run-sweep.yml.Reproduction
Run 25781240028: PR #1330 merged at 05:56:01Z; the
pull_requestsweep that started 8 seconds earlier kept running on PR-allocated runners until cancelled manually. This workflow auto-cancels that case.Notes
paths: [perf-changelog.yaml]to matchrun-sweep.yml's own trigger filter.github.tokenwithpermissions: actions: write; no PAT needed.