Skip to content

Cancel in-flight Run Sweep on PR merge#1368

Open
Oseltamivir wants to merge 1 commit into
mainfrom
workflow/cancel-sweep-on-merge
Open

Cancel in-flight Run Sweep on PR merge#1368
Oseltamivir wants to merge 1 commit into
mainfrom
workflow/cancel-sweep-on-merge

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

  • Add .github/workflows/cancel-sweep-on-merge.yml that fires on pull_request: closed (merged=true) for PRs touching perf-changelog.yaml, and uses gh run cancel to terminate any in-flight (in_progress, queued, waiting) run-sweep.yml runs on the PR's head branch.
  • Self-contained workflow, no edits to the existing concurrency expression on run-sweep.yml.

Reproduction

Run 25781240028: PR #1330 merged at 05:56:01Z; the pull_request sweep that started 8 seconds earlier kept running on PR-allocated runners until cancelled manually. This workflow auto-cancels that case.

Notes

  • Scoped to paths: [perf-changelog.yaml] to match run-sweep.yml's own trigger filter.
  • Uses github.token with permissions: actions: write; no PAT needed.
  • Only fires on merged closes — abandoning a PR without merging leaves the sweep alone (it'll finish or hit its own timeout).

Trigger on pull_request.closed (merged=true) and gh run cancel any
in_progress/queued/waiting run-sweep.yml runs for the PR's head branch,
so a sweep that's still running when the PR is merged is reclaimed
automatically instead of needing a manual cancel.
Comment on lines +25 to +29
ids=$(gh api "repos/${REPO}/actions/workflows/run-sweep.yml/runs" \
-f "event=pull_request" \
-f "branch=${BRANCH}" \
-f "status=${status}" \
--jq '.workflow_runs[].id')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 gh api defaults to POST as soon as any -f/-F/--input flag is present (it's GET only when no parameters are added). The list-runs endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs is GET-only, so this call will get a 4xx, ids ends up empty, and no in-flight sweep runs are ever cancelled — defeating the entire purpose of this workflow. Fix by adding --method GET, inlining the query string in the URL (matching the pattern in runner-offline-digest.yml:33), or switching to gh run list -w run-sweep.yml -b "$BRANCH" -e pull_request -s "$status" --json databaseId --jq '.[].databaseId'.

Extended reasoning...

What the bug is

Per the gh api manual: "The default HTTP request method is 'GET' normally and 'POST' if any parameters were added." Parameters are added via -f/--raw-field, -F/--field, or --input. This PR's new step passes three -f flags (event, branch, status) without an explicit --method, so gh api silently switches to POST and sends the fields as a JSON request body.

Why the call fails

The GitHub REST endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs only supports GET — there is no "create workflow run" verb at this path (workflow dispatch lives at .../dispatches). POSTing returns a 4xx (typically 404/405), gh api exits non-zero, and --jq runs against an error body, producing no IDs.

The silent-failure mode

Under set -euo pipefail, a failed command substitution inside an assignment does not abort bash — this is the well-known POSIX/bash quirk where the assignment is the simple command, not the substitution. So ids=$(gh api ...) ends up empty, the inner for run_id in $ids loop iterates zero times for each of the three statuses, and the step prints Cancelled 0 sweep run(s) with a green check. The workflow appears to work but accomplishes nothing — exactly the scenario the PR description says it's trying to fix.

Cross-references in this repo

The contrast with existing callers makes the convention clear:

  • .github/workflows/runner-offline-digest.yml:33 correctly inlines query params in the URL: gh api --paginate "/repos/$REPO/actions/runners?per_page=100" — no -f flags, stays GET.
  • .github/workflows/docker-tag-monitor.yml:186,192 use --method POST explicitly when POSTing.

This new file is the only place mixing -f flags with a GET-only endpoint and no explicit --method.

Step-by-step proof

  1. CI fires pull_request: closed with merged == true on a PR touching perf-changelog.yaml.
  2. Loop enters status=in_progress.
  3. gh api repos/OWNER/REPO/actions/workflows/run-sweep.yml/runs -f event=pull_request -f branch=feature-x -f status=in_progress --jq '.workflow_runs[].id' runs.
  4. Because three -f flags are present, gh api sends POST with body {"event":"pull_request","branch":"feature-x","status":"in_progress"}.
  5. GitHub responds with 4xx (no POST verb on this URL). gh exits non-zero; jq on the error JSON yields nothing.
  6. ids=$() — command substitution failure does not trip -e inside an assignment, so the script continues with ids empty.
  7. for run_id in $ids iterates 0 times.
  8. Same outcome for queued and waiting.
  9. Final echo prints Cancelled 0 sweep run(s) for branch feature-x. — green check, zero cancellations.

Fix

Any of:

  • gh api --method GET repos/.../runs -f event=pull_request -f branch="$BRANCH" -f status="$status" --jq ...
  • Inline the query string: gh api "repos/$REPO/actions/workflows/run-sweep.yml/runs?event=pull_request&branch=$BRANCH&status=$status" --jq ... (matches runner-offline-digest.yml:33)
  • gh run list -w run-sweep.yml -b "$BRANCH" -e pull_request -s "$status" --json databaseId --jq '.[].databaseId'

run: |
set -euo pipefail
cancelled=0
for status in in_progress queued waiting; do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The status loop only covers in_progress queued waiting, but the GitHub list-workflow-runs API also accepts requested and pending as pre-completion states. A sweep dispatched seconds before merge could briefly be in requested (runner being provisioned) and be missed by the cancellation pass. Adding requested pending to the loop closes this narrow race.

Extended reasoning...

What the bug is

The cancellation loop at .github/workflows/cancel-sweep-on-merge.yml:24 iterates only in_progress queued waiting:

for status in in_progress queued waiting; do
    ids=$(gh api "repos/${REPO}/actions/workflows/run-sweep.yml/runs" \
        -f "event=pull_request" \
        -f "branch=${BRANCH}" \
        -f "status=${status}" \
        --jq '.workflow_runs[].id')
    ...
done

The status filter on GET /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs accepts a broader set of pre-completion values, including requested and pending. A run that has just been dispatched can briefly sit in requested while a runner is being provisioned before it transitions to queued.

Step-by-step proof

  1. A reviewer pushes a commit to a perf-changelog.yaml PR at T=0; run-sweep.yml is dispatched on the pull_request: synchronize event.
  2. At T=0.1s the workflow_run is created and enters requested while GitHub locates a runner.
  3. The reviewer merges the PR at T=2s (very plausibly the same kind of pre-merge window the PR description targets — the 25781240028 example was 8s).
  4. cancel-sweep-on-merge.yml fires on the pull_request: closed (merged=true) event and queries /runs?status=in_progress|queued|waiting. The run is in requested, so it is not listed in any of the three queries.
  5. The loop exits with cancelled=0; the sweep continues to a runner and starts the actual benchmark work, exactly the wasted-runner scenario the workflow was added to prevent.

Addressing the refutation

One reviewer argued that requested/pending are not real states for a pull_request-triggered workflow without environment gates — that pending is set only for first-time-contributor approval and requested only for workflow_run trigger chains. That description is largely correct for the steady-state meaning of these statuses, but the status filter values are not perfectly symmetric with the lifecycle phases a workflow run can transit. In practice the requested state is observable for very brief windows during initial run creation, and confirming verifiers note this is the exact window the PR is trying to win. Even if rare, the cost of adding two strings to the loop is negligible and the cost of missing the cancellation is a wasted set of perf runners — the same problem the PR is trying to solve.

Impact and fix

Impact is low because the window is narrow and the most likely state during the merge race (in_progress once a runner has picked up the job, or queued while waiting) is already covered — that's why this is filed as nit rather than blocking. The PR author should still consider this since the workflow is a single-pass best-effort design with no retry, so any state not in the loop is permanently missed for that invocation.

Fix is a one-line change:

for status in in_progress queued waiting requested pending; do

Comment on lines +31 to +33
echo "Cancelling ${status} run ${run_id}"
gh run cancel "${run_id}" --repo "${REPO}" || true
cancelled=$((cancelled + 1))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The cancelled counter at line 32 is incremented unconditionally even though gh run cancel ... || true swallows failures, so the final log "Cancelled ${cancelled} sweep run(s)" can overstate the actual count when a cancel attempt fails (e.g., the run completed between the list and cancel calls, transient API error). Purely an observability issue — fix is to gate the increment: if gh run cancel "${run_id}" --repo "${REPO}"; then cancelled=$((cancelled + 1)); fi, or rename the log to "Attempted to cancel N sweep run(s)".

Extended reasoning...

The bug

At lines 31-33 of .github/workflows/cancel-sweep-on-merge.yml:

echo "Cancelling ${status} run ${run_id}"
gh run cancel "${run_id}" --repo "${REPO}" || true
cancelled=$((cancelled + 1))

The || true is there for a legitimate reason — a cancel can fail benignly if the run completed between the API list call and the cancel call (a normal race), or if there's a transient API blip. But because the || true short-circuits the exit code to 0, the very next line cancelled=$((cancelled + 1)) runs unconditionally. The counter therefore records attempts, not successes.

How it manifests

The final line echo "Cancelled ${cancelled} sweep run(s) for branch ${BRANCH}." will overstate reality whenever gh run cancel fails. The most common failure mode is the very race this workflow is trying to win: between gh api .../runs?status=in_progress returning a run id and the subsequent gh run cancel call, the run can naturally complete. In that case gh run cancel exits non-zero (the run is no longer cancellable), but the script logs as if it had successfully cancelled.

Step-by-step proof

  1. Workflow lists in-progress runs: returns run id 12345.
  2. Run 12345 finishes naturally (it was about to complete anyway).
  3. Workflow calls gh run cancel 12345 --repo ... || truegh exits non-zero (cannot cancel a completed run), but || true masks it.
  4. cancelled=$((cancelled + 1)) runs anyway → counter is now 1.
  5. Final log: Cancelled 1 sweep run(s) for branch foo. — but zero runs were actually cancelled by this workflow.

Impact

Purely an observability/log-accuracy issue. The workflow's functional purpose (cancelling in-flight sweeps when a PR merges) is unaffected — the cancel attempt either lands or the run was already done, which is functionally equivalent. Nothing downstream consumes cancelled. An operator debugging "why is a sweep still running after merge?" might see "Cancelled 2" and conclude the workflow did its job, when in reality the cancels failed and the runs are still chugging — but in practice the failure modes that would matter (permissions error, run id stale) would be visible in the prior Cancelling ${status} run ${run_id} echo plus any gh stderr that surfaces.

Suggested fix

Either gate the increment on cancel success:

if gh run cancel "${run_id}" --repo "${REPO}"; then
    cancelled=$((cancelled + 1))
fi

…or rename the log to "Attempted to cancel ${cancelled} sweep run(s)" to keep the loose-counting semantics honest. Either is fine; this is polish, not a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant