Skip to content

ci: cap cargo build parallelism to fix cachekit-lean MSRV OOM#26

Open
27Bslash6 wants to merge 1 commit into
mainfrom
ci/fix-lean-runner-oom
Open

ci: cap cargo build parallelism to fix cachekit-lean MSRV OOM#26
27Bslash6 wants to merge 1 commit into
mainfrom
ci/fix-lean-runner-oom

Conversation

@27Bslash6

Copy link
Copy Markdown
Contributor

Summary

Fixes the 6+ day red 1.85 (MSRV) CI job on main. It is not a code or MSRV bug — the crate compiles and tests clean on rustc 1.85.1 locally, and the committed Cargo.lock builds on 1.85.

The runner is being OOM-killed. The cachekit-lean ARC pod has a hard 6Gi memory cgroup (lab ADR-0001) and no build cache, so the 1.85 job runs a fully cold cargo test build of the 246-crate async/TLS/crypto graph. cargo defaults -j to the visible core count (~24 via the pod CPU limit), and at that fan-out the cold build's peak RSS — dominated by full test-profile debuginfo at link time — exceeds 6Gi. The kernel OOM-kills the linker, the runner "loses communication with the server," and the job fails at the ~10-minute heartbeat reckoning with no logs uploaded (the BlobNotFound symptom).

stable/beta pass because their preceding clippy step pre-builds the dependency graph, lowering the test step's peak. The 1.85 job skips clippy, concentrating the whole cold build into one step.

Evidence

  • Job annotation (survives the lost log blob): "The self-hosted runner lost communication with the server… terminates the runner process, starves it for CPU/Memory, or blocks its network access."
  • Step timeline: Run tests frozen in_progress, no Complete job, job failed at a clean ~10-min mark → killed mid-step, not a test failure.
  • Empirical repro: a cold cargo test --no-run -j24 inside a 6Gi no-swap cgroup is OOM-killed during the cachekit-rs test-binary link —
    kernel: Memory cgroup out of memory: Killed process … (ld) / scope: Failed with result 'oom-kill'.
  • Non-deterministic across re-runs (passed once, failed twice) — the signature of a resource-edge, not a deterministic code bug.

Changes

  • ci.ymlCARGO_BUILD_JOBS=4 caps concurrent compile/link jobs so peak RSS stays well under 6Gi on the cache-less lean pod (the ~6× reduction in the fan-out is the fix).
  • ci.ymltimeout-minutes on test (20) and wasm (15) so a wedged runner fails fast instead of hanging to the heartbeat timeout.
  • release.yml — same CARGO_BUILD_JOBS cap on the publish job (also runs on cachekit-lean), plus a 1.85 MSRV compile check before publish so a broken floor can't ship silently — release previously built stable only, which is how 0.3.0 published while the MSRV job was red.

MSRV floor stays at 1.85 (deliberate, edition2024 — not bumped; bumping -j value, not the toolchain, is the fix).

Validation

The 1.85 job on this PR runs at -j4 on cachekit-lean — green here proves the fix in the real constrained environment. If it proves marginal at the link spike, the CARGO_BUILD_JOBS value will be lowered.

Note: the broken ARC log persistence (BlobNotFound) is a separate runner-side observability defect tracked outside this repo; preventing the OOM is what makes the job pass.

Closes #25

The 1.85 (MSRV) matrix job has been red on main for 6+ days. This is not
a code or MSRV regression: the crate compiles and tests clean on rustc
1.85.1 locally, and the committed Cargo.lock builds on 1.85.

Root cause is an out-of-memory kill on the self-hosted runner. The
cachekit-lean ARC pod has a hard 6Gi memory cgroup (lab ADR-0001) and no
build cache, so the 1.85 job runs a fully cold `cargo test` build of the
246-crate async/TLS/crypto dependency graph. cargo defaults -j to the
visible core count (~24 via the pod CPU limit), and at that fan-out the
cold build's peak RSS — dominated by full test-profile debuginfo at link
time — exceeds 6Gi. The kernel OOM-kills the linker, the runner loses
communication with the server, and the job fails at the ~10-minute
heartbeat reckoning with no logs uploaded (BlobNotFound). stable/beta
pass because the preceding clippy step pre-builds the dependency graph,
lowering the test step's peak; the 1.85 job skips clippy and concentrates
the whole cold build into one step.

Confirmed empirically: a cold `cargo test --no-run -j24` inside a 6Gi
no-swap cgroup is OOM-killed during the cachekit-rs test-binary link
(kernel: "Memory cgroup out of memory: Killed process ... (ld)").

Changes:
- ci.yml: CARGO_BUILD_JOBS=4 caps concurrent compile/link jobs so peak
  RSS stays well under 6Gi on the cache-less lean pod.
- ci.yml: timeout-minutes on the test (20) and wasm (15) jobs so a
  wedged runner fails fast instead of hanging to the heartbeat timeout.
- release.yml: same CARGO_BUILD_JOBS cap on the publish job (also runs on
  cachekit-lean), plus a 1.85 MSRV compile check before publish so a
  broken floor cannot ship silently — release previously built stable
  only, which is how 0.3.0 published while the MSRV job was red.

Refs #25
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@27Bslash6, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 16 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cc100e80-1ef2-4d5d-aa30-ce09d11c48e0

📥 Commits

Reviewing files that changed from the base of the PR and between fc3e800 and 2be9535.

📒 Files selected for processing (2)
  • .github/workflows/ci.yml
  • .github/workflows/release.yml
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci/fix-lean-runner-oom

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI red on main: 1.85 (MSRV) job fails reproducibly — code passes locally, ARC runner logs not persisted

1 participant