ci: cap cargo build parallelism to fix cachekit-lean MSRV OOM#26
ci: cap cargo build parallelism to fix cachekit-lean MSRV OOM#2627Bslash6 wants to merge 1 commit into
Conversation
The 1.85 (MSRV) matrix job has been red on main for 6+ days. This is not a code or MSRV regression: the crate compiles and tests clean on rustc 1.85.1 locally, and the committed Cargo.lock builds on 1.85. Root cause is an out-of-memory kill on the self-hosted runner. The cachekit-lean ARC pod has a hard 6Gi memory cgroup (lab ADR-0001) and no build cache, so the 1.85 job runs a fully cold `cargo test` build of the 246-crate async/TLS/crypto dependency graph. cargo defaults -j to the visible core count (~24 via the pod CPU limit), and at that fan-out the cold build's peak RSS — dominated by full test-profile debuginfo at link time — exceeds 6Gi. The kernel OOM-kills the linker, the runner loses communication with the server, and the job fails at the ~10-minute heartbeat reckoning with no logs uploaded (BlobNotFound). stable/beta pass because the preceding clippy step pre-builds the dependency graph, lowering the test step's peak; the 1.85 job skips clippy and concentrates the whole cold build into one step. Confirmed empirically: a cold `cargo test --no-run -j24` inside a 6Gi no-swap cgroup is OOM-killed during the cachekit-rs test-binary link (kernel: "Memory cgroup out of memory: Killed process ... (ld)"). Changes: - ci.yml: CARGO_BUILD_JOBS=4 caps concurrent compile/link jobs so peak RSS stays well under 6Gi on the cache-less lean pod. - ci.yml: timeout-minutes on the test (20) and wasm (15) jobs so a wedged runner fails fast instead of hanging to the heartbeat timeout. - release.yml: same CARGO_BUILD_JOBS cap on the publish job (also runs on cachekit-lean), plus a 1.85 MSRV compile check before publish so a broken floor cannot ship silently — release previously built stable only, which is how 0.3.0 published while the MSRV job was red. Refs #25
|
Warning Review limit reached
More reviews will be available in 16 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Fixes the 6+ day red
1.85(MSRV) CI job onmain. It is not a code or MSRV bug — the crate compiles and tests clean on rustc 1.85.1 locally, and the committedCargo.lockbuilds on 1.85.The runner is being OOM-killed. The
cachekit-leanARC pod has a hard 6Gi memory cgroup (lab ADR-0001) and no build cache, so the1.85job runs a fully coldcargo testbuild of the 246-crate async/TLS/crypto graph.cargodefaults-jto the visible core count (~24 via the pod CPU limit), and at that fan-out the cold build's peak RSS — dominated by full test-profile debuginfo at link time — exceeds 6Gi. The kernel OOM-kills the linker, the runner "loses communication with the server," and the job fails at the ~10-minute heartbeat reckoning with no logs uploaded (theBlobNotFoundsymptom).stable/betapass because their preceding clippy step pre-builds the dependency graph, lowering theteststep's peak. The1.85job skips clippy, concentrating the whole cold build into one step.Evidence
Run testsfrozenin_progress, noComplete job, job failed at a clean ~10-min mark → killed mid-step, not a test failure.cargo test --no-run -j24inside a 6Gi no-swap cgroup is OOM-killed during thecachekit-rstest-binary link —kernel: Memory cgroup out of memory: Killed process … (ld)/scope: Failed with result 'oom-kill'.Changes
ci.yml—CARGO_BUILD_JOBS=4caps concurrent compile/link jobs so peak RSS stays well under 6Gi on the cache-less lean pod (the ~6× reduction in the fan-out is the fix).ci.yml—timeout-minutesontest(20) andwasm(15) so a wedged runner fails fast instead of hanging to the heartbeat timeout.release.yml— sameCARGO_BUILD_JOBScap on thepublishjob (also runs oncachekit-lean), plus a 1.85 MSRV compile check before publish so a broken floor can't ship silently —releasepreviously builtstableonly, which is how 0.3.0 published while the MSRV job was red.MSRV floor stays at 1.85 (deliberate, edition2024 — not bumped; bumping
-jvalue, not the toolchain, is the fix).Validation
The
1.85job on this PR runs at-j4oncachekit-lean— green here proves the fix in the real constrained environment. If it proves marginal at the link spike, theCARGO_BUILD_JOBSvalue will be lowered.Closes #25