Skip to content

fix: add CUDA error handling for cleanup paths in nccl_ep#2144

Open
randomizedcoder wants to merge 2 commits intoNVIDIA:masterfrom
randomizedcoder:sa/04-cuda-error-v2
Open

fix: add CUDA error handling for cleanup paths in nccl_ep#2144
randomizedcoder wants to merge 2 commits intoNVIDIA:masterfrom
randomizedcoder:sa/04-cuda-error-v2

Conversation

@randomizedcoder
Copy link
Copy Markdown

Issue

7 CUDA cleanup calls in destroy_hybridep_intranode() discard their return values. A failing cudaFree, cudaFreeHost, or cudaIpcCloseMemHandle during teardown goes unnoticed, making resource leaks and driver-state corruption difficult to diagnose.

Additionally, cudaDeviceEnablePeerAccess in init_hybridep_intranode() logs the failure but doesn't clear the sticky CUDA error state, which can cause subsequent CUDA calls to fail unexpectedly.

Found via static analysis (semgrep cuda-unchecked-malloc, cuda-unchecked-memcpy, cuda-missing-free rules; clang-tidy cert-err33-c).

Changes

Commit 1 — test: add CUDA error handling unit tests with source verification
Tests first (TDD ordering). Unit tests with CUDA type stubs (no GPU required):

  • CUDACHECK/CUDACHECKGOTO/CUDACHECKIGNORE macro behavior verification
  • Sticky error state clearing test
  • Source verification that greps for CUDA_CHECK_WARN in nccl_ep.cc (intentionally FAILS before fix)

Commit 2 — Add CUDA error handling for cleanup paths in nccl_ep

  • contrib/nccl_ep/device/macros.cuh: Add CUDA_CHECK_WARN macro — logs CUDA errors via fprintf(stderr, ...) and clears sticky error state, but does not throw (unlike CUDA_CHECK). Guarded with #ifndef.
  • contrib/nccl_ep/nccl_ep.cc: Wrap 7 cleanup calls with CUDA_CHECK_WARN: 2× cudaIpcCloseMemHandle, 2× cudaFree, 3× cudaFreeHost. Clear sticky error after cudaDeviceEnablePeerAccess failure.

Risk

Low — cleanup paths only. CUDA_CHECK_WARN logs but does not throw or return error, so existing teardown behavior is preserved.

Test plan

  • cd tests && make test — CUDA error handling tests pass
  • Build contrib/nccl_ep — no new warnings
  • Verify CUDA_CHECK_WARN does not alter teardown control flow (warn-only, no throw)

🤖 Generated with Claude Code

randomizedcoder and others added 2 commits April 28, 2026 08:22
Add tests verifying CUDACHECK/CUDACHECKGOTO/CUDACHECKIGNORE macro behavior
using CUDA type stubs (no GPU required). Source verification test intentionally
FAILS to demonstrate that contrib/nccl_ep/ cleanup code silently ignores
CUDA errors (cudaFree, cudaFreeHost, cudaIpcCloseMemHandle called without
error checking).

Test output (before fix):
  FAIL: test_source_verified - nccl_ep.cc: no CUDA_CHECK_WARN found
  FAIL: test_source_verified - macros.cuh: missing CUDA_CHECK_WARN macro
  PASS: test_cudacheck_cases
  PASS: test_cudacheckignore
  PASS: test_cudacheckgoto
  PASS: test_cuda_error_state_cleared
  Results: 4/5 tests passed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: randomizedcoder dave.seddon.ca@gmail.com <dave.seddon.ca@gmail.com>
Add CUDA_CHECK_WARN macro that logs CUDA errors and clears error
state without throwing, for use in cleanup/teardown paths. Wrap
unchecked cudaFree, cudaFreeHost, and cudaIpcCloseMemHandle calls
in destroy_hybridep_intranode(). Also clear stale CUDA error state
after cudaDeviceEnablePeerAccess failure.

Found via static analysis (semgrep: cuda-unchecked-malloc,
cuda-unchecked-memcpy, cuda-missing-free rules).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: randomizedcoder dave.seddon.ca@gmail.com <dave.seddon.ca@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant