fix: add CUDA error handling for cleanup paths in nccl_ep#2144
Open
randomizedcoder wants to merge 2 commits intoNVIDIA:masterfrom
Open
fix: add CUDA error handling for cleanup paths in nccl_ep#2144randomizedcoder wants to merge 2 commits intoNVIDIA:masterfrom
randomizedcoder wants to merge 2 commits intoNVIDIA:masterfrom
Conversation
Add tests verifying CUDACHECK/CUDACHECKGOTO/CUDACHECKIGNORE macro behavior using CUDA type stubs (no GPU required). Source verification test intentionally FAILS to demonstrate that contrib/nccl_ep/ cleanup code silently ignores CUDA errors (cudaFree, cudaFreeHost, cudaIpcCloseMemHandle called without error checking). Test output (before fix): FAIL: test_source_verified - nccl_ep.cc: no CUDA_CHECK_WARN found FAIL: test_source_verified - macros.cuh: missing CUDA_CHECK_WARN macro PASS: test_cudacheck_cases PASS: test_cudacheckignore PASS: test_cudacheckgoto PASS: test_cuda_error_state_cleared Results: 4/5 tests passed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: randomizedcoder dave.seddon.ca@gmail.com <dave.seddon.ca@gmail.com>
Add CUDA_CHECK_WARN macro that logs CUDA errors and clears error state without throwing, for use in cleanup/teardown paths. Wrap unchecked cudaFree, cudaFreeHost, and cudaIpcCloseMemHandle calls in destroy_hybridep_intranode(). Also clear stale CUDA error state after cudaDeviceEnablePeerAccess failure. Found via static analysis (semgrep: cuda-unchecked-malloc, cuda-unchecked-memcpy, cuda-missing-free rules). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: randomizedcoder dave.seddon.ca@gmail.com <dave.seddon.ca@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
7 CUDA cleanup calls in
destroy_hybridep_intranode()discard their return values. A failingcudaFree,cudaFreeHost, orcudaIpcCloseMemHandleduring teardown goes unnoticed, making resource leaks and driver-state corruption difficult to diagnose.Additionally,
cudaDeviceEnablePeerAccessininit_hybridep_intranode()logs the failure but doesn't clear the sticky CUDA error state, which can cause subsequent CUDA calls to fail unexpectedly.Found via static analysis (semgrep
cuda-unchecked-malloc,cuda-unchecked-memcpy,cuda-missing-freerules; clang-tidycert-err33-c).Changes
Commit 1 —
test: add CUDA error handling unit tests with source verificationTests first (TDD ordering). Unit tests with CUDA type stubs (no GPU required):
CUDA_CHECK_WARNinnccl_ep.cc(intentionally FAILS before fix)Commit 2 —
Add CUDA error handling for cleanup paths in nccl_epcontrib/nccl_ep/device/macros.cuh: AddCUDA_CHECK_WARNmacro — logs CUDA errors viafprintf(stderr, ...)and clears sticky error state, but does not throw (unlikeCUDA_CHECK). Guarded with#ifndef.contrib/nccl_ep/nccl_ep.cc: Wrap 7 cleanup calls withCUDA_CHECK_WARN: 2×cudaIpcCloseMemHandle, 2×cudaFree, 3×cudaFreeHost. Clear sticky error aftercudaDeviceEnablePeerAccessfailure.Risk
Low — cleanup paths only.
CUDA_CHECK_WARNlogs but does not throw or return error, so existing teardown behavior is preserved.Test plan
cd tests && make test— CUDA error handling tests passcontrib/nccl_ep— no new warningsCUDA_CHECK_WARNdoes not alter teardown control flow (warn-only, no throw)🤖 Generated with Claude Code