fix: FP16 crash on CUDA 12.x + kernel optimizations#14
Open
zaki699-blip wants to merge 1 commit into
Open
Conversation
Fix critical FP16 (half-precision) crash (CUDA error 700: illegal memory
access) when running on CUDA 12.x, and apply kernel performance optimizations.
Bug Fixes
---------
1. Remove conflicting __half operator overloads (grid_sample_3d.cuh)
CUDA 12.x ships native __half arithmetic operators. The 12 custom
operator overloads caused ODR violations at link time, leading to
undefined behavior. One overload was inherently broken:
__half operator+=(const float&, const half&)
attempts to modify a const reference.
All custom overloads removed. A minimal to_float<scalar_t>() helper
handles __half to float conversion.
2. Fix compute_index() precision loss for __half (grid_sample_3d.cuh)
compute_index() was templated in scalar_t, so when instantiated with
__half (~3.3 decimal digits), unnormalization and coordinate math
produced garbage indices, causing out-of-bounds memory access.
compute_index() now works entirely in float32 and returns float.
reflect_coordinates() rewritten as reflect_coordinates_f() in pure
float32.
3. Fix nearest kernel channel stride bug (grid_sample_3d.cu)
The nearest interpolation kernel never advanced input/output channel
pointers inside the channel loop (missing += stride_C). This caused
channel-0 data to be read for all channels.
Performance Optimizations
-------------------------
- __restrict__ on all kernel and launcher pointer parameters
- Hoist boundary checks outside channel loops (constant across C)
- Precompute spatial offsets outside channel loops
- Use __ldg() for read-only global memory loads (texture cache path)
- Compute trilinear weights in float32 (avoids repeated conversions)
- Accumulate bilinear interpolation in float32, cast on output
Tested: CUDA 12.6 / TensorRT 10.5 / SM80+ (Ampere)
FP32: max_diff vs PyTorch = 0.0 (exact match)
FP16: max_diff vs PyTorch = 9.76e-4 (expected for half)
FP16 latency: 93.9us vs FP32 117.5us = 1.25x speedup
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix critical FP16 (half-precision) crash (CUDA error 700: illegal memory access) when running on CUDA 12.x, and apply kernel performance optimizations.
Bug Fixes
1. Remove conflicting
__halfoperator overloads (grid_sample_3d.cuh)CUDA 12.x ships native
__halfarithmetic operators. The 12 custom operator overloads in this file caused ODR (One Definition Rule) violations at link time, leading to undefined behavior.One overload was inherently broken:
Fix: All custom overloads removed. A minimal
to_float<scalar_t>()helper handles__half→floatconversion.2. Fix
compute_index()precision loss for__half(grid_sample_3d.cuh)compute_index()was templated inscalar_t, so when instantiated with__half(~3.3 decimal digits of precision), unnormalization and coordinate math produced garbage indices → out-of-bounds memory access → CUDA error 700.Fix:
compute_index()now works entirely infloat32internally and returnsfloat.reflect_coordinates()is similarly rewritten asreflect_coordinates_f()in purefloat32.3. Fix nearest kernel channel stride bug (
grid_sample_3d.cu)The nearest interpolation kernel never advanced its input/output channel pointers inside the channel loop:
// MISSING: input_NC_offset += input_stride_C; output_NCDHW_offset += output_stride_C;This caused channel-0 data to be read for all channels.
Fix: Added the missing stride advancement.
Performance Optimizations
__restrict__on all pointer parametersx * stride_W + y * stride_H + z * stride_Dper channel__ldg()for read-only loadsfloat32trilinear weightsscalar_tconversions in the hot loopfloat32accumulation in bilinear kernelTesting
Tested with CUDA 12.6 / TensorRT 10.5 / SM80+ (Ampere):
Performance (input
[1,32,16,64,64]NCDHW, grid[1,16,64,64,3]):Notes
The original FP16 test was commented out in
main()with a TODO comment — this PR fixes the underlying issues and makes FP16 fully functional on modern CUDA toolkits.