Skip to content

Add a HIP/ROCm build path for AMD GPUs#35

Merged
iwald-nvidia merged 1 commit into
NVIDIA:mainfrom
jeffdaily:moat-port
Jun 17, 2026
Merged

Add a HIP/ROCm build path for AMD GPUs#35
iwald-nvidia merged 1 commit into
NVIDIA:mainfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown
Contributor

cuBQL already carries first-party HIP scaffolding (the __HIPCC__ includes of <hip/hip_runtime.h> and the cub->hipcub alias from #34), but nothing invoked hipcc and a few HIP gaps left that path uncompilable. This finishes and wires up the HIP path so the library, the instantiate_builders translation unit, and all GPU samples build, link, and run on AMD GPUs -- while keeping the CUDA build byte-identical.

Suggested review order:

  • CMake -- the root CMakeLists.txt gains a CUBQL_USE_HIP option that enables the HIP language, defaults CMAKE_HIP_ARCHITECTURES without hardcoding an arch, and sets a distinct CUBQL_HAVE_HIP switch; the per-type instantiation targets and GPU samples build when either CUBQL_HAVE_CUDA or CUBQL_HAVE_HIP is set, with the .cu sources marked LANGUAGE HIP.
  • Runtime shims -- math/common.h maps the cudaXxx runtime symbols cuBQL uses onto their hipXxx equivalents inside the existing __HIPCC__ block (every call funnels through the CUBQL_CUDA_CALL macro). math/constants.h provides the CUDART_INF_F/CUDART_NAN device float constants (no HIP analogue). builder/cuda.h widens the cudaMallocAsync version guard to fire on HIP. bvh.h includes the GPU builder and declares the bvh_floatN typedefs under HIP. shrinkingRadiusQuery.h keys a host/device fallback off __HIP_DEVICE_COMPILE__ (HIP's per-pass macro) as well as __CUDA_ARCH__. findClosest.h aligns three forward declarations to their device-only definitions (clang rejects the decl/def attribute mismatch nvcc tolerated). vec.h widens the dim3 conversion operators.
  • README -- documents the CUBQL_USE_HIP option in the Building section.

Every source change is guarded by __HIPCC__ / __HIP_DEVICE_COMPILE__, and every CMake change by CUBQL_USE_HIP (default off), so a build without CUBQL_USE_HIP is an unchanged CUDA configuration.

Validation

Built and linked the library, the instantiate_builders TU, and all eight GPU samples with hipcc (ROCm 7.2.1) on AMD GPUs across Linux and Windows -- CDNA2 (gfx90a), RDNA3 (gfx1100), and RDNA4 (gfx1201). Samples 01-07 run on GPU and return correct results (the closest-point output is bit-identical to the CPU reference). Building without CUBQL_USE_HIP keeps CMAKE_CUDA_ARCHITECTURES at native and never enables HIP.

cuBQL already carried first-party HIP source scaffolding (the __HIPCC__
includes of <hip/hip_runtime.h> and the cub->hipcub alias), but nothing
ever invoked hipcc and several HIP gaps left that path uncompilable. This
finishes and wires up the HIP path so the library, the
instantiate_builders translation unit, and all GPU samples build and link
for AMD GPUs, while keeping the CUDA build byte-identical.

Review order: the CMake change first (root CMakeLists.txt adds a
CUBQL_USE_HIP option that enables the HIP language, defaults
CMAKE_HIP_ARCHITECTURES without hardcoding an arch, and sets a distinct
CUBQL_HAVE_HIP switch alongside CUBQL_HAVE_CUDA; the GPU instantiation
targets and samples build when either is set, and cuBQL/CMakeLists.txt and
the sample CMake mark the .cu sources LANGUAGE HIP), then the source shims. math/common.h maps the
cudaXxx runtime symbols cuBQL uses onto their hipXxx equivalents inside the
existing __HIPCC__ block (every call funnels through the
CUBQL_CUDA_CALL(call) -> cuda##call macro). math/constants.h provides the
CUDART_INF_F/CUDART_NAN device float constants, which have no HIP analogue.
builder/cuda.h widens the cudaMallocAsync version guard to fire on HIP
(CUDART_VERSION is 0 there). bvh.h includes the GPU builder and declares the
bvh_floatN typedefs under HIP too. shrinkingRadiusQuery.h fixes a host/device
fallback that was gated on __CUDA_ARCH__ to also key off __HIP_DEVICE_COMPILE__
(HIP's per-pass macro), and findClosest.h aligns three forward declarations
to their device-only definitions on the HIP path (clang rejects the
decl/def host/device-attribute mismatch that nvcc tolerated). vec.h widens
the dim3 conversion operators to HIP. The README's Building section
documents the new CUBQL_USE_HIP option.

Every source change is guarded by __HIPCC__ / __HIP_DEVICE_COMPILE__, and
every CMake change by CUBQL_USE_HIP (default OFF) / LANGUAGE HIP, so a build
without CUBQL_USE_HIP produces an unchanged CUDA configuration.

This work was authored with the assistance of Claude, an AI assistant.

Test Plan:

    cmake -S . -B build-hip -DCUBQL_USE_HIP=ON \
      -DCMAKE_HIP_ARCHITECTURES=gfx90a -DCMAKE_BUILD_TYPE=Release
    cmake --build build-hip -j --target \
      cuBQL_cuda_float3 \
      cuBQL_sample01_points_closestPoint_cuda \
      cuBQL_sample01_points_closestPoint_wideBVH_cuda \
      sample02_distanceToTriangleMesh sample03_insideOutside \
      sample04_boxOverlapsOrInsideSurfaceMesh sample05_lineOfSight \
      sample06_anyTriangleWithinRadius sample07_aggregateNBody
    HIP_VISIBLE_DEVICES=0 ./build-hip/cuBQL_sample01_points_closestPoint_cuda

Built and linked the library, the instantiate_builders TU, and all eight GPU
samples with hipcc (ROCm 7.2.1) for gfx90a, gfx1100, and gfx1201; the device
code objects are amdgcn. sample01 closest-point and samples 02-07 run on GPU
and return correct results. Configuring without CUBQL_USE_HIP keeps
CMAKE_CUDA_ARCHITECTURES at native and never enables HIP; the CPU host
targets build clean in both modes.
jeffdaily added a commit to jeffdaily/barney that referenced this pull request Jun 17, 2026
…oat-port)

The submodules/cuBQL gitlink is moved from NVIDIA/cuBQL to the HIP-enabled
fork jeffdaily/cuBQL, branch moat-port, pinned at commit b0ea6a1 (the
squashed cuBQL HIP port -- adds CUBQL_USE_HIP build path for AMD GPUs,
identical to what was exercised during the barney gfx90a and gfx1201
validations via BARNEY_USE_EXTERNAL_CUBQL).

This makes the barney fork self-contained: `git clone --recursive --branch
moat-port https://github.com/jeffdaily/barney` resolves the cuBQL submodule
to the HIP build without any extra flags, enabling AMD builds from a clean
checkout before the upstream cuBQL PR (NVIDIA/cuBQL#35) lands.

TEMPORARY PIN: jeffdaily/cuBQL@moat-port will be re-pinned to the upstream
NVIDIA/cuBQL once cuBQL PR NVIDIA#35 merges (see data/deferred.json for tracking).

Test Plan:
  Configure barney with BARNEY_USE_EXTERNAL_CUBQL=OFF (uses submodule):
    cmake -S . -B build-sub -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
  Confirm cuBQL at b0ea6a1 satisfies CUBQL_USE_HIP=ON.

This work was authored with an AI assistant (Claude by Anthropic).
@iwald-nvidia

Copy link
Copy Markdown
Collaborator

looks good on first glance; i'll need to run CI first, and will also run a few manual tests just in case, but looks pretty good so far.

@iwald-nvidia iwald-nvidia merged commit d1bfc3c into NVIDIA:main Jun 17, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants