Migrate samples from cuda-samples#9697
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a new top-level "samples" directory with a standalone CMake-based CCCL samples project, shared Common helper headers, CPM/install/toolchain CMake infrastructure, multiple CUDA/C++ and Python sample programs, CI matrix/script wiring for building and testing samples, and test-runner scripts (run_tests.py, test_samples.py). Minor repo config updates included. ChangesCI and Infrastructure
Common Helper Headers
CUDA/C++ Samples
Python Samples and Test Runners
Sequence Diagram(s)sequenceDiagram
participant CI
participant build_samples.sh
participant CMake
participant test_samples.sh
participant run_tests.py
CI->>build_samples.sh: invoke build
build_samples.sh->>CMake: configure/build/install
CI->>test_samples.sh: invoke test
test_samples.sh->>run_tests.py: run installed executables
run_tests.py-->>CI: pass/fail summary
Related Issues: None referenced in the provided data. Related PRs: None referenced in the provided data. Suggested labels: ci, samples, cmake, documentation Suggested reviewers: None determinable from provided data. important: important: multiple sample CMakeLists fetch CCCL via CPM from GitHub pinned to suggestion: Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
🧹 Nitpick comments (20)
samples/Common/helper_image.h (2)
224-224: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valuesuggestion: several
malloc(sizeof(T) * n)-style allocations here (e.g. Line 224, Line 256, Line 279, Line 352, Line 369, Line 442) compute sizes without overflow checks, flagged by static analysis (CWE-190). Practical risk is low for typical sample image dimensions and this mirrors the original cuda-samples implementation, so treating as optional hardening rather than a blocker.Also applies to: 256-256, 279-279, 352-352, 369-369, 442-442
Source: Linters/SAST tools
415-425: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winsuggestion:
sdkReadFileuseswhile (!feof(fh))with an uncheckedfscanf, relying on the trailingpop_back()to drop one bogus element. This is a known anti-pattern — iffscanffails for a non-EOF reason (malformed token), a stale/garbagetokenvalue is retained and the compensatingpop_back()doesn't guarantee correctness. Checking thefscanfreturn value directly would be more robust.♻️ Proposed fix
- while (!feof(fh)) - { - fscanf(fh, "%f", &token); - data_read.push_back(token); - } - - // the last element is read twice - data_read.pop_back(); - fclose(fh); + while (fscanf(fh, "%f", &token) == 1) + { + data_read.push_back(token); + } + + fclose(fh);samples/Common/helper_string.h (1)
114-124: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winsuggestion:
getFileExtensioncan read out of bounds for an emptyfilename.filename[string_length--]accesses index 0 (the null terminator) then decrements to -1 before thestring_length == 0guard is checked, so the next iteration readsfilename[-1]. Not currently exercised by the provided call sites, but worth a bounds check for robustness.♻️ Proposed fix
inline int getFileExtension(char* filename, char** extension) { int string_length = static_cast<int>(strlen(filename)); + + if (string_length == 0) + { + *extension = NULL; + return 0; + }samples/cubDeviceFind/CMakeLists.txt (1)
16-27: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick winsuggestion: CMAKE_CUDA_ARCHITECTURES doesn't include SM 7.0 (70), but the README's "Supported SM Architectures" section lists SM 7.0 as supported. Either drop the SM 7.0 badge from the README or add
70to this list.samples/cubDeviceSegmentedScan/cubDeviceSegmentedScan.cu (1)
152-152: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: Pass
cuda::maximum<>{}directly as the binary op; the lambda wrapper adds no behavior.samples/libcuxxRandom/README.md (1)
13-13: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: fix markdownlint findings — trailing spaces inside link text on line 13 (
[SM 7.0 ]→[SM 7.0], repeated for each SM badge) and an empty link[Dependencies]()on line 40 (should anchor to the Dependencies section).Proposed fix
-[SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) [SM 10.0 ](https://developer.nvidia.com/cuda-gpus) [SM 11.0 ](https://developer.nvidia.com/cuda-gpus) [SM 12.0 ](https://developer.nvidia.com/cuda-gpus) +[SM 7.0](https://developer.nvidia.com/cuda-gpus) [SM 7.5](https://developer.nvidia.com/cuda-gpus) [SM 8.0](https://developer.nvidia.com/cuda-gpus) [SM 8.6](https://developer.nvidia.com/cuda-gpus) [SM 8.9](https://developer.nvidia.com/cuda-gpus) [SM 9.0](https://developer.nvidia.com/cuda-gpus) [SM 10.0](https://developer.nvidia.com/cuda-gpus) [SM 11.0](https://developer.nvidia.com/cuda-gpus) [SM 12.0](https://developer.nvidia.com/cuda-gpus)-Make sure the dependencies mentioned in [Dependencies]() section above are installed. +Make sure the dependencies mentioned in the [Dependencies](`#dependencies-needed-to-buildrun`) section above are installed.Also applies to: 40-40
Source: Linters/SAST tools
samples/libcuxxRandom/libcuxxRandom.cu (1)
123-138: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valuesuggestion:
--blocksuses a different CLI convention (--blocks <value>, space-separated) than--device=<value>used internally byfindCudaDevicea few lines below. Consider usinggetCmdLineArgumentIntfrom helper_string.h for--blocks=Nto stay consistent with the rest of the sample's CLI style.samples/python/Utilities/__init__.py (1)
37-47: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valuesuggestion:
print_gpu_infois a public helper used by every sample (binarySearch, parallelReduction, prefixSum) but isn't re-exported here. Anyone importing viafrom Utilities import ...instead of the direct module-path pattern won't get it.diff
from .cuda_samples_utils import ( check_cuda_requirements, + print_gpu_info, verify_array_result, ) __version__ = "1.0.0" __all__ = [ "check_cuda_requirements", + "print_gpu_info", "verify_array_result", ]samples/python/cuda.compute/parallelReduction/README.md (1)
21-26: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: Add language identifiers to both fenced code blocks.
Markdownlint is already flagging these as MD040, and tagging the fences will also improve syntax highlighting in rendered docs.
Also applies to: 74-76
Source: Linters/SAST tools
samples/run_tests.py (1)
331-342: 🚀 Performance & Scalability | 🔵 Trivialsuggestion: CI runs this with
--parallel 4, but scheduling is GPU-agnostic. GPU-heavy samples (e.g. dmmaTensorCoreGemm, globalToShmemAsyncCopy) launched concurrently on a single device can OOM/contend and fail nondeterministically. Consider bounding parallelism by device memory, pinning runs to distinct devices, or marking heavy samples for serial execution in the config.samples/test_args.json (1)
1-1: 📐 Maintainability & Code Quality | 🔵 Trivialsuggestion: Empty config means every discovered sample runs with no args,
min_gpus=1, and a 300s timeout. Migrated cuda-samples that need specific CLI args, multiple GPUs, longer timeouts, or should be skipped/waived will silently fall back to defaults and may fail in CI. Want me to scaffold per-sample entries (args/runs/min_gpus/skip) for the samples in this PR?ci/windows/build_samples.ps1 (1)
26-45: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: this script imports
build_common.psm1but never applies$script:GLOBAL_CMAKE_OPTIONS(which carriesCCCL_ENABLE_WERROR/CCCL_ENABLE_PRAGMA_SYSTEM_HEADERin CI, perbuild_common.psm1) to the samples configure command — only the raw$CMAKE_OPTIONSparameter is appended. Since the comment at line 15-18 says presets are intentionally bypassed, this is likely by design, but it means Windows CI builds samples without warnings-as-errors enforcement that the rest of CCCL relies on for catching regressions.ci/test_cuda_cccl_examples_python.sh (1)
31-37: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valuesuggestion: silently skipping the samples pytest run when
test_samples.pyis missing means a broken migration (e.g. file accidentally not shipped/renamed) fails silently instead of failing CI. Consider an explicitecho/warning when the file is absent, or making its presence mandatory if samples are always expected in-tree from this point on.samples/globalToShmemAsyncCopy/CMakeLists.txt (1)
34-38: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winimportant: QNX exclusion here contradicts README.md line 17, which lists QNX as a supported OS. See the README comment for the cross-file fix.
samples/globalToShmemAsyncCopy/README.md (3)
13-13: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: strip the trailing spaces inside
[SM X.X ]link text (flagged by markdownlint MD039), and reconcile the listed SM range withCMakeLists.txt, which only builds for SM 75+ (75 80 86 87 89 90 100 110 120). SM 7.0/7.2 in the README won't actually be compiled.doc fix
-[SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.2 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.7 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) +[SM 7.5](https://developer.nvidia.com/cuda-gpus) [SM 8.0](https://developer.nvidia.com/cuda-gpus) [SM 8.6](https://developer.nvidia.com/cuda-gpus) [SM 8.7](https://developer.nvidia.com/cuda-gpus) [SM 8.9](https://developer.nvidia.com/cuda-gpus) [SM 9.0](https://developer.nvidia.com/cuda-gpus)Source: Linters/SAST tools
17-17: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winimportant: README lists QNX as a supported OS, but
CMakeLists.txt(lines 35-38) explicitly refuses to build the sample on QNX. Update one side to match.
38-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: empty link
[Dependencies]()(MD042). Point it at the "Dependencies needed to build/run" section above.doc fix
-Make sure the dependencies mentioned in [Dependencies]() section above are installed. +Make sure the dependencies mentioned in the [Dependencies](`#dependencies-needed-to-buildrun`) section above are installed.Source: Linters/SAST tools
samples/globalToShmemAsyncCopy/globalToShmemAsyncCopy.cu (1)
50-52: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valuesuggestion: conditional
#include <cuda/barrier>under__CUDA_ARCH__ >= 700only takes effect during device-code compilation passes; sinceCMAKE_CUDA_ARCHITECTURESin this sample'sCMakeLists.txtstarts at 75, this resolves correctly today, but it's a fragile pattern if the minimum architecture is ever lowered. No action needed given current build config.samples/radixSortThrust/radixSortThrust.cu (2)
227-233: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valuesuggestion: This conditional is a no-op — both branches return
bTestResult's value (the earlyreturn falseonly triggers whenbTestResultis already false). Safe to simplify.Proposed fix
- if (!bTestResult && !quiet) - { - return false; - } - return bTestResult; }
29-29: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valuesuggestion:
<thrust/detail/type_traits.h>is a Thrust internal implementation header (not part of the public API) and no symbols from it appear used in this file. Consider dropping it, or replacing with the specific public header if something from it is actually needed.Source: Coding guidelines
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1ed8f7fb-6e50-4632-be65-99695c3b3df1
📒 Files selected for processing (82)
.codespell-ignore.txt.gitignore.pre-commit-config.yamlci/build_samples.shci/matrix.yamlci/project_files_and_dependencies.yamlci/test_cuda_cccl_examples_python.shci/test_samples.shci/windows/build_samples.ps1samples/CMakeLists.txtsamples/Common/exception.hsamples/Common/helper_cuda.hsamples/Common/helper_functions.hsamples/Common/helper_image.hsamples/Common/helper_math.hsamples/Common/helper_string.hsamples/Common/helper_timer.hsamples/cmake/CPM.cmakesamples/cmake/InstallSamples.cmakesamples/cmake/Modules/FindEGL.cmakesamples/cmake/Modules/FindFreeImage.cmakesamples/cmake/Modules/FindNVSCI.cmakesamples/cmake/toolchains/toolchain-aarch64-linux.cmakesamples/cmake/toolchains/toolchain-aarch64-qnx.cmakesamples/cubDeviceFind/CMakeLists.txtsamples/cubDeviceFind/README.mdsamples/cubDeviceFind/cubDeviceFind.cusamples/cubDeviceSegmentedScan/CMakeLists.txtsamples/cubDeviceSegmentedScan/README.mdsamples/cubDeviceSegmentedScan/cubDeviceSegmentedScan.cusamples/cubDeviceTransform/CMakeLists.txtsamples/cubDeviceTransform/README.mdsamples/cubDeviceTransform/cubDeviceTransform.cusamples/dmmaTensorCoreGemm/CMakeLists.txtsamples/dmmaTensorCoreGemm/README.mdsamples/dmmaTensorCoreGemm/dmmaTensorCoreGemm.cusamples/globalToShmemAsyncCopy/CMakeLists.txtsamples/globalToShmemAsyncCopy/README.mdsamples/globalToShmemAsyncCopy/globalToShmemAsyncCopy.cusamples/libcuxxMdspan/CMakeLists.txtsamples/libcuxxMdspan/README.mdsamples/libcuxxMdspan/libcuxxMdspan.cusamples/libcuxxRandom/CMakeLists.txtsamples/libcuxxRandom/README.mdsamples/libcuxxRandom/libcuxxRandom.cusamples/lineOfSight/CMakeLists.txtsamples/lineOfSight/README.mdsamples/lineOfSight/lineOfSight.cusamples/python/Utilities/README.mdsamples/python/Utilities/__init__.pysamples/python/Utilities/cuda_samples_utils.pysamples/python/cuda.compute/binarySearch/README.mdsamples/python/cuda.compute/binarySearch/binarySearch.pysamples/python/cuda.compute/binarySearch/requirements.txtsamples/python/cuda.compute/cudaComputeLambdas/README.mdsamples/python/cuda.compute/cudaComputeLambdas/cudaComputeLambdas.pysamples/python/cuda.compute/cudaComputeLambdas/requirements.txtsamples/python/cuda.compute/parallelReduction/README.mdsamples/python/cuda.compute/parallelReduction/parallelReduction.pysamples/python/cuda.compute/parallelReduction/requirements.txtsamples/python/cuda.compute/prefixSum/README.mdsamples/python/cuda.compute/prefixSum/prefixSum.pysamples/python/cuda.compute/prefixSum/requirements.txtsamples/python/requirements.txtsamples/python/test_samples.pysamples/radixSortThrust/CMakeLists.txtsamples/radixSortThrust/README.mdsamples/radixSortThrust/doc/readme.txtsamples/radixSortThrust/radixSortThrust.cusamples/run_tests.pysamples/segmentationTreeThrust/CMakeLists.txtsamples/segmentationTreeThrust/README.mdsamples/segmentationTreeThrust/common.cuhsamples/segmentationTreeThrust/data/ref_00.ppmsamples/segmentationTreeThrust/data/ref_09.ppmsamples/segmentationTreeThrust/data/test.ppmsamples/segmentationTreeThrust/kernels.cuhsamples/segmentationTreeThrust/segmentationTree.cusamples/simpleAWBarrier/CMakeLists.txtsamples/simpleAWBarrier/README.mdsamples/simpleAWBarrier/simpleAWBarrier.cusamples/test_args.json
| set(CMAKE_C_FLAGS " \"-V${__qnx_gcc_ver},gcc_ntoaarch64le\"") | ||
| set(CMAKE_CXX_FLAGS " \"-V${__qnx_gcc_ver},gcc_ntoaarch64le\"") | ||
| set(CMAKE_CUDA_FLAGS " --qpp-config=${__qnx_gcc_ver},gcc_ntoaarch64le") | ||
| set( | ||
| AUTOMAGIC_NVCC_FLAGS | ||
| --qpp-config=${__qnx_gcc_ver},gcc_ntoaarch64le | ||
| CACHE STRING | ||
| "automagic feature detection flags for cross build" | ||
| ) | ||
| add_link_options("-V${__qnx_gcc_ver},gcc_ntoaarch64le") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm __qnx_gcc_ver is defined somewhere the toolchain can see it
rg -nP '__qnx_gcc_ver' -g '!**/toolchain-aarch64-qnx.cmake'Repository: NVIDIA/cccl
Length of output: 149
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== toolchain file =="
nl -ba samples/cmake/toolchains/toolchain-aarch64-qnx.cmake | sed -n '1,220p'
echo
echo "== references to qnx toolchains / gcc_ver =="
rg -n "__qnx_gcc_ver|toolchain-aarch64-qnx|qnx" samples cmake .github ci docs -g '!**/build/**' -g '!**/dist/**' -g '!**/node_modules/**' || trueRepository: NVIDIA/cccl
Length of output: 210
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
p = Path('samples/cmake/toolchains/toolchain-aarch64-qnx.cmake')
print(p.exists(), p.stat().st_size if p.exists() else 0)
PYRepository: NVIDIA/cccl
Length of output: 159
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== samples/cmake/toolchains/toolchain-aarch64-qnx.cmake =="
sed -n '1,220p' samples/cmake/toolchains/toolchain-aarch64-qnx.cmake
echo
echo "== similar toolchain patterns =="
rg -n "CACHE STRING|CACHE PATH|if\\(NOT DEFINED|message\\(FATAL_ERROR|set\\(__.*gcc_ver|qnx" samples/cmake/toolchains -g '*.cmake' || trueRepository: NVIDIA/cccl
Length of output: 2614
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== usages of toolchain-aarch64-qnx =="
rg -n "toolchain-aarch64-qnx|QNX_HOST|QNX_TARGET|__qnx_gcc_ver" . -g '!**/build/**' -g '!**/dist/**' -g '!**/node_modules/**' || true
echo
echo "== toolchain-aarch64-linux.cmake =="
sed -n '1,120p' samples/cmake/toolchains/toolchain-aarch64-linux.cmake
echo
echo "== docs mentioning QNX toolchain =="
rg -n "QNX|qnx" docs samples README.md .github ci -g '!**/build/**' -g '!**/dist/**' -g '!**/node_modules/**' || trueRepository: NVIDIA/cccl
Length of output: 8966
important: Define or require __qnx_gcc_ver in this toolchain; it is only referenced here, so if it is unset these flags become malformed (-V,gcc_ntoaarch64le, --qpp-config=,gcc_ntoaarch64le) and QNX builds will fail. Add a cache declaration with a default or a fatal guard.
| inline bool getCmdLineArgumentString(const int argc, const char** argv, const char* string_ref, char** string_retval) | ||
| { | ||
| bool bFound = false; | ||
|
|
||
| if (argc >= 1) | ||
| { | ||
| for (int i = 1; i < argc; i++) | ||
| { | ||
| int string_start = stringRemoveDelimiter('-', argv[i]); | ||
| char* string_argv = const_cast<char*>(&argv[i][string_start]); | ||
| int length = static_cast<int>(strlen(string_ref)); | ||
|
|
||
| if (!STRNCASECMP(string_argv, string_ref, length)) | ||
| { | ||
| *string_retval = &string_argv[length + 1]; | ||
| bFound = true; | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (!bFound) | ||
| { | ||
| *string_retval = NULL; | ||
| } | ||
|
|
||
| return bFound; | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
important: out-of-bounds read when the matched flag has no =value suffix.
Unlike getCmdLineArgumentValue/getCmdLineArgumentInt/getCmdLineArgumentFloat in this same file, which check length + 1 <= strlen(string_argv) and detect the = before offsetting, getCmdLineArgumentString unconditionally does &string_argv[length + 1]. If the CLI flag is passed without =value (e.g. -file alone), string_argv[length] is the null terminator and string_argv[length + 1] reads one byte past the string buffer.
🐛 Proposed fix
if (!STRNCASECMP(string_argv, string_ref, length))
{
- *string_retval = &string_argv[length + 1];
+ int auto_inc = (static_cast<int>(strlen(string_argv)) > length && string_argv[length] == '=') ? 1 : 0;
+ *string_retval = &string_argv[length + auto_inc];
bFound = true;
continue;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inline bool getCmdLineArgumentString(const int argc, const char** argv, const char* string_ref, char** string_retval) | |
| { | |
| bool bFound = false; | |
| if (argc >= 1) | |
| { | |
| for (int i = 1; i < argc; i++) | |
| { | |
| int string_start = stringRemoveDelimiter('-', argv[i]); | |
| char* string_argv = const_cast<char*>(&argv[i][string_start]); | |
| int length = static_cast<int>(strlen(string_ref)); | |
| if (!STRNCASECMP(string_argv, string_ref, length)) | |
| { | |
| *string_retval = &string_argv[length + 1]; | |
| bFound = true; | |
| continue; | |
| } | |
| } | |
| } | |
| if (!bFound) | |
| { | |
| *string_retval = NULL; | |
| } | |
| return bFound; | |
| } | |
| inline bool getCmdLineArgumentString(const int argc, const char** argv, const char* string_ref, char** string_retval) | |
| { | |
| bool bFound = false; | |
| if (argc >= 1) | |
| { | |
| for (int i = 1; i < argc; i++) | |
| { | |
| int string_start = stringRemoveDelimiter('-', argv[i]); | |
| char* string_argv = const_cast<char*>(&argv[i][string_start]); | |
| int length = static_cast<int>(strlen(string_ref)); | |
| if (!STRNCASECMP(string_argv, string_ref, length)) | |
| { | |
| int auto_inc = (static_cast<int>(strlen(string_argv)) > length && string_argv[length] == '=') ? 1 : 0; | |
| *string_retval = &string_argv[length + auto_inc]; | |
| bFound = true; | |
| continue; | |
| } | |
| } | |
| } | |
| if (!bFound) | |
| { | |
| *string_retval = NULL; | |
| } | |
| return bFound; | |
| } |
|
|
||
| ## Supported SM Architectures | ||
|
|
||
| [SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) [SM 10.0 ](https://developer.nvidia.com/cuda-gpus) [SM 11.0 ](https://developer.nvidia.com/cuda-gpus) [SM 12.0 ](https://developer.nvidia.com/cuda-gpus) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
suggestion: SM 7.0 is listed as supported here, but samples/cubDeviceFind/CMakeLists.txt doesn't include architecture 70 in CMAKE_CUDA_ARCHITECTURES (only 75/80/86/87/89/90/100/110/120). Reconcile the two. Also each link label has a trailing space (SM 7.0 ) flagged by markdownlint MD039.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
Source: Linters/SAST tools
| ## Prerequisites | ||
|
|
||
| Download and install the [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) for your corresponding platform. | ||
| Make sure the dependencies mentioned in [Dependencies]() section above are installed. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
suggestion: [Dependencies]() is an empty link (markdownlint MD042). Either remove the markdown link syntax or point it at the actual "Dependencies needed to build/run" anchor.
Proposed fix
-Make sure the dependencies mentioned in [Dependencies]() section above are installed.
+Make sure the dependencies mentioned in the "Dependencies needed to build/run" section above are installed.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Make sure the dependencies mentioned in [Dependencies]() section above are installed. | |
| Make sure the dependencies mentioned in the "Dependencies needed to build/run" section above are installed. |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 48-48: No empty links
(MD042, no-empty-links)
Source: Linters/SAST tools
| set( | ||
| CMAKE_CUDA_ARCHITECTURES | ||
| 75 | ||
| 80 | ||
| 86 | ||
| 87 | ||
| 89 | ||
| 90 | ||
| 100 | ||
| 110 | ||
| 120 | ||
| ) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🌐 Web query:
Does NVIDIA cuda-samples CMakeLists still include architecture 110 in its default CMAKE_CUDA_ARCHITECTURES list, and was this issue resolved?
💡 Result:
As of July 2, 2026, the NVIDIA cuda-samples repository continues to include 110 in its default CMAKE_CUDA_ARCHITECTURES list within its CMakeLists.txt files [1][2][3]. This has been an ongoing issue for users, as documented in multiple GitHub issues [4][5]. The inclusion of 110 (compute_110/sm_110) in this list often leads to build failures because the CUDA Toolkit versions used by many users do not support this specific architecture, causing the nvcc compiler to return an "Unsupported gpu architecture" fatal error [5]. The issue remains unresolved in the sense that the default configuration in the repository still explicitly sets 110 as a target [1][2][3]. Users encountering this problem typically resolve it locally by modifying the CMAKE_CUDA_ARCHITECTURES list in the relevant CMakeLists.txt files to remove 110 or by overriding the variable during the CMake configuration process [4][5].
Citations:
- 1: https://github.com/NVIDIA/cuda-samples/blob/master/CMakeLists.txt
- 2: https://github.com/NVIDIA/cuda-samples/blob/master/cpp/0_Introduction/template/CMakeLists.txt
- 3: https://github.com/NVIDIA/cuda-samples/blob/4f735616/Samples/0_Introduction/vectorAddMMAP/CMakeLists.txt
- 4: Unsupported gpu architecture 'compute_110' cuda-samples#400
- 5: nvcc fatal error: Unsupported gpu architecture 'compute_110' (but compute_120 is supported) cuda-samples#388
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant file and surrounding sample build context.
git ls-files samples/cubDeviceSegmentedScan/CMakeLists.txt
wc -l samples/cubDeviceSegmentedScan/CMakeLists.txt
cat -n samples/cubDeviceSegmentedScan/CMakeLists.txt | sed -n '1,120p'
# Look for any toolkit/version gating or architecture overrides nearby.
rg -n "CMAKE_CUDA_ARCHITECTURES|CUDA_TOOLKIT|CUDAToolkit|110|sm_110|compute_110" samples/cubDeviceSegmentedScan -S
# Look for repo-wide guidance on supported CUDA versions/architectures in samples.
rg -n "110|sm_110|compute_110|CUDA 13|CUDA 12\.8|CUDA 12\.9|CMAKE_CUDA_ARCHITECTURES" -S .github ci samples CMakeLists.txt docs | sed -n '1,200p'Repository: NVIDIA/cccl
Length of output: 18740
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the repository's stated CUDA toolkit support and any sample-specific version gating.
rg -n "CUDA 12|CUDA 13|12\.8|12\.9|13\.0|CUDAToolkit .*REQUIRED|find_package\(CUDAToolkit|CMAKE_CUDA_ARCHITECTURES" -S \
CMakeLists.txt README.md docs .devcontainer ci samples | sed -n '1,240p'
# Inspect the top-level CMake logic that mentions CMAKE_CUDA_ARCHITECTURES special handling.
cat -n CMakeLists.txt | sed -n '1,120p'Repository: NVIDIA/cccl
Length of output: 27118
important: CMAKE_CUDA_ARCHITECTURES should not unconditionally include 110 here. CCCL still supports CUDA 12.9, and compute_110 is rejected by nvcc before CUDA 13.0, so this sample fails to build on the older toolchain. Gate 110 on the toolkit version or remove it from the default list.
| unsigned int totalMem = (keysOnly ? 2 : 4) * numElements * sizeof(T); | ||
|
|
||
| if (devprop.totalGlobalMem < totalMem) | ||
| { | ||
| printf("Error: insufficient amount of memory to sort %d elements.\n", numElements); | ||
| printf("%d bytes needed, %d bytes available\n", (int) totalMem, (int) devprop.totalGlobalMem); | ||
| exit(EXIT_SUCCESS); | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
suggestion: totalMem is unsigned int and computed as (keysOnly ? 2 : 4) * numElements * sizeof(T), which can overflow 32 bits for large -n values, silently defeating the memory-sufficiency check. The subsequent (int) casts of totalMem/devprop.totalGlobalMem in the printf will also print negative numbers on GPUs with more than 2GB of memory.
Proposed fix
- unsigned int totalMem = (keysOnly ? 2 : 4) * numElements * sizeof(T);
+ size_t totalMem = (keysOnly ? size_t{2} : size_t{4}) * static_cast<size_t>(numElements) * sizeof(T);
if (devprop.totalGlobalMem < totalMem)
{
printf("Error: insufficient amount of memory to sort %d elements.\n", numElements);
- printf("%d bytes needed, %d bytes available\n", (int) totalMem, (int) devprop.totalGlobalMem);
+ printf("%zu bytes needed, %zu bytes available\n", totalMem, devprop.totalGlobalMem);
exit(EXIT_SUCCESS);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| unsigned int totalMem = (keysOnly ? 2 : 4) * numElements * sizeof(T); | |
| if (devprop.totalGlobalMem < totalMem) | |
| { | |
| printf("Error: insufficient amount of memory to sort %d elements.\n", numElements); | |
| printf("%d bytes needed, %d bytes available\n", (int) totalMem, (int) devprop.totalGlobalMem); | |
| exit(EXIT_SUCCESS); | |
| } | |
| size_t totalMem = (keysOnly ? size_t{2} : size_t{4}) * static_cast<size_t>(numElements) * sizeof(T); | |
| if (devprop.totalGlobalMem < totalMem) | |
| { | |
| printf("Error: insufficient amount of memory to sort %d elements.\n", numElements); | |
| printf("%zu bytes needed, %zu bytes available\n", totalMem, devprop.totalGlobalMem); | |
| exit(EXIT_SUCCESS); | |
| } |
| int main(int argc, char** argv) | ||
| { | ||
| // Start logs | ||
| printf("%s Starting...\n\n", argv[0]); | ||
|
|
||
| findCudaDevice(argc, (const char**) argv); | ||
|
|
||
| bool bTestResult = false; | ||
|
|
||
| if (checkCmdLineFlag(argc, (const char**) argv, "float")) | ||
| { | ||
| bTestResult = testSort<float, true>(argc, argv); | ||
| } | ||
| else | ||
| { | ||
| bTestResult = testSort<unsigned int, false>(argc, argv); | ||
| } | ||
|
|
||
| printf(bTestResult ? "Test passed\n" : "Test failed!\n"); | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file='samples/radixSortThrust/radixSortThrust.cu'
echo '--- file outline ---'
ast-grep outline "$file" --view expanded || true
echo
echo '--- relevant ranges ---'
sed -n '1,80p' "$file"
echo
sed -n '110,260p' "$file"
echo
echo '--- search for run_tests reference ---'
rg -n "run_tests\.py|Test passed|Test failed|EXIT_SUCCESS|EXIT_FAILURE|return false;|totalMem" samples/radixSortThrust -n || trueRepository: NVIDIA/cccl
Length of output: 7538
important: main() prints the result but never returns it, so the process exits with success even when the sort fails. Add return bTestResult ? 0 : 1; at the end.
|
|
||
| ## Supported SM Architectures | ||
|
|
||
| [SM 5.0 ](https://developer.nvidia.com/cuda-gpus) [SM 5.2 ](https://developer.nvidia.com/cuda-gpus) [SM 5.3 ](https://developer.nvidia.com/cuda-gpus) [SM 6.0 ](https://developer.nvidia.com/cuda-gpus) [SM 6.1 ](https://developer.nvidia.com/cuda-gpus) [SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.2 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.7 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
important: Supported-architecture list doesn't match the build config. README lists SM 5.0–9.0, but samples/radixSortThrust/CMakeLists.txt only builds for SM 75/80/86/87/89/90/100/110/120. Also, markdownlint flags MD039 (spaces inside link text, e.g. [SM 5.0 ]) on this line.
Proposed fix
-[SM 5.0 ](https://developer.nvidia.com/cuda-gpus) [SM 5.2 ](https://developer.nvidia.com/cuda-gpus) [SM 5.3 ](https://developer.nvidia.com/cuda-gpus) [SM 6.0 ](https://developer.nvidia.com/cuda-gpus) [SM 6.1 ](https://developer.nvidia.com/cuda-gpus) [SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.2 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.7 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus)
+[SM 7.5](https://developer.nvidia.com/cuda-gpus) [SM 8.0](https://developer.nvidia.com/cuda-gpus) [SM 8.6](https://developer.nvidia.com/cuda-gpus) [SM 8.7](https://developer.nvidia.com/cuda-gpus) [SM 8.9](https://developer.nvidia.com/cuda-gpus) [SM 9.0](https://developer.nvidia.com/cuda-gpus) [SM 10.0](https://developer.nvidia.com/cuda-gpus) [SM 11.0](https://developer.nvidia.com/cuda-gpus) [SM 12.0](https://developer.nvidia.com/cuda-gpus)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [SM 5.0 ](https://developer.nvidia.com/cuda-gpus) [SM 5.2 ](https://developer.nvidia.com/cuda-gpus) [SM 5.3 ](https://developer.nvidia.com/cuda-gpus) [SM 6.0 ](https://developer.nvidia.com/cuda-gpus) [SM 6.1 ](https://developer.nvidia.com/cuda-gpus) [SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.2 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.7 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) | |
| [SM 7.5](https://developer.nvidia.com/cuda-gpus) [SM 8.0](https://developer.nvidia.com/cuda-gpus) [SM 8.6](https://developer.nvidia.com/cuda-gpus) [SM 8.7](https://developer.nvidia.com/cuda-gpus) [SM 8.9](https://developer.nvidia.com/cuda-gpus) [SM 9.0](https://developer.nvidia.com/cuda-gpus) [SM 10.0](https://developer.nvidia.com/cuda-gpus) [SM 11.0](https://developer.nvidia.com/cuda-gpus) [SM 12.0](https://developer.nvidia.com/cuda-gpus) |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
[warning] 13-13: Spaces inside link text
(MD039, no-space-in-links)
Source: Linters/SAST tools
| if os.access(path, os.X_OK): | ||
| # Skip if it's a library file | ||
| if path.suffix.lower() in (".dll", ".so", ".dylib"): | ||
| continue | ||
| executables.append(path) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
important: Versioned shared libraries slip through this filter. path.suffix for libcudart.so.12 is .12, not .so, so it won't be skipped; .so files usually carry the exec bit and will be launched as "tests", producing spurious failures. The install tree copies shared libs (InstallSamples.cmake), so this is reachable in CI.
- # Check if file is executable
- if os.access(path, os.X_OK):
- # Skip if it's a library file
- if path.suffix.lower() in (".dll", ".so", ".dylib"):
- continue
- executables.append(path)
+ # Check if file is executable
+ if os.access(path, os.X_OK):
+ # Skip shared/dynamic libraries, including versioned names like libfoo.so.1.2
+ suffixes = {s.lower() for s in path.suffixes}
+ if suffixes & {".dll", ".so", ".dylib"}:
+ continue
+ executables.append(path)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if os.access(path, os.X_OK): | |
| # Skip if it's a library file | |
| if path.suffix.lower() in (".dll", ".so", ".dylib"): | |
| continue | |
| executables.append(path) | |
| # Check if file is executable | |
| if os.access(path, os.X_OK): | |
| # Skip shared/dynamic libraries, including versioned names like libfoo.so.1.2 | |
| suffixes = {s.lower() for s in path.suffixes} | |
| if suffixes & {".dll", ".so", ".dylib"}: | |
| continue | |
| executables.append(path) |
| uint newVerticesCount; | ||
| cudaMemcpy(&newVerticesCount, (dNewVerticesIDs_ + verticesCount_ - 1).get(), sizeof(uint), cudaMemcpyDeviceToHost); | ||
| ++newVerticesCount; |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
important: this cudaMemcpy is unchecked, unlike the other cudaMemcpy calls in this file (e.g. lines 256-263, 476-480) which use checkCudaErrors. If it silently fails, newVerticesCount stays uninitialized and directly drives the convergence branch below (ALGORITHM_FINISHED vs continuing), which can corrupt results or hang the loop. Several other CUDA calls in this file (cudaEventCreate, cudaMemGetInfo, cudaMemset, cudaEventElapsedTime) share this same unchecked pattern, but this one is the most consequential since it gates control flow.
🛠️ Suggested fix
uint newVerticesCount;
- cudaMemcpy(&newVerticesCount, (dNewVerticesIDs_ + verticesCount_ - 1).get(), sizeof(uint), cudaMemcpyDeviceToHost);
+ checkCudaErrors(cudaMemcpy(
+ &newVerticesCount, (dNewVerticesIDs_ + verticesCount_ - 1).get(), sizeof(uint), cudaMemcpyDeviceToHost));
++newVerticesCount;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| uint newVerticesCount; | |
| cudaMemcpy(&newVerticesCount, (dNewVerticesIDs_ + verticesCount_ - 1).get(), sizeof(uint), cudaMemcpyDeviceToHost); | |
| ++newVerticesCount; | |
| uint newVerticesCount; | |
| checkCudaErrors(cudaMemcpy( | |
| &newVerticesCount, (dNewVerticesIDs_ + verticesCount_ - 1).get(), sizeof(uint), cudaMemcpyDeviceToHost)); | |
| +newVerticesCount; |
Description
closes
Migrating the existing CCCL (c++ and python) samples from cuda-samples to here.
These will be mirrored on each release to the cuda-samples repo so we need to keep them in the same format as they are in today.
Also added a CI run of these.
In general everything in
/samplesdoesnt need to be reviewed right now. Those are simply migrated here and formatted. We can evaluate adding/removing/cleaning in the future.Checklist