[cuda.compute] Expose .serialize() and .deserialize() methods in Python#9644
[cuda.compute] Expose .serialize() and .deserialize() methods in Python#9644shwina wants to merge 14 commits into
.serialize() and .deserialize() methods in Python#9644Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cb811c3 to
18f9776
Compare
18f9776 to
21d8901
Compare
|
@coderabbitai full review |
✅ Action performedFull review finished. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR renames the C parallel AoT serialization API to serialization naming, adds blob validation and diagnostics APIs, and introduces a Python schema-driven serialization system with Cython bindings, per-algorithm serialize/deserialize support, public loaders, and round-trip tests. ChangesC parallel serialization rename
Python AoT serialization feature
Suggested reviewers: Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
python/cuda_cccl/tests/compute/test_binary_search_aot.py (1)
25-72: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: add a negative test for loading a
lower_boundblob throughload_upper_boundand vice versa. The current tests cover only valid round trips, so the mode-mismatch validation could regress unnoticed.with pytest.raises(ValueError, match="mode mismatch"): load_upper_bound(lower_bound_blob)
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: aa78048f-6597-48de-a0b2-9ce3b50a930e
📒 Files selected for processing (30)
python/cuda_cccl/CMakeLists.txtpython/cuda_cccl/cuda/compute/_aot_serde.pypython/cuda_cccl/cuda/compute/_bindings.pyipython/cuda_cccl/cuda/compute/_bindings_aot_v1.pxipython/cuda_cccl/cuda/compute/_bindings_aot_v2.pxipython/cuda_cccl/cuda/compute/_bindings_impl.pyxpython/cuda_cccl/cuda/compute/algorithms/__init__.pypython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/tests/compute/bench_aot.pypython/cuda_cccl/tests/compute/test_binary_search_aot.pypython/cuda_cccl/tests/compute/test_histogram_aot.pypython/cuda_cccl/tests/compute/test_merge_sort_aot.pypython/cuda_cccl/tests/compute/test_radix_sort_aot.pypython/cuda_cccl/tests/compute/test_reduce_aot.pypython/cuda_cccl/tests/compute/test_scan_aot.pypython/cuda_cccl/tests/compute/test_segmented_reduce_aot.pypython/cuda_cccl/tests/compute/test_segmented_sort_aot.pypython/cuda_cccl/tests/compute/test_three_way_partition_aot.pypython/cuda_cccl/tests/compute/test_transform_aot.pypython/cuda_cccl/tests/compute/test_unique_by_key_aot.py
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cuda_cccl/cuda/compute/algorithms/_scan.py (1)
137-150: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winimportant: Reject non-binary
force_inclusivevalues from the AoT blob.Line 138 converts any non-zero byte to
True, so a malformed blob can silently deserialize an exclusive scan as inclusive and write incorrect results. Read the raw byte, require0or1, then convert tobool.As per path instructions,
python/cuda_cccl/**/*: Focus on Python API stability, CUDA array interoperability, memory ownership, JIT/NVRTC/nvJitLink behavior, package boundaries, user-defined operator correctness, tests, and examples.Source: Path instructions
🧹 Nitpick comments (1)
python/cuda_cccl/cuda/compute/_aot/dispatch.py (1)
68-74: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: Avoid swallowing internal
AttributeErrors from real serializers.This catches any
AttributeErrorraised insidealgorithm.serialize(), masking implementation defects as “not AoT-serializable”. Checkgetattr(algorithm, "serialize", None)first, validate it is callable, then invoke it outside theAttributeErrorhandler.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e30eaece-e301-4ed7-9f7e-656daf4d2786
📒 Files selected for processing (21)
python/cuda_cccl/cuda/compute/__init__.pypython/cuda_cccl/cuda/compute/_aot/__init__.pypython/cuda_cccl/cuda/compute/_aot/dispatch.pypython/cuda_cccl/cuda/compute/_aot/serde.pypython/cuda_cccl/cuda/compute/_bindings_aot_v1.pxipython/cuda_cccl/cuda/compute/algorithms/__init__.pypython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/tests/compute/bench_aot.pypython/cuda_cccl/tests/compute/test_merge_sort_aot.pypython/cuda_cccl/tests/compute/test_reduce_aot.pypython/cuda_cccl/tests/compute/test_scan_aot.py
✅ Files skipped from review due to trivial changes (1)
- python/cuda_cccl/cuda/compute/_aot/init.py
🚧 Files skipped from review as they are similar to previous changes (16)
- python/cuda_cccl/tests/compute/test_merge_sort_aot.py
- python/cuda_cccl/cuda/compute/algorithms/init.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
- python/cuda_cccl/cuda/compute/algorithms/_transform.py
- python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
- python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
- python/cuda_cccl/cuda/compute/algorithms/_histogram.py
- python/cuda_cccl/tests/compute/bench_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_reduce.py
- python/cuda_cccl/tests/compute/test_reduce_aot.py
- python/cuda_cccl/tests/compute/test_scan_aot.py
- python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
1f601fe to
1336f31
Compare
|
@coderabbitai full review |
✅ Action performedFull review finished. |
|
/ok to test b0f4db3 |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
c/parallel/src/aot.cpp (1)
18-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valuesuggestion: Switch this include to angle brackets
c/parallel/src/aot.cpp:18still uses quotes; the repo rule for headers is angle-bracket includes.Source: Coding guidelines
python/cuda_cccl/cuda/compute/algorithms/_scan.py (1)
136-152: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winsuggestion: the "exclusive scan with no init value" error now only surfaces on first
__call__(property access), whereas previously (per the summary, "relying on a pre-assigned instance attribute") this was presumably validated eagerly at construction. A caller building an invalid scan viamake_exclusive_scan(..., init_value=None)will no longer fail immediately — the error only appears when the scan actually executes. Worth forcing eager evaluation once at the end of__init__to keep the fail-fast contract:🔧 Proposed fix
self.force_inclusive = force_inclusive # Compile the op with value types self.op_cccl = op.compile((value_type, value_type), value_type) self.build_result = call_build( _bindings.DeviceScanBuildResult, self.d_in_cccl, self.d_out_cccl, self.op_cccl, init_value_type_info, force_inclusive, self.init_kind, ) + + # Trigger validation eagerly so invalid combinations fail at build time + # rather than on first execution. + _ = self.device_scan_fnSource: Path instructions
python/cuda_cccl/tests/compute/test_aot_diagnostics.py (1)
27-34: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: dedupe the
USING_V2skip boilerplate.The
try/except ImportError+pytestmark = pytest.mark.skipif(USING_V2, ...)block is repeated verbatim in every AOT test file in this cohort (test_binary_search_aot.py,test_histogram_aot.py,test_merge_sort_aot.py,test_transform_aot.py, etc.). Move it to a sharedconftest.pyfixture/marker to avoid maintaining N copies.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7ad3d8ea-e521-48fc-8724-056811a2cc09
📒 Files selected for processing (39)
c/parallel/include/cccl/c/aot_diagnostics.hc/parallel/src/aot.cpppython/cuda_cccl/CMakeLists.txtpython/cuda_cccl/cuda/compute/__init__.pypython/cuda_cccl/cuda/compute/_aot/__init__.pypython/cuda_cccl/cuda/compute/_aot/dispatch.pypython/cuda_cccl/cuda/compute/_aot/serde.pypython/cuda_cccl/cuda/compute/_aot/serializable.pypython/cuda_cccl/cuda/compute/_bindings.pyipython/cuda_cccl/cuda/compute/_bindings_aot_v1.pxipython/cuda_cccl/cuda/compute/_bindings_aot_v2.pxipython/cuda_cccl/cuda/compute/_bindings_impl.pyxpython/cuda_cccl/cuda/compute/algorithms/__init__.pypython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_select.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/tests/compute/bench_aot.pypython/cuda_cccl/tests/compute/test_aot_diagnostics.pypython/cuda_cccl/tests/compute/test_binary_search_aot.pypython/cuda_cccl/tests/compute/test_histogram_aot.pypython/cuda_cccl/tests/compute/test_merge_sort_aot.pypython/cuda_cccl/tests/compute/test_radix_sort_aot.pypython/cuda_cccl/tests/compute/test_reduce_aot.pypython/cuda_cccl/tests/compute/test_scan_aot.pypython/cuda_cccl/tests/compute/test_segmented_reduce_aot.pypython/cuda_cccl/tests/compute/test_segmented_sort_aot.pypython/cuda_cccl/tests/compute/test_select_aot.pypython/cuda_cccl/tests/compute/test_three_way_partition_aot.pypython/cuda_cccl/tests/compute/test_transform_aot.pypython/cuda_cccl/tests/compute/test_unique_by_key_aot.py
| try: | ||
| return algorithm.serialize() | ||
| except AttributeError as e: | ||
| raise TypeError( | ||
| f"{type(algorithm).__name__} is not an AoT-serializable algorithm " | ||
| "(expected an object from a make_* factory)." | ||
| ) from e |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
important: the except AttributeError wraps the whole call, so an AttributeError raised inside a valid serialize() (real bug, missing attr on a partially-built object) is swallowed and reported as "not an AoT-serializable algorithm", masking the true failure. Gate on the attribute's presence instead of catching from the call.
- try:
- return algorithm.serialize()
- except AttributeError as e:
- raise TypeError(
- f"{type(algorithm).__name__} is not an AoT-serializable algorithm "
- "(expected an object from a make_* factory)."
- ) from e
+ serialize_method = getattr(type(algorithm), "serialize", None)
+ if not callable(serialize_method):
+ raise TypeError(
+ f"{type(algorithm).__name__} is not an AoT-serializable algorithm "
+ "(expected an object from a make_* factory)."
+ )
+ return algorithm.serialize()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| return algorithm.serialize() | |
| except AttributeError as e: | |
| raise TypeError( | |
| f"{type(algorithm).__name__} is not an AoT-serializable algorithm " | |
| "(expected an object from a make_* factory)." | |
| ) from e | |
| serialize_method = getattr(type(algorithm), "serialize", None) | |
| if not callable(serialize_method): | |
| raise TypeError( | |
| f"{type(algorithm).__name__} is not an AoT-serializable algorithm " | |
| "(expected an object from a make_* factory)." | |
| ) | |
| return algorithm.serialize() |
| __serde_schema__ = ( | ||
| ("d_in_keys_cccl", ITER), | ||
| ("d_in_values_cccl", ITER), | ||
| ("d_out_keys_cccl", ITER), | ||
| ("d_out_values_cccl", ITER), | ||
| ("op_cccl", OP), | ||
| ("build_result", BUILD_RESULT(_bindings.DeviceMergeSortBuildResult)), | ||
| ) | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check ITER's handling of None iterators in the AoT serde schema.
rg -n -A15 'def ITER' python/cuda_cccl/cuda/compute/_aot/serializable.py
rg -n 'CONDITIONAL' python/cuda_cccl/cuda/compute/_aot/serializable.py python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py python/cuda_cccl/cuda/compute/algorithms/*.pyRepository: NVIDIA/cccl
Length of output: 680
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== serializable.py outline =="
ast-grep outline python/cuda_cccl/cuda/compute/_aot/serializable.py --view expanded || true
echo "== merge_sort.py outline =="
ast-grep outline python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py --view expanded || true
echo "== relevant serializable.py lines =="
sed -n '1,340p' python/cuda_cccl/cuda/compute/_aot/serializable.py | cat -n | sed -n '1,340p'
echo "== relevant merge_sort.py lines =="
sed -n '1,220p' python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py | cat -n | sed -n '1,220p'
echo "== search for None/optional handling in compute algorithms =="
rg -n "None|OPTIONAL|CONDITIONAL|ITER\(" python/cuda_cccl/cuda/compute -g '*.py'Repository: NVIDIA/cccl
Length of output: 50367
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== serializable.py: ITER, CONDITIONAL, and serde core =="
sed -n '1,280p' python/cuda_cccl/cuda/compute/_aot/serializable.py | cat -n | sed -n '1,280p'
echo "== merge_sort.py: schema and init =="
sed -n '1,180p' python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py | cat -n | sed -n '1,180p'Repository: NVIDIA/cccl
Length of output: 17471
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,280p' python/cuda_cccl/cuda/compute/_aot/serde.py | cat -n | sed -n '1,280p'Repository: NVIDIA/cccl
Length of output: 8936
important: python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py:35-43 uses plain ITER for d_in_values_cccl/d_out_values_cccl, but _MergeSort.__init__ allows both to be None. That makes keys-only merge sort fail to serialize; wrap the value iterators in a conditional schema or another optional wrapper.
Source: Path instructions
There was a problem hiding this comment.
This one is a false positive — keys-only merge sort serializes and round-trips correctly. When d_in_values / d_out_values are None, to_cccl_input_iter(None) / to_cccl_output_iter(None) return a valid "none" POINTER iterator (uint8, state=None), which the plain ITER schema member writes and reads back without issue — so no conditional/optional wrapper is needed.
Verified end-to-end (build → serialize → deserialize → run → correctly sorted) and added a test_serialize_deserialize_merge_sort_keys_only regression test (0a8e15b).
There was a problem hiding this comment.
@shwina Understood — confirmed. Since to_cccl_input_iter(None)/to_cccl_output_iter(None) produce a real "none" POINTER iterator object (not a Python None), the plain ITER schema member serializes/deserializes it like any other iterator, so no conditional wrapper is needed. Thanks for adding the regression test.
✏️ Learnings added
Learnt from: shwina
Repo: NVIDIA/cccl PR: 9644
File: python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py:35-43
Timestamp: 2026-07-02T22:18:14.092Z
Learning: In `python/cuda_cccl/cuda/compute` AoT serialization (`_aot/serializable.py`, `_aot/serde.py`), optional iterator arguments like `d_in_values`/`d_out_values` in algorithms such as `_MergeSort` (`python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py`) do not need `CONDITIONAL`/optional schema wrappers. `to_cccl_input_iter(None)`/`to_cccl_output_iter(None)` (in `_cccl_interop.py`) return a valid "none" `POINTER` iterator (uint8 type, `state=None`) rather than Python `None`, so the plain `ITER` schema kind already serializes/deserializes these "absent" values correctly.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
This comment has been minimized.
This comment has been minimized.
b0f4db3 to
0a8e15b
Compare
|
/ok to test 0a8e15b |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 4h 07m: Pass: 100%/62 | Total: 17h 37m | Max: 1h 08m | Hits: 100%/1467See results here. |
…gorithms
Exposes the C serialize/deserialize layer through the Python bindings so
a built algorithm can be persisted and reloaded without any JIT step:
* algo.serialize() -> bytes — return the blob directly
* AlgoClass.deserialize(blob, ...) -> AlgoClass — reconstruct from bytes
* _bindings_impl.pyx / _bindings.pyi - serialize()/deserialize() on each
Device*BuildResult, plus cccl_aot_buffer_free and the new externs.
* algorithms/_*.py - serialize()/deserialize() on each algorithm class.
Adds tests/compute/test_<algo>_aot.py round-trip coverage for every
algorithm and an AoT-vs-JIT benchmark (bench_aot.py).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `_OpAdapter.compile_for_load()` that returns a minimal Op stub (empty LTOIR, correct type/state_alignment) without triggering numba-cuda JIT compilation. The compiled CUBIN is already embedded in the AoT blob; only op.type and op.state are read at execute time, so LTOIR is not needed on the load path. _StatelessOp and _StatefulOp override to skip JIT entirely. Well-known ops and RawOp fall through to compile() since they are already JIT-free. All eight deserialize() methods in the algorithm layer are updated to call compile_for_load() instead of compile(), reducing deserialize latency from ~1s to ~0ms for Python-callable operators. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move all AoT C extern declarations and serialize/deserialize implementations into backend-conditional .pxi files. v1 gets the real implementation; v2 gets stubs that raise NotImplementedError. CMake selects the right pxi the same way it handles the other backend-conditional files (segmented_reduce_backend, etc.). AoT tests skip automatically on v2 via pytestmark + USING_V2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The serialize/deserialize path had no guard against loading a blob that was saved for the opposite binary_search mode: an upper_bound artifact could silently be deserialized through load_lower_bound and vice versa. Fix: - _BinarySearch stores its mode in a new slot. - serialize() prepends a 4-byte mode tag (b"LBND" / b"UBND") before the C-level blob. - _BinarySearch._deserialize() reads and validates the tag; raises ValueError if the blob mode doesn't match the expected_mode. - Add module-level load_lower_bound() / load_upper_bound() that call _deserialize with the appropriate expected_mode, making the API symmetrical with make_lower_bound / make_upper_bound. - Export load_lower_bound / load_upper_bound from algorithms/__init__. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ization `Device<algo>` `deserialize()` now takes only the blob: the iterator, operator, and value descriptors are written into a self-describing sidecar (`cuda/compute/_aot_serde.py`) prepended to the C `build_result`, and rebuilt on load with no objects supplied. Custom iterators are fully supported — their device LTOIR round-trips — and no JIT runs on the load path. The operator's real device code is serialized (exactly what a normal `__call__` passes to execute), which supersedes the earlier placeholder-op approach; the now-unused `compile_for_load()` helper is removed. Adds `Op.operator_type` and `Iterator.alignment` getters used by the sidecar. Blob format is versioned (`CCAOTPY1`) and tagged per algorithm; binary_search folds its search mode into the sidecar so `load_lower_bound`/`load_upper_bound` stay object-free and reject opposite-mode blobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Deserializing an AoT blob could fail with only an opaque CUDA error code while the actual reason was printed to stdout by the C layer and discarded. The two worst cases were an ABI/format mismatch and loading a CUBIN built for a different GPU architecture, both of which surfaced with no actionable message. Add an AoT blob validator to the C layer and route the deserialize path through it: * New cccl_aot_validate_blob() checks the blob magic, format version, and CCCL C parallel ABI version, and — for CUBIN payloads — that the target compute-capability major matches the device the blob would load on (falling back to the default device when there is no current context, via idempotent cuInit, so a bare deserialize is still checked). It runs before cuLibraryLoadData, turning a deep opaque failure into an early, clear one. * New cccl_aot_last_error() exposes the descriptive message for callers to surface instead of a bare CUDA error code. Both are declared in a standalone header (cccl/c/aot_diagnostics.h) rather than aot.h so they don't trigger a rebuild of every algorithm translation unit. The v1 AoT bindings now validate through cccl_aot_validate_blob() in every cccl_device_<algo>_deserialize and report cccl_aot_last_error() on failure. Adds tests/compute/test_aot_diagnostics.py (ABI mismatch, wrong compute-capability major, corrupt magic, and a valid-blob round trip). The CC case is exercised single-GPU by patching the blob's cc field; a real cross-GPU load is the same code path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0a8e15b to
25298cb
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2b88c99f-4dc5-4ef6-9105-533e43c91be7
📒 Files selected for processing (70)
c/parallel/include/cccl/c/binary_search.hc/parallel/include/cccl/c/for.hc/parallel/include/cccl/c/histogram.hc/parallel/include/cccl/c/merge_sort.hc/parallel/include/cccl/c/radix_sort.hc/parallel/include/cccl/c/reduce.hc/parallel/include/cccl/c/scan.hc/parallel/include/cccl/c/segmented_reduce.hc/parallel/include/cccl/c/segmented_sort.hc/parallel/include/cccl/c/serialization.hc/parallel/include/cccl/c/serialization_diagnostics.hc/parallel/include/cccl/c/three_way_partition.hc/parallel/include/cccl/c/transform.hc/parallel/include/cccl/c/unique_by_key.hc/parallel/src/aot.cppc/parallel/src/binary_search.cuc/parallel/src/for.cuc/parallel/src/histogram.cuc/parallel/src/merge_sort.cuc/parallel/src/radix_sort.cuc/parallel/src/reduce.cuc/parallel/src/scan.cuc/parallel/src/segmented_reduce.cuc/parallel/src/segmented_sort.cuc/parallel/src/serialization.cppc/parallel/src/three_way_partition.cuc/parallel/src/transform.cuc/parallel/src/unique_by_key.cuc/parallel/src/util/nvjitlink.hc/parallel/src/util/serialization.hc/parallel/test/test_binary_search.cppc/parallel/test/test_for.cppc/parallel/test/test_histogram.cppc/parallel/test/test_merge_sort.cppc/parallel/test/test_radix_sort.cppc/parallel/test/test_reduce.cppc/parallel/test/test_scan.cppc/parallel/test/test_segmented_reduce.cppc/parallel/test/test_segmented_sort.cppc/parallel/test/test_three_way_partition.cppc/parallel/test/test_transform.cppc/parallel/test/test_unique_by_key.cpppython/cuda_cccl/CMakeLists.txtpython/cuda_cccl/cuda/compute/__init__.pypython/cuda_cccl/cuda/compute/_bindings.pyipython/cuda_cccl/cuda/compute/_bindings_impl.pyxpython/cuda_cccl/cuda/compute/_bindings_serialization_v1.pxipython/cuda_cccl/cuda/compute/_bindings_serialization_v2.pxipython/cuda_cccl/cuda/compute/_serialization/__init__.pypython/cuda_cccl/cuda/compute/_serialization/codec.pypython/cuda_cccl/cuda/compute/_serialization/dispatch.pypython/cuda_cccl/cuda/compute/_serialization/serializable.pypython/cuda_cccl/cuda/compute/algorithms/__init__.pypython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_select.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/tests/compute/bench_serialization.pypython/cuda_cccl/tests/compute/test_binary_search_serialization.pypython/cuda_cccl/tests/compute/test_histogram_serialization.pypython/cuda_cccl/tests/compute/test_merge_sort_serialization.pypython/cuda_cccl/tests/compute/test_radix_sort_serialization.py
💤 Files with no reviewable changes (29)
- python/cuda_cccl/CMakeLists.txt
- python/cuda_cccl/cuda/compute/_serialization/dispatch.py
- python/cuda_cccl/cuda/compute/_serialization/init.py
- python/cuda_cccl/tests/compute/test_radix_sort_serialization.py
- python/cuda_cccl/cuda/compute/_bindings_serialization_v2.pxi
- c/parallel/src/aot.cpp
- python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
- python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
- python/cuda_cccl/cuda/compute/algorithms/_histogram.py
- python/cuda_cccl/cuda/compute/algorithms/_reduce.py
- python/cuda_cccl/cuda/compute/init.py
- python/cuda_cccl/cuda/compute/algorithms/init.py
- python/cuda_cccl/tests/compute/bench_serialization.py
- python/cuda_cccl/tests/compute/test_histogram_serialization.py
- python/cuda_cccl/cuda/compute/algorithms/_select.py
- python/cuda_cccl/cuda/compute/algorithms/_transform.py
- python/cuda_cccl/tests/compute/test_merge_sort_serialization.py
- python/cuda_cccl/tests/compute/test_binary_search_serialization.py
- python/cuda_cccl/cuda/compute/_serialization/serializable.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
- python/cuda_cccl/cuda/compute/_serialization/codec.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
- python/cuda_cccl/cuda/compute/algorithms/_scan.py
- python/cuda_cccl/cuda/compute/_bindings.pyi
- python/cuda_cccl/cuda/compute/_bindings_serialization_v1.pxi
- python/cuda_cccl/cuda/compute/_bindings_impl.pyx
✅ Files skipped from review due to trivial changes (24)
- c/parallel/include/cccl/c/three_way_partition.h
- c/parallel/src/util/nvjitlink.h
- c/parallel/include/cccl/c/transform.h
- c/parallel/include/cccl/c/reduce.h
- c/parallel/include/cccl/c/scan.h
- c/parallel/include/cccl/c/segmented_sort.h
- c/parallel/include/cccl/c/radix_sort.h
- c/parallel/test/test_histogram.cpp
- c/parallel/test/test_segmented_sort.cpp
- c/parallel/include/cccl/c/histogram.h
- c/parallel/include/cccl/c/merge_sort.h
- c/parallel/test/test_binary_search.cpp
- c/parallel/include/cccl/c/segmented_reduce.h
- c/parallel/include/cccl/c/for.h
- c/parallel/test/test_transform.cpp
- c/parallel/test/test_segmented_reduce.cpp
- c/parallel/include/cccl/c/binary_search.h
- c/parallel/test/test_radix_sort.cpp
- c/parallel/test/test_three_way_partition.cpp
- c/parallel/test/test_unique_by_key.cpp
- c/parallel/include/cccl/c/unique_by_key.h
- c/parallel/test/test_merge_sort.cpp
- c/parallel/test/test_scan.cpp
- c/parallel/test/test_for.cpp
Description
This PR introduces two Python functions,
cuda.compute.serializeandcuda.compute.deserialize. These are relatively low-level functions that can be used to implement things like:An example of compiling an algorithm object and writing it to disk, then reading it and executing the algorithm from a subsequent Python process:
Implementation
Each algorithm class now inherits from
Serializable. This inheritance requires the algorithm class to define a__serialization_schema__member, which defines the members that will be serialized and their schema kinds (e.g.,ITER,OP,VALUE, etc.,).Serializableknows how to serialize each of those kinds of objects into bytes.As a note, 25298cb also does a huge rename of
"aot"to"serialization"- as ahead-of-time compilation is only one application of the serialization/deserialization capability.Follow-up work
make_<algo>API to accept a (list of)compute_capabilityso that AoT compilation can be done for a different GPU (or with no GPU present at all).Checklist