[cuda.compute] Expose `.serialize()` and `.deserialize()` methods in Python by shwina · Pull Request #9644 · NVIDIA/cccl

shwina · 2026-06-30T15:28:37Z

Description

This PR introduces two Python functions, cuda.compute.serialize and cuda.compute.deserialize. These are relatively low-level functions that can be used to implement things like:

Ahead-of-time compilation workflows
Disk cache
Cross-node communication of algorithm objects

An example of compiling an algorithm object and writing it to disk, then reading it and executing the algorithm from a subsequent Python process:

import cuda.compute as cc, cupy as cp, numpy as np

d_in = cp.empty(1)
d_out = cp.empty(1)
op = lambda x: 2 * x
transformer = cc.make_unary_transform(d_in=d_in, d_out=d_out, op=op)
with open("transform.cclb", "wb") as f:
f.write(cc.serialize(transformer))

import cuda.compute as cc, cupy as cp, numpy as np

with open("transform.cclb", "rb") as f:
    transformer = cc.deserialize(f.read())

d_in = cp.asarray([1., 2, 3])
d_out = cp.empty_like(d_in)
op = lambda x: 2 * x  # doesn't matter what this unary op actually is
transformer(d_in=d_in, d_out=d_out, op=op, num_items=len(d_in))
cp.testing.assert_allclose(d_out, cp.asarray([2., 4, 6]))

Implementation

Each algorithm class now inherits from Serializable. This inheritance requires the algorithm class to define a __serialization_schema__ member, which defines the members that will be serialized and their schema kinds (e.g., ITER, OP, VALUE, etc.,). Serializable knows how to serialize each of those kinds of objects into bytes.

class _Reduce(Serializable):
    __serialization_schema__ = (
        ("d_in_cccl", ITER),
        ("d_out_cccl", ITER),
        ("op_cccl", OP),
        ("h_init_cccl", VALUE),
        ("build_result", BUILD_RESULT(_bindings.DeviceReduceBuildResult)),
    )

As a note, 25298cb also does a huge rename of "aot" to "serialization" - as ahead-of-time compilation is only one application of the serialization/deserialization capability.

Follow-up work

Extend each make_<algo> API to accept a (list of) compute_capability so that AoT compilation can be done for a different GPU (or with no GPU present at all).
Add documentation for the new APIs

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-06-30T15:28:40Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot · 2026-06-30T18:12:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shwina · 2026-06-30T22:48:45Z

@coderabbitai full review

coderabbitai · 2026-06-30T22:48:52Z

✅ Action performed

Full review finished.

coderabbitai · 2026-06-30T23:04:16Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9dd4e8d1-e007-4ac2-a856-4f4b06f7e2a2

📥 Commits

Reviewing files that changed from the base of the PR and between 25298cb and c6c2820.

📒 Files selected for processing (2)

c/parallel/src/serialization.cpp
c/parallel/src/util/serialization.h

🚧 Files skipped from review as they are similar to previous changes (2)

c/parallel/src/serialization.cpp
c/parallel/src/util/serialization.h

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added AOT-style ahead-of-time serialize/deserialize support across multiple CUDA compute algorithms, with a public CUDA compute dispatch API.
- Added helpers to load lower-bound/upper-bound from serialized binary search blobs.
- Exposed additional build/iterator metadata (e.g., determinism, alignment, operator kind) to the Python API.
Bug Fixes
- Improved serialization blob validation behavior and error reporting.
Documentation
- Updated serialization buffer free guidance to the serialization API naming.
Tests
- Added new Python round-trip serialization tests for binary search, histogram, merge sort, and radix sort (and related workflows).

Walkthrough

This PR renames the C parallel AoT serialization API to serialization naming, adds blob validation and diagnostics APIs, and introduces a Python schema-driven serialization system with Cython bindings, per-algorithm serialize/deserialize support, public loaders, and round-trip tests.

Changes

C parallel serialization rename

Layer / File(s)	Summary
Public header renames and diagnostics API `c/parallel/include/cccl/c/serialization.h`, `serialization_diagnostics.h`, `binary_search.h`, `for.h`, `histogram.h`, `merge_sort.h`, `radix_sort.h`, `reduce.h`, `scan.h`, `segmented_reduce.h`, `segmented_sort.h`, `three_way_partition.h`, `transform.h`, `unique_by_key.h`	Renames `cccl_aot_algo_t`/`cccl_aot_buffer_free` to `cccl_serialization_algo_t`/`cccl_serialization_buffer_free`, updates the documentation comments, and adds `cccl_serialization_last_error` and `cccl_serialization_validate_blob`.
Serialization utility and diagnostics implementation `c/parallel/src/serialization.cpp`, `src/util/serialization.h`, `src/util/nvjitlink.h`	Implements buffer-free, last-error, and blob-validation APIs, renames blob magic and header helpers, updates error wording, and changes the nvJitLink input label.
Per-algorithm serialize/deserialize source updates `c/parallel/src/*.cu`	Switches each algorithm implementation from `cccl::aot` to `cccl::serialization`, updates blob tags and error strings, and rewrites segmented-sort selector-op reconstruction.
C++ test retagging `c/parallel/test/test_*.cpp`	Retags C2H tests from `[aot]` to `[serialization]` and updates reduce test buffer freeing to `cccl_serialization_buffer_free`.

Python AoT serialization feature

Layer / File(s)	Summary
Serde wire format and dispatch `python/cuda_cccl/cuda/compute/_serialization/*`	Defines the blob framing, descriptor codecs, `AlgoTag`, the `Serializable` mixin, and public `serialize`/`deserialize` dispatch entry points.
Cython serialization backends and CMake wiring `python/cuda_cccl/CMakeLists.txt`, `_bindings_serialization_v1.pxi`, `_bindings_serialization_v2.pxi`	Generates the serialization pxi, implements v1 blob validation/load wrappers, and provides v2 HostJIT stubs.
Typed bindings surface `_bindings.pyi`	Adds `serialize`/`deserialize` typings for build-result classes and `Op.operator_type`/`Iterator.alignment`/`Iterator.dereference_or_assign_op`.
Cython build-result wiring `_bindings_impl.pyx`, `__init__.py`	Adds zero-init constructors, new properties, and per-class serialize/deserialize methods; exports `serialize` and `deserialize`.
Reduce/Scan/SegmentedReduce serde `_reduce.py`, `_scan.py`, `_segmented_reduce.py`	Adds serialization schemas and computed `device_reduce_fn`/`device_scan_fn` selection.
MergeSort/RadixSort/SegmentedSort serde `_sort/*.py`	Adds serialization schemas including radix decomposer state handling.
Partition/transform/unique/search/select serde `_three_way_partition.py`, `_transform.py`, `_unique_by_key.py`, `_histogram.py`, `_binary_search.py`, `_select.py`, `algorithms/__init__.py`	Adds serialization schemas and public `load_lower_bound`/`load_upper_bound` loaders.
Round-trip tests and benchmark `tests/compute/test_*_serialization.py`, `bench_serialization.py`	Adds round-trip tests for binary search, histogram, merge sort, and radix sort, plus a cold-start benchmark comparing JIT and serialization paths.

Suggested reviewers: bernhardmgruber

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

python/cuda_cccl/tests/compute/test_binary_search_aot.py (1)
25-72: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: add a negative test for loading a lower_bound blob through load_upper_bound and vice versa. The current tests cover only valid round trips, so the mode-mismatch validation could regress unnoticed.
with pytest.raises(ValueError, match="mode mismatch"):
    load_upper_bound(lower_bound_blob)

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aa78048f-6597-48de-a0b2-9ce3b50a930e

📥 Commits

Reviewing files that changed from the base of the PR and between 3d5f235 and 21d8901.

📒 Files selected for processing (30)

python/cuda_cccl/CMakeLists.txt
python/cuda_cccl/cuda/compute/_aot_serde.py
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx
python/cuda_cccl/cuda/compute/algorithms/__init__.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/tests/compute/test_binary_search_aot.py
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
python/cuda_cccl/tests/compute/test_transform_aot.py
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

shwina · 2026-07-01T11:26:11Z

@coderabbitai review

coderabbitai · 2026-07-01T11:26:16Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/cuda_cccl/cuda/compute/algorithms/_scan.py (1)

137-150: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

important: Reject non-binary force_inclusive values from the AoT blob.

Line 138 converts any non-zero byte to True, so a malformed blob can silently deserialize an exclusive scan as inclusive and write incorrect results. Read the raw byte, require 0 or 1, then convert to bool.

As per path instructions, python/cuda_cccl/**/*: Focus on Python API stability, CUDA array interoperability, memory ownership, JIT/NVRTC/nvJitLink behavior, package boundaries, user-defined operator correctness, tests, and examples.

Source: Path instructions

🧹 Nitpick comments (1)

python/cuda_cccl/cuda/compute/_aot/dispatch.py (1)

68-74: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: Avoid swallowing internal AttributeErrors from real serializers.

This catches any AttributeError raised inside algorithm.serialize(), masking implementation defects as “not AoT-serializable”. Check getattr(algorithm, "serialize", None) first, validate it is callable, then invoke it outside the AttributeError handler.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e30eaece-e301-4ed7-9f7e-656daf4d2786

📥 Commits

Reviewing files that changed from the base of the PR and between 21d8901 and 6d67e53.

📒 Files selected for processing (21)

python/cuda_cccl/cuda/compute/__init__.py
python/cuda_cccl/cuda/compute/_aot/__init__.py
python/cuda_cccl/cuda/compute/_aot/dispatch.py
python/cuda_cccl/cuda/compute/_aot/serde.py
python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
python/cuda_cccl/cuda/compute/algorithms/__init__.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py

✅ Files skipped from review due to trivial changes (1)

python/cuda_cccl/cuda/compute/_aot/init.py

🚧 Files skipped from review as they are similar to previous changes (16)

python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/cuda/compute/algorithms/init.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi

shwina · 2026-07-02T19:42:26Z

@coderabbitai full review

coderabbitai · 2026-07-02T19:42:32Z

✅ Action performed

Full review finished.

shwina · 2026-07-02T19:42:43Z

/ok to test b0f4db3

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

c/parallel/src/aot.cpp (1)

18-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

suggestion: Switch this include to angle brackets
c/parallel/src/aot.cpp:18 still uses quotes; the repo rule for headers is angle-bracket includes.

Source: Coding guidelines
python/cuda_cccl/cuda/compute/algorithms/_scan.py (1)
136-152: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: the "exclusive scan with no init value" error now only surfaces on first __call__ (property access), whereas previously (per the summary, "relying on a pre-assigned instance attribute") this was presumably validated eagerly at construction. A caller building an invalid scan via make_exclusive_scan(..., init_value=None) will no longer fail immediately — the error only appears when the scan actually executes. Worth forcing eager evaluation once at the end of __init__ to keep the fail-fast contract:
🔧 Proposed fix
         self.force_inclusive = force_inclusive
 
         # Compile the op with value types
         self.op_cccl = op.compile((value_type, value_type), value_type)
 
         self.build_result = call_build(
             _bindings.DeviceScanBuildResult,
             self.d_in_cccl,
             self.d_out_cccl,
             self.op_cccl,
             init_value_type_info,
             force_inclusive,
             self.init_kind,
         )
+
+        # Trigger validation eagerly so invalid combinations fail at build time
+        # rather than on first execution.
+        _ = self.device_scan_fn
Source: Path instructions
python/cuda_cccl/tests/compute/test_aot_diagnostics.py (1)

27-34: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: dedupe the USING_V2 skip boilerplate.

The try/except ImportError + pytestmark = pytest.mark.skipif(USING_V2, ...) block is repeated verbatim in every AOT test file in this cohort (test_binary_search_aot.py, test_histogram_aot.py, test_merge_sort_aot.py, test_transform_aot.py, etc.). Move it to a shared conftest.py fixture/marker to avoid maintaining N copies.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ad3d8ea-e521-48fc-8724-056811a2cc09

📥 Commits

Reviewing files that changed from the base of the PR and between 64b6b0e and b0f4db3.

📒 Files selected for processing (39)

c/parallel/include/cccl/c/aot_diagnostics.h
c/parallel/src/aot.cpp
python/cuda_cccl/CMakeLists.txt
python/cuda_cccl/cuda/compute/__init__.py
python/cuda_cccl/cuda/compute/_aot/__init__.py
python/cuda_cccl/cuda/compute/_aot/dispatch.py
python/cuda_cccl/cuda/compute/_aot/serde.py
python/cuda_cccl/cuda/compute/_aot/serializable.py
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx
python/cuda_cccl/cuda/compute/algorithms/__init__.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_select.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/tests/compute/test_aot_diagnostics.py
python/cuda_cccl/tests/compute/test_binary_search_aot.py
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
python/cuda_cccl/tests/compute/test_select_aot.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
python/cuda_cccl/tests/compute/test_transform_aot.py
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

coderabbitai · 2026-07-02T20:01:11Z

+    try:
+        return algorithm.serialize()
+    except AttributeError as e:
+        raise TypeError(
+            f"{type(algorithm).__name__} is not an AoT-serializable algorithm "
+            "(expected an object from a make_* factory)."
+        ) from e


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

important: the except AttributeError wraps the whole call, so an AttributeError raised inside a valid serialize() (real bug, missing attr on a partially-built object) is swallowed and reported as "not an AoT-serializable algorithm", masking the true failure. Gate on the attribute's presence instead of catching from the call.

- try: - return algorithm.serialize() - except AttributeError as e: - raise TypeError( - f"{type(algorithm).__name__} is not an AoT-serializable algorithm " - "(expected an object from a make_* factory)." - ) from e + serialize_method = getattr(type(algorithm), "serialize", None) + if not callable(serialize_method): + raise TypeError( + f"{type(algorithm).__name__} is not an AoT-serializable algorithm " + "(expected an object from a make_* factory)." + ) + return algorithm.serialize()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

return algorithm.serialize()

except AttributeError as e:

raise TypeError(

f"{type(algorithm).__name__} is not an AoT-serializable algorithm "

"(expected an object from a make_* factory)."

) from e

serialize_method = getattr(type(algorithm), "serialize", None)

if not callable(serialize_method):

raise TypeError(

f"{type(algorithm).__name__} is not an AoT-serializable algorithm "

"(expected an object from a make_* factory)."

)

return algorithm.serialize()

coderabbitai · 2026-07-02T20:01:11Z

+    __serde_schema__ = (
+        ("d_in_keys_cccl", ITER),
+        ("d_in_values_cccl", ITER),
+        ("d_out_keys_cccl", ITER),
+        ("d_out_values_cccl", ITER),
+        ("op_cccl", OP),
+        ("build_result", BUILD_RESULT(_bindings.DeviceMergeSortBuildResult)),
+    )
+


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check ITER's handling of None iterators in the AoT serde schema. rg -n -A15 'def ITER' python/cuda_cccl/cuda/compute/_aot/serializable.py rg -n 'CONDITIONAL' python/cuda_cccl/cuda/compute/_aot/serializable.py python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py python/cuda_cccl/cuda/compute/algorithms/*.py

Repository: NVIDIA/cccl

Length of output: 680

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== serializable.py outline ==" ast-grep outline python/cuda_cccl/cuda/compute/_aot/serializable.py --view expanded || true echo "== merge_sort.py outline ==" ast-grep outline python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py --view expanded || true echo "== relevant serializable.py lines ==" sed -n '1,340p' python/cuda_cccl/cuda/compute/_aot/serializable.py | cat -n | sed -n '1,340p' echo "== relevant merge_sort.py lines ==" sed -n '1,220p' python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py | cat -n | sed -n '1,220p' echo "== search for None/optional handling in compute algorithms ==" rg -n "None|OPTIONAL|CONDITIONAL|ITER\(" python/cuda_cccl/cuda/compute -g '*.py'

Repository: NVIDIA/cccl

Length of output: 50367

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== serializable.py: ITER, CONDITIONAL, and serde core ==" sed -n '1,280p' python/cuda_cccl/cuda/compute/_aot/serializable.py | cat -n | sed -n '1,280p' echo "== merge_sort.py: schema and init ==" sed -n '1,180p' python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py | cat -n | sed -n '1,180p'

Repository: NVIDIA/cccl

Length of output: 17471

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,280p' python/cuda_cccl/cuda/compute/_aot/serde.py | cat -n | sed -n '1,280p'

Repository: NVIDIA/cccl

Length of output: 8936

important: python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py:35-43 uses plain ITER for d_in_values_cccl/d_out_values_cccl, but _MergeSort.__init__ allows both to be None. That makes keys-only merge sort fail to serialize; wrap the value iterators in a conditional schema or another optional wrapper.

Source: Path instructions

This one is a false positive — keys-only merge sort serializes and round-trips correctly. When d_in_values / d_out_values are None, to_cccl_input_iter(None) / to_cccl_output_iter(None) return a valid "none" POINTER iterator (uint8, state=None), which the plain ITER schema member writes and reads back without issue — so no conditional/optional wrapper is needed.

Verified end-to-end (build → serialize → deserialize → run → correctly sorted) and added a test_serialize_deserialize_merge_sort_keys_only regression test (0a8e15b).

@shwina Understood — confirmed. Since to_cccl_input_iter(None)/to_cccl_output_iter(None) produce a real "none" POINTER iterator object (not a Python None), the plain ITER schema member serializes/deserializes it like any other iterator, so no conditional wrapper is needed. Thanks for adding the regression test.

✏️ Learnings added

Learnt from: shwina Repo: NVIDIA/cccl PR: 9644 File: python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py:35-43 Timestamp: 2026-07-02T22:18:14.092Z Learning: In `python/cuda_cccl/cuda/compute` AoT serialization (`_aot/serializable.py`, `_aot/serde.py`), optional iterator arguments like `d_in_values`/`d_out_values` in algorithms such as `_MergeSort` (`python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py`) do not need `CONDITIONAL`/optional schema wrappers. `to_cccl_input_iter(None)`/`to_cccl_output_iter(None)` (in `_cccl_interop.py`) return a valid "none" `POINTER` iterator (uint8 type, `state=None`) rather than Python `None`, so the plain `ITER` schema kind already serializes/deserializes these "absent" values correctly.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

shwina · 2026-07-03T11:33:53Z

/ok to test 0a8e15b

github-actions · 2026-07-03T15:43:29Z

🥳 CI Workflow Results

🟩 Finished in 4h 07m: Pass: 100%/62 | Total: 17h 37m | Max: 1h 08m | Hits: 100%/1467

See results here.

…gorithms Exposes the C serialize/deserialize layer through the Python bindings so a built algorithm can be persisted and reloaded without any JIT step: * algo.serialize() -> bytes — return the blob directly * AlgoClass.deserialize(blob, ...) -> AlgoClass — reconstruct from bytes * _bindings_impl.pyx / _bindings.pyi - serialize()/deserialize() on each Device*BuildResult, plus cccl_aot_buffer_free and the new externs. * algorithms/_*.py - serialize()/deserialize() on each algorithm class. Adds tests/compute/test_<algo>_aot.py round-trip coverage for every algorithm and an AoT-vs-JIT benchmark (bench_aot.py). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add `_OpAdapter.compile_for_load()` that returns a minimal Op stub (empty LTOIR, correct type/state_alignment) without triggering numba-cuda JIT compilation. The compiled CUBIN is already embedded in the AoT blob; only op.type and op.state are read at execute time, so LTOIR is not needed on the load path. _StatelessOp and _StatefulOp override to skip JIT entirely. Well-known ops and RawOp fall through to compile() since they are already JIT-free. All eight deserialize() methods in the algorithm layer are updated to call compile_for_load() instead of compile(), reducing deserialize latency from ~1s to ~0ms for Python-callable operators. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move all AoT C extern declarations and serialize/deserialize implementations into backend-conditional .pxi files. v1 gets the real implementation; v2 gets stubs that raise NotImplementedError. CMake selects the right pxi the same way it handles the other backend-conditional files (segmented_reduce_backend, etc.). AoT tests skip automatically on v2 via pytestmark + USING_V2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The serialize/deserialize path had no guard against loading a blob that was saved for the opposite binary_search mode: an upper_bound artifact could silently be deserialized through load_lower_bound and vice versa. Fix: - _BinarySearch stores its mode in a new slot. - serialize() prepends a 4-byte mode tag (b"LBND" / b"UBND") before the C-level blob. - _BinarySearch._deserialize() reads and validates the tag; raises ValueError if the blob mode doesn't match the expected_mode. - Add module-level load_lower_bound() / load_upper_bound() that call _deserialize with the appropriate expected_mode, making the API symmetrical with make_lower_bound / make_upper_bound. - Export load_lower_bound / load_upper_bound from algorithms/__init__. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ization `Device<algo>` `deserialize()` now takes only the blob: the iterator, operator, and value descriptors are written into a self-describing sidecar (`cuda/compute/_aot_serde.py`) prepended to the C `build_result`, and rebuilt on load with no objects supplied. Custom iterators are fully supported — their device LTOIR round-trips — and no JIT runs on the load path. The operator's real device code is serialized (exactly what a normal `__call__` passes to execute), which supersedes the earlier placeholder-op approach; the now-unused `compile_for_load()` helper is removed. Adds `Op.operator_type` and `Iterator.alignment` getters used by the sidecar. Blob format is versioned (`CCAOTPY1`) and tagged per algorithm; binary_search folds its search mode into the sidecar so `load_lower_bound`/`load_upper_bound` stay object-free and reject opposite-mode blobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Deserializing an AoT blob could fail with only an opaque CUDA error code while the actual reason was printed to stdout by the C layer and discarded. The two worst cases were an ABI/format mismatch and loading a CUBIN built for a different GPU architecture, both of which surfaced with no actionable message. Add an AoT blob validator to the C layer and route the deserialize path through it: * New cccl_aot_validate_blob() checks the blob magic, format version, and CCCL C parallel ABI version, and — for CUBIN payloads — that the target compute-capability major matches the device the blob would load on (falling back to the default device when there is no current context, via idempotent cuInit, so a bare deserialize is still checked). It runs before cuLibraryLoadData, turning a deep opaque failure into an early, clear one. * New cccl_aot_last_error() exposes the descriptive message for callers to surface instead of a bare CUDA error code. Both are declared in a standalone header (cccl/c/aot_diagnostics.h) rather than aot.h so they don't trigger a rebuild of every algorithm translation unit. The v1 AoT bindings now validate through cccl_aot_validate_blob() in every cccl_device_<algo>_deserialize and report cccl_aot_last_error() on failure. Adds tests/compute/test_aot_diagnostics.py (ABI mismatch, wrong compute-capability major, corrupt magic, and a valid-blob round trip). The CC case is exercised single-GPU by patching the blob's cc field; a real cross-GPU load is the same code path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b88c99f-4dc5-4ef6-9105-533e43c91be7

📥 Commits

Reviewing files that changed from the base of the PR and between b0f4db3 and 25298cb.

📒 Files selected for processing (70)

c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/serialization.h
c/parallel/include/cccl/c/serialization_diagnostics.h
c/parallel/include/cccl/c/three_way_partition.h
c/parallel/include/cccl/c/transform.h
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/src/aot.cpp
c/parallel/src/binary_search.cu
c/parallel/src/for.cu
c/parallel/src/histogram.cu
c/parallel/src/merge_sort.cu
c/parallel/src/radix_sort.cu
c/parallel/src/reduce.cu
c/parallel/src/scan.cu
c/parallel/src/segmented_reduce.cu
c/parallel/src/segmented_sort.cu
c/parallel/src/serialization.cpp
c/parallel/src/three_way_partition.cu
c/parallel/src/transform.cu
c/parallel/src/unique_by_key.cu
c/parallel/src/util/nvjitlink.h
c/parallel/src/util/serialization.h
c/parallel/test/test_binary_search.cpp
c/parallel/test/test_for.cpp
c/parallel/test/test_histogram.cpp
c/parallel/test/test_merge_sort.cpp
c/parallel/test/test_radix_sort.cpp
c/parallel/test/test_reduce.cpp
c/parallel/test/test_scan.cpp
c/parallel/test/test_segmented_reduce.cpp
c/parallel/test/test_segmented_sort.cpp
c/parallel/test/test_three_way_partition.cpp
c/parallel/test/test_transform.cpp
c/parallel/test/test_unique_by_key.cpp
python/cuda_cccl/CMakeLists.txt
python/cuda_cccl/cuda/compute/__init__.py
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx
python/cuda_cccl/cuda/compute/_bindings_serialization_v1.pxi
python/cuda_cccl/cuda/compute/_bindings_serialization_v2.pxi
python/cuda_cccl/cuda/compute/_serialization/__init__.py
python/cuda_cccl/cuda/compute/_serialization/codec.py
python/cuda_cccl/cuda/compute/_serialization/dispatch.py
python/cuda_cccl/cuda/compute/_serialization/serializable.py
python/cuda_cccl/cuda/compute/algorithms/__init__.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_select.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/tests/compute/bench_serialization.py
python/cuda_cccl/tests/compute/test_binary_search_serialization.py
python/cuda_cccl/tests/compute/test_histogram_serialization.py
python/cuda_cccl/tests/compute/test_merge_sort_serialization.py
python/cuda_cccl/tests/compute/test_radix_sort_serialization.py

💤 Files with no reviewable changes (29)

python/cuda_cccl/CMakeLists.txt
python/cuda_cccl/cuda/compute/_serialization/dispatch.py
python/cuda_cccl/cuda/compute/_serialization/init.py
python/cuda_cccl/tests/compute/test_radix_sort_serialization.py
python/cuda_cccl/cuda/compute/_bindings_serialization_v2.pxi
c/parallel/src/aot.cpp
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/init.py
python/cuda_cccl/cuda/compute/algorithms/init.py
python/cuda_cccl/tests/compute/bench_serialization.py
python/cuda_cccl/tests/compute/test_histogram_serialization.py
python/cuda_cccl/cuda/compute/algorithms/_select.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/tests/compute/test_merge_sort_serialization.py
python/cuda_cccl/tests/compute/test_binary_search_serialization.py
python/cuda_cccl/cuda/compute/_serialization/serializable.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/_serialization/codec.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_serialization_v1.pxi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx

✅ Files skipped from review due to trivial changes (24)

c/parallel/include/cccl/c/three_way_partition.h
c/parallel/src/util/nvjitlink.h
c/parallel/include/cccl/c/transform.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/test/test_histogram.cpp
c/parallel/test/test_segmented_sort.cpp
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
c/parallel/test/test_binary_search.cpp
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/for.h
c/parallel/test/test_transform.cpp
c/parallel/test/test_segmented_reduce.cpp
c/parallel/include/cccl/c/binary_search.h
c/parallel/test/test_radix_sort.cpp
c/parallel/test/test_three_way_partition.cpp
c/parallel/test/test_unique_by_key.cpp
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/test/test_merge_sort.cpp
c/parallel/test/test_scan.cpp
c/parallel/test/test_for.cpp

github-project-automation Bot added this to CCCL Jun 30, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 30, 2026

cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 30, 2026

shwina force-pushed the pr/aot-cuda-compute branch from cb811c3 to 18f9776 Compare June 30, 2026 18:12

shwina force-pushed the pr/aot-cuda-compute branch from 18f9776 to 21d8901 Compare June 30, 2026 18:17

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

shwina force-pushed the pr/aot-cuda-compute branch 2 times, most recently from 1f601fe to 1336f31 Compare July 2, 2026 11:16

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

shwina force-pushed the pr/aot-cuda-compute branch from b0f4db3 to 0a8e15b Compare July 2, 2026 22:17

This comment has been minimized.

Sign in to view

shwina and others added 4 commits July 3, 2026 11:55

shwina and others added 9 commits July 3, 2026 11:55

Update

57c8c21

Addressing coderabbit review

526fef4

Change tests to use public APIs

14ab428

Simplify with Serializable base class

50ba166

Don't reference post-load

2a55b9f

Address coderabbit review

e72cb07

Rename aot -> serialization

25298cb

shwina force-pushed the pr/aot-cuda-compute branch from 0a8e15b to 25298cb Compare July 3, 2026 15:55

shwina marked this pull request as ready for review July 3, 2026 16:08

shwina requested review from a team as code owners July 3, 2026 16:08

shwina requested review from bernhardmgruber and oleksandr-pavlyk July 3, 2026 16:08

cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Jul 3, 2026

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread c/parallel/src/serialization.cpp

Address coderabbit review

c6c2820

Uh oh!

Conversation

shwina commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementation

Follow-up work

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

shwina commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shwina commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

shwina commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shwina commented Jul 2, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shwina Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

shwina commented Jul 3, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Jul 3, 2026

🥳 CI Workflow Results

🟩 Finished in 4h 07m: Pass: 100%/62 | Total: 17h 37m | Max: 1h 08m | Hits: 100%/1467

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

shwina commented Jun 30, 2026 •

edited

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

coderabbitai Bot Jul 2, 2026 •

edited

Loading