Skip to content

[cuda.compute] Expose .serialize() and .deserialize() methods in Python#9644

Open
shwina wants to merge 14 commits into
NVIDIA:mainfrom
shwina:pr/aot-cuda-compute
Open

[cuda.compute] Expose .serialize() and .deserialize() methods in Python#9644
shwina wants to merge 14 commits into
NVIDIA:mainfrom
shwina:pr/aot-cuda-compute

Conversation

@shwina

@shwina shwina commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Description

This PR introduces two Python functions, cuda.compute.serialize and cuda.compute.deserialize. These are relatively low-level functions that can be used to implement things like:

  • Ahead-of-time compilation workflows
  • Disk cache
  • Cross-node communication of algorithm objects

An example of compiling an algorithm object and writing it to disk, then reading it and executing the algorithm from a subsequent Python process:

import cuda.compute as cc, cupy as cp, numpy as np

d_in = cp.empty(1)
d_out = cp.empty(1)
op = lambda x: 2 * x
transformer = cc.make_unary_transform(d_in=d_in, d_out=d_out, op=op)
with open("transform.cclb", "wb") as f:
f.write(cc.serialize(transformer))
import cuda.compute as cc, cupy as cp, numpy as np

with open("transform.cclb", "rb") as f:
    transformer = cc.deserialize(f.read())

d_in = cp.asarray([1., 2, 3])
d_out = cp.empty_like(d_in)
op = lambda x: 2 * x  # doesn't matter what this unary op actually is
transformer(d_in=d_in, d_out=d_out, op=op, num_items=len(d_in))
cp.testing.assert_allclose(d_out, cp.asarray([2., 4, 6]))

Implementation

Each algorithm class now inherits from Serializable. This inheritance requires the algorithm class to define a __serialization_schema__ member, which defines the members that will be serialized and their schema kinds (e.g., ITER, OP, VALUE, etc.,). Serializable knows how to serialize each of those kinds of objects into bytes.

class _Reduce(Serializable):
    __serialization_schema__ = (
        ("d_in_cccl", ITER),
        ("d_out_cccl", ITER),
        ("op_cccl", OP),
        ("h_init_cccl", VALUE),
        ("build_result", BUILD_RESULT(_bindings.DeviceReduceBuildResult)),
    )

As a note, 25298cb also does a huge rename of "aot" to "serialization" - as ahead-of-time compilation is only one application of the serialization/deserialization capability.

Follow-up work

  • Extend each make_<algo> API to accept a (list of) compute_capability so that AoT compilation can be done for a different GPU (or with no GPU present at all).
  • Add documentation for the new APIs

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 30, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 30, 2026
@shwina shwina force-pushed the pr/aot-cuda-compute branch from cb811c3 to 18f9776 Compare June 30, 2026 18:12
@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shwina shwina force-pushed the pr/aot-cuda-compute branch from 18f9776 to 21d8901 Compare June 30, 2026 18:17
@shwina

shwina commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9dd4e8d1-e007-4ac2-a856-4f4b06f7e2a2

📥 Commits

Reviewing files that changed from the base of the PR and between 25298cb and c6c2820.

📒 Files selected for processing (2)
  • c/parallel/src/serialization.cpp
  • c/parallel/src/util/serialization.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • c/parallel/src/serialization.cpp
  • c/parallel/src/util/serialization.h

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added AOT-style ahead-of-time serialize/deserialize support across multiple CUDA compute algorithms, with a public CUDA compute dispatch API.
    • Added helpers to load lower-bound/upper-bound from serialized binary search blobs.
    • Exposed additional build/iterator metadata (e.g., determinism, alignment, operator kind) to the Python API.
  • Bug Fixes
    • Improved serialization blob validation behavior and error reporting.
  • Documentation
    • Updated serialization buffer free guidance to the serialization API naming.
  • Tests
    • Added new Python round-trip serialization tests for binary search, histogram, merge sort, and radix sort (and related workflows).

Walkthrough

This PR renames the C parallel AoT serialization API to serialization naming, adds blob validation and diagnostics APIs, and introduces a Python schema-driven serialization system with Cython bindings, per-algorithm serialize/deserialize support, public loaders, and round-trip tests.

Changes

C parallel serialization rename

Layer / File(s) Summary
Public header renames and diagnostics API
c/parallel/include/cccl/c/serialization.h, serialization_diagnostics.h, binary_search.h, for.h, histogram.h, merge_sort.h, radix_sort.h, reduce.h, scan.h, segmented_reduce.h, segmented_sort.h, three_way_partition.h, transform.h, unique_by_key.h
Renames cccl_aot_algo_t/cccl_aot_buffer_free to cccl_serialization_algo_t/cccl_serialization_buffer_free, updates the documentation comments, and adds cccl_serialization_last_error and cccl_serialization_validate_blob.
Serialization utility and diagnostics implementation
c/parallel/src/serialization.cpp, src/util/serialization.h, src/util/nvjitlink.h
Implements buffer-free, last-error, and blob-validation APIs, renames blob magic and header helpers, updates error wording, and changes the nvJitLink input label.
Per-algorithm serialize/deserialize source updates
c/parallel/src/*.cu
Switches each algorithm implementation from cccl::aot to cccl::serialization, updates blob tags and error strings, and rewrites segmented-sort selector-op reconstruction.
C++ test retagging
c/parallel/test/test_*.cpp
Retags C2H tests from [aot] to [serialization] and updates reduce test buffer freeing to cccl_serialization_buffer_free.

Python AoT serialization feature

Layer / File(s) Summary
Serde wire format and dispatch
python/cuda_cccl/cuda/compute/_serialization/*
Defines the blob framing, descriptor codecs, AlgoTag, the Serializable mixin, and public serialize/deserialize dispatch entry points.
Cython serialization backends and CMake wiring
python/cuda_cccl/CMakeLists.txt, _bindings_serialization_v1.pxi, _bindings_serialization_v2.pxi
Generates the serialization pxi, implements v1 blob validation/load wrappers, and provides v2 HostJIT stubs.
Typed bindings surface
_bindings.pyi
Adds serialize/deserialize typings for build-result classes and Op.operator_type/Iterator.alignment/Iterator.dereference_or_assign_op.
Cython build-result wiring
_bindings_impl.pyx, __init__.py
Adds zero-init constructors, new properties, and per-class serialize/deserialize methods; exports serialize and deserialize.
Reduce/Scan/SegmentedReduce serde
_reduce.py, _scan.py, _segmented_reduce.py
Adds serialization schemas and computed device_reduce_fn/device_scan_fn selection.
MergeSort/RadixSort/SegmentedSort serde
_sort/*.py
Adds serialization schemas including radix decomposer state handling.
Partition/transform/unique/search/select serde
_three_way_partition.py, _transform.py, _unique_by_key.py, _histogram.py, _binary_search.py, _select.py, algorithms/__init__.py
Adds serialization schemas and public load_lower_bound/load_upper_bound loaders.
Round-trip tests and benchmark
tests/compute/test_*_serialization.py, bench_serialization.py
Adds round-trip tests for binary search, histogram, merge sort, and radix sort, plus a cold-start benchmark comparing JIT and serialization paths.

Suggested reviewers: bernhardmgruber


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
python/cuda_cccl/tests/compute/test_binary_search_aot.py (1)

25-72: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: add a negative test for loading a lower_bound blob through load_upper_bound and vice versa. The current tests cover only valid round trips, so the mode-mismatch validation could regress unnoticed.

with pytest.raises(ValueError, match="mode mismatch"):
    load_upper_bound(lower_bound_blob)

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aa78048f-6597-48de-a0b2-9ce3b50a930e

📥 Commits

Reviewing files that changed from the base of the PR and between 3d5f235 and 21d8901.

📒 Files selected for processing (30)
  • python/cuda_cccl/CMakeLists.txt
  • python/cuda_cccl/cuda/compute/_aot_serde.py
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
  • python/cuda_cccl/cuda/compute/algorithms/__init__.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

Comment thread python/cuda_cccl/cuda/compute/_serialization/codec.py
Comment thread python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi Outdated
Comment thread python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py Outdated
Comment thread python/cuda_cccl/tests/compute/bench_aot.py Outdated
Comment thread python/cuda_cccl/tests/compute/test_merge_sort_aot.py
Comment thread python/cuda_cccl/tests/compute/test_reduce_serialization.py
Comment thread python/cuda_cccl/tests/compute/test_scan_aot.py Outdated
@shwina

shwina commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cuda_cccl/cuda/compute/algorithms/_scan.py (1)

137-150: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

important: Reject non-binary force_inclusive values from the AoT blob.

Line 138 converts any non-zero byte to True, so a malformed blob can silently deserialize an exclusive scan as inclusive and write incorrect results. Read the raw byte, require 0 or 1, then convert to bool.

As per path instructions, python/cuda_cccl/**/*: Focus on Python API stability, CUDA array interoperability, memory ownership, JIT/NVRTC/nvJitLink behavior, package boundaries, user-defined operator correctness, tests, and examples.

Source: Path instructions

🧹 Nitpick comments (1)
python/cuda_cccl/cuda/compute/_aot/dispatch.py (1)

68-74: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: Avoid swallowing internal AttributeErrors from real serializers.

This catches any AttributeError raised inside algorithm.serialize(), masking implementation defects as “not AoT-serializable”. Check getattr(algorithm, "serialize", None) first, validate it is callable, then invoke it outside the AttributeError handler.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e30eaece-e301-4ed7-9f7e-656daf4d2786

📥 Commits

Reviewing files that changed from the base of the PR and between 21d8901 and 6d67e53.

📒 Files selected for processing (21)
  • python/cuda_cccl/cuda/compute/__init__.py
  • python/cuda_cccl/cuda/compute/_aot/__init__.py
  • python/cuda_cccl/cuda/compute/_aot/dispatch.py
  • python/cuda_cccl/cuda/compute/_aot/serde.py
  • python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
  • python/cuda_cccl/cuda/compute/algorithms/__init__.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
✅ Files skipped from review due to trivial changes (1)
  • python/cuda_cccl/cuda/compute/_aot/init.py
🚧 Files skipped from review as they are similar to previous changes (16)
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/init.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi

@shwina shwina force-pushed the pr/aot-cuda-compute branch 2 times, most recently from 1f601fe to 1336f31 Compare July 2, 2026 11:16
@shwina

shwina commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@shwina

shwina commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test b0f4db3

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
c/parallel/src/aot.cpp (1)

18-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

suggestion: Switch this include to angle brackets
c/parallel/src/aot.cpp:18 still uses quotes; the repo rule for headers is angle-bracket includes.

Source: Coding guidelines

python/cuda_cccl/cuda/compute/algorithms/_scan.py (1)

136-152: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: the "exclusive scan with no init value" error now only surfaces on first __call__ (property access), whereas previously (per the summary, "relying on a pre-assigned instance attribute") this was presumably validated eagerly at construction. A caller building an invalid scan via make_exclusive_scan(..., init_value=None) will no longer fail immediately — the error only appears when the scan actually executes. Worth forcing eager evaluation once at the end of __init__ to keep the fail-fast contract:

🔧 Proposed fix
         self.force_inclusive = force_inclusive
 
         # Compile the op with value types
         self.op_cccl = op.compile((value_type, value_type), value_type)
 
         self.build_result = call_build(
             _bindings.DeviceScanBuildResult,
             self.d_in_cccl,
             self.d_out_cccl,
             self.op_cccl,
             init_value_type_info,
             force_inclusive,
             self.init_kind,
         )
+
+        # Trigger validation eagerly so invalid combinations fail at build time
+        # rather than on first execution.
+        _ = self.device_scan_fn

Source: Path instructions

python/cuda_cccl/tests/compute/test_aot_diagnostics.py (1)

27-34: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: dedupe the USING_V2 skip boilerplate.

The try/except ImportError + pytestmark = pytest.mark.skipif(USING_V2, ...) block is repeated verbatim in every AOT test file in this cohort (test_binary_search_aot.py, test_histogram_aot.py, test_merge_sort_aot.py, test_transform_aot.py, etc.). Move it to a shared conftest.py fixture/marker to avoid maintaining N copies.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ad3d8ea-e521-48fc-8724-056811a2cc09

📥 Commits

Reviewing files that changed from the base of the PR and between 64b6b0e and b0f4db3.

📒 Files selected for processing (39)
  • c/parallel/include/cccl/c/aot_diagnostics.h
  • c/parallel/src/aot.cpp
  • python/cuda_cccl/CMakeLists.txt
  • python/cuda_cccl/cuda/compute/__init__.py
  • python/cuda_cccl/cuda/compute/_aot/__init__.py
  • python/cuda_cccl/cuda/compute/_aot/dispatch.py
  • python/cuda_cccl/cuda/compute/_aot/serde.py
  • python/cuda_cccl/cuda/compute/_aot/serializable.py
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
  • python/cuda_cccl/cuda/compute/algorithms/__init__.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_select.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/tests/compute/test_aot_diagnostics.py
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • python/cuda_cccl/tests/compute/test_select_aot.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

Comment on lines +32 to +38
try:
return algorithm.serialize()
except AttributeError as e:
raise TypeError(
f"{type(algorithm).__name__} is not an AoT-serializable algorithm "
"(expected an object from a make_* factory)."
) from e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

important: the except AttributeError wraps the whole call, so an AttributeError raised inside a valid serialize() (real bug, missing attr on a partially-built object) is swallowed and reported as "not an AoT-serializable algorithm", masking the true failure. Gate on the attribute's presence instead of catching from the call.

-    try:
-        return algorithm.serialize()
-    except AttributeError as e:
-        raise TypeError(
-            f"{type(algorithm).__name__} is not an AoT-serializable algorithm "
-            "(expected an object from a make_* factory)."
-        ) from e
+    serialize_method = getattr(type(algorithm), "serialize", None)
+    if not callable(serialize_method):
+        raise TypeError(
+            f"{type(algorithm).__name__} is not an AoT-serializable algorithm "
+            "(expected an object from a make_* factory)."
+        )
+    return algorithm.serialize()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
return algorithm.serialize()
except AttributeError as e:
raise TypeError(
f"{type(algorithm).__name__} is not an AoT-serializable algorithm "
"(expected an object from a make_* factory)."
) from e
serialize_method = getattr(type(algorithm), "serialize", None)
if not callable(serialize_method):
raise TypeError(
f"{type(algorithm).__name__} is not an AoT-serializable algorithm "
"(expected an object from a make_* factory)."
)
return algorithm.serialize()

Comment on lines +35 to +43
__serde_schema__ = (
("d_in_keys_cccl", ITER),
("d_in_values_cccl", ITER),
("d_out_keys_cccl", ITER),
("d_out_values_cccl", ITER),
("op_cccl", OP),
("build_result", BUILD_RESULT(_bindings.DeviceMergeSortBuildResult)),
)

@coderabbitai coderabbitai Bot Jul 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check ITER's handling of None iterators in the AoT serde schema.
rg -n -A15 'def ITER' python/cuda_cccl/cuda/compute/_aot/serializable.py
rg -n 'CONDITIONAL' python/cuda_cccl/cuda/compute/_aot/serializable.py python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py python/cuda_cccl/cuda/compute/algorithms/*.py

Repository: NVIDIA/cccl

Length of output: 680


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== serializable.py outline =="
ast-grep outline python/cuda_cccl/cuda/compute/_aot/serializable.py --view expanded || true

echo "== merge_sort.py outline =="
ast-grep outline python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py --view expanded || true

echo "== relevant serializable.py lines =="
sed -n '1,340p' python/cuda_cccl/cuda/compute/_aot/serializable.py | cat -n | sed -n '1,340p'

echo "== relevant merge_sort.py lines =="
sed -n '1,220p' python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py | cat -n | sed -n '1,220p'

echo "== search for None/optional handling in compute algorithms =="
rg -n "None|OPTIONAL|CONDITIONAL|ITER\(" python/cuda_cccl/cuda/compute -g '*.py'

Repository: NVIDIA/cccl

Length of output: 50367


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== serializable.py: ITER, CONDITIONAL, and serde core =="
sed -n '1,280p' python/cuda_cccl/cuda/compute/_aot/serializable.py | cat -n | sed -n '1,280p'

echo "== merge_sort.py: schema and init =="
sed -n '1,180p' python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py | cat -n | sed -n '1,180p'

Repository: NVIDIA/cccl

Length of output: 17471


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,280p' python/cuda_cccl/cuda/compute/_aot/serde.py | cat -n | sed -n '1,280p'

Repository: NVIDIA/cccl

Length of output: 8936


important: python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py:35-43 uses plain ITER for d_in_values_cccl/d_out_values_cccl, but _MergeSort.__init__ allows both to be None. That makes keys-only merge sort fail to serialize; wrap the value iterators in a conditional schema or another optional wrapper.

Source: Path instructions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a false positive — keys-only merge sort serializes and round-trips correctly. When d_in_values / d_out_values are None, to_cccl_input_iter(None) / to_cccl_output_iter(None) return a valid "none" POINTER iterator (uint8, state=None), which the plain ITER schema member writes and reads back without issue — so no conditional/optional wrapper is needed.

Verified end-to-end (build → serialize → deserialize → run → correctly sorted) and added a test_serialize_deserialize_merge_sort_keys_only regression test (0a8e15b).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina Understood — confirmed. Since to_cccl_input_iter(None)/to_cccl_output_iter(None) produce a real "none" POINTER iterator object (not a Python None), the plain ITER schema member serializes/deserializes it like any other iterator, so no conditional wrapper is needed. Thanks for adding the regression test.


✏️ Learnings added
Learnt from: shwina
Repo: NVIDIA/cccl PR: 9644
File: python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py:35-43
Timestamp: 2026-07-02T22:18:14.092Z
Learning: In `python/cuda_cccl/cuda/compute` AoT serialization (`_aot/serializable.py`, `_aot/serde.py`), optional iterator arguments like `d_in_values`/`d_out_values` in algorithms such as `_MergeSort` (`python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py`) do not need `CONDITIONAL`/optional schema wrappers. `to_cccl_input_iter(None)`/`to_cccl_output_iter(None)` (in `_cccl_interop.py`) return a valid "none" `POINTER` iterator (uint8 type, `state=None`) rather than Python `None`, so the plain `ITER` schema kind already serializes/deserializes these "absent" values correctly.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@github-actions

This comment has been minimized.

@shwina shwina force-pushed the pr/aot-cuda-compute branch from b0f4db3 to 0a8e15b Compare July 2, 2026 22:17
@shwina

shwina commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 0a8e15b

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 4h 07m: Pass: 100%/62 | Total: 17h 37m | Max: 1h 08m | Hits: 100%/1467

See results here.

shwina and others added 4 commits July 3, 2026 11:55
…gorithms

Exposes the C serialize/deserialize layer through the Python bindings so
a built algorithm can be persisted and reloaded without any JIT step:

  * algo.serialize() -> bytes  — return the blob directly
  * AlgoClass.deserialize(blob, ...) -> AlgoClass  — reconstruct from bytes

  * _bindings_impl.pyx / _bindings.pyi - serialize()/deserialize() on each
    Device*BuildResult, plus cccl_aot_buffer_free and the new externs.
  * algorithms/_*.py - serialize()/deserialize() on each algorithm class.

Adds tests/compute/test_<algo>_aot.py round-trip coverage for every
algorithm and an AoT-vs-JIT benchmark (bench_aot.py).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `_OpAdapter.compile_for_load()` that returns a minimal Op stub
(empty LTOIR, correct type/state_alignment) without triggering numba-cuda
JIT compilation.  The compiled CUBIN is already embedded in the AoT blob;
only op.type and op.state are read at execute time, so LTOIR is not needed
on the load path.

_StatelessOp and _StatefulOp override to skip JIT entirely.  Well-known
ops and RawOp fall through to compile() since they are already JIT-free.

All eight deserialize() methods in the algorithm layer are updated to call
compile_for_load() instead of compile(), reducing deserialize latency from
~1s to ~0ms for Python-callable operators.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move all AoT C extern declarations and serialize/deserialize
implementations into backend-conditional .pxi files. v1 gets the
real implementation; v2 gets stubs that raise NotImplementedError.
CMake selects the right pxi the same way it handles the other
backend-conditional files (segmented_reduce_backend, etc.).

AoT tests skip automatically on v2 via pytestmark + USING_V2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The serialize/deserialize path had no guard against loading a blob that
was saved for the opposite binary_search mode: an upper_bound artifact
could silently be deserialized through load_lower_bound and vice versa.

Fix:
- _BinarySearch stores its mode in a new slot.
- serialize() prepends a 4-byte mode tag (b"LBND" / b"UBND") before
  the C-level blob.
- _BinarySearch._deserialize() reads and validates the tag; raises
  ValueError if the blob mode doesn't match the expected_mode.
- Add module-level load_lower_bound() / load_upper_bound() that call
  _deserialize with the appropriate expected_mode, making the API
  symmetrical with make_lower_bound / make_upper_bound.
- Export load_lower_bound / load_upper_bound from algorithms/__init__.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
shwina and others added 9 commits July 3, 2026 11:55
…ization

`Device<algo>` `deserialize()` now takes only the blob: the iterator, operator,
and value descriptors are written into a self-describing sidecar
(`cuda/compute/_aot_serde.py`) prepended to the C `build_result`, and rebuilt on
load with no objects supplied. Custom iterators are fully supported — their
device LTOIR round-trips — and no JIT runs on the load path.

The operator's real device code is serialized (exactly what a normal `__call__`
passes to execute), which supersedes the earlier placeholder-op approach; the
now-unused `compile_for_load()` helper is removed. Adds `Op.operator_type` and
`Iterator.alignment` getters used by the sidecar.

Blob format is versioned (`CCAOTPY1`) and tagged per algorithm; binary_search
folds its search mode into the sidecar so `load_lower_bound`/`load_upper_bound`
stay object-free and reject opposite-mode blobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Deserializing an AoT blob could fail with only an opaque CUDA error code
while the actual reason was printed to stdout by the C layer and discarded.
The two worst cases were an ABI/format mismatch and loading a CUBIN built
for a different GPU architecture, both of which surfaced with no actionable
message.

Add an AoT blob validator to the C layer and route the deserialize path
through it:

* New cccl_aot_validate_blob() checks the blob magic, format version, and
  CCCL C parallel ABI version, and — for CUBIN payloads — that the target
  compute-capability major matches the device the blob would load on
  (falling back to the default device when there is no current context, via
  idempotent cuInit, so a bare deserialize is still checked). It runs before
  cuLibraryLoadData, turning a deep opaque failure into an early, clear one.
* New cccl_aot_last_error() exposes the descriptive message for callers to
  surface instead of a bare CUDA error code.

Both are declared in a standalone header (cccl/c/aot_diagnostics.h) rather
than aot.h so they don't trigger a rebuild of every algorithm translation
unit.

The v1 AoT bindings now validate through cccl_aot_validate_blob() in every
cccl_device_<algo>_deserialize and report cccl_aot_last_error() on failure.
Adds tests/compute/test_aot_diagnostics.py (ABI mismatch, wrong
compute-capability major, corrupt magic, and a valid-blob round trip). The
CC case is exercised single-GPU by patching the blob's cc field; a real
cross-GPU load is the same code path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shwina shwina force-pushed the pr/aot-cuda-compute branch from 0a8e15b to 25298cb Compare July 3, 2026 15:55
@shwina shwina marked this pull request as ready for review July 3, 2026 16:08
@shwina shwina requested review from a team as code owners July 3, 2026 16:08
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Jul 3, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b88c99f-4dc5-4ef6-9105-533e43c91be7

📥 Commits

Reviewing files that changed from the base of the PR and between b0f4db3 and 25298cb.

📒 Files selected for processing (70)
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/serialization.h
  • c/parallel/include/cccl/c/serialization_diagnostics.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/src/aot.cpp
  • c/parallel/src/binary_search.cu
  • c/parallel/src/for.cu
  • c/parallel/src/histogram.cu
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/src/segmented_sort.cu
  • c/parallel/src/serialization.cpp
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/transform.cu
  • c/parallel/src/unique_by_key.cu
  • c/parallel/src/util/nvjitlink.h
  • c/parallel/src/util/serialization.h
  • c/parallel/test/test_binary_search.cpp
  • c/parallel/test/test_for.cpp
  • c/parallel/test/test_histogram.cpp
  • c/parallel/test/test_merge_sort.cpp
  • c/parallel/test/test_radix_sort.cpp
  • c/parallel/test/test_reduce.cpp
  • c/parallel/test/test_scan.cpp
  • c/parallel/test/test_segmented_reduce.cpp
  • c/parallel/test/test_segmented_sort.cpp
  • c/parallel/test/test_three_way_partition.cpp
  • c/parallel/test/test_transform.cpp
  • c/parallel/test/test_unique_by_key.cpp
  • python/cuda_cccl/CMakeLists.txt
  • python/cuda_cccl/cuda/compute/__init__.py
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
  • python/cuda_cccl/cuda/compute/_bindings_serialization_v1.pxi
  • python/cuda_cccl/cuda/compute/_bindings_serialization_v2.pxi
  • python/cuda_cccl/cuda/compute/_serialization/__init__.py
  • python/cuda_cccl/cuda/compute/_serialization/codec.py
  • python/cuda_cccl/cuda/compute/_serialization/dispatch.py
  • python/cuda_cccl/cuda/compute/_serialization/serializable.py
  • python/cuda_cccl/cuda/compute/algorithms/__init__.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_select.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/tests/compute/bench_serialization.py
  • python/cuda_cccl/tests/compute/test_binary_search_serialization.py
  • python/cuda_cccl/tests/compute/test_histogram_serialization.py
  • python/cuda_cccl/tests/compute/test_merge_sort_serialization.py
  • python/cuda_cccl/tests/compute/test_radix_sort_serialization.py
💤 Files with no reviewable changes (29)
  • python/cuda_cccl/CMakeLists.txt
  • python/cuda_cccl/cuda/compute/_serialization/dispatch.py
  • python/cuda_cccl/cuda/compute/_serialization/init.py
  • python/cuda_cccl/tests/compute/test_radix_sort_serialization.py
  • python/cuda_cccl/cuda/compute/_bindings_serialization_v2.pxi
  • c/parallel/src/aot.cpp
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/init.py
  • python/cuda_cccl/cuda/compute/algorithms/init.py
  • python/cuda_cccl/tests/compute/bench_serialization.py
  • python/cuda_cccl/tests/compute/test_histogram_serialization.py
  • python/cuda_cccl/cuda/compute/algorithms/_select.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/tests/compute/test_merge_sort_serialization.py
  • python/cuda_cccl/tests/compute/test_binary_search_serialization.py
  • python/cuda_cccl/cuda/compute/_serialization/serializable.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/_serialization/codec.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_serialization_v1.pxi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
✅ Files skipped from review due to trivial changes (24)
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/src/util/nvjitlink.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/test/test_histogram.cpp
  • c/parallel/test/test_segmented_sort.cpp
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/test/test_binary_search.cpp
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/test/test_transform.cpp
  • c/parallel/test/test_segmented_reduce.cpp
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/test/test_radix_sort.cpp
  • c/parallel/test/test_three_way_partition.cpp
  • c/parallel/test/test_unique_by_key.cpp
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/test/test_merge_sort.cpp
  • c/parallel/test/test_scan.cpp
  • c/parallel/test/test_for.cpp

Comment thread c/parallel/src/serialization.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

1 participant