Improve reducer API for memory allocation & compute by acosmicflamingo · Pull Request #9290 · NVIDIA/cccl

acosmicflamingo · 2026-06-08T02:13:19Z

Description

Resolves #5638

Add 'get_temp_storage_bytes' and 'compute' functions to _Reduce class to improve API usability (as suggested by @kkraus14 and @shwina).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

@kkraus14

* Start migrating Reduce.__call__ dunder methods to more descriptive API (as suggested by @kkraus14 and @shwina) * Add 'get_temp_storage_bytes' and 'compute' method stubs * Display deprecation warnings when users use utilize Reduce.__call__ dunder method (and add respective tests)

copy-pr-bot · 2026-06-08T02:13:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

acosmicflamingo · 2026-06-08T22:14:07Z

@shwina I imagine that we want to avoid something like this, correct?

diff a/python/cuda_cccl/cuda/compute/algorithms/_reduce.py 
     b/python/cuda_cccl/cuda/compute/algorithms/_reduce.py
@@ -35,6 +35,8 @@ from ..typing import (
 
 class _Reduce:
     __slots__ = [
+        "d_in",
+        "d_out",
         "d_in_cccl",
         "d_out_cccl",
         "h_init_cccl",
@@ -52,6 +54,8 @@ class _Reduce:
         h_init: np.ndarray | GpuStruct,
         determinism: Determinism,
     ):
+        self.d_in = d_in
+        self.d_out = d_out
         self.d_in_cccl = cccl.to_cccl_input_iter(d_in)
         self.d_out_cccl = cccl.to_cccl_output_iter(d_out)
         self.h_init_cccl = cccl.to_cccl_value(h_init)
@@ -152,8 +156,8 @@ class _Reduce:
         h_init: np.ndarray | GpuStruct,
         stream=None,
     ):
-        set_cccl_iterator_state(self.d_in_cccl, d_in)
-        set_cccl_iterator_state(self.d_out_cccl, d_out)
+        set_cccl_iterator_state(self.d_in_cccl, self.d_in)
+        set_cccl_iterator_state(self.d_out_cccl, self.d_out)
 
         # Update op state for stateful ops
         op_adapter = make_op_adapter(op)

On the surface, it would pave the way for implementing the API we seek. However, make_reduce_into utilizes the cache_with_registered_key_functions decorator, and caching can pose an issue if we're now saving self.d_in and self.d_out attributes in our reducers. It explains why in my test suite, I could have two identical tests (or three, four, etc) and only the first would pass an assert statement. In the subsequent tests when I'm creating a new reducer, it fetches the reducer created from the first test, and mutates those self.d_in and self.d_out attributes and doesn't actually touch the d_in and d_out arrays that were passed to make_reduce_into in the current test.

acosmicflamingo · 2026-06-25T16:37:24Z

@shwina would it be okay to remove @cache_with_registered_key_functions from make_reduce_into function?

shwina · 2026-06-25T17:08:50Z

@shwina would it be okay to remove @cache_with_registered_key_functions from make_reduce_into function?

Hmm - can you explain a bit more?

it fetches the reducer created from the first test, and mutates those self.d_in and self.d_out attributes and doesn't actually touch the d_in and d_out arrays that were passed to make_reduce_into in the current test.

It's intended behaviour that make_reduce_into caches the built object and retrieves it for future calls (if identical dtypes etc., are passed).

Curious - what specific issue are you running into due to this caching?

acosmicflamingo · 2026-06-25T17:20:09Z

@shwina the issue I came across could be just how I'm using the API in my tests, where I copied and pasted the inputs but used them in slightly different ways. Because all tests have the same input structure, they all have the same cache key in make_reduce_into (e.g. _make_cache_key_from_args(*args, **kwargs) returns ((('d_in', ('cupy.ndarray', dtype('bool'))), ('d_out', ('cupy.ndarray', dtype('bool'))), ('h_init', ('numpy.ndarray', dtype('bool'))), ('num_items', 3), ('op', ('MAXIMUM', 23))),), which you can see doesn't take input data into account). This results in always modifying the initial output array coming from the cached reduce object, so any assertion beyond the first test will fail. Because of this, it looks like I have to clear the cache before every test run. Does that sound reasonable of a solution? Or should cache_with_registered_key_functions be refactored to take input values into account?

acosmicflamingo · 2026-06-25T17:35:14Z

I'll put it another way. If you look at this code, the issue is that the arrays in the second-half will never be touched because of caching (and this is essentially what happens if I write tests that all use cuda.compute.make_reduce_into with the same input values:

h_init = np.array([False])
d_input = cp.array([True, False, True])
d_output = cp.empty_like(d_input, shape=(1,))
reducer = cuda.compute.make_reduce_into(
    d_in=d_input,
    d_out=d_output,
    num_items=len(d_input),
    op=OpKind.MAXIMUM,
    h_init=h_init,
)
reducer.compute()
expected = True
assert d_output.get()[0] == expected # Passes

h_init2 = np.array([False])
d_input2 = cp.array([True, False, True])
d_output2 = cp.empty_like(d_input2, shape=(1,))
# Fetches reducer instead of make a new reducer2
reducer2 = cuda.compute.make_reduce_into(
    d_in=d_input2,
    d_out=d_output2,
    num_items=len(d_input2),
    op=OpKind.MAXIMUM,
    h_init=h_init2,
)
reducer2.compute()
expected2 = True
assert d_output2.get()[0] == expected2  # Fails

shwina · 2026-06-26T11:58:49Z

OK, I see it now - thanks. Yes -- .compute() will (unfortunately) have to take in the same arguments d_in, d_out, etc., This kind of matches the semantics of the "two-phase" CUB API.

The awkwardness/inconvenience of this API is exactly why we have the single-phase reduce_into(...) and friends.

acosmicflamingo · 2026-06-26T16:42:09Z

@shwina good point! I had seen Leo make this comment in this draft PR of copilot (although I don't know if it's as vital given the update you had made here):

Points for consideration:

If we want the multi-phase API to use this pattern (which I like a lot and is aligned with nvmath):
...
2. I don't like having to pass the same arguments to all 3 APIs. This is too much typing and makes the API intent unclear. Ideally we want to pass them to only make_ to define the problem. The rest two APIs are for execution only. The only exception is when we want to allow changing the input arrays (of the same dtype/shape/etc to keep the problem definition intact) or the output arrays (ex: out=). But these would incur extra overheads (for consistency checks). So in the nvmath case we added another API .reset_operands() for this purpose, and reduce the number of arguments in plan/execute.
3. Nit: .compute() or .execute()? nvmath uses the latter. (For the record I am fine either way, just wanna ensure this is called out.)

Perhaps I can go ahead with adding function calls that execute the __call__ dunder functions under the hood. Then, in the future I can analyze what nvmath does to get this kind of behavior and then we could revisit simplifying compute and others. Does that sound good?

shwina · 2026-06-29T11:50:58Z

I do like the approach proposed by Leo. The only downside is that we lose consistency with the C++ API. Although, it can be argued that by providing separate, named methods for memory allocation and compute, we are already losing consistency.

@NaderAlAwar any thoughts? Would you be OK with .reset_operands(...), .get_temp_storage_bytes(...) and .compute(...)?

NaderAlAwar · 2026-06-29T13:08:03Z

I like adding the get_temp_storage_bytes() and compute()/execute() (either is fine) methods.

Regarding reset_operands(), will that be needed because the reducer object has to hold pointers to the actual buffers? I would like to avoid the wrappers holding any state beyond what they need. I prefer the API being verbose and explicit to saving the pointers and having to call reset_operands().

The current state where you have to pass the same arrays twice is not great, but we should think carefully about we we want to evolve that API.

acosmicflamingo · 2026-06-29T15:50:39Z

@shwina @NaderAlAwar thanks for the feedback! Yes, the pointers were being cached via @cache_with_registered_key_functions.

Your input got me thinking...what does nvmath do under-the-hood anyway? It turns out, their API doesn't magically handle storing inputs and cache algorithm output without having to call reset_operands as it seemed at first glance: https://github.com/NVIDIA/nvmath-python/blob/main/nvmath/linalg/advanced/matmulmod.py#L1075). Notice in the docstring that while compute() is not being passed anything, they are mutating a and b in-place, so the algorithm will still be pointing to those arrays. If they were to create new arrays before calling compute/execute, they would also have to run reset_operands beforehand.

In summary, I think the real question here is whether CCCL Python API users would find it more annoying to be passing d_in, d_out over and over again to get_temp_storage_bytes() and compute()/execute(), or having to slap on reset_operands anytime d_in or d_out change.

acosmicflamingo · 2026-06-29T15:53:32Z

Based on @NaderAlAwar's latest comment, I'll go ahead with implementing the functions that explicitly pass the arguments that users are currently familiar with 😄

acosmicflamingo · 2026-07-02T16:11:05Z

In light of latest feedback, is this the way we expect the API to look? Only change users expect from the front-end is that they don't have to pass temp_storage=None to get_temp_storage_bytes (and perhaps we want to make it optional for compute as well?):

reducer = cuda.compute.make_reduce_into(
    d_in=d_input,
    d_out=d_output,
    num_items=len(d_input),
    op=OpKind.MAXIMUM,
    h_init=h_init,
)
temp_storage_size = reducer.get_temp_storage_bytes(
    d_in=d_input,
    d_out=d_output,
    num_items=len(d_input),
    op=OpKind.MAXIMUM,
    h_init=h_init,
)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
reducer.compute(
    temp_storage=d_temp_storage,
    d_in=d_input,
    d_out=d_output,
    num_items=len(d_input),
    op=OpKind.MAXIMUM,
    h_init=h_init,
)

NaderAlAwar · 2026-07-02T18:59:15Z

Yes that looks right. For now I would also reimplement the __call__ method to call get_temp_storage_bytes and compute internally depending on whether or not temp_storage is None

acosmicflamingo · 2026-07-02T21:46:35Z

Great! Assuming that you literally meant just calling the functions within __call__ and leaving _execute_call_legacy intact, I think the PR is now ready for review.

Edit: only thing I need to do is update documentation so it mentions this new API

coderabbitai · 2026-07-02T21:52:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0169b145-a1d9-488e-b048-22e1fab15d8a

📥 Commits

Reviewing files that changed from the base of the PR and between a467f17 and 3c06641.

📒 Files selected for processing (1)

python/cuda_cccl/tests/compute/test_reduce.py

🚧 Files skipped from review as they are similar to previous changes (1)

python/cuda_cccl/tests/compute/test_reduce.py

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

Added explicit get_temp_storage_bytes(...) and compute(...) flow for reduce reducers.
Kept __call__ working as a legacy path, routing it to the new methods and emitting a DeprecationWarning.
Updated reducer argument plumbing to consistently forward inputs and stream.
Adjusted documentation link in python/cuda_cccl/README.md to point at the unstable docs URL.
Added tests covering the new two-phase API and the deprecated callable behavior.

Tests

Added reducer API tests for temp-storage sizing, execution, and deprecation warnings.

Walkthrough

Adds named reducer methods for temporary-storage sizing and compute, keeps direct callable use working with deprecation warnings, expands reduce tests, and updates the README documentation link.

Changes

Reduce API and test updates

Layer / File(s)	Summary
Named reducer methods and legacy routing `python/cuda_cccl/cuda/compute/algorithms/_reduce.py`	Adds `DeprecationWarning` emission for direct calls and routes them to `get_temp_storage_bytes(...)` or `compute(...)` through a shared legacy execution path that forwards `stream`.
Reduce API and stream tests `python/cuda_cccl/tests/compute/test_reduce.py`	Updates stream validation assertions and adds coverage for `make_reduce_into` named-method usage plus deprecated callable invocation.
README documentation link `python/cuda_cccl/README.md`	Updates the full documentation link target to the unstable Python docs URL.

Assessment against linked issues

Objective	Addressed	Explanation
Add named APIs for temporary storage allocation and compute in the object-based API [`#5638`]	✅
Preserve direct callable behavior while transitioning to named APIs [`#5638`]	✅
Update existing tests/examples/benchmarks to reflect the new usage [`#5638`]	❓	Tests are updated here, but no examples or benchmarks are shown in the provided changes.

Out-of-scope changes

Code Change	Explanation
README link target changed to unstable docs (`python/cuda_cccl/README.md:47-47`)	The linked issue is about reducer APIs and usage migration; the docs URL update is unrelated to that behavior change.

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b516dbe-275c-4275-9700-cef47638a5a5

📥 Commits

Reviewing files that changed from the base of the PR and between 4be2ca2 and a467f17.

📒 Files selected for processing (3)

python/cuda_cccl/README.md
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/tests/compute/test_reduce.py

coderabbitai · 2026-07-02T21:52:18Z

 For complete documentation, examples, and API reference, visit:

- **Full Documentation**: [nvidia.github.io/cccl/python](https://nvidia.github.io/cccl/python)
+- **Full Documentation**: [nvidia.github.io/cccl/python](https://nvidia.github.io/cccl/unstable/python)


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: The visible label still says nvidia.github.io/cccl/python while the target now points to .../unstable/python. Please update the text or keep the stable target so the docs link doesn’t advertise the wrong destination.

@shwina should https://nvidia.github.io/cccl/python exist? I added the unstable one because at least it exists, but maybe the issue is with the hyperlink itself, where it may have existed at one point

🧩 Analysis chain

🤖 web_search executed:

Does https://nvidia.github.io/cccl/python exist and return a valid page? Also check if https://nvidia.github.io/cccl/unstable/python exists.

Length of output: 4381

@acosmicflamingo confirmed: nvidia.github.io/cccl/unstable/python resolves to real pages (index, API reference, setup docs), while I found no evidence that nvidia.github.io/cccl/python currently resolves to anything — it only shows up as historical README source text in search results, not as a live page. So the stable path is likely broken/removed (versioning scheme probably changed to require /unstable/ or a version tag), and pointing to unstable is the correct fix for now.

suggestion: since unstable is intentional, just update the visible label text to match, e.g.:

- **Full Documentation**: [nvidia.github.io/cccl/unstable/python](https://nvidia.github.io/cccl/unstable/python)

That avoids advertising a dead/misleading link text.

acosmicflamingo added 3 commits June 7, 2026 16:59

Fix broken link in python/cuda_cccl README

f4d7242

Move deprecated _Reduce.__call__ functionality to _execute_call

4022970

github-project-automation Bot added this to CCCL Jun 8, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 8, 2026

cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 8, 2026

Polish new API kwargs

b17dadc

acosmicflamingo force-pushed the improve-mem-compute-api branch from 3d56a30 to b17dadc Compare June 8, 2026 14:44

Incorporate suggestions from @NaderAlAwar and @shwina

a467f17

acosmicflamingo changed the title ~~[WIP] Improve reducer API for memory allocation & compute~~ Improve reducer API for memory allocation & compute Jul 2, 2026

acosmicflamingo marked this pull request as ready for review July 2, 2026 21:46

acosmicflamingo requested review from a team as code owners July 2, 2026 21:46

acosmicflamingo requested review from gevtushenko and gonidelis July 2, 2026 21:46

cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Jul 2, 2026

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Fix test_reduce_with_stream not catching missing streams

3c06641

Uh oh!

Conversation

acosmicflamingo commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 8, 2026

Uh oh!

acosmicflamingo commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acosmicflamingo commented Jun 25, 2026

Uh oh!

shwina commented Jun 25, 2026

Uh oh!

acosmicflamingo commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acosmicflamingo commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shwina commented Jun 26, 2026

Uh oh!

acosmicflamingo commented Jun 26, 2026

Uh oh!

shwina commented Jun 29, 2026

Uh oh!

NaderAlAwar commented Jun 29, 2026

Uh oh!

acosmicflamingo commented Jun 29, 2026

Uh oh!

acosmicflamingo commented Jun 29, 2026

Uh oh!

acosmicflamingo commented Jul 2, 2026

Uh oh!

NaderAlAwar commented Jul 2, 2026

Uh oh!

acosmicflamingo commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Walkthrough

Changes

Assessment against linked issues

Out-of-scope changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acosmicflamingo Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

acosmicflamingo commented Jun 8, 2026 •

edited

Loading

acosmicflamingo commented Jun 8, 2026 •

edited

Loading

acosmicflamingo commented Jun 25, 2026 •

edited

Loading

acosmicflamingo commented Jun 25, 2026 •

edited

Loading

acosmicflamingo commented Jul 2, 2026 •

edited

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

coderabbitai Bot Jul 2, 2026 •

edited

Loading