Skip to content

Improve reducer API for memory allocation & compute#9290

Open
acosmicflamingo wants to merge 6 commits into
NVIDIA:mainfrom
acosmicflamingo:improve-mem-compute-api
Open

Improve reducer API for memory allocation & compute#9290
acosmicflamingo wants to merge 6 commits into
NVIDIA:mainfrom
acosmicflamingo:improve-mem-compute-api

Conversation

@acosmicflamingo

@acosmicflamingo acosmicflamingo commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Description

Resolves #5638

Add 'get_temp_storage_bytes' and 'compute' functions to _Reduce class to improve API usability (as suggested by @kkraus14 and @shwina).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

* Start migrating Reduce.__call__ dunder methods to more descriptive
  API (as suggested by @kkraus14 and @shwina)

* Add 'get_temp_storage_bytes' and 'compute' method stubs

* Display deprecation warnings when users use utilize Reduce.__call__
  dunder method (and add respective tests)
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 8, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 8, 2026
@acosmicflamingo acosmicflamingo force-pushed the improve-mem-compute-api branch from 3d56a30 to b17dadc Compare June 8, 2026 14:44
@acosmicflamingo

acosmicflamingo commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@shwina I imagine that we want to avoid something like this, correct?

diff a/python/cuda_cccl/cuda/compute/algorithms/_reduce.py 
     b/python/cuda_cccl/cuda/compute/algorithms/_reduce.py
@@ -35,6 +35,8 @@ from ..typing import (
 
 class _Reduce:
     __slots__ = [
+        "d_in",
+        "d_out",
         "d_in_cccl",
         "d_out_cccl",
         "h_init_cccl",
@@ -52,6 +54,8 @@ class _Reduce:
         h_init: np.ndarray | GpuStruct,
         determinism: Determinism,
     ):
+        self.d_in = d_in
+        self.d_out = d_out
         self.d_in_cccl = cccl.to_cccl_input_iter(d_in)
         self.d_out_cccl = cccl.to_cccl_output_iter(d_out)
         self.h_init_cccl = cccl.to_cccl_value(h_init)
@@ -152,8 +156,8 @@ class _Reduce:
         h_init: np.ndarray | GpuStruct,
         stream=None,
     ):
-        set_cccl_iterator_state(self.d_in_cccl, d_in)
-        set_cccl_iterator_state(self.d_out_cccl, d_out)
+        set_cccl_iterator_state(self.d_in_cccl, self.d_in)
+        set_cccl_iterator_state(self.d_out_cccl, self.d_out)
 
         # Update op state for stateful ops
         op_adapter = make_op_adapter(op)

On the surface, it would pave the way for implementing the API we seek. However, make_reduce_into utilizes the cache_with_registered_key_functions decorator, and caching can pose an issue if we're now saving self.d_in and self.d_out attributes in our reducers. It explains why in my test suite, I could have two identical tests (or three, four, etc) and only the first would pass an assert statement. In the subsequent tests when I'm creating a new reducer, it fetches the reducer created from the first test, and mutates those self.d_in and self.d_out attributes and doesn't actually touch the d_in and d_out arrays that were passed to make_reduce_into in the current test.

@acosmicflamingo

Copy link
Copy Markdown
Contributor Author

@shwina would it be okay to remove @cache_with_registered_key_functions from make_reduce_into function?

@shwina

shwina commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@shwina would it be okay to remove @cache_with_registered_key_functions from make_reduce_into function?

Hmm - can you explain a bit more?

it fetches the reducer created from the first test, and mutates those self.d_in and self.d_out attributes and doesn't actually touch the d_in and d_out arrays that were passed to make_reduce_into in the current test.

It's intended behaviour that make_reduce_into caches the built object and retrieves it for future calls (if identical dtypes etc., are passed).

Curious - what specific issue are you running into due to this caching?

@acosmicflamingo

acosmicflamingo commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@shwina the issue I came across could be just how I'm using the API in my tests, where I copied and pasted the inputs but used them in slightly different ways. Because all tests have the same input structure, they all have the same cache key in make_reduce_into (e.g. _make_cache_key_from_args(*args, **kwargs) returns ((('d_in', ('cupy.ndarray', dtype('bool'))), ('d_out', ('cupy.ndarray', dtype('bool'))), ('h_init', ('numpy.ndarray', dtype('bool'))), ('num_items', 3), ('op', ('MAXIMUM', 23))),), which you can see doesn't take input data into account). This results in always modifying the initial output array coming from the cached reduce object, so any assertion beyond the first test will fail. Because of this, it looks like I have to clear the cache before every test run. Does that sound reasonable of a solution? Or should cache_with_registered_key_functions be refactored to take input values into account?

@acosmicflamingo

acosmicflamingo commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

I'll put it another way. If you look at this code, the issue is that the arrays in the second-half will never be touched because of caching (and this is essentially what happens if I write tests that all use cuda.compute.make_reduce_into with the same input values:

h_init = np.array([False])
d_input = cp.array([True, False, True])
d_output = cp.empty_like(d_input, shape=(1,))
reducer = cuda.compute.make_reduce_into(
    d_in=d_input,
    d_out=d_output,
    num_items=len(d_input),
    op=OpKind.MAXIMUM,
    h_init=h_init,
)
reducer.compute()
expected = True
assert d_output.get()[0] == expected # Passes

h_init2 = np.array([False])
d_input2 = cp.array([True, False, True])
d_output2 = cp.empty_like(d_input2, shape=(1,))
# Fetches reducer instead of make a new reducer2
reducer2 = cuda.compute.make_reduce_into(
    d_in=d_input2,
    d_out=d_output2,
    num_items=len(d_input2),
    op=OpKind.MAXIMUM,
    h_init=h_init2,
)
reducer2.compute()
expected2 = True
assert d_output2.get()[0] == expected2  # Fails

@shwina

shwina commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

OK, I see it now - thanks. Yes -- .compute() will (unfortunately) have to take in the same arguments d_in, d_out, etc., This kind of matches the semantics of the "two-phase" CUB API.

The awkwardness/inconvenience of this API is exactly why we have the single-phase reduce_into(...) and friends.

@acosmicflamingo

Copy link
Copy Markdown
Contributor Author

@shwina good point! I had seen Leo make this comment in this draft PR of copilot (although I don't know if it's as vital given the update you had made here):

Points for consideration:

If we want the multi-phase API to use this pattern (which I like a lot and is aligned with nvmath):
...
2. I don't like having to pass the same arguments to all 3 APIs. This is too much typing and makes the API intent unclear. Ideally we want to pass them to only make_ to define the problem. The rest two APIs are for execution only. The only exception is when we want to allow changing the input arrays (of the same dtype/shape/etc to keep the problem definition intact) or the output arrays (ex: out=). But these would incur extra overheads (for consistency checks). So in the nvmath case we added another API .reset_operands() for this purpose, and reduce the number of arguments in plan/execute.
3. Nit: .compute() or .execute()? nvmath uses the latter. (For the record I am fine either way, just wanna ensure this is called out.)

Perhaps I can go ahead with adding function calls that execute the __call__ dunder functions under the hood. Then, in the future I can analyze what nvmath does to get this kind of behavior and then we could revisit simplifying compute and others. Does that sound good?

@shwina

shwina commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

I do like the approach proposed by Leo. The only downside is that we lose consistency with the C++ API. Although, it can be argued that by providing separate, named methods for memory allocation and compute, we are already losing consistency.

@NaderAlAwar any thoughts? Would you be OK with .reset_operands(...), .get_temp_storage_bytes(...) and .compute(...)?

@NaderAlAwar

Copy link
Copy Markdown
Contributor

I like adding the get_temp_storage_bytes() and compute()/execute() (either is fine) methods.

Regarding reset_operands(), will that be needed because the reducer object has to hold pointers to the actual buffers? I would like to avoid the wrappers holding any state beyond what they need. I prefer the API being verbose and explicit to saving the pointers and having to call reset_operands().

The current state where you have to pass the same arrays twice is not great, but we should think carefully about we we want to evolve that API.

@acosmicflamingo

Copy link
Copy Markdown
Contributor Author

@shwina @NaderAlAwar thanks for the feedback! Yes, the pointers were being cached via @cache_with_registered_key_functions.

Your input got me thinking...what does nvmath do under-the-hood anyway? It turns out, their API doesn't magically handle storing inputs and cache algorithm output without having to call reset_operands as it seemed at first glance: https://github.com/NVIDIA/nvmath-python/blob/main/nvmath/linalg/advanced/matmulmod.py#L1075). Notice in the docstring that while compute() is not being passed anything, they are mutating a and b in-place, so the algorithm will still be pointing to those arrays. If they were to create new arrays before calling compute/execute, they would also have to run reset_operands beforehand.

In summary, I think the real question here is whether CCCL Python API users would find it more annoying to be passing d_in, d_out over and over again to get_temp_storage_bytes() and compute()/execute(), or having to slap on reset_operands anytime d_in or d_out change.

@acosmicflamingo

Copy link
Copy Markdown
Contributor Author

Based on @NaderAlAwar's latest comment, I'll go ahead with implementing the functions that explicitly pass the arguments that users are currently familiar with 😄

@acosmicflamingo

Copy link
Copy Markdown
Contributor Author

In light of latest feedback, is this the way we expect the API to look? Only change users expect from the front-end is that they don't have to pass temp_storage=None to get_temp_storage_bytes (and perhaps we want to make it optional for compute as well?):

reducer = cuda.compute.make_reduce_into(
    d_in=d_input,
    d_out=d_output,
    num_items=len(d_input),
    op=OpKind.MAXIMUM,
    h_init=h_init,
)
temp_storage_size = reducer.get_temp_storage_bytes(
    d_in=d_input,
    d_out=d_output,
    num_items=len(d_input),
    op=OpKind.MAXIMUM,
    h_init=h_init,
)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
reducer.compute(
    temp_storage=d_temp_storage,
    d_in=d_input,
    d_out=d_output,
    num_items=len(d_input),
    op=OpKind.MAXIMUM,
    h_init=h_init,
)

@NaderAlAwar

Copy link
Copy Markdown
Contributor

Yes that looks right. For now I would also reimplement the __call__ method to call get_temp_storage_bytes and compute internally depending on whether or not temp_storage is None

@acosmicflamingo acosmicflamingo changed the title [WIP] Improve reducer API for memory allocation & compute Improve reducer API for memory allocation & compute Jul 2, 2026
@acosmicflamingo

acosmicflamingo commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Great! Assuming that you literally meant just calling the functions within __call__ and leaving _execute_call_legacy intact, I think the PR is now ready for review.

Edit: only thing I need to do is update documentation so it mentions this new API

@acosmicflamingo acosmicflamingo marked this pull request as ready for review July 2, 2026 21:46
@acosmicflamingo acosmicflamingo requested review from a team as code owners July 2, 2026 21:46
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Jul 2, 2026
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0169b145-a1d9-488e-b048-22e1fab15d8a

📥 Commits

Reviewing files that changed from the base of the PR and between a467f17 and 3c06641.

📒 Files selected for processing (1)
  • python/cuda_cccl/tests/compute/test_reduce.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/cuda_cccl/tests/compute/test_reduce.py

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

  • Added explicit get_temp_storage_bytes(...) and compute(...) flow for reduce reducers.
  • Kept __call__ working as a legacy path, routing it to the new methods and emitting a DeprecationWarning.
  • Updated reducer argument plumbing to consistently forward inputs and stream.
  • Adjusted documentation link in python/cuda_cccl/README.md to point at the unstable docs URL.
  • Added tests covering the new two-phase API and the deprecated callable behavior.

Tests

  • Added reducer API tests for temp-storage sizing, execution, and deprecation warnings.

Walkthrough

Adds named reducer methods for temporary-storage sizing and compute, keeps direct callable use working with deprecation warnings, expands reduce tests, and updates the README documentation link.

Changes

Reduce API and test updates

Layer / File(s) Summary
Named reducer methods and legacy routing
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
Adds DeprecationWarning emission for direct calls and routes them to get_temp_storage_bytes(...) or compute(...) through a shared legacy execution path that forwards stream.
Reduce API and stream tests
python/cuda_cccl/tests/compute/test_reduce.py
Updates stream validation assertions and adds coverage for make_reduce_into named-method usage plus deprecated callable invocation.
README documentation link
python/cuda_cccl/README.md
Updates the full documentation link target to the unstable Python docs URL.

Assessment against linked issues

Objective Addressed Explanation
Add named APIs for temporary storage allocation and compute in the object-based API [#5638]
Preserve direct callable behavior while transitioning to named APIs [#5638]
Update existing tests/examples/benchmarks to reflect the new usage [#5638] Tests are updated here, but no examples or benchmarks are shown in the provided changes.

Out-of-scope changes

Code Change Explanation
README link target changed to unstable docs (python/cuda_cccl/README.md:47-47) The linked issue is about reducer APIs and usage migration; the docs URL update is unrelated to that behavior change.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b516dbe-275c-4275-9700-cef47638a5a5

📥 Commits

Reviewing files that changed from the base of the PR and between 4be2ca2 and a467f17.

📒 Files selected for processing (3)
  • python/cuda_cccl/README.md
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/tests/compute/test_reduce.py

Comment thread python/cuda_cccl/cuda/compute/algorithms/_reduce.py
For complete documentation, examples, and API reference, visit:

- **Full Documentation**: [nvidia.github.io/cccl/python](https://nvidia.github.io/cccl/python)
- **Full Documentation**: [nvidia.github.io/cccl/python](https://nvidia.github.io/cccl/unstable/python)

@coderabbitai coderabbitai Bot Jul 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: The visible label still says nvidia.github.io/cccl/python while the target now points to .../unstable/python. Please update the text or keep the stable target so the docs link doesn’t advertise the wrong destination.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina should https://nvidia.github.io/cccl/python exist? I added the unstable one because at least it exists, but maybe the issue is with the hyperlink itself, where it may have existed at one point

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🤖 web_search executed:

Does https://nvidia.github.io/cccl/python exist and return a valid page? Also check if https://nvidia.github.io/cccl/unstable/python exists.

Length of output: 4381


@acosmicflamingo confirmed: nvidia.github.io/cccl/unstable/python resolves to real pages (index, API reference, setup docs), while I found no evidence that nvidia.github.io/cccl/python currently resolves to anything — it only shows up as historical README source text in search results, not as a live page. So the stable path is likely broken/removed (versioning scheme probably changed to require /unstable/ or a version tag), and pointing to unstable is the correct fix for now.

suggestion: since unstable is intentional, just update the visible label text to match, e.g.:

- **Full Documentation**: [nvidia.github.io/cccl/unstable/python](https://nvidia.github.io/cccl/unstable/python)

That avoids advertising a dead/misleading link text.

Comment thread python/cuda_cccl/tests/compute/test_reduce.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

cuda.cccl.parallel: Add named APIs to do temporary memory allocation and compute respectively,

3 participants