Improve reducer API for memory allocation & compute#9290
Improve reducer API for memory allocation & compute#9290acosmicflamingo wants to merge 6 commits into
Conversation
3d56a30 to
b17dadc
Compare
|
@shwina I imagine that we want to avoid something like this, correct? diff a/python/cuda_cccl/cuda/compute/algorithms/_reduce.py
b/python/cuda_cccl/cuda/compute/algorithms/_reduce.py
@@ -35,6 +35,8 @@ from ..typing import (
class _Reduce:
__slots__ = [
+ "d_in",
+ "d_out",
"d_in_cccl",
"d_out_cccl",
"h_init_cccl",
@@ -52,6 +54,8 @@ class _Reduce:
h_init: np.ndarray | GpuStruct,
determinism: Determinism,
):
+ self.d_in = d_in
+ self.d_out = d_out
self.d_in_cccl = cccl.to_cccl_input_iter(d_in)
self.d_out_cccl = cccl.to_cccl_output_iter(d_out)
self.h_init_cccl = cccl.to_cccl_value(h_init)
@@ -152,8 +156,8 @@ class _Reduce:
h_init: np.ndarray | GpuStruct,
stream=None,
):
- set_cccl_iterator_state(self.d_in_cccl, d_in)
- set_cccl_iterator_state(self.d_out_cccl, d_out)
+ set_cccl_iterator_state(self.d_in_cccl, self.d_in)
+ set_cccl_iterator_state(self.d_out_cccl, self.d_out)
# Update op state for stateful ops
op_adapter = make_op_adapter(op)On the surface, it would pave the way for implementing the API we seek. However, |
|
@shwina would it be okay to remove |
Hmm - can you explain a bit more?
It's intended behaviour that Curious - what specific issue are you running into due to this caching? |
|
@shwina the issue I came across could be just how I'm using the API in my tests, where I copied and pasted the inputs but used them in slightly different ways. Because all tests have the same input structure, they all have the same cache key in |
|
I'll put it another way. If you look at this code, the issue is that the arrays in the second-half will never be touched because of caching (and this is essentially what happens if I write tests that all use h_init = np.array([False])
d_input = cp.array([True, False, True])
d_output = cp.empty_like(d_input, shape=(1,))
reducer = cuda.compute.make_reduce_into(
d_in=d_input,
d_out=d_output,
num_items=len(d_input),
op=OpKind.MAXIMUM,
h_init=h_init,
)
reducer.compute()
expected = True
assert d_output.get()[0] == expected # Passes
h_init2 = np.array([False])
d_input2 = cp.array([True, False, True])
d_output2 = cp.empty_like(d_input2, shape=(1,))
# Fetches reducer instead of make a new reducer2
reducer2 = cuda.compute.make_reduce_into(
d_in=d_input2,
d_out=d_output2,
num_items=len(d_input2),
op=OpKind.MAXIMUM,
h_init=h_init2,
)
reducer2.compute()
expected2 = True
assert d_output2.get()[0] == expected2 # Fails |
|
OK, I see it now - thanks. Yes -- The awkwardness/inconvenience of this API is exactly why we have the single-phase |
|
@shwina good point! I had seen Leo make this comment in this draft PR of copilot (although I don't know if it's as vital given the update you had made here):
Perhaps I can go ahead with adding function calls that execute the |
|
I do like the approach proposed by Leo. The only downside is that we lose consistency with the C++ API. Although, it can be argued that by providing separate, named methods for memory allocation and compute, we are already losing consistency. @NaderAlAwar any thoughts? Would you be OK with |
|
I like adding the Regarding The current state where you have to pass the same arrays twice is not great, but we should think carefully about we we want to evolve that API. |
|
@shwina @NaderAlAwar thanks for the feedback! Yes, the pointers were being cached via Your input got me thinking...what does In summary, I think the real question here is whether CCCL Python API users would find it more annoying to be passing |
|
Based on @NaderAlAwar's latest comment, I'll go ahead with implementing the functions that explicitly pass the arguments that users are currently familiar with 😄 |
|
In light of latest feedback, is this the way we expect the API to look? Only change users expect from the front-end is that they don't have to pass reducer = cuda.compute.make_reduce_into(
d_in=d_input,
d_out=d_output,
num_items=len(d_input),
op=OpKind.MAXIMUM,
h_init=h_init,
)
temp_storage_size = reducer.get_temp_storage_bytes(
d_in=d_input,
d_out=d_output,
num_items=len(d_input),
op=OpKind.MAXIMUM,
h_init=h_init,
)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
reducer.compute(
temp_storage=d_temp_storage,
d_in=d_input,
d_out=d_output,
num_items=len(d_input),
op=OpKind.MAXIMUM,
h_init=h_init,
) |
|
Yes that looks right. For now I would also reimplement the |
|
Great! Assuming that you literally meant just calling the functions within Edit: only thing I need to do is update documentation so it mentions this new API |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Summary
Tests
WalkthroughAdds named reducer methods for temporary-storage sizing and compute, keeps direct callable use working with deprecation warnings, expands reduce tests, and updates the README documentation link. ChangesReduce API and test updates
Assessment against linked issues
Out-of-scope changes
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2b516dbe-275c-4275-9700-cef47638a5a5
📒 Files selected for processing (3)
python/cuda_cccl/README.mdpython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/tests/compute/test_reduce.py
| For complete documentation, examples, and API reference, visit: | ||
|
|
||
| - **Full Documentation**: [nvidia.github.io/cccl/python](https://nvidia.github.io/cccl/python) | ||
| - **Full Documentation**: [nvidia.github.io/cccl/python](https://nvidia.github.io/cccl/unstable/python) |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
suggestion: The visible label still says nvidia.github.io/cccl/python while the target now points to .../unstable/python. Please update the text or keep the stable target so the docs link doesn’t advertise the wrong destination.
There was a problem hiding this comment.
@shwina should https://nvidia.github.io/cccl/python exist? I added the unstable one because at least it exists, but maybe the issue is with the hyperlink itself, where it may have existed at one point
There was a problem hiding this comment.
🧩 Analysis chain
🤖 web_search executed:
Does https://nvidia.github.io/cccl/python exist and return a valid page? Also check if https://nvidia.github.io/cccl/unstable/python exists.
Length of output: 4381
@acosmicflamingo confirmed: nvidia.github.io/cccl/unstable/python resolves to real pages (index, API reference, setup docs), while I found no evidence that nvidia.github.io/cccl/python currently resolves to anything — it only shows up as historical README source text in search results, not as a live page. So the stable path is likely broken/removed (versioning scheme probably changed to require /unstable/ or a version tag), and pointing to unstable is the correct fix for now.
suggestion: since unstable is intentional, just update the visible label text to match, e.g.:
- **Full Documentation**: [nvidia.github.io/cccl/unstable/python](https://nvidia.github.io/cccl/unstable/python)That avoids advertising a dead/misleading link text.
Description
Resolves #5638
Add 'get_temp_storage_bytes' and 'compute' functions to _Reduce class to improve API usability (as suggested by @kkraus14 and @shwina).
Checklist