Skip to content

[cudax] Implement cudax::coop::any_of algorithm for <= warp groups#9665

Open
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:coop_any
Open

[cudax] Implement cudax::coop::any_of algorithm for <= warp groups#9665
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:coop_any

Conversation

@davebayer

@davebayer davebayer commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Fixes #9669, #9670 and #9672.

@davebayer davebayer requested a review from a team as a code owner July 1, 2026 12:45
@davebayer davebayer requested a review from caugonnet July 1, 2026 12:45
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jul 1, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jul 1, 2026
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added coop::any support for cooperative groups, including warp-level and broadcasted variants.
    • New behavior returns a result for the group root and supports direct boolean output in broadcasted mode.
  • Bug Fixes
    • Added safeguards to prevent any from being used with non-boolean thread arrays.
  • Tests
    • Added CUDA test coverage for thread, warp, and within-warp group scenarios across multiple input patterns.

Walkthrough

Adds a new cuda::experimental::coop::any header implementing OR-reduction across per-thread boolean arrays and warp-level groups, with regular and broadcasted overloads. Wires it into the coop.cuh umbrella header and adds three new CMake-registered CUDA test executables covering this_thread, this_warp, and threads_within_warp scenarios.

Changes

coop::any feature

Layer / File(s) Summary
any.cuh core implementation
cudax/include/cuda/experimental/__coop/any.cuh
Implements internal __any_impl for per-thread boolean arrays (OR-reduce) and warp-level groups (lane-mask combine via __any_sync), plus public any overloads returning optional<bool> or bool (broadcasted), with non-bool overloads deleted.
Umbrella header wiring and test registration
cudax/include/cuda/experimental/coop.cuh, cudax/test/CMakeLists.txt
Includes the new header from coop.cuh and registers three new Catch2 test executables for coop.any.this_thread, coop.any.this_warp, coop.any.threads_within_warp.
this_thread any test
cudax/test/coop/any/this_thread.cu
Adds test_group, CustomBinaryPartition, TestKernel, and a C2H_TEST validating any over per-thread boolean arrays for all-false, all-true, and single-difference cases.
this_warp any test
cudax/test/coop/any/this_warp.cu
Adds test_group, CustomBinaryPartition, TestKernel, and a C2H_TEST validating any and broadcasted any across warp lanes with rank-based patterns.
threads_within_warp any test
cudax/test/coop/any/threads_within_warp.cu
Adds test_group filtering non-member threads, CustomBinaryPartition, TestKernel exercising group_by and binary_partition configurations, and a C2H_TEST harness.

Suggested reviewers: griwes, jrhemstad, fbusato


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__coop/any.cuh (1)

49-50: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: Accept const bool arrays for __thread_data.

The algorithm only reads the array, but every overload takes _Tp (&)[_Np] / bool (&)[_Np], so const bool[N] inputs cannot call coop::any. Consider changing the internal, public, and deleted overloads to take const bool (&)[_Np] / const _Tp (&)[_Np]. As per path instructions, focus on experimental API clarity. As per coding guidelines, all variables that are not modified must be declared const.

Also applies to: 70-70, 89-90, 97-98, 105-109

Sources: Coding guidelines, Path instructions


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 15730b6b-01fe-4ca2-85ab-918364345690

📥 Commits

Reviewing files that changed from the base of the PR and between 96e4c40 and 3c30139.

📒 Files selected for processing (6)
  • cudax/include/cuda/experimental/__coop/any.cuh
  • cudax/include/cuda/experimental/coop.cuh
  • cudax/test/CMakeLists.txt
  • cudax/test/coop/any/this_thread.cu
  • cudax/test/coop/any/this_warp.cu
  • cudax/test/coop/any/threads_within_warp.cu

@github-actions

This comment has been minimized.

@davebayer davebayer changed the title [cudax] Implement cudax::coop::any algorithm [cudax] Implement cudax::coop::any algorithm for <= warp groups Jul 1, 2026
@davebayer davebayer changed the title [cudax] Implement cudax::coop::any algorithm for <= warp groups [cudax] Implement cudax::coop::any_of algorithm for <= warp groups Jul 1, 2026
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Caution

Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted.

Error details
{}

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 09m: Pass: 100%/55 | Total: 9h 49m | Max: 1h 09m | Hits: 79%/43333

See results here.

@pciolkosz pciolkosz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a variant that just takes a single bool instead of an array?

}
}

struct CustomBinaryPartition

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not needed in this file

}
}

struct CustomBinaryPartition

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

}
else
{
return (gpu_thread.is_root_rank(__group)) ? ::cuda::std::optional{__result} : ::cuda::std::nullopt;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider in general, not just for this review: I just thought in cases where broadcast (or partial broadcast) is free, would it make sense for all threads to return the result? We could document that without cudax::broadcasted some subset of threads that includes the leader returns the result, while with broadcasted all threads return.

Not sure if you can use it meaningfully, but I guess you could opportunistically use the result in a larger set of threads

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the answer is that I'm not sure. Because for some algorithm + group combinations the broadcast is free, but for others it's not.

And then it's hard to come up with some rules when the broadcast is implicit. We would have to document each case and it will be confusing and less generic. So, to be consistent with other cases I thought I would return the result only by the root rank.

Wouldn't documenting that using the broadcasted version is free when used with this_thread/this_warp/threads within warp sound better?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more option to not document it, but just have the user opportunistically check the returned optional if it has a value. But I don't know if it's even useful for anything. Not sure what's the best here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also considering this, but I don't think it's a good idea, because the user might rely on has_value() => is_root_rank(), which wouldn't be true for this case.

I'm wondering whether there is actually something wrong about the current approach. You should have used the broadcasted tag, if you wanted the result to be known to all members of the group 🙂

Is there a problem with this logic? Isn't the problem just that you see the implementation and feel like we are wasting infotmation in the non-broadcasted case? I think the behavior makes sense from the design perspective

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think there is anything wrong with the current logic, just wanted to bring this up as an option, but we don't have to change anything

Comment thread cudax/include/cuda/experimental/__coop/any_of.cuh
Comment thread cudax/test/coop/any_of/threads_within_warp.cu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

2 participants