Add custom multi_tensor_apply kernels (L2norm, Adam) by matthiasdiener · Pull Request #585 · ROCm/TransformerEngine

matthiasdiener · 2026-05-13T15:53:32Z

Description

Fixes https://github.com/ROCm/frameworks-internal/issues/16529

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

alextmagro · 2026-05-16T01:20:58Z

+template <int N, typename T>
+__device__ __forceinline__ void load_store_n(T *dst, const T *src,
+                                             int dst_offset, int src_offset) {
+  typedef typename std::aligned_storage<N * sizeof(T), N * alignof(T)>::type LT;


We already have some load store functions that are optimal for rocm here -- Can we reuse them?

https://github.com/ROCm/TransformerEngine/blob/dev/transformer_engine/common/util/rocm_device_utils.cuh#L68-L115

P.S., this also has some other utils that come in handy for us.

alextmagro · 2026-05-16T01:25:14Z

+
+  TRANSFORMER_ENGINE_TYPE_SWITCH_NON_FP8ONLY(
+      grad_dtype, grad_type,
+      if (mode == ADAM_MODE_0) {


We can use TRANSFORMER_ENGINE_SWITCH_CONDITION here I think

alextmagro · 2026-05-16T01:26:34Z

+#pragma unroll
+    for (int ii = 0; ii < CILP; ii++) {
+      if (MODE == ADAM_MODE_0) {  // L2
+        r_g[ii] = r_g[ii] + (decay * r_p[ii]);


+= here for readability

alextmagro · 2026-05-16T01:27:34Z

+        param_dtype, p_type,
+        TRANSFORMER_ENGINE_TYPE_SWITCH_NON_FP8ONLY(
+            grad_dtype, g_type,
+            if (mode == ADAM_MODE_0) {


Same, can use TRANSFORMER_ENGINE_SWITCH_CONDITION here

alextmagro · 2026-05-16T01:28:38Z

+              LAUNCH_CUSTOM_ADAM(g_type, p_type, ADAM_MODE_0, true);
+            } else {
+              LAUNCH_CUSTOM_ADAM(g_type, p_type, ADAM_MODE_1, true);
+            };););


Add NOLINT here and at the all of our macro switches

alextmagro · 2026-05-16T01:42:04Z

@@ -1 +1 @@
 /*************************************************************************


alextmagro · 2026-05-16T01:43:07Z

@@ -1,2 +1,2 @@
 /*************************************************************************
 * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


alextmagro · 2026-05-16T01:43:59Z

@@ -1,2 +1,2 @@
 /*************************************************************************
 * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


alextmagro · 2026-05-16T01:44:10Z

@@ -1 +1 @@
 # Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


alextmagro · 2026-05-16T01:44:36Z

-    multi_tensor_scale,
-    multi_tensor_l2norm,
-    multi_tensor_unscale_l2norm,
+    multi_tensor_scale as _multi_tensor_scale,


These changes should be hip guarded I think

initial fused implementation

cc136ac

matthiasdiener self-assigned this May 13, 2026

matthiasdiener added the ci-level 1 CI test level 1 label May 13, 2026

broaden Adam support

c0f60da

matthiasdiener changed the title ~~Add a custom multi_tensor_l2norm_kernel~~ Add a custom multi_tensor_apply kernels (L2norm, Adam) May 15, 2026

matthiasdiener changed the title ~~Add a custom multi_tensor_apply kernels (L2norm, Adam)~~ Add custom multi_tensor_apply kernels (L2norm, Adam) May 15, 2026

alextmagro reviewed May 16, 2026

View reviewed changes

alextmagro requested changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom multi_tensor_apply kernels (L2norm, Adam)#585

Add custom multi_tensor_apply kernels (L2norm, Adam)#585
matthiasdiener wants to merge 2 commits into
devfrom
mdiener/multi_tensor_apply_kernel

matthiasdiener commented May 13, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1 +1 @@
		/*************************************************************************

		@@ -1,2 +1,2 @@
		/*************************************************************************
		* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -1 +1 @@
		# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

matthiasdiener commented May 13, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants