RMS Norm Optimization by aris134 · Pull Request #583 · ROCm/TransformerEngine

aris134 · 2026-05-12T12:13:34Z

Description

Fixes # (16527)

RMSNorm falls back to general kernel implementation on several DeepSeek and Qwen shapes, causing poor performance. These shapes have been registered with the tuned kernel cache, and a performance benchmark for RMSNorm has been added.

Additionally, a fallback warning is printed the first time at which a tuned config is not found for a requested kernel. For example:

in function getKernel: Falling back to general normalization kernel because no tuned kernel is available for this config. hidden_size=128, wtype=bf16, itype=bf16, otype=bf16, ctype=fp32

E2E TFLOPS/s/GPU for proxy models (Previous -> Current with RMSNorm tuning) :

Qwen:
bf16: 369.4 -> 374.7
fp8: 352.1 ->358.2

Deepseek:
bf16: 501.4 -> 529.4
fp8: 463.9 -> 511.4

Also added matching tuned configs for LayerNorm.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…tyle

…ot found

… missing configs for layer norm

ipanfilo · 2026-05-15T01:24:06Z

+                     prop.multiProcessorCount, zero_centered_gamma, stream);
+  }
+
+  HIP_CHECK(hipStreamSynchronize(stream));


Is synchronization needed before warmup?

Good point. These are in fact redundant since the warmup already calls a device-wide sync anyway. Removed in 4256e3c

ipanfilo · 2026-05-15T01:25:43Z

 #include <typeindex>
 #include <unordered_map>
 #include <vector>
+#include <unordered_set>


nit: move it after unordered_map

Done in 2f9ff47

ipanfilo · 2026-05-15T01:41:04Z

                     bool is_tuned, NVTEScalingMode mode = NVTE_DELAYED_TENSOR_SCALING,
                     bool training = true, bool gamma_in_weight_dtype = false);

+inline DType decode_itype(uint64_t general_key) {


This code is fragile because encoding could change. At least put comments here and at encoding block that they should match

Good point. I updated this in d548d54 to make the coupling between encoding/decoding explicit by introducing shared norm_key bit-layout constants and using them in both get_key() and the decode helpers. I also added comments documenting that the layouts must remain in sync, so future changes to the packed key format are less likely to silently diverge.

…tion.cpp

alextmagro · 2026-05-16T00:48:29Z

+  HIP_CHECK(hipEventDestroy(stop));
+  HIP_CHECK(hipStreamDestroy(stream));
+
+  size_t bytes_read =


nit: line splits aren't needed here

alextmagro · 2026-05-16T00:49:20Z

+static void BM_NormBackward(benchmark::State& state) {
+  const size_t N = state.range(0);
+  const size_t H = state.range(1);
+  const float epsilon = 1e-5f;


epsilon is the same between forward and backward, can probably make a const global

alextmagro · 2026-05-16T00:53:05Z

 REGISTER_NORM_LAUNCHER(LayerNorm, Backward, tuned, 6144, bf16, bf16, bf16, fp32, 1, 1, 8, 16, 4);
 REGISTER_NORM_LAUNCHER(LayerNorm, Backward, tuned, 6144, bf16, fp32, bf16, fp32, 1, 1, 8, 16, 4);

+REGISTER_NORM_LAUNCHER(LayerNorm, Backward, tuned, 7168, fp32, fp32, fp32, fp32, 1, 1, 7, 8, 4);


Does BYTES_PER_LDG=8 outperform 16 for this config? If so, I wonder if the configs around it would perform better that way too.

alextmagro · 2026-05-16T00:54:14Z

 REGISTER_NORM_LAUNCHER(LayerNorm, Forward, tuned, 6144, bf16, bf16, bf16, fp32, 1, 1, 4, 16);
 REGISTER_NORM_LAUNCHER(LayerNorm, Forward, tuned, 6144, fp32, fp32, bf16, fp32, 1, 1, 4, 16);

+REGISTER_NORM_LAUNCHER(LayerNorm, Forward, tuned, 7168, fp32, fp32, fp32, fp32, 1, 1, 4, 16);


BWD you have 7 warps set, but here you have 4. Is this optimal?

alextmagro · 2026-05-16T00:55:21Z

 REGISTER_NORM_LAUNCHER(RMSNorm, Backward, tuned, 4096, fp16, fp16, fp16, fp32, 1, 1, 4, 16, 4);
 REGISTER_NORM_LAUNCHER(RMSNorm, Backward, tuned, 4096, bf16, bf16, bf16, fp32, 1, 1, 4, 16, 4);

+REGISTER_NORM_LAUNCHER(RMSNorm, Backward, tuned, 7168, fp32, fp32, fp32, fp32, 1, 1, 4, 16, 4);


Same here, is 4 better than 7?

alextmagro · 2026-05-16T00:57:27Z

-                         (uint64_t(NormStage)) << 22 | (uint64_t(NormBackend) << 24) |
-                         (uint64_t(zero_centered_gamma) << 26) | (uint64_t(mode) << 27) |
-                         (uint64_t(training) << 37) | (uint64_t(gamma_in_weight_dtype) << 38);
+  uint64_t general_key =


I get the motivation behind this change, but this affects upstream code. I feel like we're more likely to miss a key change from upstream if we have diverged here.

aris134 added 2 commits May 11, 2026 14:53

add rmsnorm perf benchmark

f639c6e

add rmsnorm perf benchmark and missing tuned DS configs

b5720e9

aris134 requested a review from alextmagro May 12, 2026 12:13

aris134 self-assigned this May 12, 2026

aris134 added 2 commits May 12, 2026 12:38

add missing tuned shape for Qwen and update benchmark

6c2cd28

move benchmark to benchmarks folder and rewrite in google benchmark s…

912c62b

…tyle

aris134 marked this pull request as ready for review May 12, 2026 19:15

aris134 requested review from ipanfilo, wangye805 and wenchenvincent as code owners May 12, 2026 19:15

aris134 added 5 commits May 12, 2026 19:44

add matching tuned configs for layernorm

c24a091

add fallback warning print if tuned config not found for normalization

3d8e1de

add fallback warning message when tuned normalization kernel config n…

856346d

…ot found

uncomment qwen configs

d5293cb

generalization rms norm benchmark to also include layer norm, and add…

78f9aa5

… missing configs for layer norm

ipanfilo reviewed May 15, 2026

View reviewed changes

aris134 added 3 commits May 15, 2026 16:00

remove redundant synchronization before gpu warmup in bench_normaliza…

4256e3c

…tion.cpp

address nit: move unordered_set to after unordered_map

2f9ff47

share normalization key bit layout constants

d548d54

aris134 requested a review from ipanfilo May 15, 2026 16:40

alextmagro requested changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RMS Norm Optimization#583

RMS Norm Optimization#583
aris134 wants to merge 12 commits into
devfrom
amartin/rmsnorm

aris134 commented May 12, 2026 •

edited

Loading

Uh oh!

ipanfilo May 15, 2026

Uh oh!

aris134 May 15, 2026

Uh oh!

ipanfilo May 15, 2026

Uh oh!

aris134 May 15, 2026

Uh oh!

ipanfilo May 15, 2026

Uh oh!

aris134 May 15, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

alextmagro May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aris134 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aris134 commented May 12, 2026 •

edited

Loading