add MXFP8 pre-swizzling for gfx1250 GEMM by matthiasdiener · Pull Request #568 · ROCm/TransformerEngine

matthiasdiener · 2026-04-29T16:59:38Z

Description

Fixes https://github.com/ROCm/frameworks-internal/issues/16428

This was lightly tested on gfx1250.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

alextmagro

Hi Matthias, a few comments. I also assume you are still planning on adding in the hooks to scale swizzle when we're on gfx1250? I believe there were hooks in all of common, pytorch and jax. These PRs removed them, so would be a partial revert.

#420
#424
#442

alextmagro · 2026-05-05T14:29:00Z

-    asm volatile("ds_swizzle_b32 %0, %1 offset:0x041F\n\t"
-                 "s_waitcnt lgkmcnt(0)" : "=v"(r) : "v"(v));
-    return r;
+    return __shfl_xor(v, 1);


Do we still need these helper functions now that we're just doing a __shfl_xor?

This change is only inadvertently part of this PR, it is already part of #571. Will revert here.

alextmagro · 2026-05-05T14:34:48Z

+  const int k = idx % K_scale;
+
+  uint8_t val = 127;
+  if (m < original_M && k < original_K) {


Could we move this check to the hostside, or remove it completely?

Moved hostside in b55a538

alextmagro · 2026-05-05T14:44:19Z

 #include <cstdint>

 #include "../common.h"
+#include "../util/cuda_runtime.h"


Why is this include needed?

Removed in b55a538

alextmagro · 2026-05-05T14:52:28Z

             " (got shape=", shape, ")");
 #ifdef USE_ROCM
+  // gfx1250 MX pre-swizzle (Tensile 3D) layout requires M padded to multiple of 4.
+  // Other ROCm architectures use 128x4 tiles but currently skip padding


I'm not sure this is true regarding us using 128x4 tiles. 128x4 scaling is an upstream requirement. We also have padding expectations in pytorch, jax, and all 3 test dirs have padding that will probably need fixing.

Removed the comment in b55a538

This reverts commit 76ca4b1.

This reverts commit d714038.

matthiasdiener · 2026-05-13T21:55:13Z

I also assume you are still planning on adding in the hooks to scale swizzle when we're on gfx1250? I believe there were hooks in all of common, pytorch and jax. These PRs removed them, so would be a partial revert.

#420 #424 #442

The hooks should be re-added in 384d590.

alextmagro · 2026-05-16T01:03:56Z

+// Simple GPU reference kernel for MXFP8 GEMM: D = A * B^T  (TN layout)
+// A is [M, K] row-major, B is [N, K] row-major, D is [M, N] column-major
+// Scales are E8M0, one per group of 32 elements along K.
+__global__ void mxfp8_gemm_ref_kernel(


Why do we need a second mxfp8 reference kernel?

alextmagro · 2026-05-16T01:05:01Z

+class MxGemmSwizzleGfx1250TestSuite
+    : public ::testing::TestWithParam<MxGemmParams> {};
+
+TEST_P(MxGemmSwizzleGfx1250TestSuite, TestMxfp8GemmE2E) {


My understanding is we must swizzle scales for gfx1250. I think ideally we would fuse this with the existing mxfp8 GEMM tests -- pre-1250 we don't swizzle, 1250+ we do.

alextmagro · 2026-05-16T01:11:01Z


+#ifdef USE_ROCM
+  // On ROCm, only MXFP8 on gfx1250 needs scale pre-swizzling
+  if (scaling_mode != NVTE_MXFP8_1D_SCALING || transformer_engine::cuda::sm_arch() != 125) {


Sometimes we use == 125, sometimes >= 125. Should probably be consistent one or the other.

matthiasdiener added 3 commits April 27, 2026 15:36

add MX scale pre-swizzling for gfx1250

bc363fa

switch to mxfp4

a6ca3af

tensile-like implementation

d1ee5bd

matthiasdiener self-assigned this Apr 29, 2026

Merge remote-tracking branch 'upstream/dev' into mdiener/mxfp8-swizzle

d1647ee

matthiasdiener added the ci-level 1 CI test level 1 label Apr 29, 2026

matthiasdiener added 9 commits May 1, 2026 18:41

Merge remote-tracking branch 'origin/dev' into mdiener/mxfp8-swizzle

1fff6d9

gfx1250 swizzle_xor changes for FP4

d714038

change line endings to unix, trim trailing whitespace

76ca4b1

Merge branch 'mdiener/swizzle_xor-1250' into mdiener/mxfp8-swizzle

81a0a27

fix arch

2991bcf

[WIP] e2e gemm test, not working yet

8ceb89c

fix for gfx1250

167d2eb

k-tile

5d46537

extend tests

313a6b7

matthiasdiener force-pushed the mdiener/mxfp8-swizzle branch from ddf19da to 313a6b7 Compare May 3, 2026 22:06

remove ifdef

2a8eeb5

matthiasdiener requested a review from alextmagro May 4, 2026 16:33

undo BLK32_UE8M0_32_8_EXT

c37a781

alextmagro requested changes May 5, 2026

View reviewed changes

matthiasdiener added 5 commits May 5, 2026 10:16

Merge remote-tracking branch 'upstream/dev' into mdiener/mxfp8-swizzle

5d2d38f

Revert "change line endings to unix, trim trailing whitespace"

f093f64

This reverts commit 76ca4b1.

Revert "gfx1250 swizzle_xor changes for FP4"

ecbffea

This reverts commit d714038.

Merge remote-tracking branch 'origin/dev' into mdiener/mxfp8-swizzle

33fca6e

address review comments

b55a538

matthiasdiener changed the title ~~[proof-of-concept] add MXFP8 pre-swizzling for gfx1250~~ add MXFP8 pre-swizzling for gfx1250 GEMM May 13, 2026

matthiasdiener added 2 commits May 13, 2026 21:16

cleanups

398cc3c

re-add scale swizzle hooks in GEMM paths for gfx1250

384d590

cleanups

5c5a902

arch fixes

2c05ec5

matthiasdiener requested a review from alextmagro May 14, 2026 20:20

matthiasdiener marked this pull request as ready for review May 14, 2026 20:21

matthiasdiener requested review from ipanfilo, wangye805 and wenchenvincent as code owners May 14, 2026 20:21

alextmagro reviewed May 16, 2026

View reviewed changes

Conversation

matthiasdiener commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

alextmagro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener commented May 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matthiasdiener commented Apr 29, 2026 •

edited

Loading