Skip to content

Adds GEMM Profiling Guide to TE#2863

Open
jomitchellnv wants to merge 3 commits intoNVIDIA:mainfrom
jomitchellnv:jm/gemm-blog
Open

Adds GEMM Profiling Guide to TE#2863
jomitchellnv wants to merge 3 commits intoNVIDIA:mainfrom
jomitchellnv:jm/gemm-blog

Conversation

@jomitchellnv
Copy link
Copy Markdown
Contributor

Description

Adds a GEMM profiling guide to the Transformer Engine documentation and a companion benchmark tool. The guide
explains how to derive all 12 per-layer GEMM shapes (Fprop, Dgrad, Wgrad) from transformer model
hyperparameters, benchmark them across precisions (BF16, FP8 Block, MXFP8, NVFP4), and interpret the resulting
speedup estimates.

The benchmark tool supports two modes: model config mode (derives shapes automatically from hidden_size,
intermediate_size, etc.) and manual shape mode (explicit MxKxN triplets). It measures both autocast performance
(realistic end-to-end with quantization overhead) and pre-quantized kernel-only throughput, using CUDA events
or torch.profiler timing backends.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Add benchmarks/gemm/benchmark_gemm.py — standalone GEMM benchmark tool supporting BF16, FP8 Block, MXFP8, and
    NVFP4 precisions with autocast and pre-quantized modes, CUDA event and torch.profiler timing, Nsight Systems
    integration, and bar-chart output

  • Add docs/features/low_precision_training/gemm_profiling/gemm_profiling.rst — documentation covering GEMM
    shape derivation from model configs, forward/backward pass shape conventions, precision mapping per GEMM pass,
    speedup calculation methodology, and a worked example on B300

  • Add benchmark result plots (img/model_config_speedup.png, img/model_config_speedup_prequant.png)

  • Update docs/features/low_precision_training/index.rst toctree to include the new guide
    Please list the changes introduced in this PR:

  • Change A

  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
@jomitchellnv jomitchellnv changed the title adds blog post Adds GEMM Profiling Guide to TE Apr 9, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Greptile Summary

This PR adds a standalone GEMM benchmark tool (benchmarks/gemm/benchmark_gemm.py) and an accompanying RST documentation guide covering GEMM shape derivation, precision mapping (BF16, FP8 Block, MXFP8, NVFP4), and speedup methodology for transformer models. The documentation is clear and accurate; the benchmark code is well-structured with two distinct operating modes (model-config and manual-shape).

Three issues were flagged in earlier review rounds (FP8Block silently absent from shape mode, a dead or True condition in the chart loop, and a docstring/kernel-pattern mismatch). One additional finding in this pass: the saved bar chart in create_model_config_plot() always uses the Fprop×2 Dgrad approximation even when --verify-dgrad supplies measured Dgrad times, creating a silent inconsistency between the printed table and the output figure.

Confidence Score: 4/5

Safe to merge after addressing the open issues from prior review threads; the new chart-inconsistency finding is P2 but worth fixing before finalising the guide.

Three P2 issues from prior review threads remain open (FP8Block absent from shape mode is the most user-visible). The new finding — the plot ignoring measured Dgrad when --verify-dgrad is used — is also P2 but contributes to incorrect figures in the guide. No P0/P1 blocking issues. Score is 4 rather than 5 because of the cumulative unresolved P2s, one of which (FP8Block omission) is advertised functionality that silently doesn't work.

benchmarks/gemm/benchmark_gemm.py — FP8Block in shape mode, dead condition, docstring mismatch (prior threads), and chart/dgrad inconsistency (this pass).

Important Files Changed

Filename Overview
benchmarks/gemm/benchmark_gemm.py New 1609-line benchmark tool; three issues were flagged in previous threads (FP8Block absent from shape mode, dead or True condition, docstring/pattern mismatch). One additional P2 issue found: the saved chart always approximates Dgrad=Fprop*2 even when --verify-dgrad provides measured results.
docs/features/low_precision_training/gemm_profiling/gemm_profiling.rst New 579-line documentation guide covering GEMM shape derivation, precision mapping, speedup methodology, and a worked B300 example; accurate and well-structured.
docs/features/low_precision_training/index.rst Adds the new gemm_profiling toctree entry and fixes a missing newline at end of file.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[python benchmark_gemm.py] --> B{Mode?}
    B -- "--hidden_size etc." --> C[Model Config Mode\nrun_model_config_benchmarks]
    B -- "--shapes or default" --> D[Shape Mode\nrun_benchmarks]
    B -- "--profile" --> E[Nsight Profile Mode\nrun_benchmarks with single shape]
    C --> F[compute_gemm_shapes\nFprop / Dgrad / Wgrad]
    F --> G[_benchmark_single_shape\nBF16 + FP8Block + MXFP8 + NVFP4]
    G --> H{--verify-dgrad?}
    H -- Yes --> I[Benchmark Dgrad shapes\ndgrad_results measured]
    H -- No --> J[Dgrad approximated as Fprop x2]
    I --> K[Print per-layer summary\nuses measured times]
    J --> K
    K --> L[create_model_config_plot\nalways uses Fprop x2 for chart]
    D --> M[benchmark_bf16\nbenchmark_fp8 MXFP8\nbenchmark_fp4 NVFP4]
    M --> N[create_plot\nTFLOPS bar chart]
    E --> D
Loading

Reviews (3): Last reviewed commit: "fixes tests" | Re-trigger Greptile

Comment on lines +794 to +799
results: dict[str, list[float]] = {"BF16": [], "MXFP8": [], "NVFP4": []}
time_results: dict[str, list[float]] = {"BF16": [], "MXFP8": [], "NVFP4": []}

has_blackwell = is_blackwell_available()
run_fp8 = include_fp8 and TE_AVAILABLE
run_fp4 = include_fp4 and TE_AVAILABLE and has_blackwell
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 FP8Block silently omitted in shape mode

run_benchmarks() (used for both default square-shape benchmarks and explicit --shapes invocations) never calls benchmark_fp8_block / benchmark_fp8_block_prequantized. The results dict is initialized with only "BF16", "MXFP8", and "NVFP4", and the function has no include_fp8_block parameter — so the --no-fp8-block flag parsed in main() is only forwarded to run_model_config_benchmarks (line 1579) and has no effect here.

Users who run the tool in shape mode (no model-config flags) will silently receive BF16/MXFP8/NVFP4 data only, even though the module docstring advertises "BF16, FP8 Block, MXFP8, and NVFP4 precisions."

To fix, add include_fp8_block: bool = True to run_benchmarks, initialise results["FP8Block"] = [], select fp8_block_fn the same way model-config mode does, and forward the flag from main().

Comment on lines +1355 to +1367
color=op_color,
alpha=0.9,
label=f"{op_label} (Fprop+Dgrad)" if i == 0 or True else "",
)
ax.bar(
x,
wgrad_ms,
bar_width,
bottom=all_fprop_total + total_wgrad_bottom,
color=op_color,
alpha=0.5,
label=f"{op_label} (Wgrad)" if i == 0 or True else "",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead condition if i == 0 or True always evaluates to True

Both label= expressions use if i == 0 or True, which unconditionally takes the True branch. This is dead code — or True makes the condition tautological. The intent was likely either True (always label, which is fine for a stacked bar chart) or if i == 0 (label only the first series). Clean it up to express intent clearly:

Suggested change
color=op_color,
alpha=0.9,
label=f"{op_label} (Fprop+Dgrad)" if i == 0 or True else "",
)
ax.bar(
x,
wgrad_ms,
bar_width,
bottom=all_fprop_total + total_wgrad_bottom,
color=op_color,
alpha=0.5,
label=f"{op_label} (Wgrad)" if i == 0 or True else "",
)
label=f"{op_label} (Fprop+Dgrad)",

and

Suggested change
color=op_color,
alpha=0.9,
label=f"{op_label} (Fprop+Dgrad)" if i == 0 or True else "",
)
ax.bar(
x,
wgrad_ms,
bar_width,
bottom=all_fprop_total + total_wgrad_bottom,
color=op_color,
alpha=0.5,
label=f"{op_label} (Wgrad)" if i == 0 or True else "",
)
label=f"{op_label} (Wgrad)",

Comment on lines +18 to +21
* **profiler** -- ``torch.profiler`` (CUPTI) kernel timestamps.
Only the matched GEMM compute kernels (nvjet, xmma, cutlass, cublas)
are summed, giving a kernel-only measurement.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Docstring lists "cublas" but the pattern tuple uses "gemm" instead

The module docstring (line 19) lists the matched kernel patterns as (nvjet, xmma, cutlass, cublas), but GEMM_KERNEL_PATTERNS at line 70 is ("gemm", "nvjet", "xmma", "cutlass")"cublas" is absent and "gemm" was added in its place. In practice "gemm" does catch cuBLAS kernels (their names contain gemm), so the behaviour is correct, but the docstring is inaccurate and may confuse users auditing kernel coverage.

Suggested change
* **profiler** -- ``torch.profiler`` (CUPTI) kernel timestamps.
Only the matched GEMM compute kernels (nvjet, xmma, cutlass, cublas)
are summed, giving a kernel-only measurement.
* **profiler** -- ``torch.profiler`` (CUPTI) kernel timestamps.
Only the matched GEMM compute kernels (gemm, nvjet, xmma, cutlass)
are summed, giving a kernel-only measurement.

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
@pggPL pggPL self-requested a review April 10, 2026 14:00
@pggPL
Copy link
Copy Markdown
Collaborator

pggPL commented Apr 13, 2026

Hi @jomitchellnv, I see that this PR is open, but "Documentation" job is failing. If you fix it, please ping me and I'll review it.

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants