Skip to content

Add DSV4 FP4 GB200 dynamo-sglang MTP disagg benchmarks#1370

Open
Fridge003 wants to merge 6 commits into
mainfrom
sglang-dsagg-gb200-0513
Open

Add DSV4 FP4 GB200 dynamo-sglang MTP disagg benchmarks#1370
Fridge003 wants to merge 6 commits into
mainfrom
sglang-dsagg-gb200-0513

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +38 to +44

dynamo:
hash: "81d0555ee23519cea80a42b4fe824e30368b7300"
install: true

slurm:
time_limit: "03:00:00"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The recipe file pins a different container (lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev) and dynamo hash (81d0555ee23519cea80a42b4fe824e30368b7300) than the 7 other new GB200 recipes in this PR, which all use lmsysorg/sglang:nightly-dev-cu13-20260513-4fb40bff and 34d55a596fb8d3d44daefe425ec1e303131f4d2c — matching the single image: declared by the master yaml entry for dsv4-fp4-gb200-dynamo-sglang. The divergent values match the gb300 sibling recipes in the same directory, suggesting a copy-paste leftover from gb300. Within a single config-key sweep, the concurrency=64 datapoint will run on a fundamentally different SGLang build and dynamo binary than the rest of the search space — please either homogenize to match the other 7 GB200 recipes, or document why this point intentionally diverges.

Extended reasoning...

What the bug is

disagg-gb200-1p4d-dep8-tp8-10-c64.yaml is the lone exception in the new GB200 sweep:

# lines 38-44 of the file
model:
  path: "deepseek-v4-pro"
  container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev"
  precision: "fp4"

dynamo:
  hash: "81d0555ee23519cea80a42b4fe824e30368b7300"

All 7 other new GB200 recipes added in this PR (1p1d-dep8-dep16, 1p1d-tp8-tp8, 1p2d-dep8-dep16, 2p1d-dep8-dep16, 4p1d-dep8-dep16, 5p1d-dep8-dep16, 6p1d-dep8-dep12) pin:

  • container: lmsysorg/sglang:nightly-dev-cu13-20260513-4fb40bff
  • dynamo hash: 34d55a596fb8d3d44daefe425ec1e303131f4d2c

These match the single image: declared at the config-key level in .github/configs/nvidia-master.yaml (line 8181) for dsv4-fp4-gb200-dynamo-sglang.

Why this looks like a paste leftover

The divergent values in the 1p4d file (lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev / hash 81d0555ee23519cea80a42b4fe824e30368b7300) are exactly what the pre-existing disagg-gb300-*.yaml recipes in the same directory use. The 1p4d-dep8-tp8 file's decode block is also structurally aligned with gb300 narrow-EP shape (model-path: /model/, disable-radix-cache, moe-runner-backend: flashinfer_mxfp4, TP=8/DP=1/EP=1 — no deepep / dp-attn), while every other new GB200 recipe uses deepep + dp-attn with TP=16/DP=16/EP=16. The simplest explanation is that this file was copy-pasted from a gb300 narrow-EP recipe and the model/dynamo block wasn't updated.

Step-by-step proof

  1. Open .github/configs/nvidia-master.yaml line 8181: dsv4-fp4-gb200-dynamo-sglang block declares one image: lmsysorg/sglang:nightly-dev-cu13-20260513-4fb40bff and references all 8 recipe files in its search-space.
  2. grep -A2 '^model:' benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb200-*.yaml — 7 of 8 files print container: lmsysorg/sglang:nightly-dev-cu13-20260513-4fb40bff; only disagg-gb200-1p4d-dep8-tp8-10-c64.yaml prints container: lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev.
  3. grep -A1 '^dynamo:' …/disagg-gb200-*.yaml — same 7-vs-1 split for the dynamo hash.
  4. grep -A2 '^model:' …/disagg-gb300-*.yaml shows the gb300 recipes use the -staging:deepseek-v4-grace-blackwell-dev container and hash 81d0555ee23519cea80a42b4fe824e30368b7300 — the exact values that leaked into the 1p4d GB200 file.

Impact

The concurrency=64 datapoint of the gb200 sweep will run on a fundamentally different SGLang build (staging branch vs. nightly main) and a different dynamo binary (different hash → different rust source build via dynamo.install: true) than the rest of the sweep. Per-concurrency throughput/TTFT numbers will therefore not be directly comparable to the c=1 / c=256 / c=1024 / c=4096 / c=8192 points, undermining the value of running this concurrency sweep under a single config-key. The master yaml's single image: field is also misleading because it implies one build for the whole config-key. If the per-recipe model.container is not honored at the master level and the master image: overrides, then the recipe file silently lies about what's running — and the dynamo hash divergence is still real and still produces a different runtime.

How to fix

Update lines 38-44 of disagg-gb200-1p4d-dep8-tp8-10-c64.yaml to:

model:
  path: "deepseek-v4-pro"
  container: "lmsysorg/sglang:nightly-dev-cu13-20260513-4fb40bff"
  precision: "fp4"

dynamo:
  hash: "34d55a596fb8d3d44daefe425ec1e303131f4d2c"
  install: true

Or, if the staging container + alternate hash are intentional (e.g. needed to support this recipe's narrow-EP flashinfer_mxfp4 decode backend), add a comment explaining why this point diverges and acknowledge in the PR description that c=64 results are not directly comparable to the other concurrency points.

ep: 8
dp-attn: true

dsv4-fp4-gb200-dynamo-sglang:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 PR title says "Add DSV4 FP4 GB200 dynamo-sglang MTP disagg benchmarks" but nothing in the diff wires up MTP/speculative decoding: the new config-key is dsv4-fp4-gb200-dynamo-sglang (no -mtp suffix), none of the 8 search-space scenarios carry a spec-decoding: "mtp" marker (unlike the sibling dsv4-fp4-gb300-dynamo-sglang-mtp at line 8557 which sets it on every scenario), none of the 7 new recipe yamls contain any speculative-algo/speculative-num-steps/speculative-eagle-topk/speculative-num-draft-tokens keys, and the perf-changelog entry added by this PR also makes no mention of MTP. Either drop "MTP" from the title, or add the MTP wiring — please clarify before merge so the commit history matches what shipped.

Extended reasoning...

What the bug is

The PR is titled "Add DSV4 FP4 GB200 dynamo-sglang MTP disagg benchmarks", which sets the expectation that this PR adds Multi-Token-Prediction / speculative-decoding benchmark coverage. But every artifact in the actual diff says this is the non-MTP variant of dsv4-fp4-gb200-dynamo-sglang. The title is the only place in the PR that mentions MTP.

Step-by-step proof

  1. Config-key naming convention. Existing MTP configs in .github/configs/nvidia-master.yaml follow the pattern <base>-mtp / <base>-mtp2 — e.g. dsv4-fp4-gb300-dynamo-sglang-mtp (line 8557) and dsv4-fp4-gb200-dynamo-vllm-mtp2 (line 8178, immediately above this addition). The new key added in this PR is dsv4-fp4-gb200-dynamo-sglang with no -mtp suffix.

  2. spec-decoding per-scenario marker. The convention used by sibling MTP entries is to mark every scenario with spec-decoding: "mtp". The dsv4-fp4-gb300-dynamo-sglang-mtp block carries this marker on each of its scenarios; the dsv4-fp4-gb200-dynamo-vllm-mtp2 block immediately above this addition also does. All 8 scenarios in the new dsv4-fp4-gb200-dynamo-sglang block omit it — consistent with non-MTP coverage.

  3. Speculative-decoding keys in recipe yamls. Established MTP recipes (e.g. disagg-mid-curve-1p1d-dep4-dep8-mtp.yaml lines 137–140, and the various *-mtp.yaml siblings) set speculative-algo, speculative-num-steps, speculative-eagle-topk, and speculative-num-draft-tokens in their sglang_config. Grep across the 7 new GB200 recipe yamls returns zero matches for any of those keys.

  4. Recipe filenames. Existing MTP recipes are suffixed -mtp.yaml. The 7 new recipes here are named disagg-gb200-{N}p{M}d-{topo}-{nodes}-c{conc}.yamlno -mtp suffix on any of them.

  5. Perf-changelog description. The changelog entry added by this same PR (perf-changelog.yaml line ~2280) says "Add DeepSeek-V4-Pro FP4 GB200 disaggregated SGLang coverage" — no mention of MTP. The author's own changelog text agrees with the code: this is the non-MTP variant.

  6. The SGLANG_OPT_FIX_NEXTN_MEGA_MOE env var present in the recipes is a red herring. It's an internal fix flag for the MEGA_MOE codepath when NEXTN is in use; it does not enable MTP/speculative decoding on its own — that requires the speculative-* keys, which are absent.

Impact

Every piece of evidence inside the PR points to "non-MTP variant," while the title says MTP. Two failure modes:

  • (a) The title is just wrong — minor metadata cleanup before merge so the commit history doesn't claim functionality that wasn't added.
  • (b) The author intended an MTP variant but forgot the wiring — in which case the benchmark will silently run non-MTP numbers and be reported as MTP results in dashboards, mis-labeling shipped performance data.

The changelog text agreeing with the code strongly suggests (a), but the author should confirm.

How to fix

If (a): rename the PR title to drop "MTP" (e.g. "Add DSV4 FP4 GB200 dynamo-sglang disagg benchmarks") so it matches the changelog and config-key.

If (b): rename the config-key to dsv4-fp4-gb200-dynamo-sglang-mtp, add spec-decoding: "mtp" to each scenario, suffix the recipe filenames -mtp.yaml, add the speculative-algo / speculative-num-steps / speculative-eagle-topk / speculative-num-draft-tokens keys to each recipe's sglang_config, and update the changelog description to mention MTP.

Comment thread perf-changelog.yaml Outdated
- "Add DeepSeek-V4-Pro FP4 GB200 disaggregated SGLang coverage using lmsysorg/sglang:nightly-dev-cu13-20260513-4fb40bff"
- "Rename and consolidate the per-concurrency recipe files to `disagg-gb200-{N}p1d-{topo}-{nodes}-c{conc}.yaml`"
- "Recipes adapted from https://github.com/shyeh25/srt-slurm/commit/b54ef27bbe6c278895f739519014f8d7d136ccc5#diff-bd1c9e424528b0d04035fbc630454d1e6746665d1c91dbefa7e03296c0fcc27f"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/xxxx # TODO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dsv4-fp4-gb200-dynamo-sglang changelog entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/xxxx # TODO on line 2286 — a leftover placeholder. Please replace xxxx with the actual PR number (1370) before merging so the link resolves and the # TODO marker doesn't ship in the changelog.

Extended reasoning...

The bug

perf-changelog.yaml line 2286 contains an unresolved placeholder for the new entry added by this PR:

- config-keys:
    - dsv4-fp4-gb200-dynamo-sglang
  description:
    - "Add DeepSeek-V4-Pro FP4 GB200 disaggregated SGLang coverage ..."
    - ...
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/xxxx # TODO

The literal string xxxx and the trailing # TODO comment make the author's intent explicit: this value was meant to be filled in once the PR number was known.

Why this matters

Every other adjacent entry in the file uses a real PR number — e.g. pull/1295 immediately above this entry and pull/1297 immediately below. Merging the placeholder leaves the changelog with:

  1. A broken/invalid URL (/pull/xxxx does not resolve to any PR).
  2. An unresolved # TODO comment shipped in a documentation file.
  3. Inconsistency with the rest of the changelog, which downstream consumers (release notes, dashboards, anything parsing pr-link) may rely on.

Step-by-step proof

  1. Open perf-changelog.yaml at line 2280–2287 — the new entry added by this PR.
  2. Observe line 2286: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/xxxx # TODO.
  3. Compare with line 2279 (immediately above): pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1295 — valid integer PR number.
  4. The PR introducing this entry is Add DSV4 FP4 GB200 dynamo-sglang MTP disagg benchmarks #1370, so the correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/1370.

Fix

Replace line 2286 with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1370

Trivial one-line change; no other entries are affected.

Comment on lines +39 to +62
# See ../1k1k/disagg-gb200-1p1d-dep8-tep8.yaml for the dynamo pin
# rationale. Hash bumped from PR #1213 to track the dynamo-sglang dsv4
# dev branch.
dynamo:
hash: "34d55a596fb8d3d44daefe425ec1e303131f4d2c"
install: true

slurm:
time_limit: "03:00:00"

# Match yangminl's working all-dynamo.yaml on the same gb300-cw cluster:
# cpus-per-task=144 — without this slurm hands out 1 CPU/task, which
# turns the dynamo `hash:` cold source build (~500 rust crates,
# ravif/exr/zip/pyo3 stack) into a 30+ min serial compile. With 144
# cargo finishes in ~5 min.
# mem=0 — slurm's "give the whole node's memory"; needed
# for sglang loading 671B FP4 weights + dynamo build at the same
# time without OOM.
sbatch_directives:
cpus-per-task: "144"
mem: "0"

# Topology: 7 prefill (TP=4 / DP=4 / EP=4 / 1 node each) + 1 decode
# (TP=8 / DP=8 / EP=8 / 2 nodes). 9 nodes total.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Three stale copy-paste comments in disagg-gb200-1p1d-tp8-tp8-4-c1.yaml contradict the actual configuration: line 39 points at a non-existent ../1k1k/disagg-gb200-1p1d-dep8-tep8.yaml (no 1k1k directory exists and the typo'd filename is nowhere in the repo); line 49 says "the same gb300-cw cluster" but this file is for gb200 (filename, gpu_type, and master yaml all agree); lines 61-62 describe a "7 prefill + 1 decode, 9 nodes total" topology with TP=4/DP=4/EP=4 even though the config below is 1p1d with TP=8/DP=1/EP=1 on 4 nodes (matching the filename). Documentation-only — no runtime impact — but actively misleads readers about what this recipe does. Suggest fixing the dynamo-pin pointer (or dropping the dangling ref), changing "gb300-cw" to the actual target cluster, and rewriting the topology comment to match the real 1p1d/TP=8 layout.

Extended reasoning...

This recipe was clearly adapted from a gb300 source, and three comment blocks were not updated to match the new gb200 configuration. They are independent issues that all share the same root cause (copy-paste from a different recipe), which is why the synthesis agent grouped them.

1. Dead pointer at line 39# See ../1k1k/disagg-gb200-1p1d-dep8-tep8.yaml for the dynamo pin rationale. There is no 1k1k directory under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/ — only 8k1k/ exists. And the referenced filename contains tep8 (a typo of tp8); searching the repo for either *dep8-tep8* or *dep8-tp8* finds nothing. So readers chasing the dynamo-pin justification land on a non-existent file at a non-existent path.

2. Wrong cluster name at line 49# Match yangminl's working all-dynamo.yaml on the same gb300-cw cluster: But this file is unambiguously a gb200 recipe: the filename is disagg-gb200-..., resources.gpu_type: "gb200", and the master yaml entry is dsv4-fp4-gb200-dynamo-sglang with runner: gb200. The word "same" implies the file targets gb300-cw, which contradicts everything else. The same comment block exists verbatim in disagg-gb300-1p1d-tp4-tp4-2-c1.yaml where it is accurate — this is a copy-paste residue. The cpus-per-task=144 / mem=0 rationale itself remains operationally sound for gb200 Grace-Blackwell nodes; only the cluster attribution is wrong.

3. Wrong topology at lines 61-62 — Comment says # Topology: 7 prefill (TP=4 / DP=4 / EP=4 / 1 node each) + 1 decode (TP=8 / DP=8 / EP=8 / 2 nodes). 9 nodes total. But the resources: block directly below it (lines 64-71) declares:

prefill_nodes: 2
prefill_workers: 1
gpus_per_prefill: 8
decode_nodes: 2
decode_workers: 1
gpus_per_decode: 8

And the sglang_config for both prefill and decode has tensor-parallel-size: 8, data-parallel-size: 1, expert-parallel-size: 1 (with no enable-dp-attention).

Step-by-step proof of the topology mismatch:

  1. Filename disagg-gb200-1p1d-tp8-tp8-4-c1.yaml decomposes as 1p1d (1 prefill / 1 decode) tp8-tp8 (TP=8 on both sides) -4- (4 nodes total) c1 (concurrency 1).
  2. Master yaml entry (lines 8190-8202 of .github/configs/nvidia-master.yaml) for conc-list: [1]: prefill num-worker: 1, tp: 8, ep: 1, dp-attn: false; decode num-worker: 1, tp: 8, ep: 1, dp-attn: false. Matches the filename.
  3. Recipe resources block: 2 prefill nodes + 2 decode nodes = 4 nodes, with 1 worker on each side and 8 GPUs per worker (TP=8).
  4. Recipe sglang_config.prefill / sglang_config.decode: TP=8, DP=1, EP=1 on both sides. No dp-attention, no expert parallelism.
  5. The comment claims 7 prefill workers (TP=4/DP=4/EP=4, 1 node each = 7 nodes for prefill) + 1 decode (TP=8/DP=8/EP=8, 2 nodes) = 9 nodes total, with dp-attention/EP enabled. That topology resembles one of the other new recipes (e.g. the wideep family with enable-dp-attention: true and tensor-parallel-size: 8 / data-parallel-size: 8 / expert-parallel-size: 8 in prefill). It does not describe this file.

Impact — All three are comment-only; the runtime configuration is internally consistent and the recipe should execute correctly. But this is a brand-new recipe whose stated purpose includes documenting the topology and the rationale for sbatch directives and dynamo pinning. Future maintainers reading the comments will be actively misled: they'll chase a dead path for the pin rationale, believe the file targets a different cluster, and form a completely wrong mental model of the topology.

Fix — Either delete the stale comments or rewrite them to match: (1) drop the ../1k1k/... reference or repoint it at a file that actually exists in the tree; (2) change "the same gb300-cw cluster" to whichever cluster this recipe actually targets (or drop the cluster name entirely — the rationale stands on its own); (3) rewrite the topology comment as something like # Topology: 1 prefill (TP=8 / DP=1 / EP=1 / 2 nodes) + 1 decode (TP=8 / DP=1 / EP=1 / 2 nodes). 4 nodes total. so it matches the filename and resources block.

@yhyang201
Copy link
Copy Markdown
Collaborator

/sweep

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25791594293
Command: ``
Pinned ref: fc724b8
Approval: not required (trusted collaborator).

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the sglang-dsagg-gb200-0513 branch from 441e8e8 to bba0e06 Compare May 13, 2026 15:55
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the sglang-dsagg-gb200-0513 branch from bba0e06 to 597203d Compare May 13, 2026 16:29
@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Fridge003 and others added 6 commits May 15, 2026 00:57
The sa-submission-q2-2026 branch's SGLang container path in
get_install_commands() calls maturin directly without checking if it
exists. The vllm container ships cargo/maturin so dynamo-vllm runs
succeed, but nightly-dev SGLang images lack both tools, causing all
workers to fail with exit code 127.

Backport the fix from NVIDIA/srt-slurm#81 (commit 5193837) onto the
cloned sa-submission-q2-2026 checkout: install cargo via rustup when
missing, force-reinstall maturin, and add protobuf-compiler.
The escaped \\/Users/yyh in the Python heredoc produced literal $HOME in
schema.py, which bash could not resolve.  Use bare /Users/yyh to match
srt-slurm main.
The nightly SGLang container's transformers version does not recognize
the deepseek_v4 model type, causing sa-bench to crash on tokenizer
init. Use the same SGLangDeepseekV4Tokenizer that the GB300 MTP
recipes already use.
sa-submission-q2-2026 predates the sa_bench_tokenizers module
(NVIDIA/srt-slurm#73), causing benchmark to fail with
ModuleNotFoundError. Cherry-pick the module from main, same as
the aflowers/vllm-gb200-v0.20.0 branch uses.
@yhyang201 yhyang201 force-pushed the sglang-dsagg-gb200-0513 branch from 597e9b1 to 5899bf4 Compare May 14, 2026 17:01
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants