Modify DSV4 GB300 SGLang MTP disagg concurrency#1316
Conversation
Reduce concurrency for mid-curve MTP configs based on server log analysis of run #25642986639. Current values are 1.6-1.8x above system capacity, causing prefill queue buildup and high TTFT. - 1P 8D: 256 -> 160 - 1P 16D: 256 -> 160 - 2P 8D: 512 -> 320 - 4P 8D: 1024 -> 640 Set num_prompts_mult=16 to keep total prompt count unchanged (original: conc*10, new: conc*16).
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
/sweep |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25744314802 |
There was a problem hiding this comment.
LGTM — mechanical concurrency tuning for 4 mid-curve MTP disagg configs with clear rationale and offsetting num_prompts_mult to preserve total prompt count.
Extended reasoning...
Overview
The PR tunes 4 mid-curve MTP disagg concurrency points for the dsv4-fp4-gb300-dynamo-sglang-mtp config family across nvidia-master.yaml and 4 recipe YAMLs (1P 8D: 256→160, 1P 16D: 256→160, 2P 8D: 512→320, 4P 8D: 1024→640), and adds num_prompts_mult: 16 in each recipe so total prompt count is preserved. Low-latency configs are untouched.
Security risks
None. This is benchmark-recipe configuration data — YAML scalars consumed by the sweep harness. No code paths, auth, crypto, permissions, or runtime logic are affected.
Level of scrutiny
Low. The edits are purely numeric concurrency knobs with a clear empirical justification in the description (server-log analysis of run #25642986639 showing 1.6–1.8× over-subscription and P50 prefill queue of 130–230). The values change consistently across nvidia-master.yaml and each recipe, and num_prompts_mult: 16 is applied uniformly to keep workload size constant.
Other factors
The bug hunter flagged a nit about not appending a perf-changelog.yaml entry, which means the post-merge sweep won't auto-trigger to verify the TTFT improvement claim — this is already conveyed as an inline comment for the author to act on before merging, and doesn't impair correctness of the YAML edits themselves. Given the small, self-contained, mechanical nature of the change and the precedent set by PR #1305 / #1310, this can land without further human review.
| # Mid curve 1p1d-dep4-dep8. 3 nodes. | ||
| - spec-decoding: "mtp" | ||
| conc-list: [256] | ||
| conc-list: [160] |
There was a problem hiding this comment.
🟡 This PR retunes concurrency for four dsv4-fp4-gb300-dynamo-sglang-mtp mid-curve configs (in nvidia-master.yaml and the four recipe YAMLs) but does not append a corresponding entry to perf-changelog.yaml. Because .github/workflows/run-sweep.yml gates both its push and pull_request triggers on a path filter for perf-changelog.yaml, the post-merge sweep won't fire for this change and the TTFT improvement vs. run #25642986639 cited in the description can't be auto-recorded — appending a single entry with config-keys: dsv4-fp4-gb300-dynamo-sglang-mtp (matching the pattern of the analogous PR #1305) before merging will fix it.
Extended reasoning...
What the bug is. The PR changes concurrency points for dsv4-fp4-gb300-dynamo-sglang-mtp in .github/configs/nvidia-master.yaml (4 conc-list entries at lines 8477/8492/8507/8522 — 256→160, 256→160, 512→320, 1024→640) and the matching benchmark stanzas of the four mid-curve recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/. The PR description's stated goal is to verify a TTFT/E2E improvement against benchmark run #25642986639. However, perf-changelog.yaml is not touched (the PR file list is 5 files — nvidia-master.yaml plus the four recipes).\n\nWhy this matters — the sweep workflow gating. .github/workflows/run-sweep.yml declares its triggers as:\n\nyaml\non:\n push:\n branches: [main]\n paths:\n - "perf-changelog.yaml"\n pull_request:\n branches: [main]\n types: [ready_for_review, synchronize, labeled, unlabeled]\n paths:\n - "perf-changelog.yaml"\n\n\nBoth the push-to-main path and the pull_request path are gated on a paths filter that only matches perf-changelog.yaml. With no edit to that file, the workflow is not eligible to run at all on this PR or on the merge commit.\n\nProject convention reinforces this. AGENTS.md states explicitly: "Changes to perf-changelog.yaml trigger benchmark runs" (Git section) and "perf-changelog.yaml triggers which configs to benchmark" (YAML section). The "Adding a New Benchmark Configuration" checklist (step 2) directs maintainers to "Add corresponding entry to perf-changelog.yaml to trigger benchmark" — and although that step is written for new-config flows, the same gating applies to recipe retunes, which is why earlier tuning PRs added entries too.\n\nPrior precedent — PR #1305. The directly analogous PR "Fix B200 Dynamo vLLM recipe benchmark concurrencies" performed exactly this kind of operation (concurrency tuning in nvidia-master.yaml + recipe YAMLs) and appended a 7-line entry to perf-changelog.yaml with config-keys: dsv4-fp4-b200-dynamo-vllm to trigger the post-merge sweep. That entry is visible at the tail of the current perf-changelog.yaml. The current PR is the same pattern minus the changelog entry, so it falls outside the established workflow.\n\nStep-by-step proof that the sweep will not fire after merge.\n1. Merge commit lands on main. Modified paths: .github/configs/nvidia-master.yaml and four files under benchmarks/multi_node/srt-slurm-recipes/.../8k1k/.\n2. GitHub evaluates run-sweep.yml's on.push block. The paths filter requires at least one changed path to match perf-changelog.yaml.\n3. None of the five changed paths match. The workflow run is not created (no Run Sweep entry will appear in the Actions tab for this SHA).\n4. As a result, the TTFT/TPOT comparison against run #25642986639 that motivates the PR is never produced by the standard pipeline, and no changelog row links this PR for future bisection.\n\nImpact. This is a process/workflow omission, not a code defect — the YAML edits themselves are internally consistent. The consequence is loss of automated verification and changelog provenance for a perf-targeted change.\n\nHow to fix. Append a new entry to the END of perf-changelog.yaml (per AGENTS.md's chronological-append rule), mirroring PR #1305's shape, e.g.:\n\nyaml\n- config-keys:\n - dsv4-fp4-gb300-dynamo-sglang-mtp\n description:\n - "Tune mid-curve MTP disagg concurrencies (1p1d/2p1d/4p1d) to match measured system capacity"\n - "Reduce conc 256→160 (dep4-dep8, dep4-dep16), 512→320 (2p1d), 1024→640 (4p1d); keep num_prompts_mult=16 to preserve total prompt count"\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1316\n\n\nMake sure the file still ends with a trailing newline (the check-newline job in run-sweep.yml enforces this on PRs).
|
/sweep |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25745147292 |
Required to trigger the sweep workflow on merge.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25745868416 |
Enable DP-rank-aware routing to improve prefill queue distribution across DP ranks. Matches non-MTP disagg configs.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25753311297 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25753311297 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25753311297 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25775064759 |
Replace default follow_bootstrap_room with round_robin to improve queue distribution across prefill DP ranks.
a1ed80c to
2b265fa
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25775153188 |
Explore throughput-latency tradeoff with 3 conc points each: - 1P 8D/16D: 64x128x160 - 2P 8D: 128x256x320 - 4P 8D: 256x512x640
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25775190340 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25775454558 |
|
/sweep |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25793093112 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25793153134 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25793192375 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25807984297 |
- 1P8D: 160, 1P16D: 192, 2P8D: 384, 4P8D: 640 - Resolve perf-changelog.yaml merge conflict
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25816160589 |
Summary
num_prompts_mult: 16to keep total prompt count unchanged