feat: deepseek-v4 model support by wenxie-amd · Pull Request #698 · AMD-AGI/Primus

wenxie-amd · 2026-04-28T11:01:30Z

Summary

This PR brings DeepSeek-V4 training support into Primus on the Megatron backend.

It now spans the full bring-up arc (P0 – P10) and the plan-2 lockdown (P12) that closes out plan-0 / plan-1 with an architecture-faithful rewrite plan for the remaining work (P13 – P21).

Plan timeline

Plan	Phases	Window	Status
plan-0 (`develop/plan-0/`)	P0 – P7	2026-04-28	done — initial bring-up, configs, dispatch, layer specs, HC + Hybrid Attn, MoE / activation / RoPE / MTP, single-node smoke (`PP=2 EP=4`)
plan-1 (`develop/plan-1/`)	P8 – P11	2026-04-29 → 2026-04-30	partial — P8 / P9 / P10 done; P11 paused by the architecture review
plan-2 (`develop/plan-2/`)	P12 – P19 (+ deferred P20 / P21 / P22+)	2026-05-01 (lockdown) → 2026-05-07 (P19 close-out)	wrapping up — P12 / P13 / P14 / P15 / P16 / P17 / P18 / P19 done; pre-training-first scope means P20 (perf / convergence gates), P21 (docs / handover), and P22+ (HF state-dict adapter) are all deferred follow-ups, gated by the next campaign that needs them

Plan-2 reshuffle — 2026-05-01 (commit `f548d8b2`, docs-only)

Pre-training is the release path; HF-weight loading is not required for the release. Plan-2 phase shape after this reshuffle:

Phase	New scope	Notes
P17	Code cleanup (was: state-dict adapter)	retire `_RMSNorm` duplicates / `dual_rope.py` / `csa_attention.py` / `hca_attention.py` / legacy `DeepseekV4MTPBlock` / EP `all_reduce` fallback gate / `_v4_token_ids` residue / yaml comment fixes. New gate G14 (static dead-code audit).
P18	Spec audit (unchanged; `_v4_token_ids` removal moved to P17)
P19	Distributed re-validation (unchanged; G6 / G7 still here)
P20	Convergence + perf gates (HF numerical-alignment row removed; convergence baseline switched to Megatron-bridge)
P21	Docs + handover (slimmed; cleanup tasks moved to P17)	techblog / progress HTML / PPT / `develop_deepseek-v4-in-primus.md` only
P22+	HF state-dict adapter + V4-Flash checkpoint load (deferred)	Activate when SFT / evaluation needs HF weights. Design notes preserved in `02-target-architecture.md` §7 + `03-phase-details.md` (P22+ section). G8 / G9 deferred from P17; HF-numerical-alignment portion of G12 also deferred here.

Why plan-2

A code review of dev/wenx/deepseek-v4 against real DeepSeek-V4 (HF reference, NeMo port, official inference) and Megatron's spec + config + provider + submodule + build_module pattern surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW). Highlights:

Attention uses separate linear_k_proj / linear_v_proj; real V4 has a single-latent wkv (K = V = kv).
q_norm / kv_norm per-head RMSNorms are missing.
HashRouter outputs uniform 1/topk weights with no learnable gate.
clamped_swiglu clamps post-mul; real V4 clamps pre-mul on silu(gate) and up.
No state-dict adapter: official V4-Flash / V4-Pro HF safetensors cannot be loaded.
DeepseekV4Attention / DeepseekV4TransformerBlock / DeepseekV4HybridLayer / DeepseekV4MoE reinvent rather than subclass MLASelfAttention / TransformerBlock / TransformerLayer / MoELayer.

Plan-2 (develop/plan-2/) is the architecture-faithful rewrite. Full review in develop/plan-2/00-review-findings.md; rewrite map in 02-target-architecture.md; phase-by-phase plan in 03-phase-details.md; gates in 04-test-strategy.md.

Commit map

commit	phase	scope
`e194e039`	docs	architecture deep-dive + plan docs
`d3383c02`	P1	configs / yaml + tokenizer
`8ae10000`	P2	`model_type=deepseek_v4` dispatch
`a5d2a561`	P3	layer spec + block scaffolding
`3b7ad8c8`	P4	HC + Hybrid Attention + dual-RoPE
`5e4008dc`	P5	V4 MoE + clamped SwiGLU + V4 MTP
`97b9720d`	P6-P7	PP/EP integration fixes + single-node run script + progress docs
`df273a45`	P8(v2)	LanguageModule migration + DeepSeek runtime spec-tree main path
`e5fec968`	P9(v2)	provider reuse integration + TE CUDA runtime validation/report
`b38e83cf`	P10(v2)	enforce MoE provider path and add V4 config schema
`752b7534`	P10(v2)	stabilize smoke runtime and add phase report
`636ab3de`	P12(v3)	plan-2 lockdown + as-built techblog + roadmap visuals
`cad0fb38`	P13(v3)	rebase V4 attention on `MLASelfAttention` (faithful dense path)
`aa9929a0`	P13(v3)	fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy
`1a8bf32e`	P14(v3) phase-1	faithful pre-mul clamped SwiGLU + V4 routers (learnable gate weight; HF-aligned scoring) + G3 + G4 unit tests
`5fe8bc3c`	P14(v3) phase-2	`DeepseekV4MoE` -> `MegatronModule` + CPU local-experts path; `v4_grouped_mlp_spec` / `v4_router_spec` providers; G5 (1L MoE forward <= 1e-3 vs HF reference)
`25ccdb5e`	P15(v3)	`DeepseekV4HybridLayer` -> `TransformerLayer`; `DeepseekV4TransformerBlock` -> `TransformerBlock`; HC x PP K-stream packing helpers; `HyperHead` only on post_process; `token_ids` forward kwarg replaces `decoder._v4_token_ids` stash; 16 unit tests
`6c5875d4`	P16(v3)	spec-based MTP via upstream `MultiTokenPredictionBlock` + `process_mtp_loss`; `get_v4_mtp_block_spec` helper; layer forward returns `(hidden_states, None)` for MTP-call compatibility; legacy `DeepseekV4MTPBlock` deprecated; 17 unit tests
`f548d8b2`	docs	plan-2 reshuffle — defer HF state-dict adapter to P22+; repurpose P17 for code cleanup; add G14 gate; update roadmap / phase-details / test-strategy / status / README
`e591b893`	P17(v3)	dead-code retirement (G14): delete legacy `DeepseekV4MTPBlock` + `v4_use_custom_mtp_block` / `mtp_compress_ratios` config fields; introduce shared `LocalRMSNorm` helper and dedup three `_RMSNorm` shadows (`block.py` / `attention.py` / `compressor.py`); fix inverted yaml comment (4=CSA / 128=HCA); refresh package `__init__` surface; add `tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py` (G14 audit). `dual_rope.py` is intentionally kept — load-bearing for V4's CSA / HCA dual-base RoPE; no Megatron equivalent.
`b5832672`	P18(v3)	spec-system audit (D1 / D2 / D4 / G1): `build_context.resolve_v4_provider(config)` caches the V4 provider on the config object (replaces three direct `DeepSeekV4SpecProvider(...)` call sites); new `provider.v4_mlp_activation_func()` returns `None` when `use_te_activation_func=False` (V4 default — clamped-SwiGLU eager path) and `TEActivationOp` otherwise; `compress_ratios` normalized to `tuple[int, ...]` in `__post_init__` (so runtime never re-runs `ast.literal_eval`); new `tests/unit_tests/configs/test_deepseek_v4_yaml.py` (G1 schema gate) + `tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py` (D1 / D2 / package-surface AST audits).
`83c33ad0`	P19(v3)	distributed re-validation (G10) — two primus-patches that close PP > 1 + VPP under V4: `megatron.deepseek_v4.pp_tensor_shape` (wraps both `schedules.get_tensor_shapes` for 1F1B and `forward_backward_pipelining_with_interleaving` for VPP, multiplies the seq dim by `hc_mult` so the PP wire carries V4's mHC `[SK, B, D]` packing) and `megatron.deepseek_v4.pp_token_pre_broadcast` (pre-broadcasts all microbatch / chunk `input_ids` from PP rank 0 across the PP group upfront* in a wrapper around `get_forward_backward_func`, so middle PP stages owning hash-routed MoE layers see real token IDs without deadlocking the interleaved-1F1B / VPP schedule). Drops the in-`forward` PP broadcast + VPP fail-fast assert from `DeepseekV4Model`, and stops pre-assigning `self.mtp = None` so Megatron's `set_current_microbatch` only iterates `model.mtp.layers` when MTP is live (matches upstream `GPTModel`).
`dba27163`	plan-2 close-out	docs-only — mark the `c10d::allreduce_` autograd warning as gone (verified absent in P19 smokes A/B/C/D + EP=8 / PP=2 EP=4 profile runs on `mi355-gpu-12`); mark G11 (routing-snapshot diff = 0 across PP / EP changes) as deferred (snapshot dump tooling never landed; not on the pre-training release path); drop Phase 20 / 21 / 22+ sections from `status.md` (kept as documented intent in `plan-2/03-phase-details.md`); add `deepseek-v4/develop/progress/plan-2-summary.md` (stand-alone summary of the architecture-faithful rewrite from P12 → P19, including a per-phase outcome table, a P19 deep-dive, the test-gate ledger, the plan-1 → plan-2 architectural-shift table, and pointers to logs / profile traces); add P19 profile launchers (`run_profile_ep8.sh` for TP=1 PP=1 EP=8 and `run_profile_pp2_ep4.sh` for TP=1 PP=2 EP=4) plus `deepseek-v4/download_ref.sh` (idempotent helper that ensures git-lfs and clones the V4 reference assets — HF transformers, ROCm/TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, and the four DeepSeek-V4 model repos — at pinned commits with `GIT_LFS_SKIP_SMUDGE=1` so weights are not downloaded by default).

What landed in `97b9720d` (P6/P7)

P6 integration

deepseek_v4_builders.py
- Align model_provider with upstream Megatron signature (config, pg_collection).
deepseek_v4_block.py
- Build only local PP layers via get_num_layers_to_build + get_transformer_layer_offset.
- Add set_input_tensor support for non-first PP stages.
- Normalize/parse compress_ratios more robustly.
- Return viewless output via make_viewless_tensor for PP schedule compatibility.
v4_moe.py
- Add EP-aware local expert sharding and EP all-reduce merge path.
deepseek_v4_model.py
- Keep custom V4 MTP block behind v4_use_custom_mtp_block; default to native GPTModel MTP path for stable bring-up.
dual_rope.py, deepseek_v4_attention.py, attn_sink.py
- Rename DualRoPE.apply -> apply_rope (avoid nn.Module.apply conflict).
- Cast attention probs to value dtype before matmul to avoid bf16 mismatch.

P7 bring-up

Add run_deepseek_v4.sh (based on run_qwen.bak.sh) with fixed knobs:
- MBS=1, GBS=16, TP=1, PP=2, EP=4
- lightweight smoke overrides (num_layers=8, num_experts=8, mtp_num_layers=0)
Single-node run passed on:
- host: uswslocpm2m-106-2371
- container: dev_primus_wenx_691
- command: TRAIN_ITERS=3 ./run_deepseek_v4.sh
- result: reached iteration 3/3, torchrun exit code 0

What landed in `df273a45` (P8 v2)

deepseek_v4_model.py
- DeepseekV4Model now inherits from LanguageModule (no longer GPTModel).
- Remove super_init_transformer_layer_spec path.
- Build decoder directly from externally supplied DeepSeek runtime transformer_layer_spec.
deepseek_v4_layer_specs.py
- Remove GPT placeholder-spec helpers.
- Keep DeepSeek-native runtime spec tree only, with full layer/submodules topology.
deepseek_v4_builders.py
- Resolve/pass runtime decoder spec only; remove GPT placeholder/super-init dependence.
deepseek-v4/develop/progress/status.md
- Mark Phase 8(v2) tasks completed and sync notes with the finalized implementation.

Runtime verification:

On host uswslocpm2m-106-2371, container dev_primus_wenx_691:
- Instantiate DeepseekV4Model (LanguageModule-based) with runtime spec tree.
- Forward pass succeeds with output shape (128, 2, 256).

What landed in `e5fec968` (P9 v2)

core/extensions/transformer_engine_spec_provider.py
- Add DeepSeekV4SpecProvider(PrimusTurboSpecProvider) as the V4 provider entry point.
- Resolve runtime mode (local / te / turbo) and expose V4-specific provider helpers for norm/grouped-MLP selection.
deepseek_v4_layer_specs.py
- Resolve provider once at spec-build time and route norm, attention projection specs, dense projection specs, and MoE grouped path payload through provider-aware ModuleSpec construction.
deepseek_v4_attention.py
- Refactor attention projections to submodules + build_module via DeepseekV4AttentionSubmodules (q_a, q_b, k_proj, v_proj, o_proj) with local fallback.
deepseek_v4_block.py
- Align dense MLP projection initialization with provider-selected linear modules.
- Add explicit fail-fast guard: TE/Turbo provider mode requires CUDA hidden_states.
v4_moe.py
- Integrate provider grouped-GEMM expert path with safe fallback to local clamped SwiGLU experts.
docs/status updates
- Add deepseek-v4/develop/plan-1/03-phase9-provider-ab-report.md.
- Update deepseek-v4/develop/progress/status.md with completed Phase 9(v2) items and English-only notes.

Runtime verification:

On host uswslocpm2m-106-2371, container dev_primus_wenx_691:
- local mode forward passes (Linear projections).
- TE mode module-map build resolves to TELinear projections.
- TE mode CUDA forward passes (decoder.cuda() + CUDA inputs).
- TE/Turbo host-input path now fails fast with explicit runtime error instead of low-level GPU fault.

What landed in `b38e83cf` (P10)

core/transformer/moe/v4_moe.py
- enforce SharedExpertMLP-only shared-expert path (remove local ClampedSwiGLUMLP fallback for shared experts).
- wire clamped-SwiGLU behavior through SharedExpertMLP config path.
core/models/deepseek_v4/deepseek_v4_transformer_config.py
- add DeepSeekV4TransformerConfig (inherits MLATransformerConfig) with DeepSeek-V4 specific fields used by V4 runtime modules.
- align aliases/compat in __post_init__ (norm_epsilon, moe_intermediate_size, clamp sync, vocab/padded vocab sync).
deepseek_v4_builders.py
- explicitly build V4 model config via core_transformer_config_from_args(..., config_class=DeepSeekV4TransformerConfig).
V4 modules/specs type wiring
- update V4 builder/spec/model/attention/MoE module signatures and type hints to consume DeepSeekV4TransformerConfig.
model yaml
- add activation_func_clamp_value to primus/configs/models/megatron/deepseek_v4_base.yaml with clamped-SwiGLU comment.
docs/progress
- refresh deepseek-v4/develop/plan-1/* and deepseek-v4/develop/progress/status.md for Phase10 implementation notes.

Validation in this commit:

pre-commit hooks passed (isort/autoflake/black/yaml checks).
Python syntax compile checks passed for all touched DeepSeek-V4 runtime files.

What landed in `752b7534` (P10 runtime stabilization + report)

run_deepseek_v4.sh
- add smoke-safe overrides for Phase 10 validation (seq_length/max_position_embeddings=128, index_topk=8).
- set v4_grouped_experts_support_clamped_swiglu=True for grouped-expert clamped-SwiGLU runtime guard compliance.
- disable overlap_grad_reduce and overlap_param_gather in smoke mode to avoid DDP bucket reset assertion between iterations.
primus/backends/megatron/core/transformer/hyper_connection.py
- align F.linear weight dtype to activation dtype in HyperMixer and HyperHead to fix BF16 runtime mismatch.
primus/backends/megatron/core/transformer/deepseek_v4_attention.py
- cast attention output back to activation dtype before TE output projection to satisfy TE dtype assertions.
deepseek-v4/develop/plan-1/04-phase10-moe-distributed-convergence-report.md
- add formal Phase 10 report covering delivered architecture, runtime blocker/fix chain, and remaining tracked items.

Runtime verification in this update:

host: uswslocpm2m-106-2371
container: dev_primus_wenx_691
command: ./run_deepseek_v4.sh
result: reached iteration 10/10, and torchrun finished successfully (code 0).

What landed in `636ab3de` (P12 — plan-2 lockdown)

Documentation-only commit; no runtime code changes.

Architecture review

Walked the branch e194e039..HEAD against:
- deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/{config.json, inference/model.py}
- HF Transformers PR 45616 / 45643 (deepseek-v4/transformers/.../deepseek_v4/)
- NVIDIA NeMo AutoModel V4 port (deepseek-v4/NVIDIA-NeMo/Automodel/...)
Surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW), spanning architecture faithfulness, Megatron reuse / spec violations, distributed correctness, spec-system hygiene, code quality, and testing gaps.

Plan-2 documents (active plan of record)

deepseek-v4/develop/plan-2/README.md
deepseek-v4/develop/plan-2/00-review-findings.md — full severity-ranked findings ledger
deepseek-v4/develop/plan-2/01-roadmap.md — phases P12 → P21, dependency graph, milestones, top risks
deepseek-v4/develop/plan-2/02-target-architecture.md — module-by-module rewrite map (rebases on MLASelfAttention, TransformerLayer, TransformerBlock, MoELayer, MultiTokenPredictionBlock, (Yarn)RotaryEmbedding)
deepseek-v4/develop/plan-2/03-phase-details.md — granular tasks / exit criteria / risks per phase
deepseek-v4/develop/plan-2/04-test-strategy.md — L0..L3 test pyramid and release gates G1..G14 (G8 / G9 marked deferred → P22+ since the 2026-05-01 reshuffle)

Plan-1 phases 9 / 10 / 11 are paused — their tracking rows in status.md remain for history.

Tech blog closure

Added deepseek-v4/develop/techblog/02-plan-1-as-built-and-plan-2-pointer.md: closes plan-0 / plan-1 with an as-built note (what shipped, what fell short) and points readers at plan-2.
Updated deepseek-v4/develop/techblog/README.md with a banner declaring plan-2 the active plan of record.

Layout cleanup + visuals

Renamed develop/plan/ → develop/plan-0/ (the original bring-up plan; tracked as a rename).
Added develop/progress/timeline.html: standard system-fonts version of the project timeline; daily-column Gantt with a May 02 – 05 Holiday band; remaining nine phases (P13 – P21) packed into the May 06 – 09 working window.
Added develop/progress/build_roadmap_pptx.py (generator) + develop/progress/deepseek_v4_roadmap_v1.pptx (13-slide tech-style deck on a black background, 16:9). Slide 7 — 07 · 开发计划 · DEVELOPMENT SCHEDULE — is the day-by-day plan with a 3-row layout (date chip / P0~P7-style phase chip / work-content card) plus a directional arrow with the holiday-gap marker.

Status tracker

develop/progress/status.md now has explicit Phase 12 → Phase 21 (v3) sections.
All P12 engineering items are checked off; only the stakeholder sign-off on plan-2 scope remains open.
The blockers/risks log carries one row per CRIT finding, each pointing at the plan-2 phase that resolves it.

Schedule

Block A (landed): 2026-04-28 → 2026-05-01 — plan-0 P0 – P7 + plan-1 P8 – P10 + plan-2 P12 lockdown.
Holiday: 2026-05-02 → 2026-05-05.
Block B (planned): 2026-05-06 → 2026-05-09 — plan-2 P13 – P21 across 4 working days (P13 + P14 / P15 + P16 / P17 + P18 / P19 + P20 + P21). Note: P17 scope changed to code cleanup per the 2026-05-01 reshuffle; HF state-dict adapter + V4-Flash numerical alignment is deferred to P22+ and not in this Block B window.

What landed in `cad0fb38` + `aa9929a0` (P13 — faithful attention)

Plan-2 P13 lands in two commits inside the May 06 budget. Both are scoped strictly to the dense / CSA / HCA attention path; faithful MoE / router / MTP are tracked in P14 / P15 / P16. (HF state-dict adapter — originally planned for P17 — has since been deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need it.)

`cad0fb38` — V4-faithful attention rooted on `MLASelfAttention` (dense path)

Rewrite the dense (compress_ratio == 0) path of DeepSeek-V4 attention to be faithful to the released DeepSeek-V4-Flash checkpoint and rooted on Megatron's MLASelfAttention.

primus/backends/megatron/core/transformer/deepseek_v4_attention.py
- New DeepseekV4Attention(MLASelfAttention) subclasses MLA for type identity but bypasses the parent __init__ chain because V4's KV layout differs from MLA's compressed-KV form.
- Single-latent KV: one linear_kv projection (hidden -> head_dim) feeds both K and V, broadcast across all query heads.
- Per-head q_rms: parameter-less RMS on head_dim after linear_q_up_proj and before partial RoPE (no q_rms.weight in the released checkpoint).
- Grouped low-rank O: einsum-based linear_o_a per group + linear_o_b when o_lora_rank > 0. Falls back to MLA-style flat linear_proj when o_lora_rank == 0.
- Learnable attn_sink: direct nn.Parameter on the attention (matches the released key layers.{i}.attn.attn_sink exactly), with inline softmax-with-sink in _attention_forward.
- New DeepseekV4AttentionSubmodules dataclass with MLA-canonical names (linear_q_down_proj, linear_q_up_proj, q_layernorm, kv_layernorm) plus V4 extras (linear_kv, linear_o_a, linear_o_b, attn_sink).
- _LegacyDeepseekV4Attention retained temporarily as the parent for CSAAttention / HCAAttention until the P13 follow-up commit folds the compressor / indexer into the new class.
primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py
- Added v4_q_layernorm(), v4_kv_layernorm(), v4_attention_sink() factory methods.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
- Routes compress_ratio == 0 to the new class with V4-canonical submodules; legacy path retained for {4, 128}.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_transformer_config.py
- Added o_groups: int = 8 and o_lora_rank: int = 0.
tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.py
- State-dict-key contract; forward shape + finiteness; numerical equivalence vs an inline V4 reference (single-latent KV, partial interleaved RoPE, attn-sink as virtual key column, grouped low-rank O), with attn_sink enabled and disabled (≤ 1e-3); per-head q_rms is parameter-less; o_lora_rank == 0 fallback path; rejection paths.

`aa9929a0` — Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy

Closes P13 by folding the compressed-branch attention into the V4-faithful class as spec submodules, switching the TP-sensitive projections to ColumnParallel / RowParallel, and retiring the plan-1 legacy attention classes.

primus/backends/megatron/core/transformer/deepseek_v4_attention.py
- DeepseekV4Attention.__init__ accepts compress_ratio in {0, 4, 128}. When compress_ratio > 0 it builds self.compressor from submodules.compressor; when compress_ratio == 4 it also builds self.indexer from submodules.indexer.
- DeepseekV4AttentionSubmodules extended with compressor and indexer fields.
- DeepseekV4Attention.forward now dispatches on self.compress_ratio:
  - 0 — dense / SWA over local KV.
  - 128 — HCA: compressed pool with compress-base partial RoPE on indices [0..P), broadcast to H heads, concat to local KV with a compressed-causal mask, joint softmax-with-sink shared across local + compressed branches.
  - 4 — CSA: per-query top-K from compressed pool via Indexer + overlap-mode Compressor, joint softmax-with-sink across local + sparse keys.
- _LegacyDeepseekV4Attention and _LegacyDeepseekV4AttentionSubmodules removed.
primus/backends/megatron/core/transformer/{csa,hca}_attention.py deleted.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
- _build_v4_attention_submodules now also builds compressor / indexer ModuleSpecs for compressed branches.
- linear_q_up_proj switched to provider.column_parallel_linear() (gather_output=True); linear_o_b (grouped) and linear_proj (flat-O fallback) switched to provider.row_parallel_linear() (input_is_parallel=False). At tp > 1 the projection weights are sharded across TP ranks; at tp = 1 the result is bit-identical to the previous duplicated path. linear_q_down_proj, linear_kv, linear_o_a stay duplicated; full grouped-O TP plan is tracked in P14.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
- _build_attention (no-spec fallback) now constructs DeepseekV4Attention for all branches; the new class builds its own Compressor / Indexer locally when no spec is provided.
tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.py
- HCA forward shape + finiteness + numerical equivalence vs an inline reference (≤ 1e-3); CSA forward shape + finiteness; spec wiring contract tests for ColumnParallel / RowParallel and Compressor / Indexer presence; torchrun --nproc_per_node=2 parity scaffold (skipif single-rank).

Status

deepseek-v4/develop/progress/status.md: P13 fully checked off (including the items previously deferred to the follow-up commit). Items routed to P14 (full grouped-O TP plan) / P22+ — deferred (HF-reference numerical alignment via the state-dict adapter, originally P17) / P19 (full TP=2 sharding-parity bit-equality check) are noted as such on each row.

Schedule

Block A (extended): 2026-05-01 — plan-2 P13 first commit cad0fb38 (early start; the May 02 – 05 holiday remains).
Holiday: 2026-05-02 → 2026-05-05.
Block B (planned): 2026-05-06 → 2026-05-09 — P14 – P21 across 4 working days. P13 follow-up aa9929a0 is recorded under May 06 in the daily plan.

What landed in `1a8bf32e` (P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)

P14 ships in two commits. This one lands the math + parameter-layout faithfulness so V4-Flash checkpoints will load through the future state-dict adapter (originally P17, now deferred to P22+ by the 2026-05-01 reshuffle) without remapping. The structural refactor (DeepseekV4MoE(MoELayer) subclassing, provider helpers, G5 1L MoE forward) is the P14 phase-2 follow-up.

Activation (G3)

primus/backends/megatron/core/transformer/clamped_swiglu.py
- Replace post-multiplication clamp with V4 pre-multiplication semantics: SiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha). New helpers clamped_swiglu_pre_mul(gate, up, alpha) (split inputs) and clamped_swiglu_pre_mul_fused(x, alpha) ([gate | up] last-dim concat for grouped-gemm experts).
- ClampedSwiGLUMLP now uses separate w1 / w2 / w3 Linears so the released checkpoint (Expert(w1, w2, w3, swiglu_limit)) loads without remapping. Optional fused_gate_up=True fuses the gate / up GEMMs at forward time only; the saved / loaded state_dict keys remain w1.weight / w2.weight / w3.weight.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
- _DenseSwiGLUMLP now applies the same pre-mul clamp on its dense head/tail layers; previously it computed vanilla SiLU(gate) * up and ignored swiglu_limit.

Learned router (G4)

primus/backends/megatron/core/transformer/moe/v4_topk_router.py
- Rename V4TopKRouter → DeepseekV4LearnedRouter (back-compat alias retained).
- Gate exposed as weight Parameter of shape [num_experts, hidden_size] — matches Megatron's TopKRouter.weight AND HF reference Gate.weight exactly (no gate.weight indirection).
- expert_bias is selection-only: routing weights gather from the un-biased scores so probs gradient flows to weight, never to expert_bias.
- Renormalization gated on score_function != "softmax" (HF parity; softmax probs already sum to 1).
- topk_scaling_factor honors moe_router_topk_scaling_factor (HF route_scale).
- Score functions: v4_score_fn covers softmax, sigmoid, sqrtsoftplus.

Hash router (G4)

primus/backends/megatron/core/transformer/moe/v4_hash_router.py
- Rename HashRouter → DeepseekV4HashRouter (back-compat alias retained).
- Add learnable weight Parameter same shape as the learned router; previously the hash router emitted uniform 1/topk weights, which broke gradient flow into the gate weights and silently differed from the released checkpoint.
- tid2eid is now a frozen nn.Parameter(requires_grad=False, dtype=torch.int32) (matches HF reference layout — released checkpoint stores it as a parameter so state-dict round-trips preserve it without polluting the optimizer state).
- forward(hidden, token_ids) gathers learned scores at the static expert ids prescribed by tid2eid[token_ids]; renorm + scale parity with the learned router.

MoE wiring

primus/backends/megatron/core/transformer/moe/v4_moe.py
- _route now passes (hidden, token_ids) to the hash router; both routers receive hidden_size / score_function / topk_scaling_factor at init.

Tests

tests/unit_tests/megatron/transformer/deepseek_v4/test_clamped_swiglu.py — 7 tests cover pre-mul activation vs HF reference (≤ 1e-6 fp32, four alpha values), alpha = 0 disables clamp, fused-vs-split agreement, one-sided gate clamp behavior, w1 / w2 / w3 state-dict keys (no gate_up.weight leak), fused_gate_up forward equivalence, end-to-end ClampedSwiGLUMLP vs HF Expert.forward.
tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_routers.py — 13 tests:
- Score function: parity vs inline reference for all three functions.
- Learned router: HF agreement across (softmax × sigmoid × sqrtsoftplus) × (with / without expert_bias) ≤ 1e-6; back-compat alias; gradient flows to gate weight; expert_bias detached from probs graph; softmax skips renorm.
- Hash router: HF agreement across the three score functions ≤ 1e-6; tid2eid is a frozen Parameter (requires_grad=False, dtype int32); state-dict keys; deterministic table across seeds; OOB / shape-mismatch error paths; gradient flows to weight while tid2eid.grad is None.

Status

deepseek-v4/develop/progress/status.md: P14 phase-1 tasks checked off with this commit hash (1a8bf32e); deferred items listed for the phase-2 follow-up; the "HashRouter has no learnable gate weight / clamped SwiGLU clamps post-mul" blocker is marked resolved.

Schedule

Block A (extended): 2026-05-01 — plan-2 P14 phase-1 commit 1a8bf32e (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P14 phase-2 + P15 – P21 across 4 working days. P13 follow-up aa9929a0 and P14 phase-1 1a8bf32e are recorded under May 01 / 06 in the daily plan.

What landed in `5fe8bc3c` (P14 phase-2 — V4 MoE structural bring-up + G5)

Closes plan-2 P14 by bringing DeepseekV4MoE into Megatron's spec lifecycle, exposing a CPU-testable forward path so the MoE math is pinned against the released HF reference, and adding the V4 provider helpers that plan-2 §5 / §6 call for.

`DeepseekV4MoE` → `MegatronModule`

primus/backends/megatron/core/transformer/moe/v4_moe.py
- Parent class switched from nn.Module to MegatronModule so it inherits the standard config plumbing and integrates with TransformerLayer.mlp via the spec lifecycle.
- BaseMoELayer-compatible public surface: set_layer_number(layer_number) mirrors BaseMoELayer.set_layer_number; local_expert_indices is exposed as a list attribute.

CPU local-experts path

primus/backends/megatron/core/transformer/moe/v4_moe.py
- When pg_collection is None, __init__ skips the dispatcher / grouped-experts construction and instead builds:
  - local_experts: nn.ModuleList[ClampedSwiGLUMLP] — one ClampedSwiGLUMLP per local expert (mirrors HF reference Expert exactly: separate w1 / w2 / w3 Linears + V4 pre-multiplication clamp).
  - shared_expert: ClampedSwiGLUMLP — a single shared expert with the same activation.
- _local_experts_forward runs a per-expert dispatch loop matching DeepSeek-V4-Flash/inference/model.py:MoE.forward exactly (for each routed expert, gather routed tokens, multiply by per-token routing weight, accumulate). Production path (pg_collection provided) continues to use the Megatron dispatcher + grouped experts unchanged.

Provider helpers (plan-2 P14 §5 / §6)

primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py
- DeepSeekV4SpecProvider.v4_grouped_mlp_spec(swiglu_limit, moe_use_grouped_gemm=True, ...) returns a ready-to-use ModuleSpec(grouped_module, MLPSubmodules) for the V4 MoE expert path. The pre-mul clamp itself is applied via config.activation_func_clamp_value — Megatron's eager glu() (mlp.py:312-321) already implements SiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha), which is bit-equal to the HF reference math; the spec only commits to the right grouped module + the column / row-parallel linears.
- DeepSeekV4SpecProvider.v4_router_spec(learned=True/False) returns a bare ModuleSpec for either DeepseekV4LearnedRouter or DeepseekV4HashRouter.

G5 numerical alignment

tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_moe.py — 11 tests:
- Construction sanity: parent class is MegatronModule; CPU path builds local_experts (ClampedSwiGLUMLP) + shared_expert; the token_dispatcher / grouped_experts attributes stay None; set_layer_number propagates.
- Learned-router MoE forward vs inline HF reference on a 1L toy across (sqrtsoftplus, sigmoid, softmax) × (shared expert on / off) — ≤ 1e-3 fp32 CPU.
- Hash-router MoE forward vs HF across the three score functions, with token_ids feeding tid2eid — ≤ 1e-3 fp32 CPU.
- moe_router_topk_scaling_factor (HF route_scale) propagates to the output.
- Backward populates grads on router.weight, on the shared expert, and on at least one routed expert's w1 / w2 / w3.
- Hash layer raises a clear error when token_ids is missing.

Status

deepseek-v4/develop/progress/status.md — P14 phase-2 tasks ticked with this commit; the structural row records the MegatronModule-via-CPU-path approach and explicitly defers the TopKRouter-rooted aux-loss / z-loss path to P19 alongside the distributed re-validation matrix (rationale: upstream TopKRouter.__init__ registers CUDA buffers unconditionally, which is impractical for CPU-clean V4 routers; gating that on a device check is out-of-scope for this commit).

Schedule

Block A (extended): 2026-05-01 — plan-2 P14 phase-2 commit 5fe8bc3c (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P15 – P21 across 4 working days. P13 follow-up aa9929a0, P14 phase-1 1a8bf32e, and P14 phase-2 5fe8bc3c are recorded under May 01 in the daily plan.

What landed in `25ccdb5e` (P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)

Closes plan-2 P15 except the distributed PP-equivalence gate (G6) which is tracked into P19. This commit brings V4's layer / block onto Megatron's TransformerLayer / TransformerBlock parents, drops the decoder._v4_token_ids attribute stash in favor of a real forward kwarg, gates HyperHead to the post_process stage, and extracts HC × PP K-stream packing helpers.

`DeepseekV4HybridLayer` → `TransformerLayer`

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
- Parent class switched from GraphableMegatronModule to TransformerLayer. TransformerLayer.__init__ is bypassed (V4's submodule contract differs — no cross-attention, no BDA, V4-specific attention signature); MegatronModule.__init__ is called directly.
- DeepseekV4HybridLayerSubmodules now extends TransformerLayerSubmodules and uses upstream-canonical field names: input_layernorm / self_attention / pre_mlp_layernorm / mlp. The two V4-specific HC mixer hooks attn_hc / ffn_hc remain, both default to None for hc_mult == 1.
- The layer's forward signature is now upstream-compatible: (hidden_states, attention_mask=None, *, position_ids=None, token_ids=None, **kwargs). attention_mask is accepted and ignored (V4 manages SWA / sink mask internally); position_ids is consumed from the caller (fallback to arange(S) for tiny smokes); **kwargs lets the layer plug into MultiTokenPredictionLayer (P16) without bespoke adapters.

`DeepseekV4TransformerBlock` → `TransformerBlock`

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
- Parent class switched from nn.Module to TransformerBlock (init bypass via MegatronModule for CPU instantiability; V4 has its own layer-spec / lift-lower pipeline). Type identity unlocks Megatron isinstance checks + sharded-state-dict integration.
- HyperHead is built only on the post_process stage. Earlier PP stages forward the K-stream tensor via _lower_streams_out (no per-stage HyperHead), saving memory and removing a correctness drift risk.

HC × PP K-stream packing helpers

_lift_streams_in(hidden_states, pre_process, hc_mult) / _lower_streams_out(x, post_process, hc_mult) extracted as module-level helpers in deepseek_v4_block.py.
- First PP stage: [S, B, D] -> [B, S, K, D] (broadcast across K).
- Non-first PP stage: [S*K, B, D] -> [B, S, K, D] (unfold packed K).
- Final stage: [B, S, D] -> [S, B, D] (post-HyperHead transpose).
- Non-final stage: [B, S, K, D] -> [S*K, B, D] (pack K into seq for PP P2P).
- Both helpers raise clear errors on shape mismatches.
The packing math is intentionally K-folded-into-seq (not the batch axis) so sequence-parallel chunking lines up cleanly; PP P2P doesn't need to know about K.

Token-ids forward kwarg

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
- DeepseekV4Model.forward no longer assigns decoder._v4_token_ids (and removes the try/finally cleanup). It now passes token_ids=input_ids and position_ids=position_ids directly to self.decoder(...).
- The decoder block + each layer consume them as standard forward kwargs and propagate to mlp.forward -> hash_router.forward.
- An AST-level audit (test_v4_block_pp.py::test_model_forward_does_not_set_decoder_v4_token_ids_attribute) prevents the attribute stash from regressing.

Spec wiring + MTP block update

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py renames the four core fields when constructing DeepseekV4HybridLayerSubmodules: attn_norm → input_layernorm, attention → self_attention, ffn_norm → pre_mlp_layernorm, ffn → mlp.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py switches the per-MTP-layer call to layer(stream, position_ids=..., token_ids=...) (kwarg, not positional) to match the new layer forward signature.

Tests (`tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py`, 16 tests)

Subclass identity: DeepseekV4HybridLayer is a TransformerLayer; DeepseekV4TransformerBlock is a TransformerBlock; DeepseekV4HybridLayerSubmodules extends TransformerLayerSubmodules and exposes attn_hc / ffn_hc.
Lift / lower roundtrip: bit-exact across the four PP-stage permutations (pre_process × post_process), for both single-stream (hc_mult=1) and multi-stream (K=3, K=4).
Error paths: misaligned S*K on non-first stage; collapsed input on non-final lower; uncollapsed input on final lower.
Token-ids stash: AST audit confirms decoder._v4_token_ids is gone from the model source; token_ids=input_ids kwarg is present.
Forward signatures: block.forward exposes position_ids + token_ids kwargs; layer.forward accepts (hidden_states, attention_mask=None, position_ids, token_ids).

Status / blockers

deepseek-v4/develop/progress/status.md — Phase 15 tasks ticked except G6 (PP=1 vs PP=2 vs PP=4 equivalence on a 4L toy), which requires distributed init and is tracked into P19 distributed re-validation. The CPU-only sub-gate — _lift_streams_in after _lower_streams_out is bit-exact — is covered by the lift/lower roundtrip tests, which is the math contract a real PP run depends on.
Two blocker rows resolved:
- "Custom V4 block / layer / MoE bypass TransformerBlock / TransformerLayer / MoELayer" — closed by P14 phase-2 + P15.
- "Token-IDs propagation via decoder._v4_token_ids attribute" — closed by P15.

Schedule

Block A (extended): 2026-05-01 — plan-2 P15 commit 25ccdb5e (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P16 – P21 across 4 working days. P14 phase-1 1a8bf32e, P14 phase-2 5fe8bc3c, and P15 25ccdb5e are recorded under May 01 in the daily plan.

What landed in `6c5875d4` (P16 — spec-based MTP via `MultiTokenPredictionBlock` + `process_mtp_loss`)

Closes plan-2 P16 except the distributed MTP-loss ablation gate (G7), which is tracked into P19 alongside G6. This commit wires V4 onto Megatron's upstream MTP pipeline so the auxiliary multi-token-prediction loss flows through process_mtp_loss (per-depth shifted logits + MTPLossAutoScaler) instead of the standalone primus-owned MTP block. The legacy DeepseekV4MTPBlock remains behind the v4_use_custom_mtp_block config flag for back-compat with research checkpoints (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle) and now emits a DeprecationWarning on construction.

Spec helper (`primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py`, new)

get_v4_mtp_block_spec(config, *, transformer_layer_spec, vp_stage) returns
ModuleSpec(MultiTokenPredictionBlock, submodules=MultiTokenPredictionBlockSubmodules(layer_specs=[...]*mtp_num_layers)).
Each per-depth MultiTokenPredictionLayer spec pulls
- enorm / hnorm / layer_norm from DeepSeekV4SpecProvider.v4_norm_module()
- eh_proj from provider.column_parallel_linear()
- mtp_model_layer from the V4 hybrid-layer spec passed in by the model — so each MTP depth shares HC, hash routing, and clamped-SwiGLU with the main decoder exactly.
Rejects mtp_num_layers < 1 with a clear ValueError.

`DeepseekV4Model` updates (`deepseek_v4_model.py`)

New default path: when mtp_num_layers > 0 and not v4_use_custom_mtp_block, __init__ builds self.mtp = MultiTokenPredictionBlock(spec=get_v4_mtp_block_spec(...)) on stages where mtp_on_this_rank() is True. mtp_on_this_rank is wrapped in try/except so CPU smokes (no parallel_state) do not crash; self.mtp_process is False and self.mtp is None on those paths.
Legacy DeepseekV4MTPBlock path stays available behind v4_use_custom_mtp_block; self.mtp_block is the legacy slot, self.mtp is the new spec-based slot. Both are None when MTP is disabled.
forward now mirrors GPTModel.forward: runs self.mtp(...) on stages with MTP layers (passing input_ids / position_ids / hidden_states / attention_mask / embedding / packed_seq_params), then on post_process with mtp_num_layers > 0 calls process_mtp_loss(...) which chunks the concatenated hidden states, computes the per-depth shifted MTP loss, and folds it into the gradient via MTPLossAutoScaler.
New forward kwargs: loss_mask (forwarded to process_mtp_loss) and packed_seq_params.

Layer / block forward contract

DeepseekV4HybridLayer.forward now returns (hidden_states, None) instead of just hidden_states. This matches upstream TransformerLayer (which returns (hidden_states, context)) and is required by MultiTokenPredictionLayer._proj_and_transformer_layer which unpacks hidden_states, _ = self.mtp_model_layer(...).
DeepseekV4TransformerBlock's per-layer iteration updates to x, _ = layer(...).
Legacy DeepseekV4MTPBlock likewise updates to unpack the tuple.

V4 attention spec advertises `attn_mask_type`

The V4 attention spec now declares params={"compress_ratio": ..., "attn_mask_type": AttnMaskType.causal}. MultiTokenPredictionLayer.__init__ validates the inner layer's self_attention.params['attn_mask_type'] against {padding, causal, no_mask, padding_causal}; without this the MTP block fails to construct. The value is functionally inert for V4 (which manages its own SWA / sink mask).
DeepseekV4Attention.__init__ accepts and ignores attn_mask_type plus a **kwargs catch-all so the spec lifecycle keeps working.

Legacy `DeepseekV4MTPBlock` (`deepseek_v4_mtp.py`)

Module docstring annotated as deprecated (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle).
Construction emits a DeprecationWarning pointing users at get_v4_mtp_block_spec. Code path unchanged otherwise.

Tests (`tests/.../test_v4_mtp.py`, ~17 tests)

get_v4_mtp_block_spec structural assertions: outer module is MultiTokenPredictionBlock; layer_specs length matches mtp_num_layers (parametrised 1/2/3); each per-depth spec is a MultiTokenPredictionLayer; the V4 inner layer is threaded through unchanged; norm + linear come from the V4 provider.
Rejects mtp_num_layers=0 with a clear ValueError.
DeepseekV4HybridLayerSubmodules extends TransformerLayerSubmodules so MTP picks up the GPT path (not Mamba) in its inner-layer-submodules isinstance check.
DeepseekV4HybridLayer.forward returns (hidden_states, None) (source-level assertion on return x, None).
V4 attention spec advertises AttnMaskType.causal (source-level assertion).
Legacy DeepseekV4MTPBlock emits DeprecationWarning on construction.
AST audits on deepseek_v4_model.py: process_mtp_loss is called; upstream MTP machinery is imported; spec helper is invoked; v4_use_custom_mtp_block flag is preserved; the mtp_num_layers > 0 guard keeps the no-MTP path inert.

Status / blockers

deepseek-v4/develop/progress/status.md — Phase 16 tasks ticked except G7 (MTP loss appears in train log; mtp_num_layers=0 vs mtp_num_layers=1 ablation matches LM loss to 1e-6), which requires distributed init + MultiTokenPredictionBlock runtime (CP / SP plumbing); tracked into P19 distributed re-validation alongside G6.
Two new follow-on rows recorded for the cross-cutting layer-tuple return + attention attn_mask_type declarations (both required by upstream MTP wiring).

Schedule

Block A (extended): 2026-05-01 — plan-2 P16 commit 6c5875d4 (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P17 – P21 across 4 working days. P14 phase-1 1a8bf32e, P14 phase-2 5fe8bc3c, P15 25ccdb5e, and P16 6c5875d4 are recorded under May 01 in the daily plan.

What landed in `e591b893` (P17 — code cleanup, gate G14)

P17 ships the dead-code retirement that was front-loaded from P21 in the 2026-05-01 reshuffle (f548d8b2). With pre-training as the release path, the HF state-dict adapter slot moved out (deferred to P22+) and the cleanup work moved up so P18's spec audit walks a clean tree.

Retired in this commit

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py — the legacy primus-owned DeepseekV4MTPBlock was deprecation-warned since P16 (6c5875d4); the spec-based path (get_v4_mtp_block_spec + upstream MultiTokenPredictionBlock + process_mtp_loss) is the only MTP route now.
DeepSeekV4TransformerConfig.v4_use_custom_mtp_block (legacy MTP gate) — removed.
DeepSeekV4TransformerConfig.mtp_compress_ratios (legacy-only field) — removed.
DeepseekV4Model.__init__ — single MTP branch on the spec path; the if v4_use_custom_mtp_block arm + self.mtp_block field are gone.

Dedup'd in this commit

primus/backends/megatron/core/transformer/local_rmsnorm.py (new) — one canonical LocalRMSNorm consumed by deepseek_v4_block.py (input_layernorm / pre_mlp_layernorm / final_layernorm fallback), deepseek_v4_attention.py (q_norm / kv_norm fallback closure), and compressor.py (kv_norm). The three pre-existing _RMSNorm definitions are deleted.

YAML cleanup

deepseek_v4_flash.yaml — inverted comment fixed: 4 = CSA (overlap) and 128 = HCA (non-overlap) match DeepseekV4Attention.forward dispatch.
deepseek_v4_pro.yaml + deepseek_v4_base.yaml — same canonical comment block added so all three V4 yamls are self-documenting.

Audit gate G14

tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py (new):
- retired files gone (deepseek_v4_mtp.py, csa_attention.py, hca_attention.py).
- legacy import path raises ImportError; package __all__ no longer exposes DeepseekV4MTPBlock.
- DeepSeekV4TransformerConfig no longer carries v4_use_custom_mtp_block / mtp_compress_ratios.
- AST scan over every V4 source for runtime _v4_token_ids access (Attribute / Assign / Name) — docstring mentions are exempt.
- AST scan over every V4 source for class _RMSNorm shadow definitions — none allowed.
- parameterised yaml check that the canonical 4 = CSA / 128 = HCA mapping is documented.

Out of scope (kept, with notes in `status.md`)

primus/backends/megatron/core/transformer/dual_rope.py — load-bearing for V4's CSA / HCA dual-base partial RoPE; Megatron's RotaryEmbedding only supports a single base. Plan-2 was over-eager listing this for retirement; it stays.

What landed in `b5832672` (P18 — spec-system audit, gate G1 + D1 / D2 / D4)

P18 closes the spec-system audit findings D1 / D2 / D4 from 00-review-findings.md. Walking a clean tree (after P17) makes the audits crisp.

Provider singleton (D1)

primus/backends/megatron/core/models/deepseek_v4/build_context.py (new): resolve_v4_provider(config) caches a single DeepSeekV4SpecProvider on the config object via a private attribute. Different configs get different providers; the cache is GC'd when the config is released.
All three direct DeepSeekV4SpecProvider(config=config) call sites migrated to the helper:
- deepseek_v4_block.py (_build_projection + DeepseekV4MoE shared-expert wiring)
- deepseek_v4_layer_specs.py
- deepseek_v4_mtp_specs.py
AST audit (test_v4_p18_spec_audit.py::test_no_direct_DeepSeekV4SpecProvider_construction_outside_build_context) rejects future regressions; build_context.py is the only allowed instantiation site.

Activation-func consistency (D2)

New helper DeepSeekV4SpecProvider.v4_mlp_activation_func() returns:
- None when config.use_te_activation_func is False — the V4 default; needed so Megatron MLP keeps the eager clamped-SwiGLU path (which applies activation_func_clamp_value).
- TEActivationOp (the TE class, instantiated by Megatron MLP at build) when the user opts into TE.
Layer specs + DeepseekV4MoE shared-expert spec switched to the V4 helper. The base provider's activation_func() is unchanged (BackendSpecProvider contract still says "returns a type").

`compress_ratios` normalization (D4)

DeepSeekV4TransformerConfig.__post_init__ calls _normalize_compress_ratios_field on the raw value once, so downstream consumers see tuple[int, ...] (or None). The helper handles strings ("[0, 0, 4, 128, ...]") and real lists.
Runtime helpers (_parse_int_sequence / _normalize_compress_ratios in deepseek_v4_block.py) keep accepting both forms for back-compat, but always receive the normalized form on the live path.

Schema gate G1

tests/unit_tests/configs/test_deepseek_v4_yaml.py (new): parameterises over deepseek_v4_{base,flash,pro}.yaml:
- parse_yaml() succeeds; required fields present.
- DeepSeekV4TransformerConfig builds from the parsed dict.
- compress_ratios normalized to tuple[int, ...] with no value drift vs the raw schedule.
- every compress_ratios entry is in {0, 4, 128} (canonical V4 branches).
- retired P17 fields (v4_use_custom_mtp_block / mtp_compress_ratios) are gone from the dataclass and from each YAML.
- V4-specific runtime fields (HC, sliding-window, sink, o_groups / o_lora_rank, MoE extras, swiglu_limit) all declared on the dataclass.
- provider singleton: resolve_v4_provider(cfg_a) returns the same instance on repeated calls; different configs get different providers.
- v4_mlp_activation_func contract verified for both branches of use_te_activation_func.

Spec audit (light-weight, AST-only)

tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py (new):
- D1 / D2 audits described above.
- package surface __init__.py __all__ does not re-export DeepseekV4MTPBlock (P17 cross-check).
- spec builders do not eagerly construct TENorm / TE{Column,Row}ParallelLinear / TELinear / TEActivationOp inside __init__ — they emit ModuleSpec(module=...) references that runtime build_module resolves.

Schedule

Block A (extended): 2026-05-01 — plan-2 P17 + P18 commits e591b893 + b5832672 (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P19 – P21 across 4 working days. P17 + P18 are recorded under May 01 in the daily plan; P19 (distributed re-validation) is the first item in Block B.

What landed in `83c33ad0` (P19 — distributed re-validation) + `dba27163` (plan-2 close-out)

P19 closes the distributed re-validation gate (G10) for the architecture-faithful V4 stack landed across P13 → P18. All four target smokes pass 10/10 iterations on mi355-gpu-12 (BF16, MBS=1 GBS=16, seq=128, 8 layers / 3 hash layers / hc_mult=4); two torch.profiler chrome-trace JSONs (EP=8 and PP=2 EP=4) are captured for the perf-baseline reference.

Smokes (10 iters each)

smoke	parallelism	result	gating patch	log
A	TP=1 PP=1 EP=1	10/10	none (HC stays in-stage; PP > 1 patches are no-ops)	`deepseek-v4/develop/progress/p19/smokeA*.log`
B	TP=1 PP=2 EP=4	10/10	`pp_tensor_shape`	`p19/smokeB*.log`
C	TP=1 PP=4 EP=2	10/10	`pp_tensor_shape` + `pp_token_pre_broadcast`	`p19/smokeC_pp4_ep2_v2.log`
D	TP=1 PP=2 EP=4 VPP=2	10/10	`pp_tensor_shape` (also wraps the interleaved schedule) + `pp_token_pre_broadcast` (upfront)	`p19/smokeD_pp2_ep4_vpp2_v2_run3.log`

Profile traces

torch.profiler chrome-trace JSONs (single active step, iter 6 → 7) under the same V4 smoke config:

output/amd/tas-mi355x-20260507/p19_profile_pp1_ep8/tensorboard/...rank[0].*.pt.trace.json — TP=1 PP=1 EP=8 (~99 MB).
output/amd/tas-mi355x-20260507/p19_profile_pp2_ep4/tensorboard/...rank[0].*.pt.trace.json — TP=1 PP=2 EP=4 (~105 MB).

Launchers: deepseek-v4/develop/progress/p19/run_profile_ep8.sh and run_profile_pp2_ep4.sh.

`megatron.deepseek_v4.pp_tensor_shape` (`primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py`)

Wraps two Megatron entry points in megatron.core.pipeline_parallel.schedules so V4's mHC K = hc_mult packing is reflected on the PP wire:

get_tensor_shapes (used by 1F1B): seq dim multiplied by hc_mult so the receive buffer matches [S * K, B, D] instead of the stock [S, B, D].
forward_backward_pipelining_with_interleaving (used by VPP): seq_length kwarg multiplied by hc_mult before the schedule's inline tensor_shape = [seq_length, mbs, hidden] runs.

Both wrappers gate on model_type == "deepseek_v4" + hc_mult > 1 + PP > 1 and are strict no-ops otherwise. Without (2) VPP allocates [S, B, D] recv buffers while the sender emits [S * K, B, D], and _lift_streams_in reshapes the truncated copy — surfaces as DeepseekV4HashRouter: hidden=32 vs token_ids=128.

`megatron.deepseek_v4.pp_token_pre_broadcast` (`primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py`)

V4's hash-routed MoE layers (the first num_hash_layers) need raw input_ids on every PP stage that owns one, but pretrain_gpt.get_batch returns None on middle PP stages. Two earlier in-loop hooks both deadlocked under VPP — an in-DeepseekV4Model.forward broadcast and a per-call get_batch broadcast each raced the interleaved schedule's pre-warmup recv_forward.wait().

This patch wraps pp_module.get_forward_backward_func so each train_step first runs all num_microbatches × num_chunks PP dist.broadcast collectives upfront, before the schedule's first send / recv, and caches the resulting (tokens, labels, loss_mask, attention_mask, position_ids, packed_seq_params) tuples per (vp_stage, microbatch). A companion wrapper around pretrain_gpt.get_batch consumes the cache when active and falls back to the original implementation otherwise. Cache is reset in a finally after each schedule call. Cost ≈ mbs * seq * 8B per microbatch (~32 KiB / step on the smoke), dwarfed by the activation P2P.

Model-side cleanup (`deepseek_v4_model.py`, `deepseek_v4_layer_specs.py`)

Drop the in-forward input_ids PP broadcast + VPP fail-fast assert from DeepseekV4Model; the pre-broadcast patch handles both 1F1B and VPP cleanly.
Stop pre-assigning self.mtp = None in __init__; Megatron's set_current_microbatch (in cuda_graphs.py) only iterates model.mtp.layers when MTP is actually live, which matches upstream GPTModel. Downstream MTP guards use getattr(self, "mtp", None).
Import DeepSeekV4SpecProvider in deepseek_v4_layer_specs.py so the type annotation resolves at module load (NameError surfaced once turbo path was off).

`c10d::allreduce_` autograd warning gone

The historical UserWarning: An operator was called with autograd not registered for c10d::allreduce_ came from the early bring-up's "local shard + torch.distributed.all_reduce" path for MoE routed-output aggregation in v4_moe.py. P14 phase-2 migrated MoE to Megatron's token dispatchers (MoEAlltoAllTokenDispatcher / MoEFlexTokenDispatcher); P17 deleted the v4_enable_ep_allreduce_fallback debug gate; and P19 confirms zero c10d::allreduce hits in stderr across all four smokes + the EP=8 / PP=2 EP=4 profile runs.

`dba27163` plan-2 close-out (docs-only)

status.md — mark c10d::allreduce_ warning as gone (with the verification log paths); mark G11 as [-] deferred (snapshot dump tooling never landed); drop Phase 20 / 21 / 22+ sections (kept as documented intent in plan-2/03-phase-details.md); refresh the Blockers / Risks log entry for c10d to reference the actual P19 verification rather than "still tracked into P19".
deepseek-v4/develop/progress/plan-2-summary.md (new) — stand-alone summary of the plan-2 architecture-faithful rewrite (P12 → P19): per-phase outcome with key commits; P19 deep-dive (smokes / profile traces / patches / c10d verification); test-gate ledger (G1 / G3 / G4 / G5 / G6 / G7 / G11 / G14 + smokes); plan-1 → plan-2 architectural-shift table (attention, MoE, layer / block, MTP, token-IDs path, HC × PP, TP, spec hygiene); explicit deferred / out-of-scope list (G6 distributed, G7 MTP, G11, P20, P21, P22+).
P19 profile launchers — run_profile_ep8.sh (TP=1 PP=1 EP=8) and run_profile_pp2_ep4.sh (TP=1 PP=2 EP=4); both set PROFILE=True + disable_tensorboard=False so the existing torch_profiler_patches.py hook captures iter 6 → 7.
deepseek-v4/download_ref.sh — idempotent helper that ensures git-lfs and clones the V4 reference assets at pinned commits (HF transformers, ROCm TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, plus DeepSeek-V4-Pro / Flash / Flash-Base / Pro-Base) with GIT_LFS_SKIP_SMUDGE=1 so weights are not downloaded by default.

Schedule

Block A (extended): 2026-05-01 — plan-2 P12 → P18 commits (636ab3de → b5832672).
Block B (delivered): 2026-05-07 — plan-2 P19 (83c33ad0) + plan-2 close-out (dba27163).
Deferred follow-ups: P20 (200-step Megatron-bridge convergence + TE on / off perf report + FP8 follow-up plan), P21 (techblog / progress timeline / PPT refresh), P22+ (HF state-dict adapter + V4-Flash safetensors round-trip / token-0 logits ≤ 1e-2 vs HF reference). All three are documented in plan-2/03-phase-details.md; they re-enter active work when the next campaign (release, downstream integration ask, SFT / eval) needs them.

Test plan

Known risk / follow-up

EP routed-output path currently uses all_reduce and emits a PyTorch autograd warning (c10d::allreduce_ kernel registration). Functional for bring-up; gated behind the v4_enable_ep_allreduce_fallback debug toggle on the active path. resolved (P14 phase-2 / P17 audit / P19 runtime verification): the v4_enable_ep_allreduce_fallback flag was removed during the dispatcher migration in P14; the debug gate was deleted in P17 (e591b893); P19 smokes (A/B/C/D + EP=8 / PP=2 EP=4 profile runs on mi355-gpu-12) confirm zero c10d::allreduce warnings in stderr — the EP routed-output reduction now flows entirely through Megatron's MoEAlltoAllTokenDispatcher / MoEFlexTokenDispatcher.
~~HC × PP — HyperHead per-stage application destroys K-stream context.~~ resolved (P15 + P19): DeepseekV4TransformerBlock packs [B, S, K, D] → [S*K, B, D] for PP P2P via _lower_streams_out and only applies HyperHead on the post_process stage. CPU-side bit-exact roundtrip covered by test_v4_block_pp.py. Runtime stability across PP > 1 verified by P19 smokes B / C / D with the pp_tensor_shape patch; distributed bit-equality across PP = 1 / 2 / 4 (G6) is a separate audit and is not on the pre-training release path (deferred follow-up).
~~decoder._v4_token_ids attribute stash — leaks state across PP and microbatches.~~ resolved (P15): DeepseekV4Model.forward now passes token_ids=input_ids directly to the decoder; AST audit prevents regressions.
No state-dict adapter — V4-Flash safetensors cannot be loaded. Deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need HF weights. Plan-2 P22+ (when activated by an SFT / evaluation campaign) lands the adapter and adds the HF numerical-alignment gate (G8 / G9). Design notes preserved in 02-target-architecture.md §7 + 03-phase-details.md (P22+ section).

Initial design / planning materials for integrating DeepSeek-V4 training support into Primus. Documentation only; no production code changes. - techblog/: architecture deep dive (CSA / HCA / mHC / Hash routing / sqrtsoftplus / clamped SwiGLU / dual RoPE / Muon / MTP) plus 4 PNG diagrams rendered via Pillow (see render_diagrams.py). - plan/: 8-phase roadmap, full code-landing list, per-phase task breakdown, and testing strategy. - progress/status.md: 64-task checklist tracking phase progress. - develop_deepseek-v4-in-primus.md: top-level goal and development cadence. Made-with: Cursor

Phase 1 of the V4 development plan. Pure config; no Python code paths exercised yet. Subsequent phases (P2..P4) wire dispatch and modules. * primus/configs/models/megatron/deepseek_v4_base.yaml Extends llama_base, sets model_type=deepseek_v4 and registers V4-specific defaults (hc_mult, hybrid_attention_*, q_lora_rank, attn_sink, hash routing, swiglu_limit, dual-RoPE knobs, etc.). * primus/configs/models/megatron/deepseek_v4_flash.yaml Hyperparams from DeepSeek-V4-Flash/config.json. * primus/configs/models/megatron/deepseek_v4_pro.yaml Hyperparams from DeepSeek-V4-Pro/config.json. * examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml Training scaffold; parallelism / perf knobs are conservative and will be retuned during the perf phase. * primus/backends/megatron/training/tokenizer/tokenizer.py Add DeepSeekV4Tokenizer to CUSTOM_TOKENIZER_TYPES so _add_tokenizer_args accepts it. Note: V4 fields do not need to be registered in Megatron's argparse — Primus's merge_namespace mechanism (train_runtime.py:_initialize_trainer) copies yaml-only fields onto backend_args after MegatronArgBuilder.update. Made-with: Cursor

Phase 2 of the V4 development plan. Wires the end-to-end dispatch from yaml.model_type=deepseek_v4 to a primus-owned model_provider + builder, without changing model behaviour yet. The model class is still a thin GPTModel subclass; Phase 3 swaps the decoder for the V4 transformer block. * primus/core/utils/import_utils.py Add a deepseek_v4 branch to get_model_provider() that imports primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders and returns partial(model_provider, deepseek_v4_builder). * primus/backends/megatron/megatron_pretrain_trainer.py Add a model_type == "deepseek_v4" branch alongside gpt / mamba. V4 is a causal-LM with the same data shape as GPT, so we reuse pretrain_gpt's forward_step + train_valid_test_datasets_provider; only the model_provider itself is V4-specific. * primus/backends/megatron/core/models/deepseek_v4/__init__.py (new) Re-export DeepseekV4Model + deepseek_v4_builder + model_provider. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py (new) DeepseekV4Model: thin subclass of GPTModel. P3 will replace self.decoder with DeepseekV4TransformerBlock. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py (new) deepseek_v4_builder + model_provider. Uses GPT layer specs in P2; P3 will swap them for V4 specs. Made-with: Cursor

Phase 3 of the V4 development plan. Lands the V4 layer-spec helpers and a transparent V4 transformer-block subclass; attention / MLP behaviour still matches GPT. Phase 4 will plug HC + hybrid attention into the block, and Phase 5 will swap in V4 MoE / clamped SwiGLU through the spec-resolution hooks added here. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py (new) Four V4 layer-spec helpers (layer / decoder_block / decoder_layer_specs / mtp_block) that delegate to the GPT helpers in P3, plus two resolution hooks (_resolve_attention_module_spec / _resolve_mlp_module_spec) that return None for now -- P4 / P5 fill these in. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py (new) DeepseekV4TransformerBlock: subclasses TransformerBlock and stashes V4 config fields (hc_mult, compress_ratios, attn_sliding_window, attn_sink, q_lora_rank, index_*) onto self so P4 patches don't have to re-walk the config. Forward behaviour unchanged in P3. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py Override __init__: after super().__init__() builds the stock decoder, swap self.decoder for DeepseekV4TransformerBlock (same call signature so GPTModel.forward keeps working). * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py _resolve_layer_spec / _resolve_mtp_block_spec now route through the V4 layer-spec helpers instead of the GPT helpers directly. * primus/backends/megatron/core/models/deepseek_v4/__init__.py Re-export DeepseekV4TransformerBlock alongside the existing surface. Made-with: Cursor

…dual-RoPE) Phase 4 of the V4 development plan. Lands the full V4 transformer block: mHC multi-stream residual, per-layer hybrid attention dispatch (Dense / HCA / CSA), sliding-window mask, attention sink, dual-RoPE with YaRN. The V4 block becomes a standalone nn.Module that bypasses Megatron's TransformerBlock + ModuleSpec mechanism so the multi-stream HC loop is expressed cleanly. P5 will swap the placeholder SwiGLU MLP for V4's MoE. New modules under primus/backends/megatron/core/transformer/ :: * hyper_connection.py HyperMixer (per-layer mHC mixer), HyperHead (final K->1 collapse), sinkhorn_normalize (doubly-stochastic projection). Linear weights / scales / biases held in fp32 for stability; fp32 sinkhorn iterates. Unit-tested: row/col errors ~1e-6, hc_mult=1 degenerate path exact. * compressor.py V4 compressor for KV downsampling. ratio=4 overlap mode (CSA, coff=2), ratio=128 non-overlap mode (HCA, coff=1). Internal RMSNorm + learnable APE; RoPE applied externally. * indexer.py Sparse top-K position selector for CSA. Internal mini-Compressor builds the score grid; causal mask + top-K (-1 fill for invalid positions); backward propagates to the indexer params. * sliding_window_kv.py Causal SWA mask + per-query KV index helpers. * attn_sink.py Per-head learnable sink scalar; softmax_with_sink ensures probs.sum() <= 1 with the sink absorbing the residual mass. Backward propagates to the sink params. * dual_rope.py Two RoPE bases (main + compress) with optional YaRN scaling. Partial interleaved RoPE: only ``rotary_dim`` of each head's channels rotated; remaining channels passed through unchanged. * deepseek_v4_attention.py Shared base for V4 attention: QKV projection (optional Q LoRA), partial dual-RoPE, SWA mask, attention sink, output projection. ``_extra_kv`` hook lets HCA / CSA augment KV (full pool or sparse top-K). * hca_attention.py Heavily-Compressed Attention. Subclasses DeepseekV4Attention; adds a non-overlap Compressor and concatenates the full compressed pool to the local KV (always visible). * csa_attention.py Compressed-Sparse Attention. Subclasses DeepseekV4Attention; adds an overlap Compressor + Indexer; per-query attention is computed over the local SWA + the indexer's top-K compressed positions. Updated: * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py Rewritten as a standalone nn.Module. Holds the dual-RoPE for the whole stack, builds DeepseekV4HybridLayer per layer (Dense/HCA/CSA picked from compress_ratios), and runs the K-stream HC loop. Forward shape: [S, B, D] -> [B, S, D] -> [B, S, K, D] -> ... -> [B, S, D] -> [S, B, D]. Smoke-tested: 8-layer mixed dense/CSA/HCA + hc_mult=4 forward / backward / causality OK. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py Cleaned up to a placeholder spec. The V4 block is standalone and bypasses Megatron's spec mechanism; we still hand a valid GPT-shaped spec to GPTModel.__init__ until P6 refactors that allocation away. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py Docstring rewritten for the P4 standalone-block layout; pg_collection switched to getattr(self, "pg_collection", None) for safety. * deepseek-v4/develop/progress/status.md, plan/02-phase-details.md Track P1..P4 completion; add the argparse-not-needed note (Primus's merge_namespace covers V4 fields). Made-with: Cursor

Copilot

Pull request overview

Adds a new model_type=deepseek_v4 to Primus’ Megatron backend, including V4 configs, model/provider dispatch, and an initial DeepSeek-V4 block implementation with HC + hybrid attention building blocks.

Changes:

Add DeepSeek-V4 model dispatch + builders and a Primus-owned V4 model package.
Introduce V4 config yamls (base/flash/pro) and a MI355X pretrain scaffold yaml.
Implement core V4 transformer components (HC, dual-RoPE, compressor, indexer, CSA/HCA attention, sliding-window helpers, attention sink).

Reviewed changes

Copilot reviewed 33 out of 37 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`primus/core/utils/import_utils.py`	Adds `deepseek_v4` branch to resolve the V4 model provider/builder.
`primus/backends/megatron/megatron_pretrain_trainer.py`	Dispatches `model_type=deepseek_v4` while reusing GPT data/forward_step plumbing.
`primus/backends/megatron/training/tokenizer/tokenizer.py`	Allows selecting `DeepSeekV4Tokenizer` via HF tokenizer wrapper.
`primus/configs/models/megatron/deepseek_v4_{base,flash,pro}.yaml`	Adds V4 model configs and defaults.
`examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml`	Adds a training scaffold yaml for MI355X.
`primus/backends/megatron/core/models/deepseek_v4/*`	Adds V4 model/builders/spec placeholders and a standalone V4 block implementation.
`primus/backends/megatron/core/transformer/*`	Implements HC, dual-RoPE, compressor/indexer, CSA/HCA attention, SWA helpers, and attention sink.
`deepseek-v4/develop/**`	Adds development docs/diagrams and planning materials for the V4 integration.

Copilot · 2026-04-28T11:08:56Z

+# Per-layer compression schedule (from config.json:compress_ratios)
+# 0   = uncompressed dense layer (full attention with SWA)
+# 4   = HCA branch (compress ratio 4)
+# 128 = CSA branch (compress ratio 128)
+compress_ratios: "[0, 0, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]"


compress_ratios is currently a quoted string, so YAML will parse it as str rather than a list of ints. DeepseekV4TransformerBlock.__init__ does list(compress_ratios) and checks len(...) == num_layers, so this will either explode into a list of characters or fail the length check at runtime. Define this as a real YAML list (no quotes) or normalize the string to List[int] before the block consumes it; also ensure the list length matches num_layers (43).

Copilot · 2026-04-28T11:08:57Z

+# Per-layer compression schedule (from config.json:compress_ratios)
+# 0   = uncompressed dense layer (full attention with SWA)
+# 4   = HCA branch (compress ratio 4)
+# 128 = CSA branch (compress ratio 128)


The per-layer schedule comments invert CSA vs HCA: per the V4 design and the rest of this PR, compress_ratio == 4 is CSA and compress_ratio == 128 is HCA. Please fix the comment mapping so it matches the implementation.

Copilot · 2026-04-28T11:08:57Z

+#
+# Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
+###############################################################################


Typo in the referenced source path (DeeSeek-v4-Pro). If this path is meant to mirror the repo directory (DeepSeek-V4-Pro), please correct it to avoid confusion when cross-referencing configs.

Copilot · 2026-04-28T11:08:58Z

+        # Sliding-window mask.
+        window = self.attn_sliding_window
+        local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype)  # [S, S]
+
+        # Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).
+        # Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]
+        # so they can be cat'd along the Sk axis.
+        extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)
+
+        # Concatenate sliding-window KV with extra KV (if any).
+        if extra_k is not None:
+            k_full = torch.cat([k_local_h, extra_k], dim=1)  # [B, Sk_total, H, head_dim]
+            v_full = torch.cat([v_local_h, extra_v], dim=1)
+            full_mask = torch.cat([local_mask, extra_mask], dim=-1)  # [Sq, Sk_total]
+        else:
+            k_full = k_local_h
+            v_full = v_local_h
+            full_mask = local_mask
+
+        # Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim]
+        q_bh = q.transpose(1, 2)
+        k_bh = k_full.transpose(1, 2)
+        v_bh = v_full.transpose(1, 2)
+
+        out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask)


sliding_window_causal_mask creates a [S, S] mask, but the attention still computes q @ k^T over all S keys (k_local_h is length S). For realistic training lengths (e.g. 4096), this becomes quadratic memory/compute and is very likely to OOM, even though the model is conceptually sliding-window. Consider actually restricting K/V to the window (e.g. gather with sliding_window_kv_indices, unfold, or use a kernel/backend that supports causal sliding-window attention) so Sk_local is window rather than S.

Suggested change

# Sliding-window mask.

window = self.attn_sliding_window

local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype) # [S, S]

# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).

# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]

# so they can be cat'd along the Sk axis.

extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)

# Concatenate sliding-window KV with extra KV (if any).

if extra_k is not None:

k_full = torch.cat([k_local_h, extra_k], dim=1) # [B, Sk_total, H, head_dim]

v_full = torch.cat([v_local_h, extra_v], dim=1)

full_mask = torch.cat([local_mask, extra_mask], dim=-1) # [Sq, Sk_total]

else:

k_full = k_local_h

v_full = v_local_h

full_mask = local_mask

# Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim]

q_bh = q.transpose(1, 2)

k_bh = k_full.transpose(1, 2)

v_bh = v_full.transpose(1, 2)

out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask)

# Materialize only the causal sliding-window K/V for each query position

# so local attention scales with `window` rather than the full sequence `S`.

window = self.attn_sliding_window

window = min(window, S)

# Build per-query local indices: for query i attend to [i - window + 1, ..., i].

query_positions = torch.arange(S, device=device)

window_offsets = torch.arange(window, device=device)

local_indices = query_positions.unsqueeze(1) - (window - 1) + window_offsets.unsqueeze(0) # [S, window]

local_valid = local_indices >= 0

local_indices = local_indices.clamp_(min=0, max=S - 1)

# Gather local K/V windows: [B, S, H, D] -> [B, S, window, H, D].

gather_index = local_indices.view(1, S, window, 1, 1).expand(

B, S, window, self.num_heads, self.head_dim

)

k_local = torch.gather(

k_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim),

1,

gather_index,

)

v_local = torch.gather(

v_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim),

1,

gather_index,

)

# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).

# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim].

extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)

# Move heads dim before sequence for local attention:

# q: [B, S, H, D] -> [B, H, S, D]

# local k/v: [B, S, window, H, D] -> [B, H, S, window, D]

q_bh = q.transpose(1, 2)

k_local_bh = k_local.permute(0, 3, 1, 2, 4)

v_local_bh = v_local.permute(0, 3, 1, 2, 4)

scale = self.head_dim ** -0.5

local_scores = (q_bh.unsqueeze(-2) * k_local_bh).sum(dim=-1) * scale # [B, H, S, window]

local_scores = local_scores.masked_fill(

~local_valid.view(1, 1, S, window), torch.finfo(local_scores.dtype).min

)

if extra_k is not None:

extra_k_bh = extra_k.transpose(1, 2) # [B, H, S_extra, D]

extra_v_bh = extra_v.transpose(1, 2) # [B, H, S_extra, D]

extra_scores = torch.einsum("bhsd,bhkd->bhsk", q_bh, extra_k_bh) * scale

if extra_mask is not None:

if extra_mask.dtype == torch.bool:

extra_scores = extra_scores.masked_fill(

~extra_mask.view(1, 1, S, -1), torch.finfo(extra_scores.dtype).min

)

else:

extra_scores = extra_scores + extra_mask.view(1, 1, S, -1).to(extra_scores.dtype)

attn_scores = torch.cat([local_scores, extra_scores], dim=-1)

attn_probs = torch.softmax(attn_scores.float(), dim=-1).to(q_bh.dtype)

local_probs = attn_probs[..., :window]

extra_probs = attn_probs[..., window:]

out_local = (local_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2)

out_extra = torch.einsum("bhsk,bhkd->bhsd", extra_probs, extra_v_bh)

out_bh = out_local + out_extra

else:

attn_probs = torch.softmax(local_scores.float(), dim=-1).to(q_bh.dtype)

out_bh = (attn_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2)

Copilot · 2026-04-28T11:08:58Z

+    ) -> torch.Tensor:
+        """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.
+


_gather_topk_kv is annotated as returning torch.Tensor, but it actually returns (gathered, valid). This will confuse type-checkers and readers; update the return annotation (and docstring if needed) to reflect the tuple return type.

Suggested change

) -> torch.Tensor:

"""Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.

) -> Tuple[torch.Tensor, torch.Tensor]:

"""Gather ``[B, P, head_dim]`` along ``P`` per query.

Returns:

A tuple ``(gathered, valid)`` where:

- ``gathered`` has shape ``[B, S, K, head_dim]``.

- ``valid`` has shape ``[B, S, K]`` and marks non-masked indices.

Copilot · 2026-04-28T11:08:58Z

+        gathered, valid = self._gather_topk_kv(pool_kv, topk_idxs)  # [B, S, K, head_dim]
+
+        # 5) Stash for ``_compute_attention_output`` to consume.
+        gathered.shape[2]


This statement has no effect (gathered.shape[2] is computed and discarded). It looks like a leftover debug line; please remove it to keep the CSA path clean.

Suggested change

gathered.shape[2]

Copilot · 2026-04-28T11:08:59Z

+num_layers: 61
+hidden_size: 7168
+num_attention_heads: 128
+num_query_groups: 1
+kv_channels: 512
+qk_pos_emb_head_dim: 64
+ffn_hidden_size: 18432
+moe_ffn_hidden_size: 3072
+moe_shared_expert_intermediate_size: 3072
+
+q_lora_rank: 1536
+o_lora_rank: 1024
+o_groups: 16
+
+num_experts: 384
+moe_router_topk: 6
+moe_router_topk_scaling_factor: 2.5
+
+index_topk: 1024
+
+compress_ratios: "[128, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]"


Same issue as Flash: compress_ratios is a quoted string, which will not deserialize to Sequence[int] and will break DeepseekV4TransformerBlock's len(compress_ratios) == num_layers check. Please make this a real YAML list (or add a normalization step) and verify the schedule length matches num_layers (61).

Copilot · 2026-04-28T11:08:59Z

+# Reference:
+#   - deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/config.json
+#   - deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
+#   - deepseek-v4/develop/techblog/01-deepseek-v4-architecture-deep-dive.md


Typo in the reference path (DeeSeek-v4-Pro). Please correct the spelling/casing so the comment points at the actual directory name and is searchable.

Copilot · 2026-04-28T11:08:59Z

+        # local v we have [B, H, Sk_local, head_dim] (independent of S),
+        # while sparse v depends on S. Build a "value tensor" with the
+        # same shape on both paths by broadcasting local v:
+        v.shape[2]


This statement has no effect (v.shape[2] is computed and discarded). Please remove it; it reads like a debug remnant and makes the attention path harder to audit.

Suggested change

v.shape[2]

Copilot · 2026-04-28T11:08:59Z

+FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"
+FONT_BOLD = FONT_REG  # we only have Regular; use it for both
+
+OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")
+os.makedirs(OUT_DIR, exist_ok=True)
+
+
+def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:
+    return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)


FONT_REG is hard-coded to an absolute path under a specific user's home directory, which will fail for other developers/CI. Consider using a repo-relative font path, allowing an environment variable override, and/or falling back to a default font when the file isn't present.

Suggested change

FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"

FONT_BOLD = FONT_REG # we only have Regular; use it for both

OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")

os.makedirs(OUT_DIR, exist_ok=True)

def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:

return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)

BASE_DIR = os.path.dirname(__file__)

FONT_CANDIDATES = (

os.environ.get("DIAGRAM_FONT"),

os.environ.get("FONT_REG"),

os.path.join(BASE_DIR, "NotoSansSC-Regular.otf"),

os.path.join(BASE_DIR, "fonts", "NotoSansSC-Regular.otf"),

)

def _resolve_font_path() -> str | None:

for path in FONT_CANDIDATES:

if path and os.path.isfile(path):

return path

return None

FONT_REG = _resolve_font_path()

FONT_BOLD = FONT_REG # we only have Regular; use it for both when available

OUT_DIR = os.path.join(BASE_DIR, "diagrams")

os.makedirs(OUT_DIR, exist_ok=True)

def font(sz: int, bold: bool = False) -> ImageFont.ImageFont | ImageFont.FreeTypeFont:

font_path = FONT_BOLD if bold else FONT_REG

if font_path:

return ImageFont.truetype(font_path, sz)

return ImageFont.load_default()

…+ MTP Phase 5 of the V4 development plan. Lands the FFN side of the V4 stack: hash-routed and learned top-K MoE, clamped SwiGLU experts, and the V4 MTP head. The V4 block now plugs the V4 MoE in place of P4's placeholder SwiGLU FFN; the V4 model instantiates a separate-HyperHead MTP block when mtp_num_layers > 0. Layer-aware YaRN was already done in P4 (DualRoPE.get_rope picks main_rope vs compress_rope by compress_ratio). New modules: * primus/backends/megatron/core/transformer/clamped_swiglu.py clamped_swiglu(x, alpha=7.0): silu(gate)*up clamped to [-alpha, alpha]. ClampedSwiGLUMLP wraps it as a fused gate_up + down two-linear MLP. Eager (Python) for v1; perf phase will register a fused kernel. * primus/backends/megatron/core/transformer/moe/v4_hash_router.py HashRouter: static [vocab_size, topk] tid2eid table from a fixed seed. Active for the first num_hash_layers V4 layers; gives each token a permanent expert assignment with uniform weight 1/topk. No learnable parameters; deterministic across PP / TP / EP ranks. * primus/backends/megatron/core/transformer/moe/v4_topk_router.py V4TopKRouter: learned gate with score_function in {"sqrtsoftplus", "sigmoid", "softmax"}. Top-K with optional renorm and optional noaux_tc per-expert bias (selection-only; probs are read from the un-biased score). * primus/backends/megatron/core/transformer/moe/v4_moe.py DeepseekV4MoE: per-layer router pick (hash vs learned) + N ClampedSwiGLUMLP routed experts + 1 shared expert. Pure-PyTorch per-expert dispatch; P6 swaps in Megatron's token-dispatcher / grouped-GEMM / EP path. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py DeepseekV4MTPBlock: mtp_num_layers V4 layers, each owning its own HyperHead (separate from the main decoder's). Shares the dual-RoPE with the main decoder. Loss-side wiring is deferred to P6; P5 just stands the module up so it can be unit-tested standalone. Updated: * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py DeepseekV4HybridLayer now picks MoE vs dense FFN based on num_routed_experts. forward() threads token_ids through to the MoE for hash-routed layers. The block-level forward picks token_ids up from a model-side stash (_v4_token_ids) so callers don't have to thread it explicitly through every layer of the call stack. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py Builds DeepseekV4MTPBlock when mtp_num_layers > 0 (post-process rank only). forward() overridden to stash input_ids onto self.decoder before delegating to GPTModel.forward, so hash-routed MoE layers can consume them. Cross-PP propagation of input_ids is a P6 concern. * primus/backends/megatron/core/models/deepseek_v4/__init__.py Re-export DeepseekV4MTPBlock alongside the existing surface. Smoke-tested on dev-box PyTorch container (CPU, 7-test suite): * clamped_swiglu: clamp tight; MLP forward+backward OK. * HashRouter: per-token top-K distinct, deterministic across re-runs and re-instantiations w/ same seed, probs sum to 1. * V4TopKRouter: top-K honored, renorm OK, backward OK for all three score functions (sqrtsoftplus, sigmoid, softmax). * DeepseekV4MoE (learned & hash modes): forward + backward; same-token determinism for hash routing. * DeepseekV4TransformerBlock with MoE FFN (4 layers, hc_mult=2, mixed dense + CSA): forward + backward; deterministic in eval mode. * DeepseekV4MTPBlock (mtp_num_layers=2, hc_mult=2): forward + backward; per-MTP HyperHead state_dict separation verified. Deferred to P6 (already noted in progress doc): * Real Megatron-MoE / token-dispatcher / EP integration -- replaces the pure-PyTorch dispatch loop in DeepseekV4MoE.forward. * MTP loss path wiring -- DeepseekV4Model.forward currently builds the MTP block but does not yet feed its outputs through lm_head + the auxiliary loss term. * Numerical alignment vs reference inference/model.py (token-0 logits within 1e-2) -- needs reference checkpoint loading. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 38 out of 42 changed files in this pull request and generated no new comments.

Wire DeepSeek-V4 through Megatron P6 integration (PP local-layer build, EP expert sharding, and compatibility fixes) and add the P7 single-node launcher plus progress docs after passing PP=2/EP=4 smoke run. Made-with: Cursor

Add the plan-1 roadmap/detail/test documentation plus progress tracker entries, and update the development target doc with TransformerEngine and Primus-Turbo reference pointers. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 44 out of 48 changed files in this pull request and generated 8 comments.

Copilot · 2026-04-29T12:33:29Z

+        gen = torch.Generator(device="cpu").manual_seed(int(seed))
+        # For each token id, pick ``topk`` distinct expert ids deterministically.
+        # randperm(num_experts) is a stable, dense permutation; slicing the
+        # first ``topk`` rows gives uniform-without-replacement routing.
+        rows = []
+        for _ in range(vocab_size):
+            perm = torch.randperm(num_experts, generator=gen)[:topk]
+            rows.append(perm)
+        tid2eid = torch.stack(rows, dim=0).long()  # [vocab_size, topk]


HashRouter.__init__ builds tid2eid by looping over every vocab_size entry and calling torch.randperm(num_experts) each time. For real V4 sizes (e.g., vocab≈129k, experts≈384), this will add significant startup time and CPU memory churn at model construction. Consider replacing this with a deterministic hash-based mapping (no table), or generating the table in larger vectorized blocks (and/or only for the subset of vocab used), so model init remains scalable.

Copilot · 2026-04-29T12:33:30Z

+    def _gather_topk_kv(
+        self,
+        pool: torch.Tensor,  # [B, P, head_dim]
+        topk_idxs: torch.Tensor,  # [B, S, K]  (-1 for masked)
+    ) -> torch.Tensor:
+        """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.
+
+        Out-of-range / masked indices (``-1``) are clamped to ``0`` for the
+        gather, then *zero-masked* afterwards.
+        """
+        B, S, K = topk_idxs.shape
+        P, Hd = pool.shape[1], pool.shape[2]
+        valid = topk_idxs >= 0  # [B, S, K]
+        safe_idx = topk_idxs.clamp(min=0)
+        # Expand idx to gather along P for each (B, S, K, Hd).
+        idx_expand = safe_idx.unsqueeze(-1).expand(B, S, K, Hd)
+        pool_expand = pool.unsqueeze(1).expand(B, S, P, Hd)  # [B, S, P, Hd]
+        gathered = torch.gather(pool_expand, dim=2, index=idx_expand)  # [B, S, K, Hd]
+        gathered = gathered * valid.unsqueeze(-1).to(gathered.dtype)
+        return gathered, valid


_gather_topk_kv is annotated as returning only a torch.Tensor, but it actually returns (gathered, valid). This mismatch can break type checking and mislead callers; update the return annotation (and docstring if desired) to Tuple[torch.Tensor, torch.Tensor].

Copilot · 2026-04-29T12:33:30Z

+        in_dtype = x.dtype
+        x32 = x.float()
+        rsqrt = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
+        return (x32 * rsqrt).to(in_dtype) * self.weight


The standalone RMSNorm implementation returns (…to(in_dtype) * self.weight). If self.weight remains fp32 (common in mixed-precision training), this multiplication will upcast the output back to fp32, potentially defeating BF16 activation flow and increasing memory/compute. Consider multiplying by self.weight.to(in_dtype) (or casting the final result back to in_dtype) so the output dtype stays consistent with the input activation dtype.

Suggested change

return (x32 * rsqrt).to(in_dtype) * self.weight

return (x32 * rsqrt).to(in_dtype) * self.weight.to(in_dtype)

Copilot · 2026-04-29T12:33:31Z

+        flat = hidden.reshape(-1, D)  # [N, D]
+        flat.shape[0]
+


This flat.shape[0] statement is a no-op and appears to be leftover debug code. Please remove it to keep the forward path minimal and lint-clean.

Copilot · 2026-04-29T12:33:31Z

+# DeepSeek-V4 Pro (large MoE variant).
+#
+# Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
+###############################################################################


Typo in the source comment path: DeeSeek-v4-Pro should be DeepSeek-v4-Pro (consistent with the model naming elsewhere).

Copilot · 2026-04-29T12:33:31Z

+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        in_dtype = x.dtype
+        x32 = x.float()
+        rms = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
+        return (x32 * rms).to(in_dtype) * self.weight
+


Same RMSNorm dtype issue here: (…to(in_dtype) * self.weight) can upcast the output back to fp32 if self.weight is fp32, which is likely under mixed precision. To keep the compressor output in the activation dtype, multiply by self.weight.to(in_dtype) or cast the final output back to in_dtype.

Copilot · 2026-04-29T12:33:31Z

+        v.shape[2]
+        v_local_per_q = v.unsqueeze(2).expand(-1, -1, q.shape[2], -1, -1)  # [B, H, S, Sk_local, head_dim]


This v.shape[2] line is a no-op (likely leftover from debugging) and should be removed to avoid confusing readers and linters.

Copilot · 2026-04-29T12:33:32Z

+class HashRouter(nn.Module):
+    """Static hash-based MoE router.
+
+    Args:
+        num_experts: total number of routed experts.
+        topk: number of experts each token is routed to.
+        vocab_size: tokenizer vocabulary size; controls the table length.
+        seed: deterministic seed for the hash; same across all ranks.
+        dtype: dtype of the returned ``probs`` tensor; defaults to
+            ``torch.float32``.
+


This PR introduces substantial new DeepSeek-V4 core modules (attention variants, compressor/indexer, routers, MoE, HC) but does not add unit tests covering their key invariants (e.g., HashRouter determinism, CSA/HCA causality masks, compressor/indexer shape/validity). The repo already has a Python unit test suite under tests/unit_tests/ (including Megatron transformer tests), so please add focused unit tests for these new modules to prevent regressions.

Remove GPT placeholder/super-init spec coupling so DeepSeek-V4 builds decoder directly from DeepSeek ModuleSpec submodule trees, and update Phase 8 progress records to match the finalized implementation and validation status. Made-with: Cursor

Unify DeepSeek-V4 runtime module selection under DeepSeekV4SpecProvider and migrate attention/MLP/MoE construction to provider-driven ModuleSpec flows with safe local fallbacks. Document and validate the TE CUDA runtime contract, including an explicit fail-fast guard for non-CUDA TE/Turbo inputs and updated Phase 9 progress records in English. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 46 out of 50 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-30T03:14:29Z

+
+        # 5) Stash for ``_compute_attention_output`` to consume.
+        gathered.shape[2]
+        # Build mask for the compressed branch: ``-inf`` where invalid.


There are a couple of no-op statements (e.g., gathered.shape[2]) that have no effect and appear to be leftover debugging. Please remove them to keep the CSA path easier to read/maintain.

Copilot · 2026-04-30T03:14:30Z

+                    batch, seq = input_ids.shape
+                    position_ids = (
+                        input_ids.new_arange(seq, dtype=input_ids.dtype).unsqueeze(0).expand(batch, -1)
+                    )


input_ids.new_arange(...) is not a valid PyTorch Tensor API (and there is no local helper/monkeypatch in the repo), so this will raise AttributeError when position_ids is omitted. Use torch.arange(seq, device=input_ids.device, dtype=...) (or the existing Megatron helper used elsewhere) to build position ids.

Copilot · 2026-04-30T03:14:30Z

+export PRECISION_TYPE=${PRECISION_TYPE:-BF16}
+export FP8=null
+export FP8_RECIPE=null
+


FP8/FP8_RECIPE default to the literal string null, but the script still passes them via --fp8/--fp8_recipe. That makes args.fp8 truthy and can trigger FP8 validation paths (and failures) even when FP8 is intended to be disabled. Only include these CLI flags when PRECISION_TYPE=FP8, or ensure the disabled state is represented in a way the arg parser treats as false/None.

Copilot · 2026-04-30T03:14:30Z

+        B, S, D = hidden.shape
+        flat = hidden.reshape(-1, D)  # [N, D]
+        flat.shape[0]
+


There are a few no-op statements left in forward (e.g., flat.shape[0]) that don't affect execution and look like leftover debugging. Please remove them to avoid confusion and keep the forward path clean.

…chema Align phase10 DeepSeek-V4 modules on explicit spec/provider contracts by enforcing SharedExpertMLP-only shared experts and introducing a dedicated DeepSeekV4TransformerConfig for V4-only runtime fields. Update builder/spec/docs so training resolves the new config type and tracks activation clamp through model config. Made-with: Cursor

Fix HC/attention dtype mismatches and tune the DeepSeek-V4 smoke script defaults so the Phase 10 MI355X run completes reliably end-to-end. Add a dedicated Phase 10 convergence report documenting delivered scope, runtime blockers, and remaining tracked items. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 48 out of 52 changed files in this pull request and generated 5 comments.

Copilot · 2026-04-30T12:53:47Z

+export PRECISION_TYPE=${PRECISION_TYPE:-BF16}
+export FP8=null
+export FP8_RECIPE=null
+
+if [ "$PRECISION_TYPE" = "FP8" ]; then
+  export FP8=${FP8:-hybrid}
+  export FP8_RECIPE=${FP8_RECIPE:-delayed}
+fi
+
+export EXP=${EXP:-examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml}
+export BACKEND_PATH=${BACKEND_PATH:-"$(pwd)/third_party/Megatron-LM"}
+export PRIMUS_TEAM=${PRIMUS_TEAM:-amd}
+export PRIMUS_USER=${PRIMUS_USER:-tas-mi355x-$(date +%Y%m%d)}
+export PRIMUS_EXP_NAME=${PRIMUS_EXP_NAME:-deepseek_v4_smoke_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_PP${PRIMUS_PP}_EP${PRIMUS_EP}}
+
+if [ ! -d "$BACKEND_PATH" ] || [ -z "$(ls -A "$BACKEND_PATH" 2>/dev/null)" ]; then
+  echo "[ERROR] BACKEND_PATH does not exist or is empty: $BACKEND_PATH"
+  echo "Run: git submodule update --init --recursive"
+  exit 1
+fi
+
+mkdir -p "output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME"
+
+./primus-cli direct \
+  -- train pretrain --config "$EXP" \
+  --backend_path "$BACKEND_PATH" \
+  --num_layers "$PRIMUS_TOTAL_LAYERS" \
+  --train_iters "$TRAIN_ITERS" \
+  --lr_warmup_iters 0 \
+  --lr_decay_iters "$TRAIN_ITERS" \
+  --micro_batch_size "$MBS" \
+  --global_batch_size "$GBS" \
+  --seq_length "$PRIMUS_SEQ_LENGTH" \
+  --max_position_embeddings "$PRIMUS_MAX_POSITION_EMBEDDINGS" \
+  --rope_type rope \
+  --tensor_model_parallel_size "$PRIMUS_TP" \
+  --pipeline_model_parallel_size "$PRIMUS_PP" \
+  --expert_model_parallel_size "$PRIMUS_EP" \
+  --num_experts "$PRIMUS_NUM_EXPERTS" \
+  --moe_router_topk "$PRIMUS_MOE_TOPK" \
+  --moe_router_enable_expert_bias "$PRIMUS_MOE_ENABLE_EXPERT_BIAS" \
+  --moe_ffn_hidden_size "$PRIMUS_MOE_FFN_HIDDEN_SIZE" \
+  --index_topk "$PRIMUS_INDEX_TOPK" \
+  --v4_grouped_experts_support_clamped_swiglu "$PRIMUS_V4_GROUPED_EXPERTS_SUPPORT_CLAMPED_SWIGLU" \
+  --compress_ratios "$PRIMUS_COMPRESS_RATIOS" \
+  --mtp_num_layers 0 \
+  --mock_data True \
+  --use_turbo_attention "$USE_TURBO_ATTENTION" \
+  --use_turbo_grouped_mlp "$TURBO_USE_GROUPED_MLP" \
+  --moe_use_legacy_grouped_gemm "$LEGACY_GG" \
+  --fp8 "$FP8" \
+  --fp8_recipe "$FP8_RECIPE" \


FP8/FP8_RECIPE are always passed to primus-cli (defaulting to the literal string null). Other run scripts in this repo gate --fp8 ... args behind an explicit FP8 enable flag; passing null may be rejected by argument parsing or select an unintended FP8 mode. Consider only adding --fp8/--fp8_recipe when PRECISION_TYPE=FP8 (or when a dedicated FP8=True flag is set), and omit them entirely otherwise.

Copilot · 2026-04-30T12:53:48Z

+    # Primus-owned: DeepSeek-V4 (Phase 2 stub; full V4 wiring lands in Phase 3+)
+    if model_type == "deepseek_v4":
+        deepseek_v4_module = importlib.import_module(
+            "primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders"
+        )


The comment "Phase 2 stub; full V4 wiring lands in Phase 3+" is now misleading since this PR imports the full DeepSeek-V4 builders/specs. Updating/removing it will avoid confusion when debugging model-type dispatch.

Copilot · 2026-04-30T12:53:48Z

+        # 5) Stash for ``_compute_attention_output`` to consume.
+        gathered.shape[2]
+        # Build mask for the compressed branch: ``-inf`` where invalid.
+        # This is per-query, shape [S, K]; we keep it on the module as a
+        # full [B, S, K] additive mask.
+        sparse_mask = torch.where(valid, 0.0, float("-inf")).to(dtype)  # [B, S, K]
+        self._csa_state = {
+            "gathered": gathered,  # [B, S, K, head_dim]
+            "sparse_mask": sparse_mask,  # [B, S, K]
+        }
+
+        # Tell the parent: no cat-extension; we handle CSA inside
+        # ``_compute_attention_output``.
+        return None, None, None


CSAAttention stores per-forward tensors in self._csa_state and then reads them in _compute_attention_output. This is not safe under pipeline parallel schedules (multiple microbatches in flight) or activation checkpoint recomputation, because the module attribute can be overwritten before earlier microbatches/backward recomputes run, leading to wrong outputs/gradients. Refactor CSA to avoid mutable module-level forward state (e.g., compute the joint local+sparse attention fully inside forward, or thread the gathered KV/mask through the call stack without storing on self).

Copilot · 2026-04-30T12:53:48Z

+        # 5) Stash for ``_compute_attention_output`` to consume.
+        gathered.shape[2]
+        # Build mask for the compressed branch: ``-inf`` where invalid.


There are two no-op expression statements (gathered.shape[2] and later v.shape[2]) that have no effect and look like leftover debug code. They should be removed to avoid confusion (and to keep linters/type checkers from flagging them).

Copilot · 2026-04-30T12:53:49Z

+        decoder = getattr(self, "decoder", None)
+        if decoder is not None:
+            decoder._v4_token_ids = input_ids
+        try:
+            hidden_states = self.decoder(
+                hidden_states=decoder_input,
+                attention_mask=attention_mask,
+                **kwargs,
+            )
+        finally:
+            if decoder is not None:
+                decoder._v4_token_ids = None


DeepseekV4Model.forward stashes input_ids onto decoder._v4_token_ids and clears it immediately after the forward. This breaks any activation checkpoint/recompute that re-invokes decoder/layer forwards during backward (token_ids will be missing) and is also unsafe with pipeline schedules that can have multiple microbatches using the same module instance. Prefer passing token_ids=input_ids explicitly into self.decoder(...) (the decoder already accepts a token_ids kwarg) instead of relying on mutable module state.

Add run_deepseek_v4_pro_muon.sh: runs DeepSeek-V4-Pro (deepseek_v4_pro.yaml, 61L/d7168/384 experts) with the Muon optimizer on one 8x288GB node, following the paper §4.2.1 architecture + §4.2.2 training setup as closely as a single node allows. Key points (validated on gfx950): - Muon: optimizer=muon, momentum 0.95, update-RMS 0.18 (muon_extra_scale_factor; spectral mode => update RMS ~= extra_scale_factor; Megatron default 1.0 is ~5.5x too large), use_distributed_optimizer=False + fp32 opt-state dtypes (Megatron asserts), AdamW eps 1.0e-20 (decimal point required or Primus parses it as a string and multi_tensor_adam crashes). - Batch via gradient accumulation toward the paper's 94.4M tokens/step. This amortizes Muon's fixed Newton-Schulz cost: at GBS=8 (accum 1) NS looks like ~97% of GEMM (a starved-batch artifact); at GBS=256 (accum 32)/seq4096 throughput goes 19.8 -> ~890 TFLOP/s/GPU and Muon NS falls to ~1% of GPU time. - Single-node concessions documented in-script: reduced depth/seq to fit; emerging_optimizers must be pip-installed for Muon; expert-bias off (Megatron needs sigmoid, V4 uses sqrtsoftplus); MTP depth 1 unsupported by Primus V4. Co-authored-by: yanyuqin <qin.yanyuan@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… solve the shape-error issue when mbs>1

…rHead Wire DeepSeek-V4 multi-token prediction end-to-end (mtp_num_layers>0): - New DeepseekV4MTPLayer (subclasses upstream MultiTokenPredictionLayer): lifts the eh_proj output to the K-stream (mHC) form, runs the V4 hybrid inner layer, and collapses with a per-depth HyperHead (hc_head_fn) per the released V4 checkpoint (config.mtp_use_separate_hc_head). Threads pg_collection into the inner layer so its MoE uses the same expert-parallel dispatcher as the main decoder (the upstream MTP build omits pg_collection, whose local-experts fallback breaks the DDP grad-bucket invariant under EP). - get_v4_mtp_block_spec extracts a single hybrid-layer spec from the decoder block spec (mirrors upstream GPT spec.layer_specs[-1]) instead of passing the whole block spec (which trips MTP TransformerLayerSubmodules validation). - DeepseekV4HybridLayer.__init__ accepts rope=None (builds + registers a private DualRoPE so its buffers move to device) and tolerates upstream MTP build kwargs (is_mtp_layer / vp_stage). - Add config.mtp_use_separate_hc_head (default True); fix/extend test_v4_mtp. Validated: EP=8 V4-Flash proxy, MTP_NUM_LAYERS=1, 10 iters clean (exit 0), mtp_1 loss in train log, lm loss 11.90->10.52, grad norm stable, 0 NaN.

Optional FP8 quantization-aware QK path on the CSA Indexer (config.use_v4_fp8_indexer, default False): - fake_quantize_fp8_e4m3: per-tensor dynamic E4M3 fake-quant (scale to the 448 finite range, round through torch.float8_e4m3fn, dequantize). - The Indexer fake-quantizes the per-head query (q_i) and compressed-key (k_icomp) activations before the QK scoring einsum for every dispatch path (full-fuse / post-einsum-tail Triton / eager). The ReLU + per-head weight + sum + causal mask + top-k stay in BF16 (the report's BF16 index-score path). The indexer is a frozen non-differentiable top-k selector, so no straight-through estimator is needed; one-time rank-0 log when active. - Plumb config.use_v4_fp8_indexer -> Indexer(use_fp8_qk=...) via the V4 attention builder; add base/flash yaml defaults (PRIMUS_USE_V4_FP8_INDEXER). - Unit tests: fake-quant bounded-error/dtype/zero-safe; FP8 indexer top-k overlaps the BF16 reference; flag-off is bit-identical. Validated: EP=8 V4-Flash proxy, USE_V4_FP8_INDEXER=true, 10 iters clean (exit 0), '[V4-Indexer] FP8 ... ENABLED' in log, lm loss 11.90->10.53 (matches BF16 baseline 10.52), grad norm stable, 0 NaN.

…param split Make the Muon optimizer path work for DeepSeek-V4 with the report's recipe: - Install hook (runner/.../01_install_emerging_optimizers.sh, gated by PRIMUS_INSTALL_EMERGING_OPTIMIZERS, idempotent) provisions NVIDIA-NeMo/Emerging-Optimizers from a pinned commit (0.4.0a0) inside the container; the public PyPI name is a stub, so it must come from GitHub source. - moun.py: adapt to the emerging_optimizers>=0.4.0a0 API (use_nesterov->nesterov, mode->tp_mode); add _resolve_muon_coefficient_type (auto-selects the built-in 'deepseekv4' coefficient set -- 8 aggressive (3.4445,-4.7750,2.0315) + 2 stable (2.0,-1.5,0.5) -- for V4 configs, raising num_ns_steps 5->10); extract _param_goes_to_muon (report S9.5.1 split: 2-D weight matrices incl. mHC fn -> Muon; embedding/output + all <2-D params incl. RMSNorm, mHC base/scale, biases -> AdamW; fixes a 0-dim param reaching Newton-Schulz). - muon_optimizer_patches.py: redirect megatron.training.training.get_megatron_muon_optimizer (called directly by setup_model_and_optimizer) to the Primus moun builder so the new-API + deepseekv4 path drives Muon without editing third_party; force use_gloo_process_groups=False (incompatible with the provided pg_collection). - moun_optimizer_config.py: add muon_coefficient_type field. - Unit test for the param split (incl. scalar/1-D->AdamW, 2-D->Muon) and coefficient resolution (V4 autoselect deepseekv4 / explicit override / non-V4 quintic). Validated: EP=8 V4-Flash proxy, OPTIMIZER=muon, 10 iters clean (exit 0); log shows coefficient_type='deepseekv4' num_ns_steps 5->10, momentum 0.95 / extra_scale 0.18 / wd 0.1; lm loss 11.90->11.60 monotonic, grad norm ~3.0 stable, 0 NaN; emerging_optimizers auto-installed by the hook.

…param-gather) Wire the V4 optimizer/precision recipes into the shared launcher (run_deepseek_v4.sh): - OPTIMIZER={adam(default)|muon|dist_muon}: the muon paths set the Megatron muon CLI (momentum 0.95, extra_scale 0.18, use_distributed_optimizer/precision_aware False, fp32 states), force DDP grad/param overlap OFF (plain muon: Megatron asserts; dist_muon: LayerWiseDistributedOptimizer manages its own param all-gather and otherwise double-drives DDP start_param_sync), and set PRIMUS_INSTALL_EMERGING_OPTIMIZERS so the install hook provisions emerging_optimizers. - FP8 / FP8_RECIPE now honor the incoming env (e.g. FP8_RECIPE=mxfp8) instead of being hard-clobbered to null. - FP8_PARAM_GATHER=True path: enables MXFP8 (NVTE_ROCM_ENABLE_MXFP8=1) + --fp8_param_gather (+ --reuse_grad_buf_for_mxfp8_param_ag for mxfp8). Megatron #4987 analogue. - Also surfaces MTP_NUM_LAYERS and USE_V4_FP8_INDEXER launch knobs. Validated on the EP=8 V4-Flash proxy (10 iters each, exit 0, loss decreasing, 0 NaN): OPTIMIZER=muon and OPTIMIZER=dist_muon (loss 11.90->11.60, deepseekv4 hybrid NS); OPTIMIZER=dist_muon + --fp8 hybrid --fp8_recipe mxfp8 (Muon + MXFP8 forward). KNOWN LIMITATION: --fp8-param-gather together with Muon is blocked at Megatron arg-validation ('--fp8-param-gather only supported with distributed optimizer...'), since Muon requires use_distributed_optimizer=False; full Muon+fp8-param-gather needs the unmerged upstream LayerWise integration (Megatron #4987) and cannot be done without a third_party edit. Plumbing is in place for when the container's Megatron gains it.

+        "plan6_triton": [
+            {
+                "phase": phase,
+                "name": name,
+                "count": count,
+                "total_ms": total / 1000.0,
+                "avg_ms": avg / 1000.0,
+                "pct_window": (total / win_us * 100.0) if win_us else 0.0,
+            }
+            for phase, names in plan6_families.items()
+            for name in names
+            for cand_name, count, total, avg in rows
+            if cand_name == name
+            for _name in [cand_name]
+        ],


+        head_y + Inches(0.32),
+        Inches(12.75),
+        head_y + Inches(0.32),
+        color=LINE_strong if False else LINE,


…on logs - run_deepseek_v4_pro_muon.sh: set PRIMUS_INSTALL_EMERGING_OPTIMIZERS for Muon/dist_muon so the in-container hook provisions the pinned package. - add megatron patch to silence Triton autotuner print spam. - add megatron patch to raise the emerging_optimizers 'absl' logger to INFO (drops per-step Newton-Schulz coefficient DEBUG lines). Co-authored-by: Cursor <cursoragent@cursor.com>

Sync garbage collection across ranks every 100 steps (matches run_deepseek_v4.sh) to reduce step-time jitter on the large grad-accum pro Muon run. Co-authored-by: Cursor <cursoragent@cursor.com>

- script/: per-cr single-layer profiling launcher (seq4096, adam+dist-opt, GA=2, no recompute) for MI355X trace capture - tools/: chrome-trace -> breakdown JSON parser (External-id linking, nn.Module attribution, fwd/bwd split, min-grouping) + kernel/module map - site/: static MI355X-measured + MI455X-scaled projection website with step-by-step iter-time / tokens-per-s / TFLOP/s derivation - design/: methodology, assumptions, JSON schema, projection math, deploy - publish via backend-gap Pages bundle subpath + standalone dev-branch Pages workflow (deploy-projection.yml) Co-authored-by: Cursor <cursoragent@cursor.com>

…site data - Root cause: distributed optimizer (zero1) makes ROCm Kineto drop the compute GPU kernels for pure dense(cr=0)/HCA(cr=128) layers; CSA(cr=4) is unaffected. Profiling launcher now defaults use_distributed_optimizer=False (+ fp32 states), so all three cr capture their compute. dist-opt doesn't affect the fwd/bwd compute the projection needs (optimizer modeled analytically). - Add CR=mix and DISABLE_PROFILER_CPU options used while diagnosing. - Un-ignore deepseek-v4/projection/site/data so the breakdown JSON (site/Pages input) is tracked. - pro.json now real measured data for all three cr (moe cross-checks across cr). Co-authored-by: Cursor <cursoragent@cursor.com>

- parse_trace: kernel-name _fwd_/_bwd_ now authoritative for phase (dense attention re-runs its _fwd_ kernel in backward with a "Fwd thread id", which previously misfiled dense/HCA attn.core into backward). attn.core now appears in forward for all cr. - flash.json: real measured data for all three cr (was mock); moe cross-checks across cr (fwd ~43ms / bwd ~27ms). - write breakdown JSON with trailing newline. Co-authored-by: Cursor <cursoragent@cursor.com>

… single-node - parse_trace: aggregate kernel time as total/num_microbatches (num ProfilerStep x GA) instead of min-grouping, fixing 3x moe over-count (grouped-gemm dims vary per routing step so launches never dedup). - v4_flops.py: port Megatron's V4 closed-form analytic FLOPs (self-test 34112 vs measured 34093 TFLOP, 0.05%); breakdown JSON now carries analytic_flops. - app.js: TFLOP/s from analytic model FLOPs (Megatron convention, recompute excluded); add calibFactor (0.93) for the single-layer->full-model bias. - Calibrated vs measured flash 16L single node (PP1/EP8/DP8, GBS64, recompute full): iter 6681 vs 6665 ms (+0.2%), 630 vs 636 TFLOP/s/GPU (-1%), 4905 vs 4917 tokens/s/GPU. See design/06-calibration.md. Co-authored-by: Cursor <cursoragent@cursor.com>

wenxie-amd added 5 commits April 28, 2026 08:40

Copilot AI review requested due to automatic review settings April 28, 2026 11:01

Copilot started reviewing on behalf of wenxie-amd April 28, 2026 11:03 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 28, 2026 11:30

wenxie-amd force-pushed the dev/wenx/deepseek-v4 branch from b8e47a3 to 5e4008d Compare April 28, 2026 11:30

Copilot started reviewing on behalf of wenxie-amd April 28, 2026 11:32 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 29, 2026 12:25

Copilot started reviewing on behalf of wenxie-amd April 29, 2026 12:26 View session

docs(deepseek-v4): add phase 8+ replan docs and reference notes

1030293

Add the plan-1 roadmap/detail/test documentation plus progress tracker entries, and update the development target doc with TransformerEngine and Primus-Turbo reference pointers. Made-with: Cursor

wenxie-amd force-pushed the dev/wenx/deepseek-v4 branch from ecf8169 to 1030293 Compare April 29, 2026 12:28

Copilot AI reviewed Apr 29, 2026

View reviewed changes

wenxie-amd added 2 commits April 29, 2026 13:38

Copilot AI review requested due to automatic review settings April 30, 2026 03:08

Copilot started reviewing on behalf of wenxie-amd April 30, 2026 03:10 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

github-code-quality Bot found potential problems Apr 30, 2026

View reviewed changes

Comment thread primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py Fixed

Copilot AI review requested due to automatic review settings April 30, 2026 12:46

Copilot started reviewing on behalf of wenxie-amd April 30, 2026 12:47 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

wenxie-amd and others added 4 commits May 21, 2026 09:33

fix(deepseek-v4): correct NUM_STAGES_FWD to solve the NaN issue

f3587ca

fix(deepseek-v4): add the batch axis in sin/cos in compressed pool to…

d897076

… solve the shape-error issue when mbs>1

feat(deepseek-v4): recompte compatibility

6d353b4

github-code-quality Bot found potential problems Jun 8, 2026

View reviewed changes

Comment thread deepseek-v4/develop/progress/p29/_forensics3.py Fixed

Comment thread deepseek-v4/develop/progress/p29/_forensics2.py Fixed

Comment thread deepseek-v4/develop/progress/p29/_forensics.py Fixed

lhzhang333 added 7 commits June 9, 2026 10:12

update script

38f3bdc

freeze indexer param to enable overlap_grad_reduce/param_gather

aa7aa65

update script: enable overlap_grad_reduce/param_gather and manual_gc

3841e36

github-code-quality Bot found potential problems Jun 11, 2026

View reviewed changes

Comment thread deepseek-v4/develop/progress/build_roadmap_pptx.py Fixed

JohnQinAMD force-pushed the dev/wenx/deepseek-v4 branch from c84be32 to 3841e36 Compare June 11, 2026 23:23

lhzhang333 added 2 commits June 12, 2026 14:09

update script: add pp-warmup

e7c70a7

update script: remove extra comment

689a260

JohnQinAMD force-pushed the dev/wenx/deepseek-v4 branch from 8283be2 to 689a260 Compare June 12, 2026 15:35

add dsv4 flash script

756b316

github-code-quality Bot found potential problems Jun 17, 2026

View reviewed changes

lhzhang333 and others added 4 commits June 17, 2026 21:36

enable mtp layer and fp8 indexer

80ac892

update script: add manual_gc to pro muon run

dac0a60

Sync garbage collection across ranks every 100 steps (matches run_deepseek_v4.sh) to reduce step-time jitter on the large grad-accum pro Muon run. Co-authored-by: Cursor <cursoragent@cursor.com>

wenxie-amd had a problem deploying to github-pages June 18, 2026 13:15 — with GitHub Actions Failure

wenxie-amd had a problem deploying to github-pages June 18, 2026 14:41 — with GitHub Actions Failure

wenxie-amd had a problem deploying to github-pages June 18, 2026 14:55 — with GitHub Actions Failure

wenxie-amd had a problem deploying to github-pages June 18, 2026 22:47 — with GitHub Actions Failure

		) -> torch.Tensor:
		"""Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.

-    ) -> torch.Tensor:
-        """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Gather ``[B, P, head_dim]`` along ``P`` per query.
+        Returns:
+            A tuple ``(gathered, valid)`` where:
+            - ``gathered`` has shape ``[B, S, K, head_dim]``.
+            - ``valid`` has shape ``[B, S, K]`` and marks non-masked indices.

-FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"
-FONT_BOLD = FONT_REG  # we only have Regular; use it for both
-OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")
-os.makedirs(OUT_DIR, exist_ok=True)
-def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:
-    return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)
+BASE_DIR = os.path.dirname(__file__)
+FONT_CANDIDATES = (
+    os.environ.get("DIAGRAM_FONT"),
+    os.environ.get("FONT_REG"),
+    os.path.join(BASE_DIR, "NotoSansSC-Regular.otf"),
+    os.path.join(BASE_DIR, "fonts", "NotoSansSC-Regular.otf"),
+)
+def _resolve_font_path() -> str | None:
+    for path in FONT_CANDIDATES:
+        if path and os.path.isfile(path):
+            return path
+    return None
+FONT_REG = _resolve_font_path()
+FONT_BOLD = FONT_REG  # we only have Regular; use it for both when available
+OUT_DIR = os.path.join(BASE_DIR, "diagrams")
+os.makedirs(OUT_DIR, exist_ok=True)
+def font(sz: int, bold: bool = False) -> ImageFont.ImageFont | ImageFont.FreeTypeFont:
+    font_path = FONT_BOLD if bold else FONT_REG
+    if font_path:
+        return ImageFont.truetype(font_path, sz)
+    return ImageFont.load_default()

	return (x32 * rsqrt).to(in_dtype) * self.weight
	return (x32 * rsqrt).to(in_dtype) * self.weight.to(in_dtype)

		v.shape[2]
		v_local_per_q = v.unsqueeze(2).expand(-1, -1, q.shape[2], -1, -1) # [B, H, S, Sk_local, head_dim]

Conversation

wenxie-amd commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Plan timeline

Plan-2 reshuffle — 2026-05-01 (commit f548d8b2, docs-only)

Why plan-2

Commit map

What landed in 97b9720d (P6/P7)

P6 integration

P7 bring-up

What landed in df273a45 (P8 v2)

What landed in e5fec968 (P9 v2)

What landed in b38e83cf (P10)

What landed in 752b7534 (P10 runtime stabilization + report)

What landed in 636ab3de (P12 — plan-2 lockdown)

Architecture review

Plan-2 documents (active plan of record)

Tech blog closure

Layout cleanup + visuals

Status tracker

Schedule

What landed in cad0fb38 + aa9929a0 (P13 — faithful attention)

cad0fb38 — V4-faithful attention rooted on MLASelfAttention (dense path)

aa9929a0 — Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy

Status

Schedule

What landed in 1a8bf32e (P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)

Activation (G3)

Learned router (G4)

Hash router (G4)

MoE wiring

Tests

Status

Schedule

What landed in 5fe8bc3c (P14 phase-2 — V4 MoE structural bring-up + G5)

DeepseekV4MoE → MegatronModule

CPU local-experts path

Provider helpers (plan-2 P14 §5 / §6)

G5 numerical alignment

Status

Schedule

What landed in 25ccdb5e (P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)

DeepseekV4HybridLayer → TransformerLayer

DeepseekV4TransformerBlock → TransformerBlock

HC × PP K-stream packing helpers

Token-ids forward kwarg

Spec wiring + MTP block update

Tests (tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py, 16 tests)

Status / blockers

Schedule

What landed in 6c5875d4 (P16 — spec-based MTP via MultiTokenPredictionBlock + process_mtp_loss)

Spec helper (primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py, new)

DeepseekV4Model updates (deepseek_v4_model.py)

Layer / block forward contract

V4 attention spec advertises attn_mask_type

Legacy DeepseekV4MTPBlock (deepseek_v4_mtp.py)

Tests (tests/.../test_v4_mtp.py, ~17 tests)

Status / blockers

Schedule

What landed in e591b893 (P17 — code cleanup, gate G14)

Retired in this commit

Dedup'd in this commit

YAML cleanup

Audit gate G14

Out of scope (kept, with notes in status.md)

What landed in b5832672 (P18 — spec-system audit, gate G1 + D1 / D2 / D4)

Provider singleton (D1)

Activation-func consistency (D2)

compress_ratios normalization (D4)

Schema gate G1

Spec audit (light-weight, AST-only)

Schedule

What landed in 83c33ad0 (P19 — distributed re-validation) + dba27163 (plan-2 close-out)

Smokes (10 iters each)

Profile traces

megatron.deepseek_v4.pp_tensor_shape (primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py)

megatron.deepseek_v4.pp_token_pre_broadcast (primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py)

Model-side cleanup (deepseek_v4_model.py, deepseek_v4_layer_specs.py)

c10d::allreduce_ autograd warning gone

wenxie-amd commented Apr 28, 2026 •

edited

Loading

Plan-2 reshuffle — 2026-05-01 (commit `f548d8b2`, docs-only)

What landed in `97b9720d` (P6/P7)

What landed in `df273a45` (P8 v2)

What landed in `e5fec968` (P9 v2)

What landed in `b38e83cf` (P10)

What landed in `752b7534` (P10 runtime stabilization + report)

What landed in `636ab3de` (P12 — plan-2 lockdown)

What landed in `cad0fb38` + `aa9929a0` (P13 — faithful attention)

`cad0fb38` — V4-faithful attention rooted on `MLASelfAttention` (dense path)

`aa9929a0` — Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy

What landed in `1a8bf32e` (P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)

What landed in `5fe8bc3c` (P14 phase-2 — V4 MoE structural bring-up + G5)

`DeepseekV4MoE` → `MegatronModule`

What landed in `25ccdb5e` (P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)

`DeepseekV4HybridLayer` → `TransformerLayer`

`DeepseekV4TransformerBlock` → `TransformerBlock`

Tests (`tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py`, 16 tests)

What landed in `6c5875d4` (P16 — spec-based MTP via `MultiTokenPredictionBlock` + `process_mtp_loss`)

Spec helper (`primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py`, new)

`DeepseekV4Model` updates (`deepseek_v4_model.py`)

V4 attention spec advertises `attn_mask_type`

Legacy `DeepseekV4MTPBlock` (`deepseek_v4_mtp.py`)

Tests (`tests/.../test_v4_mtp.py`, ~17 tests)

What landed in `e591b893` (P17 — code cleanup, gate G14)

Out of scope (kept, with notes in `status.md`)

What landed in `b5832672` (P18 — spec-system audit, gate G1 + D1 / D2 / D4)

`compress_ratios` normalization (D4)

What landed in `83c33ad0` (P19 — distributed re-validation) + `dba27163` (plan-2 close-out)

`megatron.deepseek_v4.pp_tensor_shape` (`primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py`)

`megatron.deepseek_v4.pp_token_pre_broadcast` (`primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py`)

Model-side cleanup (`deepseek_v4_model.py`, `deepseek_v4_layer_specs.py`)

`c10d::allreduce_` autograd warning gone

`dba27163` plan-2 close-out (docs-only)