Skip to content

feat: deepseek-v4 model support#698

Draft
wenxie-amd wants to merge 126 commits into
mainfrom
dev/wenx/deepseek-v4
Draft

feat: deepseek-v4 model support#698
wenxie-amd wants to merge 126 commits into
mainfrom
dev/wenx/deepseek-v4

Conversation

@wenxie-amd

@wenxie-amd wenxie-amd commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR brings DeepSeek-V4 training support into Primus on the Megatron backend.

It now spans the full bring-up arc (P0 – P10) and the plan-2 lockdown (P12) that closes out plan-0 / plan-1 with an architecture-faithful rewrite plan for the remaining work (P13 – P21).

Plan timeline

Plan Phases Window Status
plan-0 (develop/plan-0/) P0 – P7 2026-04-28 done — initial bring-up, configs, dispatch, layer specs, HC + Hybrid Attn, MoE / activation / RoPE / MTP, single-node smoke (PP=2 EP=4)
plan-1 (develop/plan-1/) P8 – P11 2026-04-29 → 2026-04-30 partial — P8 / P9 / P10 done; P11 paused by the architecture review
plan-2 (develop/plan-2/) P12 – P19 (+ deferred P20 / P21 / P22+) 2026-05-01 (lockdown) → 2026-05-07 (P19 close-out) wrapping up — P12 / P13 / P14 / P15 / P16 / P17 / P18 / P19 done; pre-training-first scope means P20 (perf / convergence gates), P21 (docs / handover), and P22+ (HF state-dict adapter) are all deferred follow-ups, gated by the next campaign that needs them

Plan-2 reshuffle — 2026-05-01 (commit f548d8b2, docs-only)

Pre-training is the release path; HF-weight loading is not required for the release. Plan-2 phase shape after this reshuffle:

Phase New scope Notes
P17 Code cleanup (was: state-dict adapter) retire _RMSNorm duplicates / dual_rope.py / csa_attention.py / hca_attention.py / legacy DeepseekV4MTPBlock / EP all_reduce fallback gate / _v4_token_ids residue / yaml comment fixes. New gate G14 (static dead-code audit).
P18 Spec audit (unchanged; _v4_token_ids removal moved to P17)
P19 Distributed re-validation (unchanged; G6 / G7 still here)
P20 Convergence + perf gates (HF numerical-alignment row removed; convergence baseline switched to Megatron-bridge)
P21 Docs + handover (slimmed; cleanup tasks moved to P17) techblog / progress HTML / PPT / develop_deepseek-v4-in-primus.md only
P22+ HF state-dict adapter + V4-Flash checkpoint load (deferred) Activate when SFT / evaluation needs HF weights. Design notes preserved in 02-target-architecture.md §7 + 03-phase-details.md (P22+ section). G8 / G9 deferred from P17; HF-numerical-alignment portion of G12 also deferred here.

Why plan-2

A code review of dev/wenx/deepseek-v4 against real DeepSeek-V4 (HF reference, NeMo port, official inference) and Megatron's spec + config + provider + submodule + build_module pattern surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW). Highlights:

  • Attention uses separate linear_k_proj / linear_v_proj; real V4 has a single-latent wkv (K = V = kv).
  • q_norm / kv_norm per-head RMSNorms are missing.
  • HashRouter outputs uniform 1/topk weights with no learnable gate.
  • clamped_swiglu clamps post-mul; real V4 clamps pre-mul on silu(gate) and up.
  • No state-dict adapter: official V4-Flash / V4-Pro HF safetensors cannot be loaded.
  • DeepseekV4Attention / DeepseekV4TransformerBlock / DeepseekV4HybridLayer / DeepseekV4MoE reinvent rather than subclass MLASelfAttention / TransformerBlock / TransformerLayer / MoELayer.

Plan-2 (develop/plan-2/) is the architecture-faithful rewrite. Full review in develop/plan-2/00-review-findings.md; rewrite map in 02-target-architecture.md; phase-by-phase plan in 03-phase-details.md; gates in 04-test-strategy.md.

Commit map

commit phase scope
e194e039 docs architecture deep-dive + plan docs
d3383c02 P1 configs / yaml + tokenizer
8ae10000 P2 model_type=deepseek_v4 dispatch
a5d2a561 P3 layer spec + block scaffolding
3b7ad8c8 P4 HC + Hybrid Attention + dual-RoPE
5e4008dc P5 V4 MoE + clamped SwiGLU + V4 MTP
97b9720d P6-P7 PP/EP integration fixes + single-node run script + progress docs
df273a45 P8(v2) LanguageModule migration + DeepSeek runtime spec-tree main path
e5fec968 P9(v2) provider reuse integration + TE CUDA runtime validation/report
b38e83cf P10(v2) enforce MoE provider path and add V4 config schema
752b7534 P10(v2) stabilize smoke runtime and add phase report
636ab3de P12(v3) plan-2 lockdown + as-built techblog + roadmap visuals
cad0fb38 P13(v3) rebase V4 attention on MLASelfAttention (faithful dense path)
aa9929a0 P13(v3) fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy
1a8bf32e P14(v3) phase-1 faithful pre-mul clamped SwiGLU + V4 routers (learnable gate weight; HF-aligned scoring) + G3 + G4 unit tests
5fe8bc3c P14(v3) phase-2 DeepseekV4MoE -> MegatronModule + CPU local-experts path; v4_grouped_mlp_spec / v4_router_spec providers; G5 (1L MoE forward <= 1e-3 vs HF reference)
25ccdb5e P15(v3) DeepseekV4HybridLayer -> TransformerLayer; DeepseekV4TransformerBlock -> TransformerBlock; HC x PP K-stream packing helpers; HyperHead only on post_process; token_ids forward kwarg replaces decoder._v4_token_ids stash; 16 unit tests
6c5875d4 P16(v3) spec-based MTP via upstream MultiTokenPredictionBlock + process_mtp_loss; get_v4_mtp_block_spec helper; layer forward returns (hidden_states, None) for MTP-call compatibility; legacy DeepseekV4MTPBlock deprecated; 17 unit tests
f548d8b2 docs plan-2 reshuffle — defer HF state-dict adapter to P22+; repurpose P17 for code cleanup; add G14 gate; update roadmap / phase-details / test-strategy / status / README
e591b893 P17(v3) dead-code retirement (G14): delete legacy DeepseekV4MTPBlock + v4_use_custom_mtp_block / mtp_compress_ratios config fields; introduce shared LocalRMSNorm helper and dedup three _RMSNorm shadows (block.py / attention.py / compressor.py); fix inverted yaml comment (4=CSA / 128=HCA); refresh package __init__ surface; add tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py (G14 audit). dual_rope.py is intentionally kept — load-bearing for V4's CSA / HCA dual-base RoPE; no Megatron equivalent.
b5832672 P18(v3) spec-system audit (D1 / D2 / D4 / G1): build_context.resolve_v4_provider(config) caches the V4 provider on the config object (replaces three direct DeepSeekV4SpecProvider(...) call sites); new provider.v4_mlp_activation_func() returns None when use_te_activation_func=False (V4 default — clamped-SwiGLU eager path) and TEActivationOp otherwise; compress_ratios normalized to tuple[int, ...] in __post_init__ (so runtime never re-runs ast.literal_eval); new tests/unit_tests/configs/test_deepseek_v4_yaml.py (G1 schema gate) + tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py (D1 / D2 / package-surface AST audits).
83c33ad0 P19(v3) distributed re-validation (G10) — two primus-patches that close PP > 1 + VPP under V4: megatron.deepseek_v4.pp_tensor_shape (wraps both schedules.get_tensor_shapes for 1F1B and forward_backward_pipelining_with_interleaving for VPP, multiplies the seq dim by hc_mult so the PP wire carries V4's mHC [S*K, B, D] packing) and megatron.deepseek_v4.pp_token_pre_broadcast (pre-broadcasts all microbatch / chunk input_ids from PP rank 0 across the PP group upfront in a wrapper around get_forward_backward_func, so middle PP stages owning hash-routed MoE layers see real token IDs without deadlocking the interleaved-1F1B / VPP schedule). Drops the in-forward PP broadcast + VPP fail-fast assert from DeepseekV4Model, and stops pre-assigning self.mtp = None so Megatron's set_current_microbatch only iterates model.mtp.layers when MTP is live (matches upstream GPTModel).
dba27163 plan-2 close-out docs-only — mark the c10d::allreduce_ autograd warning as gone (verified absent in P19 smokes A/B/C/D + EP=8 / PP=2 EP=4 profile runs on mi355-gpu-12); mark G11 (routing-snapshot diff = 0 across PP / EP changes) as deferred (snapshot dump tooling never landed; not on the pre-training release path); drop Phase 20 / 21 / 22+ sections from status.md (kept as documented intent in plan-2/03-phase-details.md); add deepseek-v4/develop/progress/plan-2-summary.md (stand-alone summary of the architecture-faithful rewrite from P12 → P19, including a per-phase outcome table, a P19 deep-dive, the test-gate ledger, the plan-1 → plan-2 architectural-shift table, and pointers to logs / profile traces); add P19 profile launchers (run_profile_ep8.sh for TP=1 PP=1 EP=8 and run_profile_pp2_ep4.sh for TP=1 PP=2 EP=4) plus deepseek-v4/download_ref.sh (idempotent helper that ensures git-lfs and clones the V4 reference assets — HF transformers, ROCm/TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, and the four DeepSeek-V4 model repos — at pinned commits with GIT_LFS_SKIP_SMUDGE=1 so weights are not downloaded by default).

What landed in 97b9720d (P6/P7)

P6 integration

  • deepseek_v4_builders.py
    • Align model_provider with upstream Megatron signature (config, pg_collection).
  • deepseek_v4_block.py
    • Build only local PP layers via get_num_layers_to_build + get_transformer_layer_offset.
    • Add set_input_tensor support for non-first PP stages.
    • Normalize/parse compress_ratios more robustly.
    • Return viewless output via make_viewless_tensor for PP schedule compatibility.
  • v4_moe.py
    • Add EP-aware local expert sharding and EP all-reduce merge path.
  • deepseek_v4_model.py
    • Keep custom V4 MTP block behind v4_use_custom_mtp_block; default to native GPTModel MTP path for stable bring-up.
  • dual_rope.py, deepseek_v4_attention.py, attn_sink.py
    • Rename DualRoPE.apply -> apply_rope (avoid nn.Module.apply conflict).
    • Cast attention probs to value dtype before matmul to avoid bf16 mismatch.

P7 bring-up

  • Add run_deepseek_v4.sh (based on run_qwen.bak.sh) with fixed knobs:
    • MBS=1, GBS=16, TP=1, PP=2, EP=4
    • lightweight smoke overrides (num_layers=8, num_experts=8, mtp_num_layers=0)
  • Single-node run passed on:
    • host: uswslocpm2m-106-2371
    • container: dev_primus_wenx_691
    • command: TRAIN_ITERS=3 ./run_deepseek_v4.sh
    • result: reached iteration 3/3, torchrun exit code 0

What landed in df273a45 (P8 v2)

  • deepseek_v4_model.py
    • DeepseekV4Model now inherits from LanguageModule (no longer GPTModel).
    • Remove super_init_transformer_layer_spec path.
    • Build decoder directly from externally supplied DeepSeek runtime transformer_layer_spec.
  • deepseek_v4_layer_specs.py
    • Remove GPT placeholder-spec helpers.
    • Keep DeepSeek-native runtime spec tree only, with full layer/submodules topology.
  • deepseek_v4_builders.py
    • Resolve/pass runtime decoder spec only; remove GPT placeholder/super-init dependence.
  • deepseek-v4/develop/progress/status.md
    • Mark Phase 8(v2) tasks completed and sync notes with the finalized implementation.

Runtime verification:

  • On host uswslocpm2m-106-2371, container dev_primus_wenx_691:
    • Instantiate DeepseekV4Model (LanguageModule-based) with runtime spec tree.
    • Forward pass succeeds with output shape (128, 2, 256).

What landed in e5fec968 (P9 v2)

  • core/extensions/transformer_engine_spec_provider.py
    • Add DeepSeekV4SpecProvider(PrimusTurboSpecProvider) as the V4 provider entry point.
    • Resolve runtime mode (local / te / turbo) and expose V4-specific provider helpers for norm/grouped-MLP selection.
  • deepseek_v4_layer_specs.py
    • Resolve provider once at spec-build time and route norm, attention projection specs, dense projection specs, and MoE grouped path payload through provider-aware ModuleSpec construction.
  • deepseek_v4_attention.py
    • Refactor attention projections to submodules + build_module via DeepseekV4AttentionSubmodules (q_a, q_b, k_proj, v_proj, o_proj) with local fallback.
  • deepseek_v4_block.py
    • Align dense MLP projection initialization with provider-selected linear modules.
    • Add explicit fail-fast guard: TE/Turbo provider mode requires CUDA hidden_states.
  • v4_moe.py
    • Integrate provider grouped-GEMM expert path with safe fallback to local clamped SwiGLU experts.
  • docs/status updates
    • Add deepseek-v4/develop/plan-1/03-phase9-provider-ab-report.md.
    • Update deepseek-v4/develop/progress/status.md with completed Phase 9(v2) items and English-only notes.

Runtime verification:

  • On host uswslocpm2m-106-2371, container dev_primus_wenx_691:
    • local mode forward passes (Linear projections).
    • TE mode module-map build resolves to TELinear projections.
    • TE mode CUDA forward passes (decoder.cuda() + CUDA inputs).
    • TE/Turbo host-input path now fails fast with explicit runtime error instead of low-level GPU fault.

What landed in b38e83cf (P10)

  • core/transformer/moe/v4_moe.py
    • enforce SharedExpertMLP-only shared-expert path (remove local ClampedSwiGLUMLP fallback for shared experts).
    • wire clamped-SwiGLU behavior through SharedExpertMLP config path.
  • core/models/deepseek_v4/deepseek_v4_transformer_config.py
    • add DeepSeekV4TransformerConfig (inherits MLATransformerConfig) with DeepSeek-V4 specific fields used by V4 runtime modules.
    • align aliases/compat in __post_init__ (norm_epsilon, moe_intermediate_size, clamp sync, vocab/padded vocab sync).
  • deepseek_v4_builders.py
    • explicitly build V4 model config via core_transformer_config_from_args(..., config_class=DeepSeekV4TransformerConfig).
  • V4 modules/specs type wiring
    • update V4 builder/spec/model/attention/MoE module signatures and type hints to consume DeepSeekV4TransformerConfig.
  • model yaml
    • add activation_func_clamp_value to primus/configs/models/megatron/deepseek_v4_base.yaml with clamped-SwiGLU comment.
  • docs/progress
    • refresh deepseek-v4/develop/plan-1/* and deepseek-v4/develop/progress/status.md for Phase10 implementation notes.

Validation in this commit:

  • pre-commit hooks passed (isort/autoflake/black/yaml checks).
  • Python syntax compile checks passed for all touched DeepSeek-V4 runtime files.

What landed in 752b7534 (P10 runtime stabilization + report)

  • run_deepseek_v4.sh
    • add smoke-safe overrides for Phase 10 validation (seq_length/max_position_embeddings=128, index_topk=8).
    • set v4_grouped_experts_support_clamped_swiglu=True for grouped-expert clamped-SwiGLU runtime guard compliance.
    • disable overlap_grad_reduce and overlap_param_gather in smoke mode to avoid DDP bucket reset assertion between iterations.
  • primus/backends/megatron/core/transformer/hyper_connection.py
    • align F.linear weight dtype to activation dtype in HyperMixer and HyperHead to fix BF16 runtime mismatch.
  • primus/backends/megatron/core/transformer/deepseek_v4_attention.py
    • cast attention output back to activation dtype before TE output projection to satisfy TE dtype assertions.
  • deepseek-v4/develop/plan-1/04-phase10-moe-distributed-convergence-report.md
    • add formal Phase 10 report covering delivered architecture, runtime blocker/fix chain, and remaining tracked items.

Runtime verification in this update:

  • host: uswslocpm2m-106-2371
  • container: dev_primus_wenx_691
  • command: ./run_deepseek_v4.sh
  • result: reached iteration 10/10, and torchrun finished successfully (code 0).

What landed in 636ab3de (P12 — plan-2 lockdown)

Documentation-only commit; no runtime code changes.

Architecture review

  • Walked the branch e194e039..HEAD against:
    • deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/{config.json, inference/model.py}
    • HF Transformers PR 45616 / 45643 (deepseek-v4/transformers/.../deepseek_v4/)
    • NVIDIA NeMo AutoModel V4 port (deepseek-v4/NVIDIA-NeMo/Automodel/...)
  • Surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW), spanning architecture faithfulness, Megatron reuse / spec violations, distributed correctness, spec-system hygiene, code quality, and testing gaps.

Plan-2 documents (active plan of record)

  • deepseek-v4/develop/plan-2/README.md
  • deepseek-v4/develop/plan-2/00-review-findings.md — full severity-ranked findings ledger
  • deepseek-v4/develop/plan-2/01-roadmap.md — phases P12 → P21, dependency graph, milestones, top risks
  • deepseek-v4/develop/plan-2/02-target-architecture.md — module-by-module rewrite map (rebases on MLASelfAttention, TransformerLayer, TransformerBlock, MoELayer, MultiTokenPredictionBlock, (Yarn)RotaryEmbedding)
  • deepseek-v4/develop/plan-2/03-phase-details.md — granular tasks / exit criteria / risks per phase
  • deepseek-v4/develop/plan-2/04-test-strategy.md — L0..L3 test pyramid and release gates G1..G14 (G8 / G9 marked deferred → P22+ since the 2026-05-01 reshuffle)

Plan-1 phases 9 / 10 / 11 are paused — their tracking rows in status.md remain for history.

Tech blog closure

  • Added deepseek-v4/develop/techblog/02-plan-1-as-built-and-plan-2-pointer.md: closes plan-0 / plan-1 with an as-built note (what shipped, what fell short) and points readers at plan-2.
  • Updated deepseek-v4/develop/techblog/README.md with a banner declaring plan-2 the active plan of record.

Layout cleanup + visuals

  • Renamed develop/plan/develop/plan-0/ (the original bring-up plan; tracked as a rename).
  • Added develop/progress/timeline.html: standard system-fonts version of the project timeline; daily-column Gantt with a May 02 – 05 Holiday band; remaining nine phases (P13 – P21) packed into the May 06 – 09 working window.
  • Added develop/progress/build_roadmap_pptx.py (generator) + develop/progress/deepseek_v4_roadmap_v1.pptx (13-slide tech-style deck on a black background, 16:9). Slide 7 — 07 · 开发计划 · DEVELOPMENT SCHEDULE — is the day-by-day plan with a 3-row layout (date chip / P0~P7-style phase chip / work-content card) plus a directional arrow with the holiday-gap marker.

Status tracker

  • develop/progress/status.md now has explicit Phase 12 → Phase 21 (v3) sections.
  • All P12 engineering items are checked off; only the stakeholder sign-off on plan-2 scope remains open.
  • The blockers/risks log carries one row per CRIT finding, each pointing at the plan-2 phase that resolves it.

Schedule

  • Block A (landed): 2026-04-28 → 2026-05-01 — plan-0 P0 – P7 + plan-1 P8 – P10 + plan-2 P12 lockdown.
  • Holiday: 2026-05-02 → 2026-05-05.
  • Block B (planned): 2026-05-06 → 2026-05-09 — plan-2 P13 – P21 across 4 working days (P13 + P14 / P15 + P16 / P17 + P18 / P19 + P20 + P21). Note: P17 scope changed to code cleanup per the 2026-05-01 reshuffle; HF state-dict adapter + V4-Flash numerical alignment is deferred to P22+ and not in this Block B window.

What landed in cad0fb38 + aa9929a0 (P13 — faithful attention)

Plan-2 P13 lands in two commits inside the May 06 budget. Both are scoped strictly to the dense / CSA / HCA attention path; faithful MoE / router / MTP are tracked in P14 / P15 / P16. (HF state-dict adapter — originally planned for P17 — has since been deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need it.)

cad0fb38 — V4-faithful attention rooted on MLASelfAttention (dense path)

Rewrite the dense (compress_ratio == 0) path of DeepSeek-V4 attention to be faithful to the released DeepSeek-V4-Flash checkpoint and rooted on Megatron's MLASelfAttention.

  • primus/backends/megatron/core/transformer/deepseek_v4_attention.py
    • New DeepseekV4Attention(MLASelfAttention) subclasses MLA for type identity but bypasses the parent __init__ chain because V4's KV layout differs from MLA's compressed-KV form.
    • Single-latent KV: one linear_kv projection (hidden -> head_dim) feeds both K and V, broadcast across all query heads.
    • Per-head q_rms: parameter-less RMS on head_dim after linear_q_up_proj and before partial RoPE (no q_rms.weight in the released checkpoint).
    • Grouped low-rank O: einsum-based linear_o_a per group + linear_o_b when o_lora_rank > 0. Falls back to MLA-style flat linear_proj when o_lora_rank == 0.
    • Learnable attn_sink: direct nn.Parameter on the attention (matches the released key layers.{i}.attn.attn_sink exactly), with inline softmax-with-sink in _attention_forward.
    • New DeepseekV4AttentionSubmodules dataclass with MLA-canonical names (linear_q_down_proj, linear_q_up_proj, q_layernorm, kv_layernorm) plus V4 extras (linear_kv, linear_o_a, linear_o_b, attn_sink).
    • _LegacyDeepseekV4Attention retained temporarily as the parent for CSAAttention / HCAAttention until the P13 follow-up commit folds the compressor / indexer into the new class.
  • primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py
    • Added v4_q_layernorm(), v4_kv_layernorm(), v4_attention_sink() factory methods.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
    • Routes compress_ratio == 0 to the new class with V4-canonical submodules; legacy path retained for {4, 128}.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_transformer_config.py
    • Added o_groups: int = 8 and o_lora_rank: int = 0.
  • tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.py
    • State-dict-key contract; forward shape + finiteness; numerical equivalence vs an inline V4 reference (single-latent KV, partial interleaved RoPE, attn-sink as virtual key column, grouped low-rank O), with attn_sink enabled and disabled (≤ 1e-3); per-head q_rms is parameter-less; o_lora_rank == 0 fallback path; rejection paths.

aa9929a0 — Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy

Closes P13 by folding the compressed-branch attention into the V4-faithful class as spec submodules, switching the TP-sensitive projections to ColumnParallel / RowParallel, and retiring the plan-1 legacy attention classes.

  • primus/backends/megatron/core/transformer/deepseek_v4_attention.py
    • DeepseekV4Attention.__init__ accepts compress_ratio in {0, 4, 128}. When compress_ratio > 0 it builds self.compressor from submodules.compressor; when compress_ratio == 4 it also builds self.indexer from submodules.indexer.
    • DeepseekV4AttentionSubmodules extended with compressor and indexer fields.
    • DeepseekV4Attention.forward now dispatches on self.compress_ratio:
      • 0 — dense / SWA over local KV.
      • 128 — HCA: compressed pool with compress-base partial RoPE on indices [0..P), broadcast to H heads, concat to local KV with a compressed-causal mask, joint softmax-with-sink shared across local + compressed branches.
      • 4 — CSA: per-query top-K from compressed pool via Indexer + overlap-mode Compressor, joint softmax-with-sink across local + sparse keys.
    • _LegacyDeepseekV4Attention and _LegacyDeepseekV4AttentionSubmodules removed.
  • primus/backends/megatron/core/transformer/{csa,hca}_attention.py deleted.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
    • _build_v4_attention_submodules now also builds compressor / indexer ModuleSpecs for compressed branches.
    • linear_q_up_proj switched to provider.column_parallel_linear() (gather_output=True); linear_o_b (grouped) and linear_proj (flat-O fallback) switched to provider.row_parallel_linear() (input_is_parallel=False). At tp > 1 the projection weights are sharded across TP ranks; at tp = 1 the result is bit-identical to the previous duplicated path. linear_q_down_proj, linear_kv, linear_o_a stay duplicated; full grouped-O TP plan is tracked in P14.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
    • _build_attention (no-spec fallback) now constructs DeepseekV4Attention for all branches; the new class builds its own Compressor / Indexer locally when no spec is provided.
  • tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.py
    • HCA forward shape + finiteness + numerical equivalence vs an inline reference (≤ 1e-3); CSA forward shape + finiteness; spec wiring contract tests for ColumnParallel / RowParallel and Compressor / Indexer presence; torchrun --nproc_per_node=2 parity scaffold (skipif single-rank).

Status

  • deepseek-v4/develop/progress/status.md: P13 fully checked off (including the items previously deferred to the follow-up commit). Items routed to P14 (full grouped-O TP plan) / P22+ — deferred (HF-reference numerical alignment via the state-dict adapter, originally P17) / P19 (full TP=2 sharding-parity bit-equality check) are noted as such on each row.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P13 first commit cad0fb38 (early start; the May 02 – 05 holiday remains).
  • Holiday: 2026-05-02 → 2026-05-05.
  • Block B (planned): 2026-05-06 → 2026-05-09 — P14 – P21 across 4 working days. P13 follow-up aa9929a0 is recorded under May 06 in the daily plan.

What landed in 1a8bf32e (P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)

P14 ships in two commits. This one lands the math + parameter-layout faithfulness so V4-Flash checkpoints will load through the future state-dict adapter (originally P17, now deferred to P22+ by the 2026-05-01 reshuffle) without remapping. The structural refactor (DeepseekV4MoE(MoELayer) subclassing, provider helpers, G5 1L MoE forward) is the P14 phase-2 follow-up.

Activation (G3)

  • primus/backends/megatron/core/transformer/clamped_swiglu.py
    • Replace post-multiplication clamp with V4 pre-multiplication semantics: SiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha). New helpers clamped_swiglu_pre_mul(gate, up, alpha) (split inputs) and clamped_swiglu_pre_mul_fused(x, alpha) ([gate | up] last-dim concat for grouped-gemm experts).
    • ClampedSwiGLUMLP now uses separate w1 / w2 / w3 Linears so the released checkpoint (Expert(w1, w2, w3, swiglu_limit)) loads without remapping. Optional fused_gate_up=True fuses the gate / up GEMMs at forward time only; the saved / loaded state_dict keys remain w1.weight / w2.weight / w3.weight.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
    • _DenseSwiGLUMLP now applies the same pre-mul clamp on its dense head/tail layers; previously it computed vanilla SiLU(gate) * up and ignored swiglu_limit.

Learned router (G4)

  • primus/backends/megatron/core/transformer/moe/v4_topk_router.py
    • Rename V4TopKRouterDeepseekV4LearnedRouter (back-compat alias retained).
    • Gate exposed as weight Parameter of shape [num_experts, hidden_size] — matches Megatron's TopKRouter.weight AND HF reference Gate.weight exactly (no gate.weight indirection).
    • expert_bias is selection-only: routing weights gather from the un-biased scores so probs gradient flows to weight, never to expert_bias.
    • Renormalization gated on score_function != "softmax" (HF parity; softmax probs already sum to 1).
    • topk_scaling_factor honors moe_router_topk_scaling_factor (HF route_scale).
    • Score functions: v4_score_fn covers softmax, sigmoid, sqrtsoftplus.

Hash router (G4)

  • primus/backends/megatron/core/transformer/moe/v4_hash_router.py
    • Rename HashRouterDeepseekV4HashRouter (back-compat alias retained).
    • Add learnable weight Parameter same shape as the learned router; previously the hash router emitted uniform 1/topk weights, which broke gradient flow into the gate weights and silently differed from the released checkpoint.
    • tid2eid is now a frozen nn.Parameter(requires_grad=False, dtype=torch.int32) (matches HF reference layout — released checkpoint stores it as a parameter so state-dict round-trips preserve it without polluting the optimizer state).
    • forward(hidden, token_ids) gathers learned scores at the static expert ids prescribed by tid2eid[token_ids]; renorm + scale parity with the learned router.

MoE wiring

  • primus/backends/megatron/core/transformer/moe/v4_moe.py
    • _route now passes (hidden, token_ids) to the hash router; both routers receive hidden_size / score_function / topk_scaling_factor at init.

Tests

  • tests/unit_tests/megatron/transformer/deepseek_v4/test_clamped_swiglu.py — 7 tests cover pre-mul activation vs HF reference (≤ 1e-6 fp32, four alpha values), alpha = 0 disables clamp, fused-vs-split agreement, one-sided gate clamp behavior, w1 / w2 / w3 state-dict keys (no gate_up.weight leak), fused_gate_up forward equivalence, end-to-end ClampedSwiGLUMLP vs HF Expert.forward.
  • tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_routers.py — 13 tests:
    • Score function: parity vs inline reference for all three functions.
    • Learned router: HF agreement across (softmax × sigmoid × sqrtsoftplus) × (with / without expert_bias) ≤ 1e-6; back-compat alias; gradient flows to gate weight; expert_bias detached from probs graph; softmax skips renorm.
    • Hash router: HF agreement across the three score functions ≤ 1e-6; tid2eid is a frozen Parameter (requires_grad=False, dtype int32); state-dict keys; deterministic table across seeds; OOB / shape-mismatch error paths; gradient flows to weight while tid2eid.grad is None.

Status

  • deepseek-v4/develop/progress/status.md: P14 phase-1 tasks checked off with this commit hash (1a8bf32e); deferred items listed for the phase-2 follow-up; the "HashRouter has no learnable gate weight / clamped SwiGLU clamps post-mul" blocker is marked resolved.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P14 phase-1 commit 1a8bf32e (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P14 phase-2 + P15 – P21 across 4 working days. P13 follow-up aa9929a0 and P14 phase-1 1a8bf32e are recorded under May 01 / 06 in the daily plan.

What landed in 5fe8bc3c (P14 phase-2 — V4 MoE structural bring-up + G5)

Closes plan-2 P14 by bringing DeepseekV4MoE into Megatron's spec lifecycle, exposing a CPU-testable forward path so the MoE math is pinned against the released HF reference, and adding the V4 provider helpers that plan-2 §5 / §6 call for.

DeepseekV4MoEMegatronModule

  • primus/backends/megatron/core/transformer/moe/v4_moe.py
    • Parent class switched from nn.Module to MegatronModule so it inherits the standard config plumbing and integrates with TransformerLayer.mlp via the spec lifecycle.
    • BaseMoELayer-compatible public surface: set_layer_number(layer_number) mirrors BaseMoELayer.set_layer_number; local_expert_indices is exposed as a list attribute.

CPU local-experts path

  • primus/backends/megatron/core/transformer/moe/v4_moe.py
    • When pg_collection is None, __init__ skips the dispatcher / grouped-experts construction and instead builds:
      • local_experts: nn.ModuleList[ClampedSwiGLUMLP] — one ClampedSwiGLUMLP per local expert (mirrors HF reference Expert exactly: separate w1 / w2 / w3 Linears + V4 pre-multiplication clamp).
      • shared_expert: ClampedSwiGLUMLP — a single shared expert with the same activation.
    • _local_experts_forward runs a per-expert dispatch loop matching DeepSeek-V4-Flash/inference/model.py:MoE.forward exactly (for each routed expert, gather routed tokens, multiply by per-token routing weight, accumulate). Production path (pg_collection provided) continues to use the Megatron dispatcher + grouped experts unchanged.

Provider helpers (plan-2 P14 §5 / §6)

  • primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py
    • DeepSeekV4SpecProvider.v4_grouped_mlp_spec(swiglu_limit, moe_use_grouped_gemm=True, ...) returns a ready-to-use ModuleSpec(grouped_module, MLPSubmodules) for the V4 MoE expert path. The pre-mul clamp itself is applied via config.activation_func_clamp_value — Megatron's eager glu() (mlp.py:312-321) already implements SiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha), which is bit-equal to the HF reference math; the spec only commits to the right grouped module + the column / row-parallel linears.
    • DeepSeekV4SpecProvider.v4_router_spec(learned=True/False) returns a bare ModuleSpec for either DeepseekV4LearnedRouter or DeepseekV4HashRouter.

G5 numerical alignment

  • tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_moe.py — 11 tests:
    • Construction sanity: parent class is MegatronModule; CPU path builds local_experts (ClampedSwiGLUMLP) + shared_expert; the token_dispatcher / grouped_experts attributes stay None; set_layer_number propagates.
    • Learned-router MoE forward vs inline HF reference on a 1L toy across (sqrtsoftplus, sigmoid, softmax) × (shared expert on / off) — ≤ 1e-3 fp32 CPU.
    • Hash-router MoE forward vs HF across the three score functions, with token_ids feeding tid2eid — ≤ 1e-3 fp32 CPU.
    • moe_router_topk_scaling_factor (HF route_scale) propagates to the output.
    • Backward populates grads on router.weight, on the shared expert, and on at least one routed expert's w1 / w2 / w3.
    • Hash layer raises a clear error when token_ids is missing.

Status

  • deepseek-v4/develop/progress/status.md — P14 phase-2 tasks ticked with this commit; the structural row records the MegatronModule-via-CPU-path approach and explicitly defers the TopKRouter-rooted aux-loss / z-loss path to P19 alongside the distributed re-validation matrix (rationale: upstream TopKRouter.__init__ registers CUDA buffers unconditionally, which is impractical for CPU-clean V4 routers; gating that on a device check is out-of-scope for this commit).

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P14 phase-2 commit 5fe8bc3c (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P15 – P21 across 4 working days. P13 follow-up aa9929a0, P14 phase-1 1a8bf32e, and P14 phase-2 5fe8bc3c are recorded under May 01 in the daily plan.

What landed in 25ccdb5e (P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)

Closes plan-2 P15 except the distributed PP-equivalence gate (G6) which is tracked into P19. This commit brings V4's layer / block onto Megatron's TransformerLayer / TransformerBlock parents, drops the decoder._v4_token_ids attribute stash in favor of a real forward kwarg, gates HyperHead to the post_process stage, and extracts HC × PP K-stream packing helpers.

DeepseekV4HybridLayerTransformerLayer

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
    • Parent class switched from GraphableMegatronModule to TransformerLayer. TransformerLayer.__init__ is bypassed (V4's submodule contract differs — no cross-attention, no BDA, V4-specific attention signature); MegatronModule.__init__ is called directly.
    • DeepseekV4HybridLayerSubmodules now extends TransformerLayerSubmodules and uses upstream-canonical field names: input_layernorm / self_attention / pre_mlp_layernorm / mlp. The two V4-specific HC mixer hooks attn_hc / ffn_hc remain, both default to None for hc_mult == 1.
    • The layer's forward signature is now upstream-compatible: (hidden_states, attention_mask=None, *, position_ids=None, token_ids=None, **kwargs). attention_mask is accepted and ignored (V4 manages SWA / sink mask internally); position_ids is consumed from the caller (fallback to arange(S) for tiny smokes); **kwargs lets the layer plug into MultiTokenPredictionLayer (P16) without bespoke adapters.

DeepseekV4TransformerBlockTransformerBlock

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
    • Parent class switched from nn.Module to TransformerBlock (init bypass via MegatronModule for CPU instantiability; V4 has its own layer-spec / lift-lower pipeline). Type identity unlocks Megatron isinstance checks + sharded-state-dict integration.
    • HyperHead is built only on the post_process stage. Earlier PP stages forward the K-stream tensor via _lower_streams_out (no per-stage HyperHead), saving memory and removing a correctness drift risk.

HC × PP K-stream packing helpers

  • _lift_streams_in(hidden_states, pre_process, hc_mult) / _lower_streams_out(x, post_process, hc_mult) extracted as module-level helpers in deepseek_v4_block.py.
    • First PP stage: [S, B, D] -> [B, S, K, D] (broadcast across K).
    • Non-first PP stage: [S*K, B, D] -> [B, S, K, D] (unfold packed K).
    • Final stage: [B, S, D] -> [S, B, D] (post-HyperHead transpose).
    • Non-final stage: [B, S, K, D] -> [S*K, B, D] (pack K into seq for PP P2P).
    • Both helpers raise clear errors on shape mismatches.
  • The packing math is intentionally K-folded-into-seq (not the batch axis) so sequence-parallel chunking lines up cleanly; PP P2P doesn't need to know about K.

Token-ids forward kwarg

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
    • DeepseekV4Model.forward no longer assigns decoder._v4_token_ids (and removes the try/finally cleanup). It now passes token_ids=input_ids and position_ids=position_ids directly to self.decoder(...).
    • The decoder block + each layer consume them as standard forward kwargs and propagate to mlp.forward -> hash_router.forward.
    • An AST-level audit (test_v4_block_pp.py::test_model_forward_does_not_set_decoder_v4_token_ids_attribute) prevents the attribute stash from regressing.

Spec wiring + MTP block update

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py renames the four core fields when constructing DeepseekV4HybridLayerSubmodules: attn_norminput_layernorm, attentionself_attention, ffn_normpre_mlp_layernorm, ffnmlp.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py switches the per-MTP-layer call to layer(stream, position_ids=..., token_ids=...) (kwarg, not positional) to match the new layer forward signature.

Tests (tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py, 16 tests)

  • Subclass identity: DeepseekV4HybridLayer is a TransformerLayer; DeepseekV4TransformerBlock is a TransformerBlock; DeepseekV4HybridLayerSubmodules extends TransformerLayerSubmodules and exposes attn_hc / ffn_hc.
  • Lift / lower roundtrip: bit-exact across the four PP-stage permutations (pre_process × post_process), for both single-stream (hc_mult=1) and multi-stream (K=3, K=4).
  • Error paths: misaligned S*K on non-first stage; collapsed input on non-final lower; uncollapsed input on final lower.
  • Token-ids stash: AST audit confirms decoder._v4_token_ids is gone from the model source; token_ids=input_ids kwarg is present.
  • Forward signatures: block.forward exposes position_ids + token_ids kwargs; layer.forward accepts (hidden_states, attention_mask=None, position_ids, token_ids).

Status / blockers

  • deepseek-v4/develop/progress/status.md — Phase 15 tasks ticked except G6 (PP=1 vs PP=2 vs PP=4 equivalence on a 4L toy), which requires distributed init and is tracked into P19 distributed re-validation. The CPU-only sub-gate — _lift_streams_in after _lower_streams_out is bit-exact — is covered by the lift/lower roundtrip tests, which is the math contract a real PP run depends on.
  • Two blocker rows resolved:
    • "Custom V4 block / layer / MoE bypass TransformerBlock / TransformerLayer / MoELayer" — closed by P14 phase-2 + P15.
    • "Token-IDs propagation via decoder._v4_token_ids attribute" — closed by P15.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P15 commit 25ccdb5e (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P16 – P21 across 4 working days. P14 phase-1 1a8bf32e, P14 phase-2 5fe8bc3c, and P15 25ccdb5e are recorded under May 01 in the daily plan.

What landed in 6c5875d4 (P16 — spec-based MTP via MultiTokenPredictionBlock + process_mtp_loss)

Closes plan-2 P16 except the distributed MTP-loss ablation gate (G7), which is tracked into P19 alongside G6. This commit wires V4 onto Megatron's upstream MTP pipeline so the auxiliary multi-token-prediction loss flows through process_mtp_loss (per-depth shifted logits + MTPLossAutoScaler) instead of the standalone primus-owned MTP block. The legacy DeepseekV4MTPBlock remains behind the v4_use_custom_mtp_block config flag for back-compat with research checkpoints (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle) and now emits a DeprecationWarning on construction.

Spec helper (primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py, new)

  • get_v4_mtp_block_spec(config, *, transformer_layer_spec, vp_stage) returns
    ModuleSpec(MultiTokenPredictionBlock, submodules=MultiTokenPredictionBlockSubmodules(layer_specs=[...]*mtp_num_layers)).
  • Each per-depth MultiTokenPredictionLayer spec pulls
    • enorm / hnorm / layer_norm from DeepSeekV4SpecProvider.v4_norm_module()
    • eh_proj from provider.column_parallel_linear()
    • mtp_model_layer from the V4 hybrid-layer spec passed in by the model — so each MTP depth shares HC, hash routing, and clamped-SwiGLU with the main decoder exactly.
  • Rejects mtp_num_layers < 1 with a clear ValueError.

DeepseekV4Model updates (deepseek_v4_model.py)

  • New default path: when mtp_num_layers > 0 and not v4_use_custom_mtp_block, __init__ builds self.mtp = MultiTokenPredictionBlock(spec=get_v4_mtp_block_spec(...)) on stages where mtp_on_this_rank() is True. mtp_on_this_rank is wrapped in try/except so CPU smokes (no parallel_state) do not crash; self.mtp_process is False and self.mtp is None on those paths.
  • Legacy DeepseekV4MTPBlock path stays available behind v4_use_custom_mtp_block; self.mtp_block is the legacy slot, self.mtp is the new spec-based slot. Both are None when MTP is disabled.
  • forward now mirrors GPTModel.forward: runs self.mtp(...) on stages with MTP layers (passing input_ids / position_ids / hidden_states / attention_mask / embedding / packed_seq_params), then on post_process with mtp_num_layers > 0 calls process_mtp_loss(...) which chunks the concatenated hidden states, computes the per-depth shifted MTP loss, and folds it into the gradient via MTPLossAutoScaler.
  • New forward kwargs: loss_mask (forwarded to process_mtp_loss) and packed_seq_params.

Layer / block forward contract

  • DeepseekV4HybridLayer.forward now returns (hidden_states, None) instead of just hidden_states. This matches upstream TransformerLayer (which returns (hidden_states, context)) and is required by MultiTokenPredictionLayer._proj_and_transformer_layer which unpacks hidden_states, _ = self.mtp_model_layer(...).
  • DeepseekV4TransformerBlock's per-layer iteration updates to x, _ = layer(...).
  • Legacy DeepseekV4MTPBlock likewise updates to unpack the tuple.

V4 attention spec advertises attn_mask_type

  • The V4 attention spec now declares params={"compress_ratio": ..., "attn_mask_type": AttnMaskType.causal}. MultiTokenPredictionLayer.__init__ validates the inner layer's self_attention.params['attn_mask_type'] against {padding, causal, no_mask, padding_causal}; without this the MTP block fails to construct. The value is functionally inert for V4 (which manages its own SWA / sink mask).
  • DeepseekV4Attention.__init__ accepts and ignores attn_mask_type plus a **kwargs catch-all so the spec lifecycle keeps working.

Legacy DeepseekV4MTPBlock (deepseek_v4_mtp.py)

  • Module docstring annotated as deprecated (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle).
  • Construction emits a DeprecationWarning pointing users at get_v4_mtp_block_spec. Code path unchanged otherwise.

Tests (tests/.../test_v4_mtp.py, ~17 tests)

  • get_v4_mtp_block_spec structural assertions: outer module is MultiTokenPredictionBlock; layer_specs length matches mtp_num_layers (parametrised 1/2/3); each per-depth spec is a MultiTokenPredictionLayer; the V4 inner layer is threaded through unchanged; norm + linear come from the V4 provider.
  • Rejects mtp_num_layers=0 with a clear ValueError.
  • DeepseekV4HybridLayerSubmodules extends TransformerLayerSubmodules so MTP picks up the GPT path (not Mamba) in its inner-layer-submodules isinstance check.
  • DeepseekV4HybridLayer.forward returns (hidden_states, None) (source-level assertion on return x, None).
  • V4 attention spec advertises AttnMaskType.causal (source-level assertion).
  • Legacy DeepseekV4MTPBlock emits DeprecationWarning on construction.
  • AST audits on deepseek_v4_model.py: process_mtp_loss is called; upstream MTP machinery is imported; spec helper is invoked; v4_use_custom_mtp_block flag is preserved; the mtp_num_layers > 0 guard keeps the no-MTP path inert.

Status / blockers

  • deepseek-v4/develop/progress/status.md — Phase 16 tasks ticked except G7 (MTP loss appears in train log; mtp_num_layers=0 vs mtp_num_layers=1 ablation matches LM loss to 1e-6), which requires distributed init + MultiTokenPredictionBlock runtime (CP / SP plumbing); tracked into P19 distributed re-validation alongside G6.
  • Two new follow-on rows recorded for the cross-cutting layer-tuple return + attention attn_mask_type declarations (both required by upstream MTP wiring).

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P16 commit 6c5875d4 (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P17 – P21 across 4 working days. P14 phase-1 1a8bf32e, P14 phase-2 5fe8bc3c, P15 25ccdb5e, and P16 6c5875d4 are recorded under May 01 in the daily plan.

What landed in e591b893 (P17 — code cleanup, gate G14)

P17 ships the dead-code retirement that was front-loaded from P21 in the 2026-05-01 reshuffle (f548d8b2). With pre-training as the release path, the HF state-dict adapter slot moved out (deferred to P22+) and the cleanup work moved up so P18's spec audit walks a clean tree.

Retired in this commit

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py — the legacy primus-owned DeepseekV4MTPBlock was deprecation-warned since P16 (6c5875d4); the spec-based path (get_v4_mtp_block_spec + upstream MultiTokenPredictionBlock + process_mtp_loss) is the only MTP route now.
  • DeepSeekV4TransformerConfig.v4_use_custom_mtp_block (legacy MTP gate) — removed.
  • DeepSeekV4TransformerConfig.mtp_compress_ratios (legacy-only field) — removed.
  • DeepseekV4Model.__init__ — single MTP branch on the spec path; the if v4_use_custom_mtp_block arm + self.mtp_block field are gone.

Dedup'd in this commit

  • primus/backends/megatron/core/transformer/local_rmsnorm.py (new) — one canonical LocalRMSNorm consumed by deepseek_v4_block.py (input_layernorm / pre_mlp_layernorm / final_layernorm fallback), deepseek_v4_attention.py (q_norm / kv_norm fallback closure), and compressor.py (kv_norm). The three pre-existing _RMSNorm definitions are deleted.

YAML cleanup

  • deepseek_v4_flash.yaml — inverted comment fixed: 4 = CSA (overlap) and 128 = HCA (non-overlap) match DeepseekV4Attention.forward dispatch.
  • deepseek_v4_pro.yaml + deepseek_v4_base.yaml — same canonical comment block added so all three V4 yamls are self-documenting.

Audit gate G14

  • tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py (new):
    • retired files gone (deepseek_v4_mtp.py, csa_attention.py, hca_attention.py).
    • legacy import path raises ImportError; package __all__ no longer exposes DeepseekV4MTPBlock.
    • DeepSeekV4TransformerConfig no longer carries v4_use_custom_mtp_block / mtp_compress_ratios.
    • AST scan over every V4 source for runtime _v4_token_ids access (Attribute / Assign / Name) — docstring mentions are exempt.
    • AST scan over every V4 source for class _RMSNorm shadow definitions — none allowed.
    • parameterised yaml check that the canonical 4 = CSA / 128 = HCA mapping is documented.

Out of scope (kept, with notes in status.md)

  • primus/backends/megatron/core/transformer/dual_rope.py — load-bearing for V4's CSA / HCA dual-base partial RoPE; Megatron's RotaryEmbedding only supports a single base. Plan-2 was over-eager listing this for retirement; it stays.

What landed in b5832672 (P18 — spec-system audit, gate G1 + D1 / D2 / D4)

P18 closes the spec-system audit findings D1 / D2 / D4 from 00-review-findings.md. Walking a clean tree (after P17) makes the audits crisp.

Provider singleton (D1)

  • primus/backends/megatron/core/models/deepseek_v4/build_context.py (new): resolve_v4_provider(config) caches a single DeepSeekV4SpecProvider on the config object via a private attribute. Different configs get different providers; the cache is GC'd when the config is released.
  • All three direct DeepSeekV4SpecProvider(config=config) call sites migrated to the helper:
    • deepseek_v4_block.py (_build_projection + DeepseekV4MoE shared-expert wiring)
    • deepseek_v4_layer_specs.py
    • deepseek_v4_mtp_specs.py
  • AST audit (test_v4_p18_spec_audit.py::test_no_direct_DeepSeekV4SpecProvider_construction_outside_build_context) rejects future regressions; build_context.py is the only allowed instantiation site.

Activation-func consistency (D2)

  • New helper DeepSeekV4SpecProvider.v4_mlp_activation_func() returns:
    • None when config.use_te_activation_func is False — the V4 default; needed so Megatron MLP keeps the eager clamped-SwiGLU path (which applies activation_func_clamp_value).
    • TEActivationOp (the TE class, instantiated by Megatron MLP at build) when the user opts into TE.
  • Layer specs + DeepseekV4MoE shared-expert spec switched to the V4 helper. The base provider's activation_func() is unchanged (BackendSpecProvider contract still says "returns a type").

compress_ratios normalization (D4)

  • DeepSeekV4TransformerConfig.__post_init__ calls _normalize_compress_ratios_field on the raw value once, so downstream consumers see tuple[int, ...] (or None). The helper handles strings ("[0, 0, 4, 128, ...]") and real lists.
  • Runtime helpers (_parse_int_sequence / _normalize_compress_ratios in deepseek_v4_block.py) keep accepting both forms for back-compat, but always receive the normalized form on the live path.

Schema gate G1

  • tests/unit_tests/configs/test_deepseek_v4_yaml.py (new): parameterises over deepseek_v4_{base,flash,pro}.yaml:
    • parse_yaml() succeeds; required fields present.
    • DeepSeekV4TransformerConfig builds from the parsed dict.
    • compress_ratios normalized to tuple[int, ...] with no value drift vs the raw schedule.
    • every compress_ratios entry is in {0, 4, 128} (canonical V4 branches).
    • retired P17 fields (v4_use_custom_mtp_block / mtp_compress_ratios) are gone from the dataclass and from each YAML.
    • V4-specific runtime fields (HC, sliding-window, sink, o_groups / o_lora_rank, MoE extras, swiglu_limit) all declared on the dataclass.
    • provider singleton: resolve_v4_provider(cfg_a) returns the same instance on repeated calls; different configs get different providers.
    • v4_mlp_activation_func contract verified for both branches of use_te_activation_func.

Spec audit (light-weight, AST-only)

  • tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py (new):
    • D1 / D2 audits described above.
    • package surface __init__.py __all__ does not re-export DeepseekV4MTPBlock (P17 cross-check).
    • spec builders do not eagerly construct TENorm / TE{Column,Row}ParallelLinear / TELinear / TEActivationOp inside __init__ — they emit ModuleSpec(module=...) references that runtime build_module resolves.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P17 + P18 commits e591b893 + b5832672 (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P19 – P21 across 4 working days. P17 + P18 are recorded under May 01 in the daily plan; P19 (distributed re-validation) is the first item in Block B.

What landed in 83c33ad0 (P19 — distributed re-validation) + dba27163 (plan-2 close-out)

P19 closes the distributed re-validation gate (G10) for the architecture-faithful V4 stack landed across P13 → P18. All four target smokes pass 10/10 iterations on mi355-gpu-12 (BF16, MBS=1 GBS=16, seq=128, 8 layers / 3 hash layers / hc_mult=4); two torch.profiler chrome-trace JSONs (EP=8 and PP=2 EP=4) are captured for the perf-baseline reference.

Smokes (10 iters each)

smoke parallelism result gating patch log
A TP=1 PP=1 EP=1 10/10 none (HC stays in-stage; PP > 1 patches are no-ops) deepseek-v4/develop/progress/p19/smokeA*.log
B TP=1 PP=2 EP=4 10/10 pp_tensor_shape p19/smokeB*.log
C TP=1 PP=4 EP=2 10/10 pp_tensor_shape + pp_token_pre_broadcast p19/smokeC_pp4_ep2_v2.log
D TP=1 PP=2 EP=4 VPP=2 10/10 pp_tensor_shape (also wraps the interleaved schedule) + pp_token_pre_broadcast (upfront) p19/smokeD_pp2_ep4_vpp2_v2_run3.log

Profile traces

torch.profiler chrome-trace JSONs (single active step, iter 6 → 7) under the same V4 smoke config:

  • output/amd/tas-mi355x-20260507/p19_profile_pp1_ep8/tensorboard/...rank[0].*.pt.trace.json — TP=1 PP=1 EP=8 (~99 MB).
  • output/amd/tas-mi355x-20260507/p19_profile_pp2_ep4/tensorboard/...rank[0].*.pt.trace.json — TP=1 PP=2 EP=4 (~105 MB).

Launchers: deepseek-v4/develop/progress/p19/run_profile_ep8.sh and run_profile_pp2_ep4.sh.

megatron.deepseek_v4.pp_tensor_shape (primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py)

Wraps two Megatron entry points in megatron.core.pipeline_parallel.schedules so V4's mHC K = hc_mult packing is reflected on the PP wire:

  1. get_tensor_shapes (used by 1F1B): seq dim multiplied by hc_mult so the receive buffer matches [S * K, B, D] instead of the stock [S, B, D].
  2. forward_backward_pipelining_with_interleaving (used by VPP): seq_length kwarg multiplied by hc_mult before the schedule's inline tensor_shape = [seq_length, mbs, hidden] runs.

Both wrappers gate on model_type == "deepseek_v4" + hc_mult > 1 + PP > 1 and are strict no-ops otherwise. Without (2) VPP allocates [S, B, D] recv buffers while the sender emits [S * K, B, D], and _lift_streams_in reshapes the truncated copy — surfaces as DeepseekV4HashRouter: hidden=32 vs token_ids=128.

megatron.deepseek_v4.pp_token_pre_broadcast (primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py)

V4's hash-routed MoE layers (the first num_hash_layers) need raw input_ids on every PP stage that owns one, but pretrain_gpt.get_batch returns None on middle PP stages. Two earlier in-loop hooks both deadlocked under VPP — an in-DeepseekV4Model.forward broadcast and a per-call get_batch broadcast each raced the interleaved schedule's pre-warmup recv_forward.wait().

This patch wraps pp_module.get_forward_backward_func so each train_step first runs all num_microbatches × num_chunks PP dist.broadcast collectives upfront, before the schedule's first send / recv, and caches the resulting (tokens, labels, loss_mask, attention_mask, position_ids, packed_seq_params) tuples per (vp_stage, microbatch). A companion wrapper around pretrain_gpt.get_batch consumes the cache when active and falls back to the original implementation otherwise. Cache is reset in a finally after each schedule call. Cost ≈ mbs * seq * 8B per microbatch (~32 KiB / step on the smoke), dwarfed by the activation P2P.

Model-side cleanup (deepseek_v4_model.py, deepseek_v4_layer_specs.py)

  • Drop the in-forward input_ids PP broadcast + VPP fail-fast assert from DeepseekV4Model; the pre-broadcast patch handles both 1F1B and VPP cleanly.
  • Stop pre-assigning self.mtp = None in __init__; Megatron's set_current_microbatch (in cuda_graphs.py) only iterates model.mtp.layers when MTP is actually live, which matches upstream GPTModel. Downstream MTP guards use getattr(self, "mtp", None).
  • Import DeepSeekV4SpecProvider in deepseek_v4_layer_specs.py so the type annotation resolves at module load (NameError surfaced once turbo path was off).

c10d::allreduce_ autograd warning gone

The historical UserWarning: An operator was called with autograd not registered for c10d::allreduce_ came from the early bring-up's "local shard + torch.distributed.all_reduce" path for MoE routed-output aggregation in v4_moe.py. P14 phase-2 migrated MoE to Megatron's token dispatchers (MoEAlltoAllTokenDispatcher / MoEFlexTokenDispatcher); P17 deleted the v4_enable_ep_allreduce_fallback debug gate; and P19 confirms zero c10d::allreduce hits in stderr across all four smokes + the EP=8 / PP=2 EP=4 profile runs.

dba27163 plan-2 close-out (docs-only)

  • status.md — mark c10d::allreduce_ warning as gone (with the verification log paths); mark G11 as [-] deferred (snapshot dump tooling never landed); drop Phase 20 / 21 / 22+ sections (kept as documented intent in plan-2/03-phase-details.md); refresh the Blockers / Risks log entry for c10d to reference the actual P19 verification rather than "still tracked into P19".
  • deepseek-v4/develop/progress/plan-2-summary.md (new) — stand-alone summary of the plan-2 architecture-faithful rewrite (P12 → P19): per-phase outcome with key commits; P19 deep-dive (smokes / profile traces / patches / c10d verification); test-gate ledger (G1 / G3 / G4 / G5 / G6 / G7 / G11 / G14 + smokes); plan-1 → plan-2 architectural-shift table (attention, MoE, layer / block, MTP, token-IDs path, HC × PP, TP, spec hygiene); explicit deferred / out-of-scope list (G6 distributed, G7 MTP, G11, P20, P21, P22+).
  • P19 profile launchersrun_profile_ep8.sh (TP=1 PP=1 EP=8) and run_profile_pp2_ep4.sh (TP=1 PP=2 EP=4); both set PROFILE=True + disable_tensorboard=False so the existing torch_profiler_patches.py hook captures iter 6 → 7.
  • deepseek-v4/download_ref.sh — idempotent helper that ensures git-lfs and clones the V4 reference assets at pinned commits (HF transformers, ROCm TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, plus DeepSeek-V4-Pro / Flash / Flash-Base / Pro-Base) with GIT_LFS_SKIP_SMUDGE=1 so weights are not downloaded by default.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P12 → P18 commits (636ab3deb5832672).
  • Block B (delivered): 2026-05-07 — plan-2 P19 (83c33ad0) + plan-2 close-out (dba27163).
  • Deferred follow-ups: P20 (200-step Megatron-bridge convergence + TE on / off perf report + FP8 follow-up plan), P21 (techblog / progress timeline / PPT refresh), P22+ (HF state-dict adapter + V4-Flash safetensors round-trip / token-0 logits ≤ 1e-2 vs HF reference). All three are documented in plan-2/03-phase-details.md; they re-enter active work when the next campaign (release, downstream integration ask, SFT / eval) needs them.

Test plan

  • P1 – P5 unit / smoke coverage from previous commits.
  • P6 / P7 functional smoke: 1 node, 8 GPUs, PP=2, EP=4, BF16, 3 iters.
  • P8 (v2) runtime instantiate / forward validation in dev_primus_wenx_691.
  • P9 (v2) provider-mode A/B validation (local forward, TE build, TE CUDA forward, TE host-input guard).
  • P10 (v2) iteration 10/10 smoke with grouped-expert clamped-SwiGLU guard.
  • P12 plan-2 lockdown: docs-only commit; pre-commit hooks (isort / autoflake / black) pass.
  • P13 — V4-faithful attention class (dense + HCA + CSA in one MLASelfAttention-rooted module); inline-reference numerical alignment for dense (≤ 1e-3) and HCA (≤ 1e-3); CSA shape / finiteness; linear_q_up_proj / linear_o_b Column / Row parallel; pre-commit hooks (isort / autoflake / black) pass.
  • P13 — HF-reference numerical alignment within 1e-3 (CPU fp32) — release gate G2 / G3 — deferred to P22+ alongside the state-dict adapter (originally tracked into P17; reshuffled 2026-05-01).
  • P13 — TP=2 sharding-parity bit-equality vs duplicated baseline — scaffold landed (skipif single-rank); execution deferred to P19.
  • P14 phase-1 — pre-mul clamped SwiGLU activation + V4 routers (learned + hash); G3 (≤ 1e-6 fp32 vs HF reference) + G4 (identical (probs, indices) + gradient flow on gate weight) covered by test_clamped_swiglu.py + test_v4_routers.py; pre-commit hooks (isort / autoflake / black) pass.
  • P14 phase-2 — DeepseekV4MoE -> MegatronModule + provider v4_grouped_mlp_spec(swiglu_limit) / v4_router_spec(learned) + 1L MoE forward within 1e-3 of HF reference (gate G5) — covered by test_v4_moe.py; pre-commit hooks pass.
  • P15 — DeepseekV4HybridLayer -> TransformerLayer + DeepseekV4TransformerBlock -> TransformerBlock; HyperHead only on post_process; _lift_streams_in / _lower_streams_out packing helpers (CPU-only G6 sub-gate covered by test_v4_block_pp.py); token_ids forward-kwarg threading + decoder._v4_token_ids AST audit; pre-commit hooks pass.
  • P15 — distributed PP=1 / 2 / 4 equivalence on 4L V4 toy — gate G6 — deferred to P19.
  • P16 — spec-based MTP via upstream MultiTokenPredictionBlock + process_mtp_loss; get_v4_mtp_block_spec helper; layer forward returns (hidden_states, None); legacy DeepseekV4MTPBlock deprecated; pre-commit hooks pass.
  • P16 — distributed MTP loss appears in train log; mtp_num_layers=0 matches LM loss to 1e-6 — gate G7 — deferred to P19.
  • P17 — code cleanup: legacy DeepseekV4MTPBlock deleted; v4_use_custom_mtp_block / mtp_compress_ratios config fields removed; three _RMSNorm shadows replaced by shared LocalRMSNorm; yaml comment inversion fixed (4 = CSA / 128 = HCA); package surface refreshed; AST gate G14 green via test_v4_p17_dead_code.py (retired-files check, retired-config-fields check, _v4_token_ids AST scan, _RMSNorm shadow scan, yaml-comment dispatch). dual_rope.py intentionally kept (load-bearing for V4's CSA / HCA dual-base RoPE; no Megatron equivalent — documented in status.md). Pre-commit hooks pass.
  • P18 — spec-system audit: provider singleton via build_context.resolve_v4_provider(config) (D1); provider.v4_mlp_activation_func() returns None when use_te_activation_func=False and TEActivationOp otherwise (D2); compress_ratios normalized to tuple[int, ...] in __post_init__ (D4); new tests/unit_tests/configs/test_deepseek_v4_yaml.py (G1 schema gate) + test_v4_p18_spec_audit.py (D1 / D2 / package surface / TE eager-construction AST audits). Pre-commit hooks pass.
  • P19 — distributed re-validation (G10): smokes A 1×8 PP=1 EP=1, B 1×8 PP=2 EP=4, C 1×8 PP=4 EP=2, D 1×8 PP=2 EP=4 VPP=2 all 10/10 iters on mi355-gpu-12 (BF16, MBS=1 GBS=16, seq=128, 8 layers / 3 hash layers / hc_mult=4); two torch.profiler chrome-trace JSONs (EP=8 and PP=2 EP=4) captured. Two primus-patches landed (pp_tensor_shape + pp_token_pre_broadcast); c10d::allreduce_ autograd warning verified absent in stderr across all smokes + profile runs.
  • [-] P19 — routing-snapshot diff = 0 across PP / EP changes — gate G11. deferred: snapshot dump tooling never landed; not on the pre-training release path. Runtime stability of the P15 / P19 patches is covered by the smokes above.
  • P20 — 200-step Megatron-bridge convergence (±0.05 loss) + TE on/off perf report + FP8 follow-up plan — gates G12 / G13. Deferred follow-up as of 2026-05-07; not on the pre-training release path. Re-enters active work when a release / perf campaign needs it.
  • 50-iter stability run + TP partitioning end-to-end coverage — superseded by the P19 smoke matrix (10 iters × 4 parallelism configurations); a longer stability sweep is bundled into the deferred P20 perf campaign.
  • P22+ (deferred follow-up) — V4-Flash safetensors round-trip + token-0 logits ≤ 1e-2 vs HF reference — gate G8 / G9. Not on the pre-training release path; activate when SFT / evaluation needs HF weights.

Known risk / follow-up

  • EP routed-output path currently uses all_reduce and emits a PyTorch autograd warning (c10d::allreduce_ kernel registration). Functional for bring-up; gated behind the v4_enable_ep_allreduce_fallback debug toggle on the active path. resolved (P14 phase-2 / P17 audit / P19 runtime verification): the v4_enable_ep_allreduce_fallback flag was removed during the dispatcher migration in P14; the debug gate was deleted in P17 (e591b893); P19 smokes (A/B/C/D + EP=8 / PP=2 EP=4 profile runs on mi355-gpu-12) confirm zero c10d::allreduce warnings in stderr — the EP routed-output reduction now flows entirely through Megatron's MoEAlltoAllTokenDispatcher / MoEFlexTokenDispatcher.
  • HC × PP — HyperHead per-stage application destroys K-stream context. resolved (P15 + P19): DeepseekV4TransformerBlock packs [B, S, K, D] → [S*K, B, D] for PP P2P via _lower_streams_out and only applies HyperHead on the post_process stage. CPU-side bit-exact roundtrip covered by test_v4_block_pp.py. Runtime stability across PP > 1 verified by P19 smokes B / C / D with the pp_tensor_shape patch; distributed bit-equality across PP = 1 / 2 / 4 (G6) is a separate audit and is not on the pre-training release path (deferred follow-up).
  • decoder._v4_token_ids attribute stash — leaks state across PP and microbatches. resolved (P15): DeepseekV4Model.forward now passes token_ids=input_ids directly to the decoder; AST audit prevents regressions.
  • No state-dict adapter — V4-Flash safetensors cannot be loaded. Deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need HF weights. Plan-2 P22+ (when activated by an SFT / evaluation campaign) lands the adapter and adds the HF numerical-alignment gate (G8 / G9). Design notes preserved in 02-target-architecture.md §7 + 03-phase-details.md (P22+ section).

Initial design / planning materials for integrating DeepSeek-V4 training
support into Primus. Documentation only; no production code changes.

- techblog/: architecture deep dive (CSA / HCA / mHC / Hash routing /
  sqrtsoftplus / clamped SwiGLU / dual RoPE / Muon / MTP) plus 4 PNG
  diagrams rendered via Pillow (see render_diagrams.py).
- plan/: 8-phase roadmap, full code-landing list, per-phase task
  breakdown, and testing strategy.
- progress/status.md: 64-task checklist tracking phase progress.
- develop_deepseek-v4-in-primus.md: top-level goal and development
  cadence.

Made-with: Cursor
Phase 1 of the V4 development plan. Pure config; no Python code paths
exercised yet. Subsequent phases (P2..P4) wire dispatch and modules.

* primus/configs/models/megatron/deepseek_v4_base.yaml
  Extends llama_base, sets model_type=deepseek_v4 and registers V4-specific
  defaults (hc_mult, hybrid_attention_*, q_lora_rank, attn_sink, hash routing,
  swiglu_limit, dual-RoPE knobs, etc.).
* primus/configs/models/megatron/deepseek_v4_flash.yaml
  Hyperparams from DeepSeek-V4-Flash/config.json.
* primus/configs/models/megatron/deepseek_v4_pro.yaml
  Hyperparams from DeepSeek-V4-Pro/config.json.
* examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml
  Training scaffold; parallelism / perf knobs are conservative and will be
  retuned during the perf phase.
* primus/backends/megatron/training/tokenizer/tokenizer.py
  Add DeepSeekV4Tokenizer to CUSTOM_TOKENIZER_TYPES so _add_tokenizer_args
  accepts it.

Note: V4 fields do not need to be registered in Megatron's argparse —
Primus's merge_namespace mechanism (train_runtime.py:_initialize_trainer)
copies yaml-only fields onto backend_args after MegatronArgBuilder.update.

Made-with: Cursor
Phase 2 of the V4 development plan. Wires the end-to-end dispatch from
yaml.model_type=deepseek_v4 to a primus-owned model_provider + builder,
without changing model behaviour yet. The model class is still a thin
GPTModel subclass; Phase 3 swaps the decoder for the V4 transformer block.

* primus/core/utils/import_utils.py
  Add a deepseek_v4 branch to get_model_provider() that imports
  primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders
  and returns partial(model_provider, deepseek_v4_builder).

* primus/backends/megatron/megatron_pretrain_trainer.py
  Add a model_type == "deepseek_v4" branch alongside gpt / mamba.
  V4 is a causal-LM with the same data shape as GPT, so we reuse
  pretrain_gpt's forward_step + train_valid_test_datasets_provider;
  only the model_provider itself is V4-specific.

* primus/backends/megatron/core/models/deepseek_v4/__init__.py (new)
  Re-export DeepseekV4Model + deepseek_v4_builder + model_provider.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py (new)
  DeepseekV4Model: thin subclass of GPTModel. P3 will replace
  self.decoder with DeepseekV4TransformerBlock.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py (new)
  deepseek_v4_builder + model_provider. Uses GPT layer specs in P2;
  P3 will swap them for V4 specs.

Made-with: Cursor
Phase 3 of the V4 development plan. Lands the V4 layer-spec helpers and a
transparent V4 transformer-block subclass; attention / MLP behaviour still
matches GPT. Phase 4 will plug HC + hybrid attention into the block, and
Phase 5 will swap in V4 MoE / clamped SwiGLU through the spec-resolution
hooks added here.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py (new)
  Four V4 layer-spec helpers (layer / decoder_block / decoder_layer_specs /
  mtp_block) that delegate to the GPT helpers in P3, plus two resolution
  hooks (_resolve_attention_module_spec / _resolve_mlp_module_spec) that
  return None for now -- P4 / P5 fill these in.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py (new)
  DeepseekV4TransformerBlock: subclasses TransformerBlock and stashes V4
  config fields (hc_mult, compress_ratios, attn_sliding_window, attn_sink,
  q_lora_rank, index_*) onto self so P4 patches don't have to re-walk the
  config. Forward behaviour unchanged in P3.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
  Override __init__: after super().__init__() builds the stock decoder,
  swap self.decoder for DeepseekV4TransformerBlock (same call signature
  so GPTModel.forward keeps working).

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py
  _resolve_layer_spec / _resolve_mtp_block_spec now route through the
  V4 layer-spec helpers instead of the GPT helpers directly.

* primus/backends/megatron/core/models/deepseek_v4/__init__.py
  Re-export DeepseekV4TransformerBlock alongside the existing surface.

Made-with: Cursor
…dual-RoPE)

Phase 4 of the V4 development plan. Lands the full V4 transformer block:
mHC multi-stream residual, per-layer hybrid attention dispatch (Dense /
HCA / CSA), sliding-window mask, attention sink, dual-RoPE with YaRN. The
V4 block becomes a standalone nn.Module that bypasses Megatron's
TransformerBlock + ModuleSpec mechanism so the multi-stream HC loop is
expressed cleanly. P5 will swap the placeholder SwiGLU MLP for V4's MoE.

New modules under primus/backends/megatron/core/transformer/ ::

* hyper_connection.py
  HyperMixer (per-layer mHC mixer), HyperHead (final K->1 collapse),
  sinkhorn_normalize (doubly-stochastic projection). Linear weights /
  scales / biases held in fp32 for stability; fp32 sinkhorn iterates.
  Unit-tested: row/col errors ~1e-6, hc_mult=1 degenerate path exact.

* compressor.py
  V4 compressor for KV downsampling. ratio=4 overlap mode (CSA, coff=2),
  ratio=128 non-overlap mode (HCA, coff=1). Internal RMSNorm + learnable
  APE; RoPE applied externally.

* indexer.py
  Sparse top-K position selector for CSA. Internal mini-Compressor builds
  the score grid; causal mask + top-K (-1 fill for invalid positions);
  backward propagates to the indexer params.

* sliding_window_kv.py
  Causal SWA mask + per-query KV index helpers.

* attn_sink.py
  Per-head learnable sink scalar; softmax_with_sink ensures probs.sum() <=
  1 with the sink absorbing the residual mass. Backward propagates to the
  sink params.

* dual_rope.py
  Two RoPE bases (main + compress) with optional YaRN scaling. Partial
  interleaved RoPE: only ``rotary_dim`` of each head's channels rotated;
  remaining channels passed through unchanged.

* deepseek_v4_attention.py
  Shared base for V4 attention: QKV projection (optional Q LoRA),
  partial dual-RoPE, SWA mask, attention sink, output projection.
  ``_extra_kv`` hook lets HCA / CSA augment KV (full pool or sparse top-K).

* hca_attention.py
  Heavily-Compressed Attention. Subclasses DeepseekV4Attention; adds a
  non-overlap Compressor and concatenates the full compressed pool to
  the local KV (always visible).

* csa_attention.py
  Compressed-Sparse Attention. Subclasses DeepseekV4Attention; adds an
  overlap Compressor + Indexer; per-query attention is computed over the
  local SWA + the indexer's top-K compressed positions.

Updated:

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
  Rewritten as a standalone nn.Module. Holds the dual-RoPE for the whole
  stack, builds DeepseekV4HybridLayer per layer (Dense/HCA/CSA picked
  from compress_ratios), and runs the K-stream HC loop. Forward shape:
  [S, B, D] -> [B, S, D] -> [B, S, K, D] -> ... -> [B, S, D] -> [S, B, D].
  Smoke-tested: 8-layer mixed dense/CSA/HCA + hc_mult=4 forward / backward
  / causality OK.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
  Cleaned up to a placeholder spec. The V4 block is standalone and
  bypasses Megatron's spec mechanism; we still hand a valid GPT-shaped
  spec to GPTModel.__init__ until P6 refactors that allocation away.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
  Docstring rewritten for the P4 standalone-block layout; pg_collection
  switched to getattr(self, "pg_collection", None) for safety.

* deepseek-v4/develop/progress/status.md, plan/02-phase-details.md
  Track P1..P4 completion; add the argparse-not-needed note (Primus's
  merge_namespace covers V4 fields).

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 28, 2026 11:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new model_type=deepseek_v4 to Primus’ Megatron backend, including V4 configs, model/provider dispatch, and an initial DeepSeek-V4 block implementation with HC + hybrid attention building blocks.

Changes:

  • Add DeepSeek-V4 model dispatch + builders and a Primus-owned V4 model package.
  • Introduce V4 config yamls (base/flash/pro) and a MI355X pretrain scaffold yaml.
  • Implement core V4 transformer components (HC, dual-RoPE, compressor, indexer, CSA/HCA attention, sliding-window helpers, attention sink).

Reviewed changes

Copilot reviewed 33 out of 37 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
primus/core/utils/import_utils.py Adds deepseek_v4 branch to resolve the V4 model provider/builder.
primus/backends/megatron/megatron_pretrain_trainer.py Dispatches model_type=deepseek_v4 while reusing GPT data/forward_step plumbing.
primus/backends/megatron/training/tokenizer/tokenizer.py Allows selecting DeepSeekV4Tokenizer via HF tokenizer wrapper.
primus/configs/models/megatron/deepseek_v4_{base,flash,pro}.yaml Adds V4 model configs and defaults.
examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml Adds a training scaffold yaml for MI355X.
primus/backends/megatron/core/models/deepseek_v4/* Adds V4 model/builders/spec placeholders and a standalone V4 block implementation.
primus/backends/megatron/core/transformer/* Implements HC, dual-RoPE, compressor/indexer, CSA/HCA attention, SWA helpers, and attention sink.
deepseek-v4/develop/** Adds development docs/diagrams and planning materials for the V4 integration.

Comment on lines +34 to +38
# Per-layer compression schedule (from config.json:compress_ratios)
# 0 = uncompressed dense layer (full attention with SWA)
# 4 = HCA branch (compress ratio 4)
# 128 = CSA branch (compress ratio 128)
compress_ratios: "[0, 0, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]"

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compress_ratios is currently a quoted string, so YAML will parse it as str rather than a list of ints. DeepseekV4TransformerBlock.__init__ does list(compress_ratios) and checks len(...) == num_layers, so this will either explode into a list of characters or fail the length check at runtime. Define this as a real YAML list (no quotes) or normalize the string to List[int] before the block consumes it; also ensure the list length matches num_layers (43).

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +37
# Per-layer compression schedule (from config.json:compress_ratios)
# 0 = uncompressed dense layer (full attention with SWA)
# 4 = HCA branch (compress ratio 4)
# 128 = CSA branch (compress ratio 128)

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-layer schedule comments invert CSA vs HCA: per the V4 design and the rest of this PR, compress_ratio == 4 is CSA and compress_ratio == 128 is HCA. Please fix the comment mapping so it matches the implementation.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +5
#
# Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
###############################################################################

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the referenced source path (DeeSeek-v4-Pro). If this path is meant to mirror the repo directory (DeepSeek-V4-Pro), please correct it to avoid confusion when cross-referencing configs.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +261
# Sliding-window mask.
window = self.attn_sliding_window
local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype) # [S, S]

# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).
# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]
# so they can be cat'd along the Sk axis.
extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)

# Concatenate sliding-window KV with extra KV (if any).
if extra_k is not None:
k_full = torch.cat([k_local_h, extra_k], dim=1) # [B, Sk_total, H, head_dim]
v_full = torch.cat([v_local_h, extra_v], dim=1)
full_mask = torch.cat([local_mask, extra_mask], dim=-1) # [Sq, Sk_total]
else:
k_full = k_local_h
v_full = v_local_h
full_mask = local_mask

# Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim]
q_bh = q.transpose(1, 2)
k_bh = k_full.transpose(1, 2)
v_bh = v_full.transpose(1, 2)

out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask)

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sliding_window_causal_mask creates a [S, S] mask, but the attention still computes q @ k^T over all S keys (k_local_h is length S). For realistic training lengths (e.g. 4096), this becomes quadratic memory/compute and is very likely to OOM, even though the model is conceptually sliding-window. Consider actually restricting K/V to the window (e.g. gather with sliding_window_kv_indices, unfold, or use a kernel/backend that supports causal sliding-window attention) so Sk_local is window rather than S.

Suggested change
# Sliding-window mask.
window = self.attn_sliding_window
local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype) # [S, S]
# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).
# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]
# so they can be cat'd along the Sk axis.
extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)
# Concatenate sliding-window KV with extra KV (if any).
if extra_k is not None:
k_full = torch.cat([k_local_h, extra_k], dim=1) # [B, Sk_total, H, head_dim]
v_full = torch.cat([v_local_h, extra_v], dim=1)
full_mask = torch.cat([local_mask, extra_mask], dim=-1) # [Sq, Sk_total]
else:
k_full = k_local_h
v_full = v_local_h
full_mask = local_mask
# Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim]
q_bh = q.transpose(1, 2)
k_bh = k_full.transpose(1, 2)
v_bh = v_full.transpose(1, 2)
out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask)
# Materialize only the causal sliding-window K/V for each query position
# so local attention scales with `window` rather than the full sequence `S`.
window = self.attn_sliding_window
window = min(window, S)
# Build per-query local indices: for query i attend to [i - window + 1, ..., i].
query_positions = torch.arange(S, device=device)
window_offsets = torch.arange(window, device=device)
local_indices = query_positions.unsqueeze(1) - (window - 1) + window_offsets.unsqueeze(0) # [S, window]
local_valid = local_indices >= 0
local_indices = local_indices.clamp_(min=0, max=S - 1)
# Gather local K/V windows: [B, S, H, D] -> [B, S, window, H, D].
gather_index = local_indices.view(1, S, window, 1, 1).expand(
B, S, window, self.num_heads, self.head_dim
)
k_local = torch.gather(
k_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim),
1,
gather_index,
)
v_local = torch.gather(
v_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim),
1,
gather_index,
)
# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).
# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim].
extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)
# Move heads dim before sequence for local attention:
# q: [B, S, H, D] -> [B, H, S, D]
# local k/v: [B, S, window, H, D] -> [B, H, S, window, D]
q_bh = q.transpose(1, 2)
k_local_bh = k_local.permute(0, 3, 1, 2, 4)
v_local_bh = v_local.permute(0, 3, 1, 2, 4)
scale = self.head_dim ** -0.5
local_scores = (q_bh.unsqueeze(-2) * k_local_bh).sum(dim=-1) * scale # [B, H, S, window]
local_scores = local_scores.masked_fill(
~local_valid.view(1, 1, S, window), torch.finfo(local_scores.dtype).min
)
if extra_k is not None:
extra_k_bh = extra_k.transpose(1, 2) # [B, H, S_extra, D]
extra_v_bh = extra_v.transpose(1, 2) # [B, H, S_extra, D]
extra_scores = torch.einsum("bhsd,bhkd->bhsk", q_bh, extra_k_bh) * scale
if extra_mask is not None:
if extra_mask.dtype == torch.bool:
extra_scores = extra_scores.masked_fill(
~extra_mask.view(1, 1, S, -1), torch.finfo(extra_scores.dtype).min
)
else:
extra_scores = extra_scores + extra_mask.view(1, 1, S, -1).to(extra_scores.dtype)
attn_scores = torch.cat([local_scores, extra_scores], dim=-1)
attn_probs = torch.softmax(attn_scores.float(), dim=-1).to(q_bh.dtype)
local_probs = attn_probs[..., :window]
extra_probs = attn_probs[..., window:]
out_local = (local_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2)
out_extra = torch.einsum("bhsk,bhkd->bhsd", extra_probs, extra_v_bh)
out_bh = out_local + out_extra
else:
attn_probs = torch.softmax(local_scores.float(), dim=-1).to(q_bh.dtype)
out_bh = (attn_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2)

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +94
) -> torch.Tensor:
"""Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_gather_topk_kv is annotated as returning torch.Tensor, but it actually returns (gathered, valid). This will confuse type-checkers and readers; update the return annotation (and docstring if needed) to reflect the tuple return type.

Suggested change
) -> torch.Tensor:
"""Gather ``[B, P, head_dim]`` along ``P`` per query``[B, S, K, head_dim]``.
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Gather ``[B, P, head_dim]`` along ``P`` per query.
Returns:
A tuple ``(gathered, valid)`` where:
- ``gathered`` has shape ``[B, S, K, head_dim]``.
- ``valid`` has shape ``[B, S, K]`` and marks non-masked indices.

Copilot uses AI. Check for mistakes.
gathered, valid = self._gather_topk_kv(pool_kv, topk_idxs) # [B, S, K, head_dim]

# 5) Stash for ``_compute_attention_output`` to consume.
gathered.shape[2]

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement has no effect (gathered.shape[2] is computed and discarded). It looks like a leftover debug line; please remove it to keep the CSA path clean.

Suggested change
gathered.shape[2]

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +30
num_layers: 61
hidden_size: 7168
num_attention_heads: 128
num_query_groups: 1
kv_channels: 512
qk_pos_emb_head_dim: 64
ffn_hidden_size: 18432
moe_ffn_hidden_size: 3072
moe_shared_expert_intermediate_size: 3072

q_lora_rank: 1536
o_lora_rank: 1024
o_groups: 16

num_experts: 384
moe_router_topk: 6
moe_router_topk_scaling_factor: 2.5

index_topk: 1024

compress_ratios: "[128, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]"

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as Flash: compress_ratios is a quoted string, which will not deserialize to Sequence[int] and will break DeepseekV4TransformerBlock's len(compress_ratios) == num_layers check. Please make this a real YAML list (or add a normalization step) and verify the schedule length matches num_layers (61).

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +7
# Reference:
# - deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/config.json
# - deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
# - deepseek-v4/develop/techblog/01-deepseek-v4-architecture-deep-dive.md

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the reference path (DeeSeek-v4-Pro). Please correct the spelling/casing so the comment points at the actual directory name and is searchable.

Copilot uses AI. Check for mistakes.
# local v we have [B, H, Sk_local, head_dim] (independent of S),
# while sparse v depends on S. Build a "value tensor" with the
# same shape on both paths by broadcasting local v:
v.shape[2]

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement has no effect (v.shape[2] is computed and discarded). Please remove it; it reads like a debug remnant and makes the attention path harder to audit.

Suggested change
v.shape[2]

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +24
FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"
FONT_BOLD = FONT_REG # we only have Regular; use it for both

OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")
os.makedirs(OUT_DIR, exist_ok=True)


def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:
return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)

Copilot AI Apr 28, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FONT_REG is hard-coded to an absolute path under a specific user's home directory, which will fail for other developers/CI. Consider using a repo-relative font path, allowing an environment variable override, and/or falling back to a default font when the file isn't present.

Suggested change
FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"
FONT_BOLD = FONT_REG # we only have Regular; use it for both
OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")
os.makedirs(OUT_DIR, exist_ok=True)
def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:
return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)
BASE_DIR = os.path.dirname(__file__)
FONT_CANDIDATES = (
os.environ.get("DIAGRAM_FONT"),
os.environ.get("FONT_REG"),
os.path.join(BASE_DIR, "NotoSansSC-Regular.otf"),
os.path.join(BASE_DIR, "fonts", "NotoSansSC-Regular.otf"),
)
def _resolve_font_path() -> str | None:
for path in FONT_CANDIDATES:
if path and os.path.isfile(path):
return path
return None
FONT_REG = _resolve_font_path()
FONT_BOLD = FONT_REG # we only have Regular; use it for both when available
OUT_DIR = os.path.join(BASE_DIR, "diagrams")
os.makedirs(OUT_DIR, exist_ok=True)
def font(sz: int, bold: bool = False) -> ImageFont.ImageFont | ImageFont.FreeTypeFont:
font_path = FONT_BOLD if bold else FONT_REG
if font_path:
return ImageFont.truetype(font_path, sz)
return ImageFont.load_default()

Copilot uses AI. Check for mistakes.
…+ MTP

Phase 5 of the V4 development plan. Lands the FFN side of the V4 stack:
hash-routed and learned top-K MoE, clamped SwiGLU experts, and the V4
MTP head. The V4 block now plugs the V4 MoE in place of P4's placeholder
SwiGLU FFN; the V4 model instantiates a separate-HyperHead MTP block when
mtp_num_layers > 0. Layer-aware YaRN was already done in P4
(DualRoPE.get_rope picks main_rope vs compress_rope by compress_ratio).

New modules:

* primus/backends/megatron/core/transformer/clamped_swiglu.py
  clamped_swiglu(x, alpha=7.0): silu(gate)*up clamped to [-alpha, alpha].
  ClampedSwiGLUMLP wraps it as a fused gate_up + down two-linear MLP.
  Eager (Python) for v1; perf phase will register a fused kernel.

* primus/backends/megatron/core/transformer/moe/v4_hash_router.py
  HashRouter: static [vocab_size, topk] tid2eid table from a fixed seed.
  Active for the first num_hash_layers V4 layers; gives each token a
  permanent expert assignment with uniform weight 1/topk. No learnable
  parameters; deterministic across PP / TP / EP ranks.

* primus/backends/megatron/core/transformer/moe/v4_topk_router.py
  V4TopKRouter: learned gate with score_function in
  {"sqrtsoftplus", "sigmoid", "softmax"}. Top-K with optional renorm
  and optional noaux_tc per-expert bias (selection-only; probs are
  read from the un-biased score).

* primus/backends/megatron/core/transformer/moe/v4_moe.py
  DeepseekV4MoE: per-layer router pick (hash vs learned) + N
  ClampedSwiGLUMLP routed experts + 1 shared expert. Pure-PyTorch
  per-expert dispatch; P6 swaps in Megatron's token-dispatcher /
  grouped-GEMM / EP path.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py
  DeepseekV4MTPBlock: mtp_num_layers V4 layers, each owning its own
  HyperHead (separate from the main decoder's). Shares the dual-RoPE
  with the main decoder. Loss-side wiring is deferred to P6; P5 just
  stands the module up so it can be unit-tested standalone.

Updated:

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
  DeepseekV4HybridLayer now picks MoE vs dense FFN based on
  num_routed_experts. forward() threads token_ids through to the MoE
  for hash-routed layers. The block-level forward picks token_ids up
  from a model-side stash (_v4_token_ids) so callers don't have to
  thread it explicitly through every layer of the call stack.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
  Builds DeepseekV4MTPBlock when mtp_num_layers > 0 (post-process
  rank only). forward() overridden to stash input_ids onto self.decoder
  before delegating to GPTModel.forward, so hash-routed MoE layers can
  consume them. Cross-PP propagation of input_ids is a P6 concern.

* primus/backends/megatron/core/models/deepseek_v4/__init__.py
  Re-export DeepseekV4MTPBlock alongside the existing surface.

Smoke-tested on dev-box PyTorch container (CPU, 7-test suite):
* clamped_swiglu: clamp tight; MLP forward+backward OK.
* HashRouter: per-token top-K distinct, deterministic across re-runs and
  re-instantiations w/ same seed, probs sum to 1.
* V4TopKRouter: top-K honored, renorm OK, backward OK for all three
  score functions (sqrtsoftplus, sigmoid, softmax).
* DeepseekV4MoE (learned & hash modes): forward + backward; same-token
  determinism for hash routing.
* DeepseekV4TransformerBlock with MoE FFN (4 layers, hc_mult=2, mixed
  dense + CSA): forward + backward; deterministic in eval mode.
* DeepseekV4MTPBlock (mtp_num_layers=2, hc_mult=2): forward + backward;
  per-MTP HyperHead state_dict separation verified.

Deferred to P6 (already noted in progress doc):
* Real Megatron-MoE / token-dispatcher / EP integration -- replaces the
  pure-PyTorch dispatch loop in DeepseekV4MoE.forward.
* MTP loss path wiring -- DeepseekV4Model.forward currently builds the
  MTP block but does not yet feed its outputs through lm_head + the
  auxiliary loss term.
* Numerical alignment vs reference inference/model.py (token-0 logits
  within 1e-2) -- needs reference checkpoint loading.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 28, 2026 11:30
@wenxie-amd wenxie-amd force-pushed the dev/wenx/deepseek-v4 branch from b8e47a3 to 5e4008d Compare April 28, 2026 11:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 42 changed files in this pull request and generated no new comments.

Wire DeepSeek-V4 through Megatron P6 integration (PP local-layer build, EP expert sharding, and compatibility fixes) and add the P7 single-node launcher plus progress docs after passing PP=2/EP=4 smoke run.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 29, 2026 12:25
Add the plan-1 roadmap/detail/test documentation plus progress tracker entries, and update the development target doc with TransformerEngine and Primus-Turbo reference pointers.

Made-with: Cursor
@wenxie-amd wenxie-amd force-pushed the dev/wenx/deepseek-v4 branch from ecf8169 to 1030293 Compare April 29, 2026 12:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 48 changed files in this pull request and generated 8 comments.

Comment on lines +85 to +93
gen = torch.Generator(device="cpu").manual_seed(int(seed))
# For each token id, pick ``topk`` distinct expert ids deterministically.
# randperm(num_experts) is a stable, dense permutation; slicing the
# first ``topk`` rows gives uniform-without-replacement routing.
rows = []
for _ in range(vocab_size):
perm = torch.randperm(num_experts, generator=gen)[:topk]
rows.append(perm)
tid2eid = torch.stack(rows, dim=0).long() # [vocab_size, topk]

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashRouter.__init__ builds tid2eid by looping over every vocab_size entry and calling torch.randperm(num_experts) each time. For real V4 sizes (e.g., vocab≈129k, experts≈384), this will add significant startup time and CPU memory churn at model construction. Consider replacing this with a deterministic hash-based mapping (no table), or generating the table in larger vectorized blocks (and/or only for the subset of vocab used), so model init remains scalable.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +107
def _gather_topk_kv(
self,
pool: torch.Tensor, # [B, P, head_dim]
topk_idxs: torch.Tensor, # [B, S, K] (-1 for masked)
) -> torch.Tensor:
"""Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.

Out-of-range / masked indices (``-1``) are clamped to ``0`` for the
gather, then *zero-masked* afterwards.
"""
B, S, K = topk_idxs.shape
P, Hd = pool.shape[1], pool.shape[2]
valid = topk_idxs >= 0 # [B, S, K]
safe_idx = topk_idxs.clamp(min=0)
# Expand idx to gather along P for each (B, S, K, Hd).
idx_expand = safe_idx.unsqueeze(-1).expand(B, S, K, Hd)
pool_expand = pool.unsqueeze(1).expand(B, S, P, Hd) # [B, S, P, Hd]
gathered = torch.gather(pool_expand, dim=2, index=idx_expand) # [B, S, K, Hd]
gathered = gathered * valid.unsqueeze(-1).to(gathered.dtype)
return gathered, valid

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_gather_topk_kv is annotated as returning only a torch.Tensor, but it actually returns (gathered, valid). This mismatch can break type checking and mislead callers; update the return annotation (and docstring if desired) to Tuple[torch.Tensor, torch.Tensor].

Copilot uses AI. Check for mistakes.
in_dtype = x.dtype
x32 = x.float()
rsqrt = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
return (x32 * rsqrt).to(in_dtype) * self.weight

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standalone RMSNorm implementation returns (…to(in_dtype) * self.weight). If self.weight remains fp32 (common in mixed-precision training), this multiplication will upcast the output back to fp32, potentially defeating BF16 activation flow and increasing memory/compute. Consider multiplying by self.weight.to(in_dtype) (or casting the final result back to in_dtype) so the output dtype stays consistent with the input activation dtype.

Suggested change
return (x32 * rsqrt).to(in_dtype) * self.weight
return (x32 * rsqrt).to(in_dtype) * self.weight.to(in_dtype)

Copilot uses AI. Check for mistakes.
Comment on lines +219 to +221
flat = hidden.reshape(-1, D) # [N, D]
flat.shape[0]

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flat.shape[0] statement is a no-op and appears to be leftover debug code. Please remove it to keep the forward path minimal and lint-clean.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +5
# DeepSeek-V4 Pro (large MoE variant).
#
# Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
###############################################################################

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the source comment path: DeeSeek-v4-Pro should be DeepSeek-v4-Pro (consistent with the model naming elsewhere).

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +52
def forward(self, x: torch.Tensor) -> torch.Tensor:
in_dtype = x.dtype
x32 = x.float()
rms = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
return (x32 * rms).to(in_dtype) * self.weight

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same RMSNorm dtype issue here: (…to(in_dtype) * self.weight) can upcast the output back to fp32 if self.weight is fp32, which is likely under mixed precision. To keep the compressor output in the activation dtype, multiply by self.weight.to(in_dtype) or cast the final output back to in_dtype.

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +224
v.shape[2]
v_local_per_q = v.unsqueeze(2).expand(-1, -1, q.shape[2], -1, -1) # [B, H, S, Sk_local, head_dim]

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This v.shape[2] line is a no-op (likely leftover from debugging) and should be removed to avoid confusing readers and linters.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +56
class HashRouter(nn.Module):
"""Static hash-based MoE router.

Args:
num_experts: total number of routed experts.
topk: number of experts each token is routed to.
vocab_size: tokenizer vocabulary size; controls the table length.
seed: deterministic seed for the hash; same across all ranks.
dtype: dtype of the returned ``probs`` tensor; defaults to
``torch.float32``.

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces substantial new DeepSeek-V4 core modules (attention variants, compressor/indexer, routers, MoE, HC) but does not add unit tests covering their key invariants (e.g., HashRouter determinism, CSA/HCA causality masks, compressor/indexer shape/validity). The repo already has a Python unit test suite under tests/unit_tests/ (including Megatron transformer tests), so please add focused unit tests for these new modules to prevent regressions.

Copilot uses AI. Check for mistakes.
Remove GPT placeholder/super-init spec coupling so DeepSeek-V4 builds decoder directly from DeepSeek ModuleSpec submodule trees, and update Phase 8 progress records to match the finalized implementation and validation status.

Made-with: Cursor
Unify DeepSeek-V4 runtime module selection under DeepSeekV4SpecProvider and migrate attention/MLP/MoE construction to provider-driven ModuleSpec flows with safe local fallbacks. Document and validate the TE CUDA runtime contract, including an explicit fail-fast guard for non-CUDA TE/Turbo inputs and updated Phase 9 progress records in English.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 30, 2026 03:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 50 changed files in this pull request and generated 4 comments.

Comment on lines +164 to +167

# 5) Stash for ``_compute_attention_output`` to consume.
gathered.shape[2]
# Build mask for the compressed branch: ``-inf`` where invalid.

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of no-op statements (e.g., gathered.shape[2]) that have no effect and appear to be leftover debugging. Please remove them to keep the CSA path easier to read/maintain.

Copilot uses AI. Check for mistakes.
Comment on lines +176 to +179
batch, seq = input_ids.shape
position_ids = (
input_ids.new_arange(seq, dtype=input_ids.dtype).unsqueeze(0).expand(batch, -1)
)

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_ids.new_arange(...) is not a valid PyTorch Tensor API (and there is no local helper/monkeypatch in the repo), so this will raise AttributeError when position_ids is omitted. Use torch.arange(seq, device=input_ids.device, dtype=...) (or the existing Megatron helper used elsewhere) to build position ids.

Copilot uses AI. Check for mistakes.
Comment thread run_deepseek_v4.sh
Comment on lines +37 to +40
export PRECISION_TYPE=${PRECISION_TYPE:-BF16}
export FP8=null
export FP8_RECIPE=null

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8/FP8_RECIPE default to the literal string null, but the script still passes them via --fp8/--fp8_recipe. That makes args.fp8 truthy and can trigger FP8 validation paths (and failures) even when FP8 is intended to be disabled. Only include these CLI flags when PRECISION_TYPE=FP8, or ensure the disabled state is represented in a way the arg parser treats as false/None.

Copilot uses AI. Check for mistakes.
Comment on lines +321 to +324
B, S, D = hidden.shape
flat = hidden.reshape(-1, D) # [N, D]
flat.shape[0]

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few no-op statements left in forward (e.g., flat.shape[0]) that don't affect execution and look like leftover debugging. Please remove them to avoid confusion and keep the forward path clean.

Copilot uses AI. Check for mistakes.
…chema

Align phase10 DeepSeek-V4 modules on explicit spec/provider contracts by enforcing SharedExpertMLP-only shared experts and introducing a dedicated DeepSeekV4TransformerConfig for V4-only runtime fields. Update builder/spec/docs so training resolves the new config type and tracks activation clamp through model config.

Made-with: Cursor
Fix HC/attention dtype mismatches and tune the DeepSeek-V4 smoke script defaults so the Phase 10 MI355X run completes reliably end-to-end. Add a dedicated Phase 10 convergence report documenting delivered scope, runtime blockers, and remaining tracked items.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 30, 2026 12:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 52 changed files in this pull request and generated 5 comments.

Comment thread run_deepseek_v4.sh
Comment on lines +39 to +90
export PRECISION_TYPE=${PRECISION_TYPE:-BF16}
export FP8=null
export FP8_RECIPE=null

if [ "$PRECISION_TYPE" = "FP8" ]; then
export FP8=${FP8:-hybrid}
export FP8_RECIPE=${FP8_RECIPE:-delayed}
fi

export EXP=${EXP:-examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml}
export BACKEND_PATH=${BACKEND_PATH:-"$(pwd)/third_party/Megatron-LM"}
export PRIMUS_TEAM=${PRIMUS_TEAM:-amd}
export PRIMUS_USER=${PRIMUS_USER:-tas-mi355x-$(date +%Y%m%d)}
export PRIMUS_EXP_NAME=${PRIMUS_EXP_NAME:-deepseek_v4_smoke_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_PP${PRIMUS_PP}_EP${PRIMUS_EP}}

if [ ! -d "$BACKEND_PATH" ] || [ -z "$(ls -A "$BACKEND_PATH" 2>/dev/null)" ]; then
echo "[ERROR] BACKEND_PATH does not exist or is empty: $BACKEND_PATH"
echo "Run: git submodule update --init --recursive"
exit 1
fi

mkdir -p "output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME"

./primus-cli direct \
-- train pretrain --config "$EXP" \
--backend_path "$BACKEND_PATH" \
--num_layers "$PRIMUS_TOTAL_LAYERS" \
--train_iters "$TRAIN_ITERS" \
--lr_warmup_iters 0 \
--lr_decay_iters "$TRAIN_ITERS" \
--micro_batch_size "$MBS" \
--global_batch_size "$GBS" \
--seq_length "$PRIMUS_SEQ_LENGTH" \
--max_position_embeddings "$PRIMUS_MAX_POSITION_EMBEDDINGS" \
--rope_type rope \
--tensor_model_parallel_size "$PRIMUS_TP" \
--pipeline_model_parallel_size "$PRIMUS_PP" \
--expert_model_parallel_size "$PRIMUS_EP" \
--num_experts "$PRIMUS_NUM_EXPERTS" \
--moe_router_topk "$PRIMUS_MOE_TOPK" \
--moe_router_enable_expert_bias "$PRIMUS_MOE_ENABLE_EXPERT_BIAS" \
--moe_ffn_hidden_size "$PRIMUS_MOE_FFN_HIDDEN_SIZE" \
--index_topk "$PRIMUS_INDEX_TOPK" \
--v4_grouped_experts_support_clamped_swiglu "$PRIMUS_V4_GROUPED_EXPERTS_SUPPORT_CLAMPED_SWIGLU" \
--compress_ratios "$PRIMUS_COMPRESS_RATIOS" \
--mtp_num_layers 0 \
--mock_data True \
--use_turbo_attention "$USE_TURBO_ATTENTION" \
--use_turbo_grouped_mlp "$TURBO_USE_GROUPED_MLP" \
--moe_use_legacy_grouped_gemm "$LEGACY_GG" \
--fp8 "$FP8" \
--fp8_recipe "$FP8_RECIPE" \

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8/FP8_RECIPE are always passed to primus-cli (defaulting to the literal string null). Other run scripts in this repo gate --fp8 ... args behind an explicit FP8 enable flag; passing null may be rejected by argument parsing or select an unintended FP8 mode. Consider only adding --fp8/--fp8_recipe when PRECISION_TYPE=FP8 (or when a dedicated FP8=True flag is set), and omit them entirely otherwise.

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +52
# Primus-owned: DeepSeek-V4 (Phase 2 stub; full V4 wiring lands in Phase 3+)
if model_type == "deepseek_v4":
deepseek_v4_module = importlib.import_module(
"primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders"
)

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Phase 2 stub; full V4 wiring lands in Phase 3+" is now misleading since this PR imports the full DeepSeek-V4 builders/specs. Updating/removing it will avoid confusion when debugging model-type dispatch.

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +185
# 5) Stash for ``_compute_attention_output`` to consume.
gathered.shape[2]
# Build mask for the compressed branch: ``-inf`` where invalid.
# This is per-query, shape [S, K]; we keep it on the module as a
# full [B, S, K] additive mask.
sparse_mask = torch.where(valid, 0.0, float("-inf")).to(dtype) # [B, S, K]
self._csa_state = {
"gathered": gathered, # [B, S, K, head_dim]
"sparse_mask": sparse_mask, # [B, S, K]
}

# Tell the parent: no cat-extension; we handle CSA inside
# ``_compute_attention_output``.
return None, None, None

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSAAttention stores per-forward tensors in self._csa_state and then reads them in _compute_attention_output. This is not safe under pipeline parallel schedules (multiple microbatches in flight) or activation checkpoint recomputation, because the module attribute can be overwritten before earlier microbatches/backward recomputes run, leading to wrong outputs/gradients. Refactor CSA to avoid mutable module-level forward state (e.g., compute the joint local+sparse attention fully inside forward, or thread the gathered KV/mask through the call stack without storing on self).

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +174
# 5) Stash for ``_compute_attention_output`` to consume.
gathered.shape[2]
# Build mask for the compressed branch: ``-inf`` where invalid.

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two no-op expression statements (gathered.shape[2] and later v.shape[2]) that have no effect and look like leftover debug code. They should be removed to avoid confusion (and to keep linters/type checkers from flagging them).

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +198
decoder = getattr(self, "decoder", None)
if decoder is not None:
decoder._v4_token_ids = input_ids
try:
hidden_states = self.decoder(
hidden_states=decoder_input,
attention_mask=attention_mask,
**kwargs,
)
finally:
if decoder is not None:
decoder._v4_token_ids = None

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepseekV4Model.forward stashes input_ids onto decoder._v4_token_ids and clears it immediately after the forward. This breaks any activation checkpoint/recompute that re-invokes decoder/layer forwards during backward (token_ids will be missing) and is also unsafe with pipeline schedules that can have multiple microbatches using the same module instance. Prefer passing token_ids=input_ids explicitly into self.decoder(...) (the decoder already accepts a token_ids kwarg) instead of relying on mutable module state.

Copilot uses AI. Check for mistakes.
wenxie-amd and others added 4 commits May 21, 2026 09:33
Add run_deepseek_v4_pro_muon.sh: runs DeepSeek-V4-Pro
(deepseek_v4_pro.yaml, 61L/d7168/384 experts) with the Muon optimizer on
one 8x288GB node, following the paper §4.2.1 architecture + §4.2.2
training setup as closely as a single node allows.

Key points (validated on gfx950):
- Muon: optimizer=muon, momentum 0.95, update-RMS 0.18
(muon_extra_scale_factor; spectral mode => update RMS ~=
extra_scale_factor; Megatron default 1.0 is ~5.5x too large),
use_distributed_optimizer=False + fp32 opt-state dtypes (Megatron
asserts), AdamW eps 1.0e-20 (decimal point required or Primus parses it
as a string and multi_tensor_adam crashes).
- Batch via gradient accumulation toward the paper's 94.4M tokens/step.
This amortizes Muon's fixed Newton-Schulz cost: at GBS=8 (accum 1) NS
looks like ~97% of GEMM (a starved-batch artifact); at GBS=256 (accum
32)/seq4096 throughput goes 19.8 -> ~890 TFLOP/s/GPU and Muon NS falls
to ~1% of GPU time.
- Single-node concessions documented in-script: reduced depth/seq to
fit; emerging_optimizers must be pip-installed for Muon; expert-bias off
(Megatron needs sigmoid, V4 uses sqrtsoftplus); MTP depth 1 unsupported
by Primus V4.

Co-authored-by: yanyuqin <qin.yanyuan@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread deepseek-v4/develop/progress/p29/_forensics3.py Fixed
Comment thread deepseek-v4/develop/progress/p29/_forensics2.py Fixed
Comment thread deepseek-v4/develop/progress/p29/_forensics.py Fixed
…rHead

Wire DeepSeek-V4 multi-token prediction end-to-end (mtp_num_layers>0):

- New DeepseekV4MTPLayer (subclasses upstream MultiTokenPredictionLayer): lifts the eh_proj output to the K-stream (mHC) form, runs the V4 hybrid inner layer, and collapses with a per-depth HyperHead (hc_head_fn) per the released V4 checkpoint (config.mtp_use_separate_hc_head). Threads pg_collection into the inner layer so its MoE uses the same expert-parallel dispatcher as the main decoder (the upstream MTP build omits pg_collection, whose local-experts fallback breaks the DDP grad-bucket invariant under EP).

- get_v4_mtp_block_spec extracts a single hybrid-layer spec from the decoder block spec (mirrors upstream GPT spec.layer_specs[-1]) instead of passing the whole block spec (which trips MTP TransformerLayerSubmodules validation).

- DeepseekV4HybridLayer.__init__ accepts rope=None (builds + registers a private DualRoPE so its buffers move to device) and tolerates upstream MTP build kwargs (is_mtp_layer / vp_stage).

- Add config.mtp_use_separate_hc_head (default True); fix/extend test_v4_mtp.

Validated: EP=8 V4-Flash proxy, MTP_NUM_LAYERS=1, 10 iters clean (exit 0), mtp_1 loss in train log, lm loss 11.90->10.52, grad norm stable, 0 NaN.
Optional FP8 quantization-aware QK path on the CSA Indexer (config.use_v4_fp8_indexer, default False):

- fake_quantize_fp8_e4m3: per-tensor dynamic E4M3 fake-quant (scale to the 448 finite range, round through torch.float8_e4m3fn, dequantize).

- The Indexer fake-quantizes the per-head query (q_i) and compressed-key (k_icomp) activations before the QK scoring einsum for every dispatch path (full-fuse / post-einsum-tail Triton / eager). The ReLU + per-head weight + sum + causal mask + top-k stay in BF16 (the report's BF16 index-score path). The indexer is a frozen non-differentiable top-k selector, so no straight-through estimator is needed; one-time rank-0 log when active.

- Plumb config.use_v4_fp8_indexer -> Indexer(use_fp8_qk=...) via the V4 attention builder; add base/flash yaml defaults (PRIMUS_USE_V4_FP8_INDEXER).

- Unit tests: fake-quant bounded-error/dtype/zero-safe; FP8 indexer top-k overlaps the BF16 reference; flag-off is bit-identical.

Validated: EP=8 V4-Flash proxy, USE_V4_FP8_INDEXER=true, 10 iters clean (exit 0), '[V4-Indexer] FP8 ... ENABLED' in log, lm loss 11.90->10.53 (matches BF16 baseline 10.52), grad norm stable, 0 NaN.
…param split

Make the Muon optimizer path work for DeepSeek-V4 with the report's recipe:

- Install hook (runner/.../01_install_emerging_optimizers.sh, gated by PRIMUS_INSTALL_EMERGING_OPTIMIZERS, idempotent) provisions NVIDIA-NeMo/Emerging-Optimizers from a pinned commit (0.4.0a0) inside the container; the public PyPI name is a stub, so it must come from GitHub source.

- moun.py: adapt to the emerging_optimizers>=0.4.0a0 API (use_nesterov->nesterov, mode->tp_mode); add _resolve_muon_coefficient_type (auto-selects the built-in 'deepseekv4' coefficient set -- 8 aggressive (3.4445,-4.7750,2.0315) + 2 stable (2.0,-1.5,0.5) -- for V4 configs, raising num_ns_steps 5->10); extract _param_goes_to_muon (report S9.5.1 split: 2-D weight matrices incl. mHC fn -> Muon; embedding/output + all <2-D params incl. RMSNorm, mHC base/scale, biases -> AdamW; fixes a 0-dim param reaching Newton-Schulz).

- muon_optimizer_patches.py: redirect megatron.training.training.get_megatron_muon_optimizer (called directly by setup_model_and_optimizer) to the Primus moun builder so the new-API + deepseekv4 path drives Muon without editing third_party; force use_gloo_process_groups=False (incompatible with the provided pg_collection).

- moun_optimizer_config.py: add muon_coefficient_type field.

- Unit test for the param split (incl. scalar/1-D->AdamW, 2-D->Muon) and coefficient resolution (V4 autoselect deepseekv4 / explicit override / non-V4 quintic).

Validated: EP=8 V4-Flash proxy, OPTIMIZER=muon, 10 iters clean (exit 0); log shows coefficient_type='deepseekv4' num_ns_steps 5->10, momentum 0.95 / extra_scale 0.18 / wd 0.1; lm loss 11.90->11.60 monotonic, grad norm ~3.0 stable, 0 NaN; emerging_optimizers auto-installed by the hook.
…param-gather)

Wire the V4 optimizer/precision recipes into the shared launcher (run_deepseek_v4.sh):

- OPTIMIZER={adam(default)|muon|dist_muon}: the muon paths set the Megatron muon CLI (momentum 0.95, extra_scale 0.18, use_distributed_optimizer/precision_aware False, fp32 states), force DDP grad/param overlap OFF (plain muon: Megatron asserts; dist_muon: LayerWiseDistributedOptimizer manages its own param all-gather and otherwise double-drives DDP start_param_sync), and set PRIMUS_INSTALL_EMERGING_OPTIMIZERS so the install hook provisions emerging_optimizers.

- FP8 / FP8_RECIPE now honor the incoming env (e.g. FP8_RECIPE=mxfp8) instead of being hard-clobbered to null.

- FP8_PARAM_GATHER=True path: enables MXFP8 (NVTE_ROCM_ENABLE_MXFP8=1) + --fp8_param_gather (+ --reuse_grad_buf_for_mxfp8_param_ag for mxfp8). Megatron #4987 analogue.

- Also surfaces MTP_NUM_LAYERS and USE_V4_FP8_INDEXER launch knobs.

Validated on the EP=8 V4-Flash proxy (10 iters each, exit 0, loss decreasing, 0 NaN): OPTIMIZER=muon and OPTIMIZER=dist_muon (loss 11.90->11.60, deepseekv4 hybrid NS); OPTIMIZER=dist_muon + --fp8 hybrid --fp8_recipe mxfp8 (Muon + MXFP8 forward). KNOWN LIMITATION: --fp8-param-gather together with Muon is blocked at Megatron arg-validation ('--fp8-param-gather only supported with distributed optimizer...'), since Muon requires use_distributed_optimizer=False; full Muon+fp8-param-gather needs the unmerged upstream LayerWise integration (Megatron #4987) and cannot be done without a third_party edit. Plumbing is in place for when the container's Megatron gains it.
Comment thread deepseek-v4/develop/progress/build_roadmap_pptx.py Fixed
@JohnQinAMD JohnQinAMD force-pushed the dev/wenx/deepseek-v4 branch from c84be32 to 3841e36 Compare June 11, 2026 23:23
@JohnQinAMD JohnQinAMD force-pushed the dev/wenx/deepseek-v4 branch from 8283be2 to 689a260 Compare June 12, 2026 15:35
Comment on lines +190 to +204
"plan6_triton": [
{
"phase": phase,
"name": name,
"count": count,
"total_ms": total / 1000.0,
"avg_ms": avg / 1000.0,
"pct_window": (total / win_us * 100.0) if win_us else 0.0,
}
for phase, names in plan6_families.items()
for name in names
for cand_name, count, total, avg in rows
if cand_name == name
for _name in [cand_name]
],
head_y + Inches(0.32),
Inches(12.75),
head_y + Inches(0.32),
color=LINE_strong if False else LINE,
lhzhang333 and others added 4 commits June 17, 2026 21:36
…on logs

- run_deepseek_v4_pro_muon.sh: set PRIMUS_INSTALL_EMERGING_OPTIMIZERS for
  Muon/dist_muon so the in-container hook provisions the pinned package.
- add megatron patch to silence Triton autotuner print spam.
- add megatron patch to raise the emerging_optimizers 'absl' logger to INFO
  (drops per-step Newton-Schulz coefficient DEBUG lines).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sync garbage collection across ranks every 100 steps (matches
run_deepseek_v4.sh) to reduce step-time jitter on the large
grad-accum pro Muon run.

Co-authored-by: Cursor <cursoragent@cursor.com>
- script/: per-cr single-layer profiling launcher (seq4096, adam+dist-opt,
  GA=2, no recompute) for MI355X trace capture
- tools/: chrome-trace -> breakdown JSON parser (External-id linking,
  nn.Module attribution, fwd/bwd split, min-grouping) + kernel/module map
- site/: static MI355X-measured + MI455X-scaled projection website with
  step-by-step iter-time / tokens-per-s / TFLOP/s derivation
- design/: methodology, assumptions, JSON schema, projection math, deploy
- publish via backend-gap Pages bundle subpath + standalone dev-branch
  Pages workflow (deploy-projection.yml)

Co-authored-by: Cursor <cursoragent@cursor.com>
…site data

- Root cause: distributed optimizer (zero1) makes ROCm Kineto drop the compute
  GPU kernels for pure dense(cr=0)/HCA(cr=128) layers; CSA(cr=4) is unaffected.
  Profiling launcher now defaults use_distributed_optimizer=False (+ fp32
  states), so all three cr capture their compute. dist-opt doesn't affect the
  fwd/bwd compute the projection needs (optimizer modeled analytically).
- Add CR=mix and DISABLE_PROFILER_CPU options used while diagnosing.
- Un-ignore deepseek-v4/projection/site/data so the breakdown JSON (site/Pages
  input) is tracked.
- pro.json now real measured data for all three cr (moe cross-checks across cr).

Co-authored-by: Cursor <cursoragent@cursor.com>
- parse_trace: kernel-name _fwd_/_bwd_ now authoritative for phase (dense
  attention re-runs its _fwd_ kernel in backward with a "Fwd thread id", which
  previously misfiled dense/HCA attn.core into backward). attn.core now appears
  in forward for all cr.
- flash.json: real measured data for all three cr (was mock); moe cross-checks
  across cr (fwd ~43ms / bwd ~27ms).
- write breakdown JSON with trailing newline.

Co-authored-by: Cursor <cursoragent@cursor.com>
… single-node

- parse_trace: aggregate kernel time as total/num_microbatches (num ProfilerStep
  x GA) instead of min-grouping, fixing 3x moe over-count (grouped-gemm dims vary
  per routing step so launches never dedup).
- v4_flops.py: port Megatron's V4 closed-form analytic FLOPs (self-test 34112 vs
  measured 34093 TFLOP, 0.05%); breakdown JSON now carries analytic_flops.
- app.js: TFLOP/s from analytic model FLOPs (Megatron convention, recompute
  excluded); add calibFactor (0.93) for the single-layer->full-model bias.
- Calibrated vs measured flash 16L single node (PP1/EP8/DP8, GBS64, recompute
  full): iter 6681 vs 6665 ms (+0.2%), 630 vs 636 TFLOP/s/GPU (-1%),
  4905 vs 4917 tokens/s/GPU. See design/06-calibration.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants