[Ascend] support qwen35 mtp on Ascend-A3 by wanfengcxz · Pull Request #337 · DeepLink-org/dlinfer

wanfengcxz · 2026-07-03T09:32:49Z

Support qwen35 mtp on Ascend-A3

- ascend_cudagraph.py: multi-token decode graph mode support (4-tuple graph key with query_len, actual_seq_lengths_q buffers) - device/__init__.py: add patch_attention_is_tp (draft model TP), patch_ray_init (NPU Ray resource), MTP multi-token paths in GatedDelta conv1d and sigmoid_gating update kernels Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the Ascend-specific graph alignment, state replay, and sampling fallback into dlinfer so multi-token speculative decode stays stable without expanding lmdeploy core runtime changes. Made-with: Cursor

Snapshot only the active state-cache rows during speculative replay so Ascend no longer clones the full state pool for rejection recovery. Made-with: Cursor

Copilot

Pull request overview

This PR adds Ascend-A3 support for Qwen3.5 MTP (speculative decoding) by introducing Ascend-specific rejection sampling and ring-buffer state handling, and by extending the gated-delta / conv-state execution paths (including cudagraph buffer management) to support multi-token decode.

Changes:

Add Ascend rejection sampling implementation (Triton + PyTorch fallback) and patch lmdeploy to use it.
Add fused recurrent gated-delta-rule kernel and integrate ring-buffer state/conv updates for multi-token decoding.
Update Ascend cudagraph buffer logic and MoE comm buffers to handle MTP’s increased per-step token counts.

Reviewed changes

Copilot reviewed 3 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
dlinfer/vendor/ascend/triton_ops/reject_sample.py	New Ascend rejection sampling implementation with Triton backend and PyTorch reference fallback.
dlinfer/vendor/ascend/triton_ops/fla/fused_recurrent.py	New fused recurrent gated-delta-rule Triton implementation for decode/prefill state updates.
dlinfer/vendor/ascend/triton_ops/fla/init.py	Expose `fused_recurrent_gated_delta_rule` from the FLA submodule.
dlinfer/vendor/ascend/triton_ops/causal_conv1d.py	Add ring-buffer mode support for conv state read/write and kernel update path.
dlinfer/vendor/ascend/triton_ops/init.py	Export new Ascend triton ops (`rejection_sample`, `fused_recurrent_gated_delta_rule`).
dlinfer/vendor/ascend/torch_npu_ops.py	Remove an unused import.
dlinfer/vendor/ascend/moe.py	Fix topk padding behavior to align with padded hidden_states handling.
dlinfer/framework/lmdeploy_ext/device/ascend.py	Adjust bad-words processing to avoid negative indices on Ascend gather/scatter.
dlinfer/framework/lmdeploy_ext/device/init.py	Patch lmdeploy rejection sampler and extend gated-delta net / Qwen3.5 builder for speculative decoding.
dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py	Extend cudagraph buffers and keying to support multi-token decode and DP-global gating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

-        max_batches, dtype=torch.int32, device=device
-    )
-
    input_buffers["kv_seqlens"] = torch.ones(max_batches, dtype=torch.int32)


+        input_buffers["q_seqlens"] = (
+            torch.arange(1, max_batches + 1, dtype=torch.int32) * max_q_seq_len
+        )


+        )
+
+    else:
+        input_buffers["q_seqlens"] = torch.arange(1, max_batches + 1, dtype=torch.int32)


+    # (which are negative padding values) with 0 before gather/scatter.
+    valid_bad_words = bad_words.where(mask, 0)
+    filtered_scores = scores.gather(1, valid_bad_words)
    filtered_scores = mask.to(filtered_scores.dtype) * filter_value + filtered_scores
-    scores.scatter_(1, bad_words, filtered_scores)
+    scores.scatter_(1, valid_bad_words, filtered_scores)


tangzhiyi11 and others added 30 commits July 3, 2026 06:04

[Ascend] Patch MTP graph and sampling runtime

6e04b7e

Move the Ascend-specific graph alignment, state replay, and sampling fallback into dlinfer so multi-token speculative decode stays stable without expanding lmdeploy core runtime changes. Made-with: Cursor

[Ascend] Reduce MTP replay state snapshot peak memory

f3aa217

Snapshot only the active state-cache rows during speculative replay so Ascend no longer clones the full state pool for rejection recovery. Made-with: Cursor

[Ascend] fix gdn kernel in speculative decoding

4661c9a

fix graph

5e44121

fix: ensure state is contiguous

80a8171

refactor fill buffers

9615f98

impl ring buffer for gdn state

aaee81f

impl ring buffer for conv1d state

1549019

Refactor GDN and conv1d computation flow

33ec8ad

refactor gdn buffer

8ce27fc

fix ascend graph

55b332f

fix graph

02e9ef9

remove unused patch

9fa7c7a

impl reject_sample triton function on ascend

cfe7dd3

fix mismatch shape error in chunked prefill

a56a6ac

refactor: lazy load mtp ops

6a2b036

update qwen35 patch func

569a17c

fix no mtp

5729141

refactor graph buffer and kernel

0550c58

fix DPTP comm buffer size and padding_batch_size

f630d72

update qwen35 patch

d39a030

fix aclnnGather error 161002 by sanitizing negative indices

63e5113

remove comma

9fba6fa

fix padding

013c61c

fix global decoding

a335214

fallback reject_sample to torch

dee74a1

refactor reject sample

cc845d5

refactor: remove redundant if branches

1ac406e

format code

24c4656

wanfengcxz requested a review from jinminxi104 as a code owner July 3, 2026 09:32

jinminxi104 requested a review from Copilot July 3, 2026 09:35

Copilot started reviewing on behalf of jinminxi104 July 3, 2026 09:36 View session

format code

cd6d96d

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ascend] support qwen35 mtp on Ascend-A3#337

[Ascend] support qwen35 mtp on Ascend-A3#337
wanfengcxz wants to merge 31 commits into
DeepLink-org:mainfrom
wanfengcxz:wq/support_qwen35_mtp

wanfengcxz commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

wanfengcxz commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants