[feat] Enable openYuanrong RDMA support by KaisennHu · Pull Request #108 · Ascend/TransferQueue

KaisennHu · 2026-05-28T07:13:55Z

Description

For 910B nodes with an additional RoCE NIC (besides NPU-side RoCE), openYuanrong datasystem supports host RDMA (H2H) transport via UCX. Since TQ routes CPU tensors through KV client and NPU tensors through tensor client by tensor location, H2H RDMA and RH2D can be enabled simultaneously — they are not mutually exclusive.

Previously, enabling RDMA required manually adding --enable_rdma true to worker_args and setting UCX_TLS=rc_x in the environment. This PR introduces dedicated config options for one-click RDMA enablement.

Changes

config.yaml: Added enable_rdma (default false) and ucx_env_vars (default {}). When enable_rdma=true, TQ auto-adds --enable_rdma true to dscli cmd and defaults UCX_TLS=rc_x. ucx_env_vars lets users specify UCX env vars (UCX_TLS, UCX_LOG_FILE, UCX_LOG_LEVEL, UCX_NET_DEVICES, UCX_TCP_CM_ROUTE) with highest priority over parent env.
yuanrong_bootstrap.py: Wired enable_rdma and ucx_env_vars through config → actor → start_datasystem_worker. Env priority: ucx_env_vars > parent env > default UCX_TLS=rc_x.
openyuanrong_datasystem.md: Added RDMA Options section, updated config examples, added manual RDMA startup instructions, and added RDMA FAQ (endpoint timeout, verification, container memlock).

Related Issues

Closes #98

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

ascend-robot · 2026-05-28T07:14:06Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot

Pull request overview

This PR adds configurable openYuanrong host RDMA support so TransferQueue can start Yuanrong datasystem workers with RDMA enabled and UCX environment overrides.

Changes:

Added enable_rdma and ucx_env_vars configuration options.
Wired RDMA flags and UCX environment handling into Yuanrong worker bootstrap.
Updated openYuanrong documentation with RDMA setup and troubleshooting guidance.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`transfer_queue/config.yaml`	Adds default RDMA and UCX environment configuration fields.
`transfer_queue/storage/bootstrap/yuanrong_bootstrap.py`	Passes RDMA options through actor startup and applies UCX env precedence for `dscli`.
`docs/storage_backends/openyuanrong_datasystem.md`	Documents RDMA configuration, manual startup, and troubleshooting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianyi-ge · 2026-05-29T02:28:13Z

    # Additional config for yuanrong worker.
    # Recommended options for NPU environments:
    #   --remote_h2d_device_ids       Enable RH2D for efficient cross-node data transfer. Specify NPU device IDs (comma-separated).
    #   --enable_huge_tlb             Enable huge page memory to improve performance. Required for >21GB shared memory on 910B.


add some useful comments here:

If you want to use RDMA or NPU transport with >20GB shared memory, please enable huge page to accelerate startup and transfer. Before enable_huge_tlb, the following os configurations are required (need root privilege)

# Each huge page is 2MB. For example If you want to allocate 128GB, then allocate 65536 systctl -w vm.nr_hugepages=65536 # This allows the current user to pin enough memory pages so that RDMA/Ascend can work ulimit -l unlimited

0oshowero0 · 2026-05-29T02:52:02Z

+    --enable_rdma true \
+    --arena_per_tenant 1 \
+    --enable_worker_worker_batch_get true \
+    --shared_memory_size_mb 8192


We may need more explanations on this param. For instance, is this per-node shared memory or per-client? Or even across-node total memory size?

Sure. Added. Per-node shared memory size in MB. All clients on the same node share this shared memory space.

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

ascend-robot · 2026-05-29T07:08:41Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

ascend-robot · 2026-05-29T09:45:25Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

[feat] Enable openYuanrong RDMA support

cf4480a

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

ascend-robot added the ascend-cla/yes label May 28, 2026

0oshowero0 requested a review from Copilot May 28, 2026 08:12

Copilot started reviewing on behalf of 0oshowero0 May 28, 2026 08:12 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread transfer_queue/config.yaml Outdated

Comment thread docs/storage_backends/openyuanrong_datasystem.md Outdated

tianyi-ge reviewed May 29, 2026

View reviewed changes

0oshowero0 reviewed May 29, 2026

View reviewed changes

[feat] Enable openYuanrong RDMA support

83f68ed

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

[feat] fix gratefully stopping

22903ab

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

0oshowero0 approved these changes May 29, 2026

View reviewed changes

tianyi-ge approved these changes May 29, 2026

View reviewed changes

0oshowero0 merged commit dc7d203 into Ascend:main May 29, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Enable openYuanrong RDMA support#108

[feat] Enable openYuanrong RDMA support#108
0oshowero0 merged 3 commits into
Ascend:mainfrom
KaisennHu:feat/enable-yr-rdma

KaisennHu commented May 28, 2026

Uh oh!

ascend-robot commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

tianyi-ge May 29, 2026

Uh oh!

0oshowero0 May 29, 2026

Uh oh!

KaisennHu May 29, 2026 •

edited

Loading

Uh oh!

ascend-robot commented May 29, 2026

Uh oh!

ascend-robot commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

KaisennHu commented May 28, 2026

Description

Changes

Related Issues

Uh oh!

ascend-robot commented May 28, 2026

CLA Signature Pass

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

tianyi-ge May 29, 2026

Choose a reason for hiding this comment

Uh oh!

0oshowero0 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ascend-robot commented May 29, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 29, 2026

CLA Signature Pass

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

KaisennHu May 29, 2026 •

edited

Loading