[feat] Enable openYuanrong RDMA support#108
Conversation
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
This PR adds configurable openYuanrong host RDMA support so TransferQueue can start Yuanrong datasystem workers with RDMA enabled and UCX environment overrides.
Changes:
- Added
enable_rdmaanducx_env_varsconfiguration options. - Wired RDMA flags and UCX environment handling into Yuanrong worker bootstrap.
- Updated openYuanrong documentation with RDMA setup and troubleshooting guidance.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
transfer_queue/config.yaml |
Adds default RDMA and UCX environment configuration fields. |
transfer_queue/storage/bootstrap/yuanrong_bootstrap.py |
Passes RDMA options through actor startup and applies UCX env precedence for dscli. |
docs/storage_backends/openyuanrong_datasystem.md |
Documents RDMA configuration, manual startup, and troubleshooting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Additional config for yuanrong worker. | ||
| # Recommended options for NPU environments: | ||
| # --remote_h2d_device_ids Enable RH2D for efficient cross-node data transfer. Specify NPU device IDs (comma-separated). | ||
| # --enable_huge_tlb Enable huge page memory to improve performance. Required for >21GB shared memory on 910B. |
There was a problem hiding this comment.
add some useful comments here:
If you want to use RDMA or NPU transport with >20GB shared memory, please enable huge page to accelerate startup and transfer. Before enable_huge_tlb, the following os configurations are required (need root privilege)
# Each huge page is 2MB. For example If you want to allocate 128GB, then allocate 65536
systctl -w vm.nr_hugepages=65536
# This allows the current user to pin enough memory pages so that RDMA/Ascend can work
ulimit -l unlimited| --enable_rdma true \ | ||
| --arena_per_tenant 1 \ | ||
| --enable_worker_worker_batch_get true \ | ||
| --shared_memory_size_mb 8192 |
There was a problem hiding this comment.
We may need more explanations on this param. For instance, is this per-node shared memory or per-client? Or even across-node total memory size?
There was a problem hiding this comment.
Sure. Added. Per-node shared memory size in MB. All clients on the same node share this shared memory space.
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Description
For 910B nodes with an additional RoCE NIC (besides NPU-side RoCE), openYuanrong datasystem supports host RDMA (H2H) transport via UCX. Since TQ routes CPU tensors through KV client and NPU tensors through tensor client by tensor location, H2H RDMA and RH2D can be enabled simultaneously — they are not mutually exclusive.
Previously, enabling RDMA required manually adding
--enable_rdma truetoworker_argsand settingUCX_TLS=rc_xin the environment. This PR introduces dedicated config options for one-click RDMA enablement.Changes
config.yaml: Addedenable_rdma(defaultfalse) anducx_env_vars(default{}). Whenenable_rdma=true, TQ auto-adds--enable_rdma trueto dscli cmd and defaultsUCX_TLS=rc_x.ucx_env_varslets users specify UCX env vars (UCX_TLS, UCX_LOG_FILE, UCX_LOG_LEVEL, UCX_NET_DEVICES, UCX_TCP_CM_ROUTE) with highest priority over parent env.yuanrong_bootstrap.py: Wiredenable_rdmaanducx_env_varsthrough config → actor →start_datasystem_worker. Env priority:ucx_env_vars> parent env > defaultUCX_TLS=rc_x.openyuanrong_datasystem.md: Added RDMA Options section, updated config examples, added manual RDMA startup instructions, and added RDMA FAQ (endpoint timeout, verification, container memlock).Related Issues
Closes #98