feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB) by howard989 · Pull Request #17 · rlops/rlix

howard989 · 2026-05-25T08:38:13Z

What

Two coordinated changes for the post-offload residual check, sender / MILES side of the paired RLix PR:

Forward MILES_MAX_RESIDUAL_GPU_MEM_GB from both rlix-mode drivers (run_miles_rlix.py, run_miles_dual.py) into Ray runtime_env.
Log SGLang post-sleep residual diagnostics from RolloutManager.shrink_engines after release_memory_occupation.

Receiver side: rlops/rlix PR #17.

Why

Per @taoluo review (R02-01): "free memory is gpu-model dependent ... it would be more robust to check the residual memory allocation."

The paired RLix PR now gates wake-up on whole-GPU residual used memory. That intentionally counts all GPU memory on the overlap GPUs, including non-SGLang co-tenants, so it can catch residual memory left by Megatron / Miles / vLLM / orphan processes rather than only SGLang child processes.

MILES keeps SGLang-specific metrics as attribution diagnostics:

SGLang process-tree resident GPU memory from nvidia-smi --query-compute-apps
SGLang /server_info weight+kvcache+graph accounting

We investigated SGLang /server_info as a possible hard-gate signal. It is not suitable as a gate. A Vast smoke showed a slept Qwen2.5-0.5B SGLang engine reporting ~9.32 GiB in /server_info, while the real resident process memory was much lower. /server_info reflects accounting/static-pool size, not physical resident memory after torch_memory_saver pause.

Evidence from that investigation:

active engine:
  server_info kvcache = 7.06 GiB
  nvidia-smi process = 10686 MiB

slept/offloaded engine:
  server_info kvcache = 8.16 GiB
  nvidia-smi process = 1852 MiB (~1.81 GiB)

So /server_info is diagnostic only. The hard gate lives in RLix and uses whole-GPU memory.used.

How Diagnostics Work

miles/utils/gpu_probe.py provides the SGLang process-resident diagnostic:

walks the engine process tree
- self.process.pid is the multiprocessing spawn parent
- the GPU-resident process is the sglang::scheduler child
queries:

nvidia-smi --query-compute-apps=gpu_bus_id,pid,used_memory --format=csv,noheader,nounits

filters to PIDs in the engine process tree
sums matched usage within each GPU
takes the max across GPUs

This gives the SGLang engine's max per-GPU process-resident memory for attribution. It is not the hard gate.

If nvidia-smi is unavailable or compute-app PIDs cannot be matched to the engine process tree, the diagnostic returns None and logs a warning. The RLix whole-GPU hard gate remains the actual enforcement point.

Changes

examples/rlix/run_miles_dual.py
- forwards MILES_MAX_RESIDUAL_GPU_MEM_GB
examples/rlix/run_miles_rlix.py
- forwards MILES_MAX_RESIDUAL_GPU_MEM_GB
miles/ray/rollout.py
- logs post-sleep SGLang residual diagnostics from shrink_engines
- no longer raises from the SGLang per-process diagnostic
miles/backends/sglang_utils/sglang_engine.py
- logs process-resident SGLang residual and /server_info accounting
miles/utils/gpu_probe.py
- adds dependency-free process-tree GPU residual probe
tests/test_gpu_probe.py
- covers per-GPU max, same-GPU sum, fail-open None-not-0, and process-tree walking
tests/test_residual_gpu_mem_wiring.py
- updated for diagnostic-only SGLang residual logging

Tests

python3 -m py_compile \
  miles/utils/gpu_probe.py \
  miles/backends/sglang_utils/sglang_engine.py \
  miles/ray/rollout.py

python3 -m pytest -q tests/test_gpu_probe.py tests/test_residual_gpu_mem_wiring.py

Results:

tests/test_gpu_probe.py: 11 passed
tests/test_residual_gpu_mem_wiring.py: 2 passed

E2E Verification

Vast Qwen2.5-0.5B dual smoke with paired RLix branch:

SGLang diagnostic:
process_resident_max=2.516-2.535 GiB
whole_gpu_threshold=13.000 GiB

RLix whole-GPU hard gate:
whole-GPU mem used max=6.25 / 6.27 GiB across overlap GPUs [0]/[3]
whole-GPU mem used max=11.95 / 11.97 GiB across overlap GPUs [0]/[3]
threshold=13.00 GiB

mp2 training loop complete
mp1 training loop complete
shutdown_hard complete for both pipelines
EXIT_CODE=0

The SGLang process-resident diagnostic stayed low while the whole-GPU residual reached ~11.97 GiB. That is expected: the hard gate intentionally includes non-SGLang co-tenants.

Known SharedStorage actor unavailable warnings and shutdown-time RolloutManager 500 / RemoteProtocolError teardown noise may appear. Training completed, both pipelines reached shutdown_hard, and EXIT_CODE=0.

Scope

Env forwarding + SGLang residual diagnostics only. The hard gate and default threshold live on the RLix side.

Megatron train-offload coverage is a separate follow-up. Current smokes show whole-GPU residual near 12 GiB while SGLang's own process-resident residual is ~2.5 GiB, indicating the large residual is likely train/co-tenant memory, not SGLang itself.

Refs: plans/m11-review.review-report/R02.md (R02-01, MEDIUM).

howard989 added 2 commits May 25, 2026 00:12

fix(rlix): gate MILES wake on residual SGLang memory

3407747

fix(rlix): use 3GB per-process residual threshold default

ac9312d

howard989 mentioned this pull request May 25, 2026

fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01) rlops/miles#5

Open

fix(rlix): gate wake on whole-GPU residual memory

4af2fc4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB)#17

feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB)#17
howard989 wants to merge 3 commits into
rlops:zhenyu/miles-mvp-e2efrom
howard989:howard/m11-residual-gpu-threshold-v2

howard989 commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

howard989 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How Diagnostics Work

Changes

Tests

E2E Verification

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

howard989 commented May 25, 2026 •

edited

Loading