feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB)#17
Open
howard989 wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two coordinated changes for the post-offload residual check, sender / MILES side of the paired RLix PR:
MILES_MAX_RESIDUAL_GPU_MEM_GBfrom both rlix-mode drivers (run_miles_rlix.py,run_miles_dual.py) into Rayruntime_env.RolloutManager.shrink_enginesafterrelease_memory_occupation.Receiver side: rlops/rlix PR #17.
Why
Per @taoluo review (R02-01): "free memory is gpu-model dependent ... it would be more robust to check the residual memory allocation."
The paired RLix PR now gates wake-up on whole-GPU residual used memory. That intentionally counts all GPU memory on the overlap GPUs, including non-SGLang co-tenants, so it can catch residual memory left by Megatron / Miles / vLLM / orphan processes rather than only SGLang child processes.
MILES keeps SGLang-specific metrics as attribution diagnostics:
nvidia-smi --query-compute-apps/server_infoweight+kvcache+graphaccountingWe investigated SGLang
/server_infoas a possible hard-gate signal. It is not suitable as a gate. A Vast smoke showed a slept Qwen2.5-0.5B SGLang engine reporting ~9.32 GiB in/server_info, while the real resident process memory was much lower./server_inforeflects accounting/static-pool size, not physical resident memory aftertorch_memory_saverpause.Evidence from that investigation:
So
/server_infois diagnostic only. The hard gate lives in RLix and uses whole-GPUmemory.used.How Diagnostics Work
miles/utils/gpu_probe.pyprovides the SGLang process-resident diagnostic:self.process.pidis the multiprocessing spawn parentsglang::schedulerchildThis gives the SGLang engine's max per-GPU process-resident memory for attribution. It is not the hard gate.
If
nvidia-smiis unavailable or compute-app PIDs cannot be matched to the engine process tree, the diagnostic returnsNoneand logs a warning. The RLix whole-GPU hard gate remains the actual enforcement point.Changes
examples/rlix/run_miles_dual.pyMILES_MAX_RESIDUAL_GPU_MEM_GBexamples/rlix/run_miles_rlix.pyMILES_MAX_RESIDUAL_GPU_MEM_GBmiles/ray/rollout.pyshrink_enginesmiles/backends/sglang_utils/sglang_engine.py/server_infoaccountingmiles/utils/gpu_probe.pytests/test_gpu_probe.pytests/test_residual_gpu_mem_wiring.pyTests
Results:
E2E Verification
Vast Qwen2.5-0.5B dual smoke with paired RLix branch:
The SGLang process-resident diagnostic stayed low while the whole-GPU residual reached ~11.97 GiB. That is expected: the hard gate intentionally includes non-SGLang co-tenants.
Known
SharedStorage actor unavailablewarnings and shutdown-timeRolloutManager500 /RemoteProtocolErrorteardown noise may appear. Training completed, both pipelines reachedshutdown_hard, andEXIT_CODE=0.Scope
Env forwarding + SGLang residual diagnostics only. The hard gate and default threshold live on the RLix side.
Megatron train-offload coverage is a separate follow-up. Current smokes show whole-GPU residual near 12 GiB while SGLang's own process-resident residual is ~2.5 GiB, indicating the large residual is likely train/co-tenant memory, not SGLang itself.
Refs:
plans/m11-review.review-report/R02.md(R02-01, MEDIUM).