feat(m11): fail-fast tms hook-mode arch guard + force offload_rollout under RLix#16
Open
zhenyulincs wants to merge 1 commit into
Open
feat(m11): fail-fast tms hook-mode arch guard + force offload_rollout under RLix#16zhenyulincs wants to merge 1 commit into
zhenyulincs wants to merge 1 commit into
Conversation
… under RLix Two adjacent F1 sleep/wake footguns that previously failed silently: 1. torch_memory_saver "preload" hook segfaults on Blackwell (CC major >= 10) under CUDA 12.9 — a tracebackless SIGSEGV during build_cpu_bucket_cache. Add assert_tms_hook_mode_matches_arch() (new torch-only tms_utils module) that resolves the effective mode (unset -> preload), fires only on major >= 10, and raises a clear RuntimeError telling the operator to set MILES_TMS_HOOK_MODE=torch. Escape hatch: MILES_TMS_ALLOW_PRELOAD_ON_BLACKWELL=1. 2. RLix mode requires enable_memory_saver (gated by offload_rollout) or the first shrink_engines OOMs. Add apply_rlix_offload_defaults() and call it at the single normalization point in miles_validate_args so RLix forces offload_rollout=True without the operator remembering --offload-rollout. offload_train is left untouched. Tests: tests/fast/backends/test_tms_utils.py (19) + tests/fast/utils/test_rlix_offload_defaults.py (5). 24 passed on vast L4.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two adjacent F1 sleep/wake footguns that previously failed silently or with a tracebackless crash:
torch_memory_saver(tms)preloadhook segfaults on Blackwell. tms' defaultpreloadhook (LD_PRELOADlibc-malloc interposer) SIGSEGVs on Blackwell-class GPUs (compute capability major ≥ 10 — RTX 50xxsm_120, B100/B200sm_100) under CUDA 12.9, duringbuild_cpu_bucket_cache, with no Python traceback. ForgettingMILES_TMS_HOOK_MODE=torchsilently falls back topreload→ crash.--offload-rolloutbut didn't enforce it.enable_memory_saver(lets SGLang return VRAM onrelease_memory_occupation) is gated byargs.offload_rollout. Under RLix this is mandatory — without it the firstshrink_enginesOOMs (M11.1 attempt-5 bug). Nothing forced it on.Changes
miles/backends/megatron_utils/tms_utils.py—resolve_tms_hook_mode()+assert_tms_hook_mode_matches_arch(). torch-only so it's unit-testable without the Megatron actor stack.unset → preload), so the "forgot the export" case is caught.preload(the more complete release path, tms' default).RuntimeError(survivespython -O) with an actionable message.MILES_TMS_ALLOW_PRELOAD_ON_BLACKWELL=1.actor.py— call the guard right beforehook_modeis applied.apply_rlix_offload_defaults()inrlix_validation.py+ call inmiles_validate_args(single normalization point) — forcesoffload_rollout=Trueunder RLix. Idempotent;offload_trainleft untouched.tests/fast/backends/test_tms_utils.py(19) +tests/fast/utils/test_rlix_offload_defaults.py(5).Test plan / results
Run on vast.ai 4× L4 (Ada
sm_89, torch 2.11.0+cu129, CUDA 12.9):Live non-mocked check on the real L4 (
sm_89, major 8 < 10):actor.pyimports after the refactor, and the guard correctly stays silent forpreload/unset/torch. The Blackwell raise path is covered by mocked unit tests (no Blackwell GPU on this box).Full changelog + results:
dev-notes/2026-06-06-tms-arch-guard-and-rlix-offload.md.Codex sign-off
APPROVE — no CRITICAL / HIGH / MEDIUM. One LOW (in-place
argsmutation) is intentional and matches the surroundingmiles_validate_argsstyle.Follow-ups (not in this PR)
torchwhen CC ≥ 10 / CUDA ≥ 12.9) so the export isn't required at all.offload_trainforced on.