Skip to content

feat(m11): fail-fast tms hook-mode arch guard + force offload_rollout under RLix#16

Open
zhenyulincs wants to merge 1 commit into
zhenyu/m11-mvp-testfrom
zhenyu/tms-arch-guard-rlix-offload
Open

feat(m11): fail-fast tms hook-mode arch guard + force offload_rollout under RLix#16
zhenyulincs wants to merge 1 commit into
zhenyu/m11-mvp-testfrom
zhenyu/tms-arch-guard-rlix-offload

Conversation

@zhenyulincs
Copy link
Copy Markdown

Summary

Two adjacent F1 sleep/wake footguns that previously failed silently or with a tracebackless crash:

  1. torch_memory_saver (tms) preload hook segfaults on Blackwell. tms' default preload hook (LD_PRELOAD libc-malloc interposer) SIGSEGVs on Blackwell-class GPUs (compute capability major ≥ 10 — RTX 50xx sm_120, B100/B200 sm_100) under CUDA 12.9, during build_cpu_bucket_cache, with no Python traceback. Forgetting MILES_TMS_HOOK_MODE=torch silently falls back to preload → crash.
  2. RLix mode needs --offload-rollout but didn't enforce it. enable_memory_saver (lets SGLang return VRAM on release_memory_occupation) is gated by args.offload_rollout. Under RLix this is mandatory — without it the first shrink_engines OOMs (M11.1 attempt-5 bug). Nothing forced it on.

Changes

  • New miles/backends/megatron_utils/tms_utils.pyresolve_tms_hook_mode() + assert_tms_hook_mode_matches_arch(). torch-only so it's unit-testable without the Megatron actor stack.
    • Checks the resolved mode (unset → preload), so the "forgot the export" case is caught.
    • Fires only on major ≥ 10; pre-Blackwell keeps preload (the more complete release path, tms' default).
    • Raises RuntimeError (survives python -O) with an actionable message.
    • Escape hatch MILES_TMS_ALLOW_PRELOAD_ON_BLACKWELL=1.
  • actor.py — call the guard right before hook_mode is applied.
  • New apply_rlix_offload_defaults() in rlix_validation.py + call in miles_validate_args (single normalization point) — forces offload_rollout=True under RLix. Idempotent; offload_train left untouched.
  • Teststests/fast/backends/test_tms_utils.py (19) + tests/fast/utils/test_rlix_offload_defaults.py (5).

Test plan / results

Run on vast.ai 4× L4 (Ada sm_89, torch 2.11.0+cu129, CUDA 12.9):

$ python -m pytest tests/fast/utils/test_rlix_offload_defaults.py \
                   tests/fast/backends/test_tms_utils.py -q
======================= 24 passed, 29 warnings in 8.42s ========================

Live non-mocked check on the real L4 (sm_89, major 8 < 10): actor.py imports after the refactor, and the guard correctly stays silent for preload/unset/torch. The Blackwell raise path is covered by mocked unit tests (no Blackwell GPU on this box).

Full changelog + results: dev-notes/2026-06-06-tms-arch-guard-and-rlix-offload.md.

Codex sign-off

APPROVE — no CRITICAL / HIGH / MEDIUM. One LOW (in-place args mutation) is intentional and matches the surrounding miles_validate_args style.

Follow-ups (not in this PR)

  • Auto-detect hook mode (default torch when CC ≥ 10 / CUDA ≥ 12.9) so the export isn't required at all.
  • Decide whether RLix partial-overlap also needs offload_train forced on.

… under RLix

Two adjacent F1 sleep/wake footguns that previously failed silently:

1. torch_memory_saver "preload" hook segfaults on Blackwell (CC major >= 10)
   under CUDA 12.9 — a tracebackless SIGSEGV during build_cpu_bucket_cache.
   Add assert_tms_hook_mode_matches_arch() (new torch-only tms_utils module)
   that resolves the effective mode (unset -> preload), fires only on
   major >= 10, and raises a clear RuntimeError telling the operator to set
   MILES_TMS_HOOK_MODE=torch. Escape hatch: MILES_TMS_ALLOW_PRELOAD_ON_BLACKWELL=1.

2. RLix mode requires enable_memory_saver (gated by offload_rollout) or the
   first shrink_engines OOMs. Add apply_rlix_offload_defaults() and call it at
   the single normalization point in miles_validate_args so RLix forces
   offload_rollout=True without the operator remembering --offload-rollout.
   offload_train is left untouched.

Tests: tests/fast/backends/test_tms_utils.py (19) +
tests/fast/utils/test_rlix_offload_defaults.py (5). 24 passed on vast L4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant