feat(m11): fail-fast tms hook-mode arch guard + force offload_rollout under RLix by zhenyulincs · Pull Request #16 · rlops/miles

zhenyulincs · 2026-06-07T06:52:58Z

Summary

Two adjacent F1 sleep/wake footguns that previously failed silently or with a tracebackless crash:

torch_memory_saver (tms) preload hook segfaults on Blackwell. tms' default preload hook (LD_PRELOAD libc-malloc interposer) SIGSEGVs on Blackwell-class GPUs (compute capability major ≥ 10 — RTX 50xx sm_120, B100/B200 sm_100) under CUDA 12.9, during build_cpu_bucket_cache, with no Python traceback. Forgetting MILES_TMS_HOOK_MODE=torch silently falls back to preload → crash.
RLix mode needs --offload-rollout but didn't enforce it. enable_memory_saver (lets SGLang return VRAM on release_memory_occupation) is gated by args.offload_rollout. Under RLix this is mandatory — without it the first shrink_engines OOMs (M11.1 attempt-5 bug). Nothing forced it on.

Changes

New miles/backends/megatron_utils/tms_utils.py — resolve_tms_hook_mode() + assert_tms_hook_mode_matches_arch(). torch-only so it's unit-testable without the Megatron actor stack.
- Checks the resolved mode (unset → preload), so the "forgot the export" case is caught.
- Fires only on major ≥ 10; pre-Blackwell keeps preload (the more complete release path, tms' default).
- Raises RuntimeError (survives python -O) with an actionable message.
- Escape hatch MILES_TMS_ALLOW_PRELOAD_ON_BLACKWELL=1.
actor.py — call the guard right before hook_mode is applied.
New apply_rlix_offload_defaults() in rlix_validation.py + call in miles_validate_args (single normalization point) — forces offload_rollout=True under RLix. Idempotent; offload_train left untouched.
Tests — tests/fast/backends/test_tms_utils.py (19) + tests/fast/utils/test_rlix_offload_defaults.py (5).

Test plan / results

Run on vast.ai 4× L4 (Ada sm_89, torch 2.11.0+cu129, CUDA 12.9):

$ python -m pytest tests/fast/utils/test_rlix_offload_defaults.py \
                   tests/fast/backends/test_tms_utils.py -q
======================= 24 passed, 29 warnings in 8.42s ========================

Live non-mocked check on the real L4 (sm_89, major 8 < 10): actor.py imports after the refactor, and the guard correctly stays silent for preload/unset/torch. The Blackwell raise path is covered by mocked unit tests (no Blackwell GPU on this box).

Full changelog + results: dev-notes/2026-06-06-tms-arch-guard-and-rlix-offload.md.

Codex sign-off

APPROVE — no CRITICAL / HIGH / MEDIUM. One LOW (in-place args mutation) is intentional and matches the surrounding miles_validate_args style.

Follow-ups (not in this PR)

Auto-detect hook mode (default torch when CC ≥ 10 / CUDA ≥ 12.9) so the export isn't required at all.
Decide whether RLix partial-overlap also needs offload_train forced on.

… under RLix Two adjacent F1 sleep/wake footguns that previously failed silently: 1. torch_memory_saver "preload" hook segfaults on Blackwell (CC major >= 10) under CUDA 12.9 — a tracebackless SIGSEGV during build_cpu_bucket_cache. Add assert_tms_hook_mode_matches_arch() (new torch-only tms_utils module) that resolves the effective mode (unset -> preload), fires only on major >= 10, and raises a clear RuntimeError telling the operator to set MILES_TMS_HOOK_MODE=torch. Escape hatch: MILES_TMS_ALLOW_PRELOAD_ON_BLACKWELL=1. 2. RLix mode requires enable_memory_saver (gated by offload_rollout) or the first shrink_engines OOMs. Add apply_rlix_offload_defaults() and call it at the single normalization point in miles_validate_args so RLix forces offload_rollout=True without the operator remembering --offload-rollout. offload_train is left untouched. Tests: tests/fast/backends/test_tms_utils.py (19) + tests/fast/utils/test_rlix_offload_defaults.py (5). 24 passed on vast L4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(m11): fail-fast tms hook-mode arch guard + force offload_rollout under RLix#16

feat(m11): fail-fast tms hook-mode arch guard + force offload_rollout under RLix#16
zhenyulincs wants to merge 1 commit into
zhenyu/m11-mvp-testfrom
zhenyu/tms-arch-guard-rlix-offload

zhenyulincs commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhenyulincs commented Jun 7, 2026

Summary

Changes

Test plan / results

Codex sign-off

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant