Skip to content

Add play (inference) benchmark with PlayBundle (benchmark refactor, Part 4/4)#6201

Open
AntoineRichard wants to merge 15 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/benchmark-play
Open

Add play (inference) benchmark with PlayBundle (benchmark refactor, Part 4/4)#6201
AntoineRichard wants to merge 15 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/benchmark-play

Conversation

@AntoineRichard

Copy link
Copy Markdown
Collaborator

Description

Part 4 of 4 of the benchmark refactor series — a checkpoint-driven play (inference) benchmark, the inference counterpart to the training benchmark.

Stacked on Part 3 (#6199). The diff against develop below also includes Parts 1–3 until they merge. For the incremental Part 4 changes only, view:
AntoineRichard/IsaacLab@antoiner/benchmark-training...antoiner/benchmark-play

Series: Part 1/4 core (#6197) → Part 2/4 runtime + startup (#6198) → Part 3/4 training (#6199) → Part 4/4 play (this PR).

Loads a trained checkpoint, runs the policy-driven rollout, and emits a new typed PlayBundle capturing inference/step performance plus the played policy's reward / episode-length / success.

Adds:

  • PlayBundle schema type (mirrors RuntimeBundle with run.framework set, plus typed success_rate / reward / ep_length (MeanStd) / checkpoint_path / video_path; no learning curve). Additive — Odin gains a play.json shape; existing bundles unchanged.
  • Core helpers: build_play_bundle, run_play_loop (policy-driven rollout, aggregates per-episode return/length/success; handles 4- and 5-tuple step signatures + numpy returns), and resolve_play_checkpoint (chain: --checkpoint <path or Nucleus URI> → else the published Nucleus checkpoint with a warning → else a clear error).
  • scripts/benchmarks/play.py dispatcher over --rl_library {rsl_rl, rl_games, skrl, sb3} + per-backend bench_play_<backend>.py adapters (each mirrors its reinforcement_learning/<backend>/play.py checkpoint-load + inference policy; develop launch API).
  • Four gated generate-then-play smokes (train a tiny checkpoint, then play it).
  • Docs (benchmarks.rst play section + arg table) and a 3.0 migration-guide entry.

Validated on develop (Newton/MJWarp): all four backends generate-then-play and emit a valid PlayBundle (rsl_rl ≈7.5k inference FPS, rl_games ≈209k @512 envs, skrl ≈5.4k, sb3 ≈4.6k; reward/ep_length populated). Note: reward/ep_length/success_rate aggregate only completed episodes, so --num_frames must exceed the task's episode length (documented).

Fixes # (n/a)

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added a changelog fragment under source/<pkg>/changelog.d/ for every touched package
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@github-actions github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels Jun 16, 2026
Introduce the capture, metrics, builders, stepping, profiling, and
backend_descriptor submodules for assembling the schema-v1 benchmark
bundles, add a schema output backend, and let BaseIsaacLabBenchmark emit
several backends in one run via a new attach_bundle hook. Unit tests
cover each submodule plus the schema backend and multi-backend finalize.

Part 1 of a series splitting the oversized benchmark refactor
(core -> runtime/startup -> training -> play).
Add backend-agnostic runtime.py (random-action stepping, emits a
RuntimeBundle) and startup.py (cProfile startup-phase profiling, emits a
StartupBundle), wired to develop's launch API (launch_simulation and
add_launcher_args from isaaclab.app; preset tokens forwarded to Hydra
without folding). Remove the legacy benchmark_non_rl.py and
benchmark_startup.py scripts plus the run_non_rl_benchmarks.sh and
run_physx_benchmarks.sh runner shells; repoint benchmark_hydra_resolve
at _common.get_backend_type.

Part 2 of the benchmark refactor series (core -> runtime/startup ->
training -> play); stacked on Part 1 (isaac-sim#6197).
Add training.py dispatching over --rl_library {rsl_rl, rl_games, skrl,
sb3}; each adapter runs real training under BenchmarkMonitor and emits a
TrainingBundle via the shared core, with an optional success-metric early
stop. Scripts use develop's launch API (launch_simulation from
isaaclab.app; preset tokens forwarded without folding). Remove the legacy
benchmark_rsl_rl.py / benchmark_rlgames.py scripts, the
run_training_benchmarks.sh runner shell, and the obsolete utils.py helper.

Part 3 of the benchmark refactor series (core -> runtime/startup ->
training -> play); stacked on Parts 1-2 (isaac-sim#6197, isaac-sim#6198).
Introduce scripts/benchmarks/play.py, a --rl_library dispatcher mirroring
training.py, plus the rsl_rl inference adapter
scripts/benchmarks/rsl_rl/bench_play_rsl_rl.py. The adapter resolves a
checkpoint via resolve_play_checkpoint, loads the policy the way the
rsl_rl play script does, rolls it out under a BenchmarkMonitor using
run_play_loop, and emits a PlayBundle.
Roll out a checkpointed skrl policy under a BenchmarkMonitor and emit a
PlayBundle. The skrl env wrapper returns reward and done tensors shaped
(num_envs, 1); reshape them to (num_envs,) in run_play_loop so the
per-environment return accumulator broadcasts correctly across backends.
Roll out a checkpointed Stable-Baselines3 policy under a BenchmarkMonitor
and emit a PlayBundle. The sb3 vec env returns NumPy reward/done arrays and
a per-environment list of info dicts; coerce reward and dones onto the env
device in run_play_loop so CPU NumPy returns do not clash with the on-device
accumulators, and skip success extraction when the info value is not a dict.
- rl_games adapter: read obs_groups/concate_obs_groups from the agent cfg and pass
  them to RlGamesVecEnvWrapper, so tasks with asymmetric/non-default observation
  layouts feed the policy the same observation it was trained on.
- _common: key the published-checkpoint lookup on the bare training-task name (drop
  any namespace prefix and the -Play suffix), matching the reinforcement_learning
  play scripts; add a unit-testable _published_task_name helper.
- rl_games adapter: drop the inaccurate RNN-state-reset claim from the policy docstring.
@AntoineRichard AntoineRichard force-pushed the antoiner/benchmark-play branch from 53df4a3 to c1f234f Compare June 16, 2026 12:22
@AntoineRichard AntoineRichard marked this pull request as ready for review June 16, 2026 12:33
@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR is Part 4/4 of a benchmark refactor series, adding a checkpoint-driven play (inference) benchmark with a new PlayBundle schema type, four backend adapters (rsl_rl, rl_games, skrl, sb3), and a unified play.py dispatcher. It also lands the shared scaffolding from Parts 1–3 (schema, capture, metrics, builders, stepping, and SchemaBundleFile backend).

  • New PlayBundle schema mirrors RuntimeBundle with framework set and adds typed success_rate / reward / ep_length (MeanStd) / checkpoint_path / video_path fields; serialised to play.json via the existing SchemaBundleFile backend.
  • run_play_loop in stepping.py runs a policy-driven rollout, accumulates per-episode returns/lengths, and handles both 4- and 5-tuple step signatures plus NumPy returns from SB3.
  • resolve_play_checkpoint in _common.py chains explicit --checkpoint → published Nucleus fallback → FileNotFoundError; the fallback branch is missing a _retrieve_file_path call, leaving adapters with a raw Nucleus URI.

Confidence Score: 3/5

The new play benchmark scaffolding (schema, builders, capture, metrics) is clean and well-tested. The single actionable defect is in resolve_play_checkpoint: when no --checkpoint is supplied the published Nucleus URI is returned as-is to the adapters, which cannot load a remote URI and will crash at runtime.

The core schema types, builder functions, and run_play_loop are solid and covered by unit tests. However, the checkpoint-resolution fallback path in _common.py skips the download step that is applied to every user-supplied path, so running the play benchmark without an explicit --checkpoint argument will fail for all four backends. This is the primary code path the PR adds value for and it is broken at the point where the resolved path is returned.

scripts/benchmarks/_common.py (resolve_play_checkpoint fallback), source/isaaclab/isaaclab/test/benchmark/stepping.py (success_rate aggregation methodology), scripts/benchmarks/skrl/bench_play_skrl.py (unnecessary env.state() call in policy closure)

Important Files Changed

Filename Overview
scripts/benchmarks/_common.py New shared helper module; contains a bug where the published-checkpoint fallback in resolve_play_checkpoint skips _retrieve_file_path, returning a raw Nucleus URI that adapters cannot load.
source/isaaclab/isaaclab/test/benchmark/stepping.py New stepping helpers including run_play_loop; logic is mostly sound but success_rate aggregation broadcasts a batch-mean scalar to every done environment in a step, which may misrepresent per-episode outcomes in large vectorized settings.
source/isaaclab/isaaclab/test/benchmark/schema.py Adds PlayBundle frozen dataclass; mirrors RuntimeBundle structure and adds success_rate, reward, ep_length, checkpoint_path, video_path; clean and consistent with existing bundle types.
source/isaaclab/isaaclab/test/benchmark/builders.py New pure-assembly builders for all bundle types including build_play_bundle; well-structured, Isaac-Sim-free, and correctly delegates to schema and metrics modules.
scripts/benchmarks/rsl_rl/bench_play_rsl_rl.py New RSL-RL play adapter; mirrors the official play.py inference path and correctly emits PlayBundle; timing, checkpoint loading, and env wrapping look consistent with training adapter.
scripts/benchmarks/rl_games/bench_play_rl_games.py New RL-Games play adapter; checkpoint loading follows the original play.py double-restore pattern; FPS computed correctly per-step.
scripts/benchmarks/skrl/bench_play_skrl.py New SKRL play adapter; policy closure calls env.state() on every inference step, fetching privileged critic observations unnecessarily during rollout.
scripts/benchmarks/sb3/bench_play_sb3.py New SB3 play adapter; VecNormalize loading logic correctly handles the saved .pkl case; fallback training=True path is inherited from the original play.py.
source/isaaclab/isaaclab/test/benchmark/benchmark_core.py Refactored to accept list[str] for backend_type, adds attach_bundle for schema serialisation, and routes bundle kwarg through finalize; multi-backend filename suffix logic is correct.
source/isaaclab/isaaclab/test/benchmark/backends.py Adds SchemaBundleFile backend that serialises the attached typed bundle; correctly ignores flat measurement phases and handles missing bundle gracefully.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as play.py dispatcher
    participant Adapter as bench_play adapter
    participant Common as _common.py
    participant StepFn as stepping.run_play_loop
    participant BM as BenchmarkMonitor
    participant Build as builders
    participant Out as SchemaBundleFile

    CLI->>Adapter: dispatch_library_entrypoint()
    Adapter->>Common: resolve_play_checkpoint(checkpoint, framework, task)
    alt checkpoint provided by user
        Common-->>Adapter: _retrieve_file_path(checkpoint) - local path
    else fallback published Nucleus checkpoint
        Common-->>Adapter: raw Nucleus URI (missing _retrieve_file_path)
    end
    Adapter->>BM: enter BenchmarkMonitor context
    Adapter->>StepFn: run_play_loop(env, policy, num_frames)
    StepFn-->>Adapter: step_times, reward, ep_length, success_rate
    BM-->>Adapter: exit + update_manual_recorders
    Adapter->>Build: build_play_bundle(run, versions, hardware, runtime, ...)
    Build-->>Adapter: PlayBundle
    Adapter->>Out: attach_bundle then _finalize_impl writes play.json
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as play.py dispatcher
    participant Adapter as bench_play adapter
    participant Common as _common.py
    participant StepFn as stepping.run_play_loop
    participant BM as BenchmarkMonitor
    participant Build as builders
    participant Out as SchemaBundleFile

    CLI->>Adapter: dispatch_library_entrypoint()
    Adapter->>Common: resolve_play_checkpoint(checkpoint, framework, task)
    alt checkpoint provided by user
        Common-->>Adapter: _retrieve_file_path(checkpoint) - local path
    else fallback published Nucleus checkpoint
        Common-->>Adapter: raw Nucleus URI (missing _retrieve_file_path)
    end
    Adapter->>BM: enter BenchmarkMonitor context
    Adapter->>StepFn: run_play_loop(env, policy, num_frames)
    StepFn-->>Adapter: step_times, reward, ep_length, success_rate
    BM-->>Adapter: exit + update_manual_recorders
    Adapter->>Build: build_play_bundle(run, versions, hardware, runtime, ...)
    Build-->>Adapter: PlayBundle
    Adapter->>Out: attach_bundle then _finalize_impl writes play.json
Loading

Comments Outside Diff (2)

  1. source/isaaclab/isaaclab/test/benchmark/stepping.py, line 1246-1258 (link)

    P2 Step-level success metric broadcast to each done environment

    _extract_success(extras) returns one scalar for the whole step — typically the batch-mean Metrics/success_rate logged across all num_envs environments. When multiple environments finish in the same step the same scalar is appended once per done environment. For large vectorised setups (e.g., 512 envs) several environments can terminate in a single step, and if the logged success metric represents the fraction of all num_envs environments that succeeded (not just the ones that finished), each done environment receives an inaccurate credit. The result is a success_rate that may not reflect actual per-episode success outcomes.

  2. scripts/benchmarks/skrl/bench_play_skrl.py, line 1037-1040 (link)

    P2 env.state() called every inference step inside policy

    env.state() is called on each call to policy(obs) inside run_play_loop. This method returns the privileged critic observation used only during training; the policy network itself needs only obs. Calling it at every inference step incurs unnecessary computation, and on environments that distinguish actor/critic inputs it may also fetch a privileged buffer that the policy ignores. The other three backend adapters (rsl_rl, rl_games, sb3) do not call an equivalent state method in their policy closures.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "Address play-benchmark review feedback" | Re-trigger Greptile

Comment on lines +183 to +201
Raises:
FileNotFoundError: When no *checkpoint* is given and no published Nucleus
checkpoint exists for *framework*/*task*.
"""
if checkpoint:
return _retrieve_file_path(checkpoint)

logger.warning(
"No --checkpoint given; falling back to the published Nucleus checkpoint for %s / %s.",
framework,
task,
)
path = _published_checkpoint(framework, task)
if not path:
raise FileNotFoundError(
f"No checkpoint available for framework {framework!r} and task {task!r}: pass --checkpoint"
" with a local or Nucleus path, or publish a Nucleus checkpoint for this task."
)
return path

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing _retrieve_file_path call on published checkpoint fallback

When the user omits --checkpoint, resolve_play_checkpoint calls _published_checkpoint and returns its result directly — skipping the _retrieve_file_path download step that is applied to user-supplied paths. get_published_pretrained_checkpoint typically returns a Nucleus URI (omniverse://…), so every downstream adapter (runner.load(resume_path), agent.restore(resume_path), PPO.load(resume_path), etc.) will receive a raw URI string and fail with a file-not-found / invalid-path error. The explicit --checkpoint branch is correctly guarded; the fallback is not.

Comment on lines +195 to +201
path = _published_checkpoint(framework, task)
if not path:
raise FileNotFoundError(
f"No checkpoint available for framework {framework!r} and task {task!r}: pass --checkpoint"
" with a local or Nucleus path, or publish a Nucleus checkpoint for this task."
)
return path

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Resolve the published Nucleus URI through _retrieve_file_path so it is downloaded to a local path before being returned, matching the behaviour for user-supplied paths.

Suggested change
path = _published_checkpoint(framework, task)
if not path:
raise FileNotFoundError(
f"No checkpoint available for framework {framework!r} and task {task!r}: pass --checkpoint"
" with a local or Nucleus path, or publish a Nucleus checkpoint for this task."
)
return path
path = _published_checkpoint(framework, task)
if not path:
raise FileNotFoundError(
f"No checkpoint available for framework {framework!r} and task {task!r}: pass --checkpoint"
" with a local or Nucleus path, or publish a Nucleus checkpoint for this task."
)
return _retrieve_file_path(path)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant