Add play (inference) benchmark with PlayBundle (benchmark refactor, Part 4/4) by AntoineRichard · Pull Request #6201 · isaac-sim/IsaacLab

AntoineRichard · 2026-06-16T10:04:42Z

Description

Part 4 of 4 of the benchmark refactor series — a checkpoint-driven play (inference) benchmark, the inference counterpart to the training benchmark.

Stacked on Part 3 (#6199). The diff against develop below also includes Parts 1–3 until they merge. For the incremental Part 4 changes only, view:
AntoineRichard/IsaacLab@antoiner/benchmark-training...antoiner/benchmark-play

Series: Part 1/4 core (#6197) → Part 2/4 runtime + startup (#6198) → Part 3/4 training (#6199) → Part 4/4 play (this PR).

Loads a trained checkpoint, runs the policy-driven rollout, and emits a new typed PlayBundle capturing inference/step performance plus the played policy's reward / episode-length / success.

Adds:

PlayBundle schema type (mirrors RuntimeBundle with run.framework set, plus typed success_rate / reward / ep_length (MeanStd) / checkpoint_path / video_path; no learning curve). Additive — Odin gains a play.json shape; existing bundles unchanged.
Core helpers: build_play_bundle, run_play_loop (policy-driven rollout, aggregates per-episode return/length/success; handles 4- and 5-tuple step signatures + numpy returns), and resolve_play_checkpoint (chain: --checkpoint <path or Nucleus URI> → else the published Nucleus checkpoint with a warning → else a clear error).
scripts/benchmarks/play.py dispatcher over --rl_library {rsl_rl, rl_games, skrl, sb3} + per-backend bench_play_<backend>.py adapters (each mirrors its reinforcement_learning/<backend>/play.py checkpoint-load + inference policy; develop launch API).
Four gated generate-then-play smokes (train a tiny checkpoint, then play it).
Docs (benchmarks.rst play section + arg table) and a 3.0 migration-guide entry.

Validated on develop (Newton/MJWarp): all four backends generate-then-play and emit a valid PlayBundle (rsl_rl ≈7.5k inference FPS, rl_games ≈209k @512 envs, skrl ≈5.4k, sb3 ≈4.6k; reward/ep_length populated). Note: reward/ep_length/success_rate aggregate only completed episodes, so --num_frames must exceed the task's episode length (documented).

Fixes # (n/a)

Type of change

New feature (non-breaking change which adds functionality)

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have added a changelog fragment under source/<pkg>/changelog.d/ for every touched package
I have added my name to the CONTRIBUTORS.md or my name already exists there

Introduce the capture, metrics, builders, stepping, profiling, and backend_descriptor submodules for assembling the schema-v1 benchmark bundles, add a schema output backend, and let BaseIsaacLabBenchmark emit several backends in one run via a new attach_bundle hook. Unit tests cover each submodule plus the schema backend and multi-backend finalize. Part 1 of a series splitting the oversized benchmark refactor (core -> runtime/startup -> training -> play).

Add backend-agnostic runtime.py (random-action stepping, emits a RuntimeBundle) and startup.py (cProfile startup-phase profiling, emits a StartupBundle), wired to develop's launch API (launch_simulation and add_launcher_args from isaaclab.app; preset tokens forwarded to Hydra without folding). Remove the legacy benchmark_non_rl.py and benchmark_startup.py scripts plus the run_non_rl_benchmarks.sh and run_physx_benchmarks.sh runner shells; repoint benchmark_hydra_resolve at _common.get_backend_type. Part 2 of the benchmark refactor series (core -> runtime/startup -> training -> play); stacked on Part 1 (isaac-sim#6197).

Add training.py dispatching over --rl_library {rsl_rl, rl_games, skrl, sb3}; each adapter runs real training under BenchmarkMonitor and emits a TrainingBundle via the shared core, with an optional success-metric early stop. Scripts use develop's launch API (launch_simulation from isaaclab.app; preset tokens forwarded without folding). Remove the legacy benchmark_rsl_rl.py / benchmark_rlgames.py scripts, the run_training_benchmarks.sh runner shell, and the obsolete utils.py helper. Part 3 of the benchmark refactor series (core -> runtime/startup -> training -> play); stacked on Parts 1-2 (isaac-sim#6197, isaac-sim#6198).

Introduce scripts/benchmarks/play.py, a --rl_library dispatcher mirroring training.py, plus the rsl_rl inference adapter scripts/benchmarks/rsl_rl/bench_play_rsl_rl.py. The adapter resolves a checkpoint via resolve_play_checkpoint, loads the policy the way the rsl_rl play script does, rolls it out under a BenchmarkMonitor using run_play_loop, and emits a PlayBundle.

Roll out a checkpointed skrl policy under a BenchmarkMonitor and emit a PlayBundle. The skrl env wrapper returns reward and done tensors shaped (num_envs, 1); reshape them to (num_envs,) in run_play_loop so the per-environment return accumulator broadcasts correctly across backends.

Roll out a checkpointed Stable-Baselines3 policy under a BenchmarkMonitor and emit a PlayBundle. The sb3 vec env returns NumPy reward/done arrays and a per-environment list of info dicts; coerce reward and dones onto the env device in run_play_loop so CPU NumPy returns do not clash with the on-device accumulators, and skip success extraction when the info value is not a dict.

- rl_games adapter: read obs_groups/concate_obs_groups from the agent cfg and pass them to RlGamesVecEnvWrapper, so tasks with asymmetric/non-default observation layouts feed the policy the same observation it was trained on. - _common: key the published-checkpoint lookup on the bare training-task name (drop any namespace prefix and the -Play suffix), matching the reinforcement_learning play scripts; add a unit-testable _published_task_name helper. - rl_games adapter: drop the inaccurate RNN-state-reset claim from the policy docstring.

greptile-apps · 2026-06-16T12:39:48Z

Greptile Summary

This PR is Part 4/4 of a benchmark refactor series, adding a checkpoint-driven play (inference) benchmark with a new PlayBundle schema type, four backend adapters (rsl_rl, rl_games, skrl, sb3), and a unified play.py dispatcher. It also lands the shared scaffolding from Parts 1–3 (schema, capture, metrics, builders, stepping, and SchemaBundleFile backend).

New PlayBundle schema mirrors RuntimeBundle with framework set and adds typed success_rate / reward / ep_length (MeanStd) / checkpoint_path / video_path fields; serialised to play.json via the existing SchemaBundleFile backend.
run_play_loop in stepping.py runs a policy-driven rollout, accumulates per-episode returns/lengths, and handles both 4- and 5-tuple step signatures plus NumPy returns from SB3.
resolve_play_checkpoint in _common.py chains explicit --checkpoint → published Nucleus fallback → FileNotFoundError; the fallback branch is missing a _retrieve_file_path call, leaving adapters with a raw Nucleus URI.

Confidence Score: 3/5

The new play benchmark scaffolding (schema, builders, capture, metrics) is clean and well-tested. The single actionable defect is in resolve_play_checkpoint: when no --checkpoint is supplied the published Nucleus URI is returned as-is to the adapters, which cannot load a remote URI and will crash at runtime.

The core schema types, builder functions, and run_play_loop are solid and covered by unit tests. However, the checkpoint-resolution fallback path in _common.py skips the download step that is applied to every user-supplied path, so running the play benchmark without an explicit --checkpoint argument will fail for all four backends. This is the primary code path the PR adds value for and it is broken at the point where the resolved path is returned.

scripts/benchmarks/_common.py (resolve_play_checkpoint fallback), source/isaaclab/isaaclab/test/benchmark/stepping.py (success_rate aggregation methodology), scripts/benchmarks/skrl/bench_play_skrl.py (unnecessary env.state() call in policy closure)

Important Files Changed

Filename	Overview
scripts/benchmarks/_common.py	New shared helper module; contains a bug where the published-checkpoint fallback in `resolve_play_checkpoint` skips `_retrieve_file_path`, returning a raw Nucleus URI that adapters cannot load.
source/isaaclab/isaaclab/test/benchmark/stepping.py	New stepping helpers including `run_play_loop`; logic is mostly sound but success_rate aggregation broadcasts a batch-mean scalar to every done environment in a step, which may misrepresent per-episode outcomes in large vectorized settings.
source/isaaclab/isaaclab/test/benchmark/schema.py	Adds `PlayBundle` frozen dataclass; mirrors `RuntimeBundle` structure and adds `success_rate`, `reward`, `ep_length`, `checkpoint_path`, `video_path`; clean and consistent with existing bundle types.
source/isaaclab/isaaclab/test/benchmark/builders.py	New pure-assembly builders for all bundle types including `build_play_bundle`; well-structured, Isaac-Sim-free, and correctly delegates to schema and metrics modules.
scripts/benchmarks/rsl_rl/bench_play_rsl_rl.py	New RSL-RL play adapter; mirrors the official play.py inference path and correctly emits PlayBundle; timing, checkpoint loading, and env wrapping look consistent with training adapter.
scripts/benchmarks/rl_games/bench_play_rl_games.py	New RL-Games play adapter; checkpoint loading follows the original play.py double-restore pattern; FPS computed correctly per-step.
scripts/benchmarks/skrl/bench_play_skrl.py	New SKRL play adapter; `policy` closure calls `env.state()` on every inference step, fetching privileged critic observations unnecessarily during rollout.
scripts/benchmarks/sb3/bench_play_sb3.py	New SB3 play adapter; VecNormalize loading logic correctly handles the saved `.pkl` case; fallback `training=True` path is inherited from the original play.py.
source/isaaclab/isaaclab/test/benchmark/benchmark_core.py	Refactored to accept `list[str]` for `backend_type`, adds `attach_bundle` for schema serialisation, and routes `bundle` kwarg through `finalize`; multi-backend filename suffix logic is correct.
source/isaaclab/isaaclab/test/benchmark/backends.py	Adds `SchemaBundleFile` backend that serialises the attached typed bundle; correctly ignores flat measurement phases and handles missing bundle gracefully.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as play.py dispatcher
    participant Adapter as bench_play adapter
    participant Common as _common.py
    participant StepFn as stepping.run_play_loop
    participant BM as BenchmarkMonitor
    participant Build as builders
    participant Out as SchemaBundleFile

    CLI->>Adapter: dispatch_library_entrypoint()
    Adapter->>Common: resolve_play_checkpoint(checkpoint, framework, task)
    alt checkpoint provided by user
        Common-->>Adapter: _retrieve_file_path(checkpoint) - local path
    else fallback published Nucleus checkpoint
        Common-->>Adapter: raw Nucleus URI (missing _retrieve_file_path)
    end
    Adapter->>BM: enter BenchmarkMonitor context
    Adapter->>StepFn: run_play_loop(env, policy, num_frames)
    StepFn-->>Adapter: step_times, reward, ep_length, success_rate
    BM-->>Adapter: exit + update_manual_recorders
    Adapter->>Build: build_play_bundle(run, versions, hardware, runtime, ...)
    Build-->>Adapter: PlayBundle
    Adapter->>Out: attach_bundle then _finalize_impl writes play.json

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as play.py dispatcher
    participant Adapter as bench_play adapter
    participant Common as _common.py
    participant StepFn as stepping.run_play_loop
    participant BM as BenchmarkMonitor
    participant Build as builders
    participant Out as SchemaBundleFile

    CLI->>Adapter: dispatch_library_entrypoint()
    Adapter->>Common: resolve_play_checkpoint(checkpoint, framework, task)
    alt checkpoint provided by user
        Common-->>Adapter: _retrieve_file_path(checkpoint) - local path
    else fallback published Nucleus checkpoint
        Common-->>Adapter: raw Nucleus URI (missing _retrieve_file_path)
    end
    Adapter->>BM: enter BenchmarkMonitor context
    Adapter->>StepFn: run_play_loop(env, policy, num_frames)
    StepFn-->>Adapter: step_times, reward, ep_length, success_rate
    BM-->>Adapter: exit + update_manual_recorders
    Adapter->>Build: build_play_bundle(run, versions, hardware, runtime, ...)
    Build-->>Adapter: PlayBundle
    Adapter->>Out: attach_bundle then _finalize_impl writes play.json

Comments Outside Diff (2)

source/isaaclab/isaaclab/test/benchmark/stepping.py, line 1246-1258 (link)

Step-level success metric broadcast to each done environment

_extract_success(extras) returns one scalar for the whole step — typically the batch-mean Metrics/success_rate logged across all num_envs environments. When multiple environments finish in the same step the same scalar is appended once per done environment. For large vectorised setups (e.g., 512 envs) several environments can terminate in a single step, and if the logged success metric represents the fraction of all num_envs environments that succeeded (not just the ones that finished), each done environment receives an inaccurate credit. The result is a success_rate that may not reflect actual per-episode success outcomes.
scripts/benchmarks/skrl/bench_play_skrl.py, line 1037-1040 (link)

env.state() called every inference step inside policy

env.state() is called on each call to policy(obs) inside run_play_loop. This method returns the privileged critic observation used only during training; the policy network itself needs only obs. Calling it at every inference step incurs unnecessary computation, and on environments that distinguish actor/critic inputs it may also fetch a privileged buffer that the policy ignores. The other three backend adapters (rsl_rl, rl_games, sb3) do not call an equivalent state method in their policy closures.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "Address play-benchmark review feedback" | Re-trigger Greptile}

greptile-apps · 2026-06-16T12:39:52Z

+    Raises:
+        FileNotFoundError: When no *checkpoint* is given and no published Nucleus
+            checkpoint exists for *framework*/*task*.
+    """
+    if checkpoint:
+        return _retrieve_file_path(checkpoint)
+
+    logger.warning(
+        "No --checkpoint given; falling back to the published Nucleus checkpoint for %s / %s.",
+        framework,
+        task,
+    )
+    path = _published_checkpoint(framework, task)
+    if not path:
+        raise FileNotFoundError(
+            f"No checkpoint available for framework {framework!r} and task {task!r}: pass --checkpoint"
+            " with a local or Nucleus path, or publish a Nucleus checkpoint for this task."
+        )
+    return path


Missing _retrieve_file_path call on published checkpoint fallback

When the user omits --checkpoint, resolve_play_checkpoint calls _published_checkpoint and returns its result directly — skipping the _retrieve_file_path download step that is applied to user-supplied paths. get_published_pretrained_checkpoint typically returns a Nucleus URI (omniverse://…), so every downstream adapter (runner.load(resume_path), agent.restore(resume_path), PPO.load(resume_path), etc.) will receive a raw URI string and fail with a file-not-found / invalid-path error. The explicit --checkpoint branch is correctly guarded; the fallback is not.

greptile-apps · 2026-06-16T12:39:53Z

+    path = _published_checkpoint(framework, task)
+    if not path:
+        raise FileNotFoundError(
+            f"No checkpoint available for framework {framework!r} and task {task!r}: pass --checkpoint"
+            " with a local or Nucleus path, or publish a Nucleus checkpoint for this task."
+        )
+    return path


Resolve the published Nucleus URI through _retrieve_file_path so it is downloaded to a local path before being returned, matching the behaviour for user-supplied paths.

Suggested change

path = _published_checkpoint(framework, task)

if not path:

raise FileNotFoundError(

f"No checkpoint available for framework {framework!r} and task {task!r}: pass --checkpoint"

" with a local or Nucleus path, or publish a Nucleus checkpoint for this task."

)

return path

path = _published_checkpoint(framework, task)

if not path:

raise FileNotFoundError(

f"No checkpoint available for framework {framework!r} and task {task!r}: pass --checkpoint"

" with a local or Nucleus path, or publish a Nucleus checkpoint for this task."

)

return _retrieve_file_path(path)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels Jun 16, 2026

AntoineRichard added 15 commits June 16, 2026 13:57

Add PlayBundle schema type for the play benchmark

cf9a576

Add build_play_bundle builder

8319599

Add run_play_loop policy-rollout stepping helper

3d6c42e

Add resolve_play_checkpoint helper to _common

6b06558

Annotate run_play_loop return type

c1717e8

Add rl_games play benchmark adapter

e868112

Add gated generate-then-play smoke tests for the play benchmark

554329c

Document the play benchmark and add changelog

af0bfa4

AntoineRichard force-pushed the antoiner/benchmark-play branch from 53df4a3 to c1f234f Compare June 16, 2026 12:22

AntoineRichard marked this pull request as ready for review June 16, 2026 12:33

AntoineRichard requested review from Mayankm96, jtigue-bdai, kellyguo11 and ooctipus as code owners June 16, 2026 12:33

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add play (inference) benchmark with PlayBundle (benchmark refactor, Part 4/4)#6201

Add play (inference) benchmark with PlayBundle (benchmark refactor, Part 4/4)#6201
AntoineRichard wants to merge 15 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/benchmark-play

AntoineRichard commented Jun 16, 2026

Uh oh!

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Jun 16, 2026

Uh oh!

greptile-apps Bot Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AntoineRichard commented Jun 16, 2026

Description

Type of change

Checklist

Uh oh!

greptile-apps Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading