Added support for Lingbot-VA by hyzhou404 · Pull Request #312 · NVIDIA/flashdreams

hyzhou404 · 2026-06-09T13:59:58Z

Summary

Add LingBot-VA Robotwin I2AV integration as a FlashDreams plugin
Achieves 2.3× speedup over the original repo (1.48× vs FSDP-removed baseline)
Self-contained package

copy-pr-bot · 2026-06-09T14:00:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-09T14:12:20Z

Greptile Summary

This PR adds a self-contained LingBot-VA Robotwin I2AV integration as a FlashDreams plugin, adapting the upstream video-action diffusion transformer into the standard runner/pipeline interface with torch.compile acceleration. Several issues flagged in earlier review rounds have been resolved (stub-module injection, diffusers private imports, duplicate VAE CPU offload calls, uncond KV sync).

_last_video_kv shared across cond/uncond forward passes — WanVADiTNetwork stores the video KV list in self._last_video_kv after each forward_video(persist=True). predict_flow calls forward_video for the cond cache first, then for the uncond cache; the uncond call overwrites _last_video_kv. When predict_action_flow subsequently calls forward_action(persist=True) for the cond cache, it reads the now-stale uncond KV and writes [uncond_video | cond_action] into the cond slot — corrupting CFG quality from chunk 1 onward in the default configuration.
Rolling-window cache, dual denoising loop, and AR cache lifecycle look correct; the trailing padded zero step is intentional (final KV write without latent update), and the circular buffer ordering fix from the previous round is properly reflected in the new VAKVCache implementation.

Confidence Score: 4/5

The cond KV cache is written with uncond video keys on every persist step, silently corrupting CFG quality from chunk 1 onward in the default configuration.

The CFG sync fix from the prior round moved both forward passes inside the network but left _last_video_kv as a single slot on the network object. The uncond forward_video call always runs second and always overwrites it, so every action write_action for the cond cache uses the wrong video KV. This affects every multi-chunk run with guidance_scale > 1.0, which is the default. The rest of the integration — scheduler, kvcache ordering, memory cleanup, model loading — is clean.

integrations/lingbot_va/lingbot_va/transformer/impl/network.py and integrations/lingbot_va/lingbot_va/transformer/init.py — the _last_video_kv write/read pair needs to be scoped per cache object rather than on the shared network.

Important Files Changed

Filename	Overview
integrations/lingbot_va/lingbot_va/transformer/impl/network.py	Core DiT network; contains a bug where `_last_video_kv` is shared network-level state overwritten by the uncond `forward_video` call, causing the cond cache to commit wrong (uncond) video KV from chunk 1 onward under the default guidance_scale=5.0 config.
integrations/lingbot_va/lingbot_va/transformer/init.py	Transformer wrapper with predict_flow/predict_action_flow; the sequencing of cond-then-uncond forward_video calls with persist=True is the proximate cause of the _last_video_kv corruption bug.
integrations/lingbot_va/lingbot_va/transformer/impl/kvcache.py	VAKVCache rolling-window implementation with write_video/write_action separation; looks correct after the circular-buffer ordering fix mentioned in prior threads.
integrations/lingbot_va/lingbot_va/pipeline.py	Dual denoising loop (video then action) with padded timestep for final KV commit; design is intentional and correct — the trailing t=0 step writes the KV cache without updating latents.
integrations/lingbot_va/lingbot_va/runner.py	Main inference runner; previously flagged issues (stub modules, diffusers internal imports, duplicate vae.to(cpu)) are resolved; memory cleanup before video decode looks correct.
integrations/lingbot_va/lingbot_va/scheduler.py	Flow-match scheduler matching upstream LingBot-VA; padded_timesteps correctly appends zero for the final cache-write step.
integrations/lingbot_va/lingbot_va/action.py	Action normalization/denormalization; preprocess correctly handles 16-channel input, postprocess correctly denormalizes and selects robotwin channels.
integrations/lingbot_va/lingbot_va/_loaders.py	Model loading utilities; WanVAEStreamingWrapper correctly counts WanCausalConv3d modules with a fallback when _cached_conv_counts is absent.
integrations/lingbot_va/lingbot_va/transformer/impl/modules.py	VA transformer blocks with compile-friendly tensor-only forward pass; cross-attention correctly receives pre-extracted text KV as plain tensor arguments.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant PL as pipeline.generate()
    participant TR as LingbotVATransformer
    participant NET as WanVADiTNetwork
    participant CC as network_cache (cond)
    participant UC as network_cache_uncond

    PL->>TR: "predict_flow(persist=True)"
    TR->>NET: "forward_video(network_cache, persist=True)"
    NET->>CC: write_video(cond_kv)
    NET-->>NET: "_last_video_kv = cond_kv ✓"

    TR->>NET: "forward_video(network_cache_uncond, persist=True)"
    NET->>UC: write_video(uncond_kv)
    NET-->>NET: "_last_video_kv = uncond_kv ← overwrites!"

    PL->>TR: "predict_action_flow(persist=True)"
    TR->>NET: "forward_action(network_cache, persist=True)"
    Note over NET: reads _last_video_kv = uncond_kv ❌
    NET->>CC: "write_action(cond_action_kv, video_kv=uncond_kv)"
    Note over CC: slot = [uncond_video | cond_action] wrong!

    TR->>NET: "forward_action(network_cache_uncond, persist=True)"
    NET->>UC: "write_action(uncond_action_kv, video_kv=uncond_kv)"
    Note over UC: slot = [uncond_video | uncond_action] ✓

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant PL as pipeline.generate()
    participant TR as LingbotVATransformer
    participant NET as WanVADiTNetwork
    participant CC as network_cache (cond)
    participant UC as network_cache_uncond

    PL->>TR: "predict_flow(persist=True)"
    TR->>NET: "forward_video(network_cache, persist=True)"
    NET->>CC: write_video(cond_kv)
    NET-->>NET: "_last_video_kv = cond_kv ✓"

    TR->>NET: "forward_video(network_cache_uncond, persist=True)"
    NET->>UC: write_video(uncond_kv)
    NET-->>NET: "_last_video_kv = uncond_kv ← overwrites!"

    PL->>TR: "predict_action_flow(persist=True)"
    TR->>NET: "forward_action(network_cache, persist=True)"
    Note over NET: reads _last_video_kv = uncond_kv ❌
    NET->>CC: "write_action(cond_action_kv, video_kv=uncond_kv)"
    Note over CC: slot = [uncond_video | cond_action] wrong!

    TR->>NET: "forward_action(network_cache_uncond, persist=True)"
    NET->>UC: "write_action(uncond_action_kv, video_kv=uncond_kv)"
    Note over UC: slot = [uncond_video | uncond_action] ✓

_{Reviews (7): Last reviewed commit: "Refactor LingBot-VA: move inference logi..." | Re-trigger Greptile}

hyzhou404 · 2026-06-09T14:33:22Z

I previously mistakenly uploaded the pyproject.toml file for version 12.8; this has been fixed. Please ignore Greptile's description regarding the CUDA version change.

jmccaffrey-nv · 2026-06-10T06:08:54Z

/ok to test 35ae585

jmccaffrey-nv · 2026-06-10T06:11:46Z

Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files .
Please see CONTRIBUTING.md and https://github.com/NVIDIA/flashdreams/blob/main/skills/maintaining-oss-state/SKILL.md

reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s
Error: Add 'SPDX-License-Identifier: Apache-2.0' (and matching SPDX-FileCopyrightText) to the first 20 lines of this file.
Error: 15 source file(s) missing inline SPDX header

hyzhou404 · 2026-06-11T05:20:19Z

Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files . Please see CONTRIBUTING.md and https://github.com/NVIDIA/flashdreams/blob/main/skills/maintaining-oss-state/SKILL.md

reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s Error: Add 'SPDX-License-Identifier: Apache-2.0' (and matching SPDX-FileCopyrightText) to the first 20 lines of this file. Error: 15 source file(s) missing inline SPDX header

Done

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove flash_attn stub that polluted sys.modules globally - Inline prompt_clean to avoid depending on diffusers private API - Fix KV cache to use left-shift eviction (aligned with official BlockKVCache) - Keep _n_committed capped at window_slots in steady-state - Remove duplicate _streaming_vae_half.vae.to("cpu") call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wilsonCernWq · 2026-06-12T21:17:38Z

+            grid_id, self.config.network.dim // self.config.network.num_heads
+        ).to(noisy_latent.device)
+
+        flow_cond = self.network.forward_video(


compile_module(net) returns an OptimizedModule that only compiles forward, but WanVADiTNetwork has no forward — the hot path calls net.forward_video / net.forward_action (transformer/init.py:200,228), which OptimizedModule delegates straight to the eager original module. So --compile-network True (default) does nothing, and the README's torch.compile-attributed 2.3× is not correct likely...

Once compile actually engages, this will likely cause torch compile to be triggered multiple times

def n_cached_tokens(self) -> int: """Number of committed tokens visible to attention.""" return min(self._n_committed, self.window_slots) * self.slot_size

Please follow the existing kvcache.py design. We have a good solution for that already.

wilsonCernWq · 2026-06-12T21:22:51Z

+            nn.SiLU(),
+            nn.Linear(self.dim, self.dim * 6),
+        )
+        self.action_text_embedding = nn.Sequential(


this action_text_embedding seems never used? is this intentional?

This is inherited from the released Lingbot-VA checkpoint. My code faithfully loads these weights to stay compatible with the pretrained checkpoint, but they don't participate in any computation. Leaving it as-is for now.

wilsonCernWq · 2026-06-12T21:30:42Z

+        raise NotImplementedError(
+            "LingBot-VA Robotwin cache initialization awaits the native DiT/VAE "
+            "port. Use --no-instantiate or CPU tests for the scaffold stage."
+        )


This does not respect the overall pipeline design

The DiffusionModel(transformer+scheduler) are built, moved to GPU, and never used. Could we still follow the intended StreamInferencePipeline.generate/finalize design (as the sibling integrations do) and drop the dead pipeline/scheduler? That restores CP/profiling/streaming-decoder and leaves a single source of truth.

- Fix uncond KV cache desync: always run uncond forward when the cache exists, regardless of guidance scale. Previously the early-return skipped write_secondary on uncond cache under default config (action_guidance_scale=1.0), leaving zeroed action KV that silently corrupted CFG from chunk 1 onward. - Share single VAE between streaming wrappers instead of loading a duplicate, halving VAE memory. Also fixes a pre-existing device mismatch bug under --enable-offload. - Remove unused sink_size config field and constant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wilsonCernWq

Thanks for quick response. This is just the second part of my review that I didn't finish yesterday!

wilsonCernWq · 2026-06-13T16:31:43Z

+        q01 = self.q01_tensor()
+        q99 = self.q99_tensor()
+        denorm = (action_cpu + 1.0) / 2.0 * (q99 - q01 + 1e-6) + q01
+        return denorm[list(self.config.used_action_channel_ids)]


I think this will return something like (16, 320) as I printed?

But in the readme it says

| `actions.npy` | Predicted actions array, shape `(num_chunks × action_per_frame × frame_chunk_size, action_dim)`. |

wilsonCernWq · 2026-06-13T16:37:41Z

+
+
+def load_vae(vae_path: str, torch_dtype: torch.dtype, torch_device):
+    vae = AutoencoderKLWan.from_pretrained(vae_path, torch_dtype=torch_dtype)


I think this assumes user has checkpoint pre-downloaded on disk. I think this is useful for dev, but a bit hard to use for users who just want to run the test? Can we default to download from HF url? and then if user provides a --checkpoint-root, we overwrite it with a local path?

To download from URL, it needs kwargs like subfolder="text_encoder", I believe

wilsonCernWq · 2026-06-13T16:40:04Z

+| `--checkpoint-root` | str | `robbyant/lingbot-va-posttrain-robotwin` | Local path or HuggingFace repo ID for model weights. Must contain `transformer/`, `vae/`, `text_encoder/`, `tokenizer/` subdirs. |
+| `--input-image-dir` | path | `assets/example_data/lingbot-va/robotwin` | Directory containing three observation camera PNGs (see below). |
+| `--output-dir` | path | `outputs/lingbot_va/robotwin_i2av` | Where to write `demo.mp4`, `actions.npy`, `latents.pt`, and timing JSON. |
+| `--prompt` | str | `"Grab the medium-sized white mug, rotate it, place it on the table, and hook it onto the smooth dark gray rack."` | Text prompt describing the manipulation task. Can also be a path to a `.txt` file. |


this does not seem to match the actual implementation, it seems --prompt will actually read strings? could you double check?

wilsonCernWq · 2026-06-13T16:42:35Z

+    runner_name=PIPELINE_LINGBOT_VA_ROBOTWIN_I2AV.name,
+    description=(
+        "LingBot-VA Robotwin I2AV inference scaffold "
+        "(three-camera Robotwin config; native DiT port pending)."


Is "native DiT port pending" still correct? Maybe we can do a full pass of docstrings so that they are all consistent?

wilsonCernWq · 2026-06-13T16:46:34Z

+    --output-dir outputs/lingbot_va/robotwin_i2av \
+    --checkpoint-root /path/to/lingbot-va-posttrain-robotwin \
+    --num-chunks 10 \
+    --benchmark True


Will this integration support multi-GPU?

wilsonCernWq · 2026-06-13T16:47:18Z

+    return torch.cat([grid_id, torch.full_like(grid_id[:1], t)], dim=0)
+
+
+def data_seq_to_patch(


unused code?

wilsonCernWq · 2026-06-13T16:48:29Z

+        """Return a copy with all non-Robotwin-used channels set to zero."""
+        masked = action.clone()
+        masked[:, ~self.action_mask(device=masked.device)] = 0
+        return masked


nit: this also seems unused?

wilsonCernWq · 2026-06-13T16:49:52Z

+        q01 = self.q01_tensor(device=expanded.device)
+        q99 = self.q99_tensor(device=expanded.device)
+        expanded = (expanded - q01) / (q99 - q01 + 1e-6) * 2.0 - 1.0
+        return expanded.unsqueeze(0).unsqueeze(-1)


nit: also unused seems?

liruilong940607 · 2026-06-15T05:21:25Z

Thanks for bringing lingbot-va into flashdreams.

A general question: how to verify generation result is on-par with the official implementation?

For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check

Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though)

hyzhou404 · 2026-06-15T07:46:59Z

Thanks for bringing lingbot-va into flashdreams.

A general question: how to verify generation result is on-par with the official implementation?

For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check

Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though)

Both the original code and my implementation exhibited some randomness that has been difficult to align. I am currently working on debugging the sources of this randomness and refactoring the code based on suggestions from Qi Wu and you. Thank you both for your careful and thorough review of my submission.

- Add LingbotVAInferencePipeline.generate() with dual video+action denoising - Add action_scheduler config for separate action generation steps - Move inference logic from runner to pipeline for better encapsulation - Optimize transformer/kvcache implementations - Update dependencies in pyproject.toml and uv.lock Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps · 2026-06-27T12:58:09Z

+        if persist:
+            for block_idx, (k, v) in enumerate(zip(k_list, v_list)):
+                cache[block_idx].self_attn.write_video(k, v)
+            self._last_video_kv = list(zip(k_list, v_list))
+
+        return output
+
+    def forward_action(
+        self,
+        x: Tensor,
+        timesteps: Tensor,
+        cache: WanVADiTNetworkCache,
+        rope_freqs: Tensor,
+        persist: bool = False,
+    ) -> Tensor:
+        """Action-mode forward.
+
+        Cache extraction and writes happen outside the compiled block loop.
+
+        Returns:
+            Action flow ``[batch, L, action_dim]``.
+        """
+        assert self._parameters_updated_after_loading_checkpoint
+
+        # Extract cache tensors (outside compile boundary)
+        committed_k, committed_v, cross_k, cross_v = self._extract_cache_tensors(cache)
+
+        # Compiled block loop (pure tensors, no cache access)
+        output, k_list, v_list = self._forward_blocks_action(
+            x, timesteps, committed_k, committed_v, cross_k, cross_v, rope_freqs,
+        )
+
+        # Cache write (outside compile boundary)
+        if persist:
+            for block_idx, (k, v) in enumerate(zip(k_list, v_list)):
+                vk, vv = self._last_video_kv[block_idx]
+                cache[block_idx].self_attn.write_action(k, v, vk, vv)
+


_last_video_kv overwritten by uncond pass — cond cache written with wrong video KV

forward_video saves its fresh KV list into self._last_video_kv on the network instance (line 387). predict_flow calls forward_video first for the cond cache, then immediately for the uncond cache (both with persist=True), so _last_video_kv is overwritten with the uncond video KV before predict_action_flow is ever called.

When predict_action_flow then calls forward_action(cache.network_cache, persist=True), it reads self._last_video_kv (now holding uncond KV) and calls write_action(cond_action_k, cond_action_v, uncond_video_k, uncond_video_v), writing [uncond_video | cond_action] into the cond cache slot. From chunk 1 onward the cond cross-chunk attention attends over uncond video keys for all committed prior slots, silently corrupting CFG. This affects every run with the default guidance_scale=5.0 (which creates network_cache_uncond).

The fix is to store the last video KV on the cache object (per-cache, not per-network), so cond and uncond are never conflated. WanVADiTNetworkCache can hold a last_video_kv field (initialized to None) that each forward_video / forward_action call reads and writes on the relevant cache object.

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

hyzhou404 force-pushed the feat/lingbot-va branch 3 times, most recently from 523ba92 to 35ae585 Compare June 9, 2026 14:20

liruilong940607 requested review from liruilong940607 and wilsonCernWq June 9, 2026 17:54

hyzhou404 force-pushed the feat/lingbot-va branch from 35ae585 to d05ddeb Compare June 11, 2026 05:08

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread integrations/lingbot_va/lingbot_va/transformer/impl/kvcache.py Outdated

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread integrations/lingbot_va/lingbot_va/transformer/__init__.py Outdated

影青 and others added 2 commits June 11, 2026 15:13

Added support for Lingbot-VA

92a8017

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hyzhou404 force-pushed the feat/lingbot-va branch from beb1238 to 755b105 Compare June 11, 2026 07:14

wilsonCernWq requested changes Jun 12, 2026

View reviewed changes

hyzhou404 requested a review from wilsonCernWq June 13, 2026 06:33

wilsonCernWq reviewed Jun 13, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 27, 2026

View reviewed changes



		def load_vae(vae_path: str, torch_dtype: torch.dtype, torch_device):
		vae = AutoencoderKLWan.from_pretrained(vae_path, torch_dtype=torch_dtype)

		return torch.cat([grid_id, torch.full_like(grid_id[:1], t)], dim=0)


		def data_seq_to_patch(

Uh oh!

Conversation

hyzhou404 commented Jun 9, 2026

Summary

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hyzhou404 commented Jun 9, 2026

Uh oh!

jmccaffrey-nv commented Jun 10, 2026

Uh oh!

jmccaffrey-nv commented Jun 10, 2026

Uh oh!

Uh oh!

hyzhou404 commented Jun 11, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wilsonCernWq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liruilong940607 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyzhou404 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

liruilong940607 commented Jun 15, 2026 •

edited

Loading

hyzhou404 commented Jun 15, 2026 •

edited

Loading