Skip to content

Added support for Lingbot-VA#312

Open
hyzhou404 wants to merge 4 commits into
NVIDIA:mainfrom
hyzhou404:feat/lingbot-va
Open

Added support for Lingbot-VA#312
hyzhou404 wants to merge 4 commits into
NVIDIA:mainfrom
hyzhou404:feat/lingbot-va

Conversation

@hyzhou404

Copy link
Copy Markdown

Summary

  • Add LingBot-VA Robotwin I2AV integration as a FlashDreams plugin
  • Achieves 2.3× speedup over the original repo (1.48× vs FSDP-removed baseline)
  • Self-contained package

@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a self-contained LingBot-VA Robotwin I2AV integration as a FlashDreams plugin, adapting the upstream video-action diffusion transformer into the standard runner/pipeline interface with torch.compile acceleration. Several issues flagged in earlier review rounds have been resolved (stub-module injection, diffusers private imports, duplicate VAE CPU offload calls, uncond KV sync).

  • _last_video_kv shared across cond/uncond forward passesWanVADiTNetwork stores the video KV list in self._last_video_kv after each forward_video(persist=True). predict_flow calls forward_video for the cond cache first, then for the uncond cache; the uncond call overwrites _last_video_kv. When predict_action_flow subsequently calls forward_action(persist=True) for the cond cache, it reads the now-stale uncond KV and writes [uncond_video | cond_action] into the cond slot — corrupting CFG quality from chunk 1 onward in the default configuration.
  • Rolling-window cache, dual denoising loop, and AR cache lifecycle look correct; the trailing padded zero step is intentional (final KV write without latent update), and the circular buffer ordering fix from the previous round is properly reflected in the new VAKVCache implementation.

Confidence Score: 4/5

The cond KV cache is written with uncond video keys on every persist step, silently corrupting CFG quality from chunk 1 onward in the default configuration.

The CFG sync fix from the prior round moved both forward passes inside the network but left _last_video_kv as a single slot on the network object. The uncond forward_video call always runs second and always overwrites it, so every action write_action for the cond cache uses the wrong video KV. This affects every multi-chunk run with guidance_scale > 1.0, which is the default. The rest of the integration — scheduler, kvcache ordering, memory cleanup, model loading — is clean.

integrations/lingbot_va/lingbot_va/transformer/impl/network.py and integrations/lingbot_va/lingbot_va/transformer/init.py — the _last_video_kv write/read pair needs to be scoped per cache object rather than on the shared network.

Important Files Changed

Filename Overview
integrations/lingbot_va/lingbot_va/transformer/impl/network.py Core DiT network; contains a bug where _last_video_kv is shared network-level state overwritten by the uncond forward_video call, causing the cond cache to commit wrong (uncond) video KV from chunk 1 onward under the default guidance_scale=5.0 config.
integrations/lingbot_va/lingbot_va/transformer/init.py Transformer wrapper with predict_flow/predict_action_flow; the sequencing of cond-then-uncond forward_video calls with persist=True is the proximate cause of the _last_video_kv corruption bug.
integrations/lingbot_va/lingbot_va/transformer/impl/kvcache.py VAKVCache rolling-window implementation with write_video/write_action separation; looks correct after the circular-buffer ordering fix mentioned in prior threads.
integrations/lingbot_va/lingbot_va/pipeline.py Dual denoising loop (video then action) with padded timestep for final KV commit; design is intentional and correct — the trailing t=0 step writes the KV cache without updating latents.
integrations/lingbot_va/lingbot_va/runner.py Main inference runner; previously flagged issues (stub modules, diffusers internal imports, duplicate vae.to(cpu)) are resolved; memory cleanup before video decode looks correct.
integrations/lingbot_va/lingbot_va/scheduler.py Flow-match scheduler matching upstream LingBot-VA; padded_timesteps correctly appends zero for the final cache-write step.
integrations/lingbot_va/lingbot_va/action.py Action normalization/denormalization; preprocess correctly handles 16-channel input, postprocess correctly denormalizes and selects robotwin channels.
integrations/lingbot_va/lingbot_va/_loaders.py Model loading utilities; WanVAEStreamingWrapper correctly counts WanCausalConv3d modules with a fallback when _cached_conv_counts is absent.
integrations/lingbot_va/lingbot_va/transformer/impl/modules.py VA transformer blocks with compile-friendly tensor-only forward pass; cross-attention correctly receives pre-extracted text KV as plain tensor arguments.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant PL as pipeline.generate()
    participant TR as LingbotVATransformer
    participant NET as WanVADiTNetwork
    participant CC as network_cache (cond)
    participant UC as network_cache_uncond

    PL->>TR: "predict_flow(persist=True)"
    TR->>NET: "forward_video(network_cache, persist=True)"
    NET->>CC: write_video(cond_kv)
    NET-->>NET: "_last_video_kv = cond_kv ✓"

    TR->>NET: "forward_video(network_cache_uncond, persist=True)"
    NET->>UC: write_video(uncond_kv)
    NET-->>NET: "_last_video_kv = uncond_kv ← overwrites!"

    PL->>TR: "predict_action_flow(persist=True)"
    TR->>NET: "forward_action(network_cache, persist=True)"
    Note over NET: reads _last_video_kv = uncond_kv ❌
    NET->>CC: "write_action(cond_action_kv, video_kv=uncond_kv)"
    Note over CC: slot = [uncond_video | cond_action] wrong!

    TR->>NET: "forward_action(network_cache_uncond, persist=True)"
    NET->>UC: "write_action(uncond_action_kv, video_kv=uncond_kv)"
    Note over UC: slot = [uncond_video | uncond_action] ✓
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant PL as pipeline.generate()
    participant TR as LingbotVATransformer
    participant NET as WanVADiTNetwork
    participant CC as network_cache (cond)
    participant UC as network_cache_uncond

    PL->>TR: "predict_flow(persist=True)"
    TR->>NET: "forward_video(network_cache, persist=True)"
    NET->>CC: write_video(cond_kv)
    NET-->>NET: "_last_video_kv = cond_kv ✓"

    TR->>NET: "forward_video(network_cache_uncond, persist=True)"
    NET->>UC: write_video(uncond_kv)
    NET-->>NET: "_last_video_kv = uncond_kv ← overwrites!"

    PL->>TR: "predict_action_flow(persist=True)"
    TR->>NET: "forward_action(network_cache, persist=True)"
    Note over NET: reads _last_video_kv = uncond_kv ❌
    NET->>CC: "write_action(cond_action_kv, video_kv=uncond_kv)"
    Note over CC: slot = [uncond_video | cond_action] wrong!

    TR->>NET: "forward_action(network_cache_uncond, persist=True)"
    NET->>UC: "write_action(uncond_action_kv, video_kv=uncond_kv)"
    Note over UC: slot = [uncond_video | uncond_action] ✓
Loading

Reviews (7): Last reviewed commit: "Refactor LingBot-VA: move inference logi..." | Re-trigger Greptile

Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
Comment thread integrations/lingbot_va/lingbot_va/runner.py
Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
Comment thread flashdreams/pyproject.toml Outdated
@hyzhou404 hyzhou404 force-pushed the feat/lingbot-va branch 3 times, most recently from 523ba92 to 35ae585 Compare June 9, 2026 14:20
@hyzhou404

Copy link
Copy Markdown
Author

I previously mistakenly uploaded the pyproject.toml file for version 12.8; this has been fixed. Please ignore Greptile's description regarding the CUDA version change.

@jmccaffrey-nv

Copy link
Copy Markdown
Collaborator

/ok to test 35ae585

@jmccaffrey-nv

Copy link
Copy Markdown
Collaborator

Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files .
Please see CONTRIBUTING.md and https://github.com/NVIDIA/flashdreams/blob/main/skills/maintaining-oss-state/SKILL.md

reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s
Error: Add 'SPDX-License-Identifier: Apache-2.0' (and matching SPDX-FileCopyrightText) to the first 20 lines of this file.
Error: 15 source file(s) missing inline SPDX header

Comment thread integrations/lingbot_va/lingbot_va/transformer/impl/kvcache.py Outdated
@hyzhou404

Copy link
Copy Markdown
Author

Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files . Please see CONTRIBUTING.md and https://github.com/NVIDIA/flashdreams/blob/main/skills/maintaining-oss-state/SKILL.md

reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s Error: Add 'SPDX-License-Identifier: Apache-2.0' (and matching SPDX-FileCopyrightText) to the first 20 lines of this file. Error: 15 source file(s) missing inline SPDX header

Done

Comment thread integrations/lingbot_va/lingbot_va/transformer/__init__.py Outdated
影青 and others added 2 commits June 11, 2026 15:13
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove flash_attn stub that polluted sys.modules globally
- Inline prompt_clean to avoid depending on diffusers private API
- Fix KV cache to use left-shift eviction (aligned with official BlockKVCache)
- Keep _n_committed capped at window_slots in steady-state
- Remove duplicate _streaming_vae_half.vae.to("cpu") call

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
grid_id, self.config.network.dim // self.config.network.num_heads
).to(noisy_latent.device)

flow_cond = self.network.forward_video(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compile_module(net) returns an OptimizedModule that only compiles forward, but WanVADiTNetwork has no forward — the hot path calls net.forward_video / net.forward_action (transformer/init.py:200,228), which OptimizedModule delegates straight to the eager original module. So --compile-network True (default) does nothing, and the README's torch.compile-attributed 2.3× is not correct likely...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once compile actually engages, this will likely cause torch compile to be triggered multiple times

    def n_cached_tokens(self) -> int:
        """Number of committed tokens visible to attention."""
        return min(self._n_committed, self.window_slots) * self.slot_size

Please follow the existing kvcache.py design. We have a good solution for that already.

nn.SiLU(),
nn.Linear(self.dim, self.dim * 6),
)
self.action_text_embedding = nn.Sequential(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this action_text_embedding seems never used? is this intentional?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inherited from the released Lingbot-VA checkpoint. My code faithfully loads these weights to stay compatible with the pretrained checkpoint, but they don't participate in any computation. Leaving it as-is for now.

raise NotImplementedError(
"LingBot-VA Robotwin cache initialization awaits the native DiT/VAE "
"port. Use --no-instantiate or CPU tests for the scaffold stage."
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not respect the overall pipeline design

The DiffusionModel(transformer+scheduler) are built, moved to GPU, and never used. Could we still follow the intended StreamInferencePipeline.generate/finalize design (as the sibling integrations do) and drop the dead pipeline/scheduler? That restores CP/profiling/streaming-decoder and leaves a single source of truth.

Comment thread integrations/lingbot_va/lingbot_va/runner.py
Comment thread integrations/lingbot_va/lingbot_va/pipeline.py Outdated
Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
@hyzhou404 hyzhou404 requested a review from wilsonCernWq June 13, 2026 06:33
- Fix uncond KV cache desync: always run uncond forward when the cache
  exists, regardless of guidance scale. Previously the early-return
  skipped write_secondary on uncond cache under default config
  (action_guidance_scale=1.0), leaving zeroed action KV that silently
  corrupted CFG from chunk 1 onward.
- Share single VAE between streaming wrappers instead of loading a
  duplicate, halving VAE memory. Also fixes a pre-existing device
  mismatch bug under --enable-offload.
- Remove unused sink_size config field and constant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@wilsonCernWq wilsonCernWq left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quick response. This is just the second part of my review that I didn't finish yesterday!

q01 = self.q01_tensor()
q99 = self.q99_tensor()
denorm = (action_cpu + 1.0) / 2.0 * (q99 - q01 + 1e-6) + q01
return denorm[list(self.config.used_action_channel_ids)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will return something like (16, 320) as I printed?

But in the readme it says

| `actions.npy` | Predicted actions array, shape `(num_chunks × action_per_frame × frame_chunk_size, action_dim)`. |



def load_vae(vae_path: str, torch_dtype: torch.dtype, torch_device):
vae = AutoencoderKLWan.from_pretrained(vae_path, torch_dtype=torch_dtype)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assumes user has checkpoint pre-downloaded on disk. I think this is useful for dev, but a bit hard to use for users who just want to run the test? Can we default to download from HF url? and then if user provides a --checkpoint-root, we overwrite it with a local path?

To download from URL, it needs kwargs like subfolder="text_encoder", I believe

| `--checkpoint-root` | str | `robbyant/lingbot-va-posttrain-robotwin` | Local path or HuggingFace repo ID for model weights. Must contain `transformer/`, `vae/`, `text_encoder/`, `tokenizer/` subdirs. |
| `--input-image-dir` | path | `assets/example_data/lingbot-va/robotwin` | Directory containing three observation camera PNGs (see below). |
| `--output-dir` | path | `outputs/lingbot_va/robotwin_i2av` | Where to write `demo.mp4`, `actions.npy`, `latents.pt`, and timing JSON. |
| `--prompt` | str | `"Grab the medium-sized white mug, rotate it, place it on the table, and hook it onto the smooth dark gray rack."` | Text prompt describing the manipulation task. Can also be a path to a `.txt` file. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to match the actual implementation, it seems --prompt will actually read strings? could you double check?

runner_name=PIPELINE_LINGBOT_VA_ROBOTWIN_I2AV.name,
description=(
"LingBot-VA Robotwin I2AV inference scaffold "
"(three-camera Robotwin config; native DiT port pending)."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "native DiT port pending" still correct? Maybe we can do a full pass of docstrings so that they are all consistent?

--output-dir outputs/lingbot_va/robotwin_i2av \
--checkpoint-root /path/to/lingbot-va-posttrain-robotwin \
--num-chunks 10 \
--benchmark True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this integration support multi-GPU?

return torch.cat([grid_id, torch.full_like(grid_id[:1], t)], dim=0)


def data_seq_to_patch(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused code?

"""Return a copy with all non-Robotwin-used channels set to zero."""
masked = action.clone()
masked[:, ~self.action_mask(device=masked.device)] = 0
return masked

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this also seems unused?

q01 = self.q01_tensor(device=expanded.device)
q99 = self.q99_tensor(device=expanded.device)
expanded = (expanded - q01) / (q99 - q01 + 1e-6) * 2.0 - 1.0
return expanded.unsqueeze(0).unsqueeze(-1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also unused seems?

@liruilong940607

liruilong940607 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Thanks for bringing lingbot-va into flashdreams.

A general question: how to verify generation result is on-par with the official implementation?

For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check

Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though)

@hyzhou404

hyzhou404 commented Jun 15, 2026

Copy link
Copy Markdown
Author

Thanks for bringing lingbot-va into flashdreams.

A general question: how to verify generation result is on-par with the official implementation?

For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check

Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though)

Both the original code and my implementation exhibited some randomness that has been difficult to align. I am currently working on debugging the sources of this randomness and refactoring the code based on suggestions from Qi Wu and you. Thank you both for your careful and thorough review of my submission.

- Add LingbotVAInferencePipeline.generate() with dual video+action denoising
- Add action_scheduler config for separate action generation steps
- Move inference logic from runner to pipeline for better encapsulation
- Optimize transformer/kvcache implementations
- Update dependencies in pyproject.toml and uv.lock

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment on lines +384 to +421
if persist:
for block_idx, (k, v) in enumerate(zip(k_list, v_list)):
cache[block_idx].self_attn.write_video(k, v)
self._last_video_kv = list(zip(k_list, v_list))

return output

def forward_action(
self,
x: Tensor,
timesteps: Tensor,
cache: WanVADiTNetworkCache,
rope_freqs: Tensor,
persist: bool = False,
) -> Tensor:
"""Action-mode forward.

Cache extraction and writes happen outside the compiled block loop.

Returns:
Action flow ``[batch, L, action_dim]``.
"""
assert self._parameters_updated_after_loading_checkpoint

# Extract cache tensors (outside compile boundary)
committed_k, committed_v, cross_k, cross_v = self._extract_cache_tensors(cache)

# Compiled block loop (pure tensors, no cache access)
output, k_list, v_list = self._forward_blocks_action(
x, timesteps, committed_k, committed_v, cross_k, cross_v, rope_freqs,
)

# Cache write (outside compile boundary)
if persist:
for block_idx, (k, v) in enumerate(zip(k_list, v_list)):
vk, vv = self._last_video_kv[block_idx]
cache[block_idx].self_attn.write_action(k, v, vk, vv)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _last_video_kv overwritten by uncond pass — cond cache written with wrong video KV

forward_video saves its fresh KV list into self._last_video_kv on the network instance (line 387). predict_flow calls forward_video first for the cond cache, then immediately for the uncond cache (both with persist=True), so _last_video_kv is overwritten with the uncond video KV before predict_action_flow is ever called.

When predict_action_flow then calls forward_action(cache.network_cache, persist=True), it reads self._last_video_kv (now holding uncond KV) and calls write_action(cond_action_k, cond_action_v, uncond_video_k, uncond_video_v), writing [uncond_video | cond_action] into the cond cache slot. From chunk 1 onward the cond cross-chunk attention attends over uncond video keys for all committed prior slots, silently corrupting CFG. This affects every run with the default guidance_scale=5.0 (which creates network_cache_uncond).

The fix is to store the last video KV on the cache object (per-cache, not per-network), so cond and uncond are never conflated. WanVADiTNetworkCache can hold a last_video_kv field (initialized to None) that each forward_video / forward_action call reads and writes on the relevant cache object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants