Added support for Lingbot-VA#312
Conversation
Greptile SummaryThis PR adds a self-contained LingBot-VA Robotwin I2AV integration as a FlashDreams plugin, adapting the upstream video-action diffusion transformer into the standard runner/pipeline interface with
Confidence Score: 4/5The cond KV cache is written with uncond video keys on every persist step, silently corrupting CFG quality from chunk 1 onward in the default configuration. The CFG sync fix from the prior round moved both forward passes inside the network but left _last_video_kv as a single slot on the network object. The uncond forward_video call always runs second and always overwrites it, so every action write_action for the cond cache uses the wrong video KV. This affects every multi-chunk run with guidance_scale > 1.0, which is the default. The rest of the integration — scheduler, kvcache ordering, memory cleanup, model loading — is clean. integrations/lingbot_va/lingbot_va/transformer/impl/network.py and integrations/lingbot_va/lingbot_va/transformer/init.py — the _last_video_kv write/read pair needs to be scoped per cache object rather than on the shared network. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant PL as pipeline.generate()
participant TR as LingbotVATransformer
participant NET as WanVADiTNetwork
participant CC as network_cache (cond)
participant UC as network_cache_uncond
PL->>TR: "predict_flow(persist=True)"
TR->>NET: "forward_video(network_cache, persist=True)"
NET->>CC: write_video(cond_kv)
NET-->>NET: "_last_video_kv = cond_kv ✓"
TR->>NET: "forward_video(network_cache_uncond, persist=True)"
NET->>UC: write_video(uncond_kv)
NET-->>NET: "_last_video_kv = uncond_kv ← overwrites!"
PL->>TR: "predict_action_flow(persist=True)"
TR->>NET: "forward_action(network_cache, persist=True)"
Note over NET: reads _last_video_kv = uncond_kv ❌
NET->>CC: "write_action(cond_action_kv, video_kv=uncond_kv)"
Note over CC: slot = [uncond_video | cond_action] wrong!
TR->>NET: "forward_action(network_cache_uncond, persist=True)"
NET->>UC: "write_action(uncond_action_kv, video_kv=uncond_kv)"
Note over UC: slot = [uncond_video | uncond_action] ✓
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant PL as pipeline.generate()
participant TR as LingbotVATransformer
participant NET as WanVADiTNetwork
participant CC as network_cache (cond)
participant UC as network_cache_uncond
PL->>TR: "predict_flow(persist=True)"
TR->>NET: "forward_video(network_cache, persist=True)"
NET->>CC: write_video(cond_kv)
NET-->>NET: "_last_video_kv = cond_kv ✓"
TR->>NET: "forward_video(network_cache_uncond, persist=True)"
NET->>UC: write_video(uncond_kv)
NET-->>NET: "_last_video_kv = uncond_kv ← overwrites!"
PL->>TR: "predict_action_flow(persist=True)"
TR->>NET: "forward_action(network_cache, persist=True)"
Note over NET: reads _last_video_kv = uncond_kv ❌
NET->>CC: "write_action(cond_action_kv, video_kv=uncond_kv)"
Note over CC: slot = [uncond_video | cond_action] wrong!
TR->>NET: "forward_action(network_cache_uncond, persist=True)"
NET->>UC: "write_action(uncond_action_kv, video_kv=uncond_kv)"
Note over UC: slot = [uncond_video | uncond_action] ✓
Reviews (7): Last reviewed commit: "Refactor LingBot-VA: move inference logi..." | Re-trigger Greptile |
523ba92 to
35ae585
Compare
|
I previously mistakenly uploaded the pyproject.toml file for version 12.8; this has been fixed. Please ignore Greptile's description regarding the CUDA version change. |
|
/ok to test 35ae585 |
|
Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files . reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s |
35ae585 to
d05ddeb
Compare
Done |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove flash_attn stub that polluted sys.modules globally
- Inline prompt_clean to avoid depending on diffusers private API
- Fix KV cache to use left-shift eviction (aligned with official BlockKVCache)
- Keep _n_committed capped at window_slots in steady-state
- Remove duplicate _streaming_vae_half.vae.to("cpu") call
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
beb1238 to
755b105
Compare
| grid_id, self.config.network.dim // self.config.network.num_heads | ||
| ).to(noisy_latent.device) | ||
|
|
||
| flow_cond = self.network.forward_video( |
There was a problem hiding this comment.
compile_module(net) returns an OptimizedModule that only compiles forward, but WanVADiTNetwork has no forward — the hot path calls net.forward_video / net.forward_action (transformer/init.py:200,228), which OptimizedModule delegates straight to the eager original module. So --compile-network True (default) does nothing, and the README's torch.compile-attributed 2.3× is not correct likely...
There was a problem hiding this comment.
Once compile actually engages, this will likely cause torch compile to be triggered multiple times
def n_cached_tokens(self) -> int:
"""Number of committed tokens visible to attention."""
return min(self._n_committed, self.window_slots) * self.slot_size
Please follow the existing kvcache.py design. We have a good solution for that already.
| nn.SiLU(), | ||
| nn.Linear(self.dim, self.dim * 6), | ||
| ) | ||
| self.action_text_embedding = nn.Sequential( |
There was a problem hiding this comment.
this action_text_embedding seems never used? is this intentional?
There was a problem hiding this comment.
This is inherited from the released Lingbot-VA checkpoint. My code faithfully loads these weights to stay compatible with the pretrained checkpoint, but they don't participate in any computation. Leaving it as-is for now.
| raise NotImplementedError( | ||
| "LingBot-VA Robotwin cache initialization awaits the native DiT/VAE " | ||
| "port. Use --no-instantiate or CPU tests for the scaffold stage." | ||
| ) |
There was a problem hiding this comment.
This does not respect the overall pipeline design
The DiffusionModel(transformer+scheduler) are built, moved to GPU, and never used. Could we still follow the intended StreamInferencePipeline.generate/finalize design (as the sibling integrations do) and drop the dead pipeline/scheduler? That restores CP/profiling/streaming-decoder and leaves a single source of truth.
- Fix uncond KV cache desync: always run uncond forward when the cache exists, regardless of guidance scale. Previously the early-return skipped write_secondary on uncond cache under default config (action_guidance_scale=1.0), leaving zeroed action KV that silently corrupted CFG from chunk 1 onward. - Share single VAE between streaming wrappers instead of loading a duplicate, halving VAE memory. Also fixes a pre-existing device mismatch bug under --enable-offload. - Remove unused sink_size config field and constant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wilsonCernWq
left a comment
There was a problem hiding this comment.
Thanks for quick response. This is just the second part of my review that I didn't finish yesterday!
| q01 = self.q01_tensor() | ||
| q99 = self.q99_tensor() | ||
| denorm = (action_cpu + 1.0) / 2.0 * (q99 - q01 + 1e-6) + q01 | ||
| return denorm[list(self.config.used_action_channel_ids)] |
There was a problem hiding this comment.
I think this will return something like (16, 320) as I printed?
But in the readme it says
| `actions.npy` | Predicted actions array, shape `(num_chunks × action_per_frame × frame_chunk_size, action_dim)`. |
|
|
||
|
|
||
| def load_vae(vae_path: str, torch_dtype: torch.dtype, torch_device): | ||
| vae = AutoencoderKLWan.from_pretrained(vae_path, torch_dtype=torch_dtype) |
There was a problem hiding this comment.
I think this assumes user has checkpoint pre-downloaded on disk. I think this is useful for dev, but a bit hard to use for users who just want to run the test? Can we default to download from HF url? and then if user provides a --checkpoint-root, we overwrite it with a local path?
To download from URL, it needs kwargs like subfolder="text_encoder", I believe
| | `--checkpoint-root` | str | `robbyant/lingbot-va-posttrain-robotwin` | Local path or HuggingFace repo ID for model weights. Must contain `transformer/`, `vae/`, `text_encoder/`, `tokenizer/` subdirs. | | ||
| | `--input-image-dir` | path | `assets/example_data/lingbot-va/robotwin` | Directory containing three observation camera PNGs (see below). | | ||
| | `--output-dir` | path | `outputs/lingbot_va/robotwin_i2av` | Where to write `demo.mp4`, `actions.npy`, `latents.pt`, and timing JSON. | | ||
| | `--prompt` | str | `"Grab the medium-sized white mug, rotate it, place it on the table, and hook it onto the smooth dark gray rack."` | Text prompt describing the manipulation task. Can also be a path to a `.txt` file. | |
There was a problem hiding this comment.
this does not seem to match the actual implementation, it seems --prompt will actually read strings? could you double check?
| runner_name=PIPELINE_LINGBOT_VA_ROBOTWIN_I2AV.name, | ||
| description=( | ||
| "LingBot-VA Robotwin I2AV inference scaffold " | ||
| "(three-camera Robotwin config; native DiT port pending)." |
There was a problem hiding this comment.
Is "native DiT port pending" still correct? Maybe we can do a full pass of docstrings so that they are all consistent?
| --output-dir outputs/lingbot_va/robotwin_i2av \ | ||
| --checkpoint-root /path/to/lingbot-va-posttrain-robotwin \ | ||
| --num-chunks 10 \ | ||
| --benchmark True |
There was a problem hiding this comment.
Will this integration support multi-GPU?
| return torch.cat([grid_id, torch.full_like(grid_id[:1], t)], dim=0) | ||
|
|
||
|
|
||
| def data_seq_to_patch( |
| """Return a copy with all non-Robotwin-used channels set to zero.""" | ||
| masked = action.clone() | ||
| masked[:, ~self.action_mask(device=masked.device)] = 0 | ||
| return masked |
There was a problem hiding this comment.
nit: this also seems unused?
| q01 = self.q01_tensor(device=expanded.device) | ||
| q99 = self.q99_tensor(device=expanded.device) | ||
| expanded = (expanded - q01) / (q99 - q01 + 1e-6) * 2.0 - 1.0 | ||
| return expanded.unsqueeze(0).unsqueeze(-1) |
There was a problem hiding this comment.
nit: also unused seems?
|
Thanks for bringing lingbot-va into flashdreams. A general question: how to verify generation result is on-par with the official implementation? For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though) |
Both the original code and my implementation exhibited some randomness that has been difficult to align. I am currently working on debugging the sources of this randomness and refactoring the code based on suggestions from Qi Wu and you. Thank you both for your careful and thorough review of my submission. |
- Add LingbotVAInferencePipeline.generate() with dual video+action denoising - Add action_scheduler config for separate action generation steps - Move inference logic from runner to pipeline for better encapsulation - Optimize transformer/kvcache implementations - Update dependencies in pyproject.toml and uv.lock Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| if persist: | ||
| for block_idx, (k, v) in enumerate(zip(k_list, v_list)): | ||
| cache[block_idx].self_attn.write_video(k, v) | ||
| self._last_video_kv = list(zip(k_list, v_list)) | ||
|
|
||
| return output | ||
|
|
||
| def forward_action( | ||
| self, | ||
| x: Tensor, | ||
| timesteps: Tensor, | ||
| cache: WanVADiTNetworkCache, | ||
| rope_freqs: Tensor, | ||
| persist: bool = False, | ||
| ) -> Tensor: | ||
| """Action-mode forward. | ||
|
|
||
| Cache extraction and writes happen outside the compiled block loop. | ||
|
|
||
| Returns: | ||
| Action flow ``[batch, L, action_dim]``. | ||
| """ | ||
| assert self._parameters_updated_after_loading_checkpoint | ||
|
|
||
| # Extract cache tensors (outside compile boundary) | ||
| committed_k, committed_v, cross_k, cross_v = self._extract_cache_tensors(cache) | ||
|
|
||
| # Compiled block loop (pure tensors, no cache access) | ||
| output, k_list, v_list = self._forward_blocks_action( | ||
| x, timesteps, committed_k, committed_v, cross_k, cross_v, rope_freqs, | ||
| ) | ||
|
|
||
| # Cache write (outside compile boundary) | ||
| if persist: | ||
| for block_idx, (k, v) in enumerate(zip(k_list, v_list)): | ||
| vk, vv = self._last_video_kv[block_idx] | ||
| cache[block_idx].self_attn.write_action(k, v, vk, vv) | ||
|
|
There was a problem hiding this comment.
_last_video_kv overwritten by uncond pass — cond cache written with wrong video KV
forward_video saves its fresh KV list into self._last_video_kv on the network instance (line 387). predict_flow calls forward_video first for the cond cache, then immediately for the uncond cache (both with persist=True), so _last_video_kv is overwritten with the uncond video KV before predict_action_flow is ever called.
When predict_action_flow then calls forward_action(cache.network_cache, persist=True), it reads self._last_video_kv (now holding uncond KV) and calls write_action(cond_action_k, cond_action_v, uncond_video_k, uncond_video_v), writing [uncond_video | cond_action] into the cond cache slot. From chunk 1 onward the cond cross-chunk attention attends over uncond video keys for all committed prior slots, silently corrupting CFG. This affects every run with the default guidance_scale=5.0 (which creates network_cache_uncond).
The fix is to store the last video KV on the cache object (per-cache, not per-network), so cond and uncond are never conflated. WanVADiTNetworkCache can hold a last_video_kv field (initialized to None) that each forward_video / forward_action call reads and writes on the relevant cache object.
Summary