Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions .github/workflows/ltx2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
name: LTX-2 CI

# M2 deliverable: end-to-end T2V/I2V CPU inference verified on both
# Linux x86-64 (AVX) and macOS ARM64 (NEON) via a tiny synthetic GGUF stack
# (video DiT + Gemma-3 encoder + text projection), with no large weight
# download. The synthetic generate exercises the exact pipeline used for the
# real ~46 GB checkpoint.

on:
workflow_dispatch:
push:
branches:
- master
- ci
paths:
- ".github/workflows/ltx2.yml"
- "src/ltx2.hpp"
- "src/gemma3.hpp"
- "src/conditioner.hpp"
- "src/denoiser.hpp"
- "src/vae.hpp"
- "src/diffusion_model.hpp"
- "src/model.*"
- "src/stable-diffusion.cpp"
- "include/stable-diffusion.h"
- "script/convert_ltx2_to_gguf.py"
- "script/make_synthetic_ltx2_gguf.py"
- "script/ci_ltx2_load_smoke.sh"
pull_request:
types: [opened, synchronize, reopened]
paths:
- ".github/workflows/ltx2.yml"
- "src/ltx2.hpp"
- "src/gemma3.hpp"
- "src/conditioner.hpp"
- "src/denoiser.hpp"
- "src/vae.hpp"
- "src/diffusion_model.hpp"
- "src/model.*"
- "src/stable-diffusion.cpp"
- "include/stable-diffusion.h"
- "script/convert_ltx2_to_gguf.py"
- "script/make_synthetic_ltx2_gguf.py"
- "script/ci_ltx2_load_smoke.sh"

concurrency:
group: ltx2-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true

jobs:
generate-smoke:
name: ${{ matrix.name }}
strategy:
fail-fast: false
matrix:
include:
- name: linux-x64-avx-generate-smoke
runs-on: ubuntu-latest
- name: macos-arm64-neon-generate-smoke
runs-on: macos-14
runs-on: ${{ matrix.runs-on }}
steps:
- name: Clone
uses: actions/checkout@v4
with:
submodules: recursive

- name: Dependencies (Linux)
if: runner.os == 'Linux'
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake

- name: Dependencies (macOS)
if: runner.os == 'macOS'
run: brew install cmake || true

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install conversion deps
run: pip install numpy gguf

- name: Validate conversion tooling (filter self-test)
run: python script/convert_ltx2_to_gguf.py --self-test

- name: Build sd-cli
run: |
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j3 --target sd-cli

- name: LTX-2 end-to-end generate smoke test
run: bash script/ci_ltx2_load_smoke.sh
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,5 @@ output*.png
models*
*.log
preview.png
__pycache__/
*.pyc
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ API and command-line option may change frequently.***
- [Qwen Image Edit series](./docs/qwen_image_edit.md)
- Video Models
- [Wan2.1/Wan2.2](./docs/wan.md)
- [LTX-2 (T2V/I2V)](./docs/ltx2.md) (work in progress)
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
- Control Net support with SD 1.5
- LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
Expand Down Expand Up @@ -138,6 +139,7 @@ If you want to improve performance or reduce VRAM/RAM usage, please refer to [pe
- [🔥Qwen Image](./docs/qwen_image.md)
- [🔥Qwen Image Edit series](./docs/qwen_image_edit.md)
- [🔥Wan2.1/Wan2.2](./docs/wan.md)
- [LTX-2 (T2V/I2V)](./docs/ltx2.md) (work in progress)
- [🔥Z-Image](./docs/z_image.md)
- [Ovis-Image](./docs/ovis_image.md)
- [Anima](./docs/anima.md)
Expand Down
151 changes: 151 additions & 0 deletions docs/ltx2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# LTX-2 (LTX-2.3) video generation

Support for [Lightricks LTX-2](https://huggingface.co/Lightricks) text-to-video
(T2V) and image-to-video (I2V) generation. Only the **video** stream is in
scope; the audio stream (audio DiT, audio VAE, vocoder) is intentionally not
converted or loaded.

> Status: **work in progress.** Milestone M1 (conversion + scaffolding + "loads
> on CPU") and Milestone M2 (core CPU inference) are implemented: the full
> T2V/I2V pipeline — Gemma-3 text encoder, video DiT denoising loop, the
> LinearQuadratic scheduler and the Video-VAE — runs end-to-end on CPU and
> writes a video file. See **Validation status** below for which pieces are
> numerically validated vs. structurally implemented pending reference parity.

## What works today (M1 + M2)

- Safetensors -> GGUF conversion tooling for the video-only DiT (+ VAE), at
`f16`, `q8_0`, `q5_1`, `q4_0`.
- LTX-2 architecture auto-detection in the model loader.
- The 14B video DiT (`AVTransformer3DModel`, video half) loads on CPU and runs a
full forward pass: `patchify_proj`, 3D RoPE, AdaLN-single modulation, gated
self/cross attention with RMS qk-norm, the gelu FFN, the video-embeddings
connector and `proj_out`. Geometry is inferred from the checkpoint shapes.
- A native **Gemma-3** text encoder (`src/gemma3.hpp`): GeGLU MLP, q/k RMSNorm,
`(1 + weight)` RMSNorm, scaled token embeddings, and sliding-window vs global
attention layers. The multi-layer feature extractor + projection feed the DiT
cross-attention.
- A **LinearQuadratic** flow-matching scheduler (`--scheduler linear_quadratic`).
- The LTX **CausalVideoAutoencoder** geometry (32x spatial / 8x temporal, 128
latent channels) and the T2V + I2V `generate_video` wiring, producing MJPEG
AVI / raw-frame output from the CLI.
- A CI smoke test that runs a full synthetic end-to-end generate on both
Linux x86-64 (AVX) and macOS ARM64 (NEON) without any large download.

## Usage

LTX-2 needs the converted LTX checkpoint (DiT + VAE + text projection) plus a
Gemma-3 GGUF for the text encoder:

```bash
# Text-to-video
./build/bin/sd-cli -M vid_gen \
--diffusion-model ltx-2.3-video-q8_0.gguf \
--llm gemma-3-12b.gguf \
-p "a corgi running on the beach at sunset" \
--scheduler linear_quadratic --steps 8 \
-W 512 -H 768 --video-frames 49 -o out

# Image-to-video (animate / bootstrap from a reference frame)
./build/bin/sd-cli -M vid_gen \
--diffusion-model ltx-2.3-video-q8_0.gguf --llm gemma-3-12b.gguf \
-i first_frame.png -p "the camera slowly pans right" \
--scheduler linear_quadratic --steps 8 \
-W 512 -H 768 --video-frames 49 -o out
```

LTX-2 targets the **distilled** checkpoint first (8 steps, CFG=1), so no
negative-prompt pass is needed. Frame counts are aligned to `(F-1) % 8 == 0` and
spatial dimensions to multiples of 32. Memory is staged per component (encode
text, then run the DiT, then decode) — use `--diffusion-fa` / quantized weights
to fit consumer RAM.

## Validation status (read before trusting output)

M2's goal is end-to-end CPU inference that runs and produces video of the right
shape; numerical parity (PSNR/SSIM) is an M3 acceptance criterion. The following
pieces are implemented structurally and **must be validated against the
Diffusers LTX-2 reference** before claiming quality parity:

- the exact assignment of the 9 per-block + 2 prompt AdaLN modulation channels,
the 3D RoPE axis split and `theta`, and the non-affine norm placement
(`src/ltx2.hpp`);
- the Gemma-3 RoPE theta / sliding-window pattern / query scaling, and a real
Gemma SentencePiece tokenizer (the current tokenizer is a byte-level
placeholder; `src/gemma3.hpp`, `Ltx2Conditioner`);
- the LTX multi-layer feature-extractor aggregation and the
`text_embedding_projection.video_*` tensor layout (`Ltx2TextProjection`);
- the **Video-VAE** is currently a shape-correct geometric placeholder
(`Ltx2VAERunner` in `src/vae.hpp`); the learned causal-conv encoder/decoder,
PixelNorm and per-channel statistics replace it next.

## Model conversion

The conversion script reads an LTX-2.3 `.safetensors` checkpoint, drops every
audio/vocoder tensor, and writes a video-only GGUF.

```bash
pip install -r script/requirements-ltx2.txt

# Inspect the keep/drop plan without writing anything (stdlib only):
python script/convert_ltx2_to_gguf.py --src ltx-2.3-22b-dev.safetensors --dry-run

# Convert to F16 (M1 deliverable):
python script/convert_ltx2_to_gguf.py \
--src ltx-2.3-22b-dev.safetensors \
--dst ltx-2.3-22b-video-f16.gguf --type f16

# Quantized variants:
python script/convert_ltx2_to_gguf.py --src ... --dst ...-q8_0.gguf --type q8_0
python script/convert_ltx2_to_gguf.py --src ... --dst ...-q4_0.gguf --type q4_0
```

Quantization notes:

- `f16` keeps every tensor in half precision (highest quality, largest file).
- `q8_0` / `q5_1` / `q4_0` quantize only the large 2D DiT linear weights; norms,
biases, modulation tables and the VAE stay in higher precision.

## Building

Follow the standard [build guide](./build.md). The CLI binary is `sd-cli`:

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target sd-cli
```

## Verifying the full pipeline without the real weights

A tiny synthetic checkpoint generator exercises the entire pipeline in seconds.
It emits the exact DiT + Gemma-3 + projection tensor names at drastically
reduced dimensions (and a separate Gemma file), so the same code path that loads
the real weights runs a full generate.

```bash
# requires numpy + gguf (see script/requirements-ltx2.txt)
python script/make_synthetic_ltx2_gguf.py --out /tmp/ltx2_tiny.gguf
# writes /tmp/ltx2_tiny.gguf (DiT + projection) and /tmp/ltx2_tiny.gguf.gemma.gguf

bash script/ci_ltx2_load_smoke.sh # build + synthetic end-to-end generate
```

Expected: the log reports `Version: LTX-2`, the inferred Gemma-3 and DiT
geometry, `get_sigmas with LinearQuadratic scheduler`, `sampling completed`,
`decode_first_stage completed`, and writes an `.avi` video.

## CI

`.github/workflows/ltx2.yml` runs on Linux x86-64 (AVX) and macOS ARM64 (NEON):

1. validates the conversion filter (`--self-test`),
2. builds `sd-cli`,
3. runs the synthetic end-to-end generate smoke test.

## Scope

In scope: video DiT, Video-VAE encoder+decoder, Gemma-3 text encoder, a noise
scheduler + CFG, T2V and I2V, GGUF conversion, CLI, C API.

Out of scope: the audio stream (audio DiT, Audio-VAE, vocoder), training /
fine-tuning, the spatial upscaler, and video-to-video.
Loading