Wujian Peng1,2,*,#,
Lingchen Meng3,*,‡,
Yuxuan Cai3,
Xianwei Zhuang3,
Yuhuan Yang3,
Rongyao Fang3,
Chenfei Wu3,
Junyang Lin3,
Zuxuan Wu1,2,†,
Shuai Bai3,†
1Fudan University 2Shanghai Innovation Institute 3Qwen Team, Alibaba Inc.
*Equal Contributions #Work done during internship at Qwen Team, Alibaba Inc. ‡Project Lead †Corresponding Authors
UniAR is a unified autoregressive multimodal model that handles image understanding, image generation, and image editing in a single Transformer. Unlike prior unified models that rely on two separate visual tokenizers (splitting the representation space), UniAR uses a single discrete visual tokenizer as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding.
Key design choices:
- Multi-level BSQ tokenizer — fuses shallow (low-level detail) and deep (high-level semantic) visual features via lookup-free Binary Spherical Quantization, scaling the effective vocabulary to 264 codes with minimal overhead.
- Parallel bitwise prediction — jointly predicts spatially grouped, multi-level visual codes per AR step, achieving a 32x visual compression ratio (a 1024x1024 image needs only 256 AR tokens).
- DiT-based visual decoder — an SD3-medium transformer with semantic visual feature injection that reconstructs high-fidelity images from discrete visual tokens, with resolution upsampling support.
- [2026/06] Code and model weights released.
- [2026/05] UniAR is accepted by ICML 2026 🎉 !
- Release visual decoder training code.
conda create -n uniar python=3.12 -y
conda activate uniar
git clone https://github.com/ShareLab-SII/UniAR.git
cd UniAR
pip install -e . # inference dependencies
pip install -e ".[train]" # additional training dependencies (deepspeed, datasets, etc.)
pip install flash-attn --no-build-isolation # highly recommended for faster attentionRequirements: Python 3.12, CUDA 12.1+, GPU with >= 24 GB VRAM for inference.
Download the UniAR checkpoint (contains AR model + visual decoder components):
| Component | Role |
|---|---|
ar_model |
Unified autoregressive model |
bsq_encoder |
BSQ quantized image tokenizer |
sd3_transformer |
SD3 transformer with visual feature injection |
sd3_pipeline |
SD3 pipeline (VAE + text encoders) |
huggingface-cli download https://huggingface.co/ShareLab-SII/UniAR-RL --local-dir checkpoints/UniAR-RL
huggingface-cli download https://huggingface.co/ShareLab-SII/UniAR-SFT --local-dir checkpoints/UniAR-SFTconda activate uniar
python inference/chat.py \
--model_path checkpoints/UniAR-RL \
--image https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg \
--prompt "Describe this image in detail."conda activate uniar
python inference/generate.py \
--model_path checkpoints/UniAR-RL \
--prompt "A cute anime girl." \
--output_path output.pngSee docs/inference.md for the full parameter reference and advanced usage.
We provide a unified batch inference script (inference/generate_batch.py) that supports multi-node, multi-GPU distributed generation via accelerate. All outputs are organized in a standard directory structure:
<output_path>/<run_name>/
├── 00000/
│ ├── metadata.json
│ └── samples/
│ ├── 0000.png
│ ├── 0001.png
│ └── ...
├── 00001/
│ └── ...
To evaluate on specific benchmarks, we provide conversion scripts under eval/convert_structure/ that transform our unified output layout into each benchmark's expected format.
See docs/evaluation.md for the full evaluation guide.
UniAR uses GRPO with a multi-reward stack for reinforcement learning on image generation. The training system runs across multiple nodes: decode servers (BSQ visual codes → images), reward servers (scoring), and training nodes (AR rollout + GRPO updates).
See docs/training_rl.md for the complete setup guide.
UniAR builds upon several excellent open-source projects:
We also thank the benchmark authors for their evaluation tools: GenEval, OneIG-Bench, LongText-Bench, ImgEdit.
If you find UniAR useful in your research, please consider citing:
@inproceedings{peng2026uniar,
title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification},
author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
booktitle={ICML},
year={2026}
}