Skip to content

ShareLab-SII/UniAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniAR: Unified Multimodal Autoregressive Modeling with Shared Context

--Visual Tokenizer is Key to Unification

Wujian Peng1,2,*,#, Lingchen Meng3,*,‡, Yuxuan Cai3, Xianwei Zhuang3, Yuhuan Yang3, Rongyao Fang3,
Chenfei Wu3, Junyang Lin3, Zuxuan Wu1,2,†, Shuai Bai3,†

1Fudan University   2Shanghai Innovation Institute   3Qwen Team, Alibaba Inc.

*Equal Contributions   #Work done during internship at Qwen Team, Alibaba Inc.   Project Lead   Corresponding Authors

arXiv Project Page Models

Introduction

UniAR is a unified autoregressive multimodal model that handles image understanding, image generation, and image editing in a single Transformer. Unlike prior unified models that rely on two separate visual tokenizers (splitting the representation space), UniAR uses a single discrete visual tokenizer as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding.

Key design choices:

  • Multi-level BSQ tokenizer — fuses shallow (low-level detail) and deep (high-level semantic) visual features via lookup-free Binary Spherical Quantization, scaling the effective vocabulary to 264 codes with minimal overhead.
  • Parallel bitwise prediction — jointly predicts spatially grouped, multi-level visual codes per AR step, achieving a 32x visual compression ratio (a 1024x1024 image needs only 256 AR tokens).
  • DiT-based visual decoder — an SD3-medium transformer with semantic visual feature injection that reconstructs high-fidelity images from discrete visual tokens, with resolution upsampling support.

News

  • [2026/06] Code and model weights released.
  • [2026/05] UniAR is accepted by ICML 2026 🎉 !

TODO

  • Release visual decoder training code.

Getting Started

Installation

conda create -n uniar python=3.12 -y
conda activate uniar

git clone https://github.com/ShareLab-SII/UniAR.git
cd UniAR
pip install -e .            # inference dependencies
pip install -e ".[train]"   # additional training dependencies (deepspeed, datasets, etc.)
pip install flash-attn --no-build-isolation  # highly recommended for faster attention

Requirements: Python 3.12, CUDA 12.1+, GPU with >= 24 GB VRAM for inference.

Checkpoints

Download the UniAR checkpoint (contains AR model + visual decoder components):

Component Role
ar_model Unified autoregressive model
bsq_encoder BSQ quantized image tokenizer
sd3_transformer SD3 transformer with visual feature injection
sd3_pipeline SD3 pipeline (VAE + text encoders)
huggingface-cli download https://huggingface.co/ShareLab-SII/UniAR-RL --local-dir checkpoints/UniAR-RL
huggingface-cli download https://huggingface.co/ShareLab-SII/UniAR-SFT --local-dir checkpoints/UniAR-SFT

Inference

Image Understanding

conda activate uniar
python inference/chat.py \
    --model_path checkpoints/UniAR-RL \
    --image https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg \
    --prompt "Describe this image in detail."

Image Generation

conda activate uniar
python inference/generate.py \
    --model_path checkpoints/UniAR-RL \
    --prompt "A cute anime girl." \
    --output_path output.png

See docs/inference.md for the full parameter reference and advanced usage.

Evaluation

We provide a unified batch inference script (inference/generate_batch.py) that supports multi-node, multi-GPU distributed generation via accelerate. All outputs are organized in a standard directory structure:

<output_path>/<run_name>/
├── 00000/
│   ├── metadata.json
│   └── samples/
│       ├── 0000.png
│       ├── 0001.png
│       └── ...
├── 00001/
│   └── ...

To evaluate on specific benchmarks, we provide conversion scripts under eval/convert_structure/ that transform our unified output layout into each benchmark's expected format.

See docs/evaluation.md for the full evaluation guide.

Training

Reinforcement Learning

UniAR uses GRPO with a multi-reward stack for reinforcement learning on image generation. The training system runs across multiple nodes: decode servers (BSQ visual codes → images), reward servers (scoring), and training nodes (AR rollout + GRPO updates).

See docs/training_rl.md for the complete setup guide.

Acknowledgements

UniAR builds upon several excellent open-source projects:

We also thank the benchmark authors for their evaluation tools: GenEval, OneIG-Bench, LongText-Bench, ImgEdit.

Citation

If you find UniAR useful in your research, please consider citing:

@inproceedings{peng2026uniar,
  title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification},
  author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
  booktitle={ICML},
  year={2026}
}

About

[ICML 2026] The official implementation of paper "Unified Multimodal Autoregressive Modeling with Shared Context—Visual Tokenizer is Key to Unification"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors