GitHub - ShareLab-SII/UniAR: [ICML 2026] The official implementation of paper "Unified Multimodal Autoregressive Modeling with Shared Context—Visual Tokenizer is Key to Unification"

UniAR: Unified Multimodal Autoregressive Modeling with Shared Context

--Visual Tokenizer is Key to Unification

Wujian Peng^1,2,*,#, Lingchen Meng^3,*,‡, Yuxuan Cai³, Xianwei Zhuang³, Yuhuan Yang³, Rongyao Fang³,
Chenfei Wu³, Junyang Lin³, Zuxuan Wu^1,2,†, Shuai Bai^3,†

¹Fudan University ²Shanghai Innovation Institute ³Qwen Team, Alibaba Inc.

^*Equal Contributions ^#Work done during internship at Qwen Team, Alibaba Inc. ^‡Project Lead ^†Corresponding Authors

Introduction

UniAR is a unified autoregressive multimodal model that handles image understanding, image generation, and image editing in a single Transformer. Unlike prior unified models that rely on two separate visual tokenizers (splitting the representation space), UniAR uses a single discrete visual tokenizer as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding.

Key design choices:

Multi-level BSQ tokenizer — fuses shallow (low-level detail) and deep (high-level semantic) visual features via lookup-free Binary Spherical Quantization, scaling the effective vocabulary to 2⁶⁴ codes with minimal overhead.
Parallel bitwise prediction — jointly predicts spatially grouped, multi-level visual codes per AR step, achieving a 32x visual compression ratio (a 1024x1024 image needs only 256 AR tokens).
DiT-based visual decoder — an SD3-medium transformer with semantic visual feature injection that reconstructs high-fidelity images from discrete visual tokens, with resolution upsampling support.

News

[2026/06] Code and model weights released.
[2026/05] UniAR is accepted by ICML 2026 🎉 !

TODO

Release visual decoder training code.

Getting Started

Installation

conda create -n uniar python=3.12 -y
conda activate uniar

git clone https://github.com/ShareLab-SII/UniAR.git
cd UniAR
pip install -e .            # inference dependencies
pip install -e ".[train]"   # additional training dependencies (deepspeed, datasets, etc.)
pip install flash-attn --no-build-isolation  # highly recommended for faster attention

Requirements: Python 3.12, CUDA 12.1+, GPU with >= 24 GB VRAM for inference.

Checkpoints

Download the UniAR checkpoint (contains AR model + visual decoder components):

Component	Role
`ar_model`	Unified autoregressive model
`bsq_encoder`	BSQ quantized image tokenizer
`sd3_transformer`	SD3 transformer with visual feature injection
`sd3_pipeline`	SD3 pipeline (VAE + text encoders)

huggingface-cli download https://huggingface.co/ShareLab-SII/UniAR-RL --local-dir checkpoints/UniAR-RL
huggingface-cli download https://huggingface.co/ShareLab-SII/UniAR-SFT --local-dir checkpoints/UniAR-SFT

Inference

Image Understanding

conda activate uniar
python inference/chat.py \
    --model_path checkpoints/UniAR-RL \
    --image https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg \
    --prompt "Describe this image in detail."

Image Generation

conda activate uniar
python inference/generate.py \
    --model_path checkpoints/UniAR-RL \
    --prompt "A cute anime girl." \
    --output_path output.png

See docs/inference.md for the full parameter reference and advanced usage.

Evaluation

We provide a unified batch inference script (inference/generate_batch.py) that supports multi-node, multi-GPU distributed generation via accelerate. All outputs are organized in a standard directory structure:

<output_path>/<run_name>/
├── 00000/
│   ├── metadata.json
│   └── samples/
│       ├── 0000.png
│       ├── 0001.png
│       └── ...
├── 00001/
│   └── ...

To evaluate on specific benchmarks, we provide conversion scripts under eval/convert_structure/ that transform our unified output layout into each benchmark's expected format.

See docs/evaluation.md for the full evaluation guide.

Training

Reinforcement Learning

UniAR uses GRPO with a multi-reward stack for reinforcement learning on image generation. The training system runs across multiple nodes: decode servers (BSQ visual codes → images), reward servers (scoring), and training nodes (AR rollout + GRPO updates).

See docs/training_rl.md for the complete setup guide.

Acknowledgements

UniAR builds upon several excellent open-source projects:

We also thank the benchmark authors for their evaluation tools: GenEval, OneIG-Bench, LongText-Bench, ImgEdit.

Citation

If you find UniAR useful in your research, please consider citing:

@inproceedings{peng2026uniar,
  title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification},
  author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
  booktitle={ICML},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data/rl_demo		data/rl_demo
docs		docs
eval		eval
inference		inference
scripts		scripts
train/rl		train/rl
uniar		uniar
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniAR: Unified Multimodal Autoregressive Modeling with Shared Context

--Visual Tokenizer is Key to Unification

Introduction

News

TODO

Getting Started

Installation

Checkpoints

Inference

Image Understanding

Image Generation

Evaluation

Training

Reinforcement Learning

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UniAR: Unified Multimodal Autoregressive Modeling with Shared Context

--Visual Tokenizer is Key to Unification

Introduction

News

TODO

Getting Started

Installation

Checkpoints

Inference

Image Understanding

Image Generation

Evaluation

Training

Reinforcement Learning

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages