SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
-
Updated
Apr 14, 2026 - Python
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.
AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with FP8 KV cache and custom decode kernels. This repo targets NVFP4 weights and keeps the entire decode path in FP8
A production-ready Docker setup for ComfyUI that unlocks the full potential of NVIDIA Blackwell GPUs (RTX 50 series) through 4-bit quantization with NVFP4.
[ACL 2026 Main] Code for the paper "ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs"
LLM fine-tuning with LoRA + NVFP4/MXFP8 on NVIDIA DGX Spark (Blackwell GB10)
Blackwell-optimized llama.cpp Docker image – works on all NVIDIA GPUs, but tuned for RTX 50 series. Built from scratch with CUDA 12.8, sm_120, NVFP4-ready. 250+ tok/s on 4B F16. Includes llama-chat script.
(Experimental) A high-throughput and memory-efficient inference and serving engine for LLMs optimized for GB10 homelabs
Guide for serving fine-tuned Qwen3.5-27B (dense, NVFP4) on DGX Spark via native vLLM. Includes critical config fixes for modelopt export_hf_checkpoint() that prevent silent FP32 dequantization.
🔧 Fine-tune large language models efficiently on NVIDIA DGX Spark with LoRA adapters and optimized quantization for high performance.
NVFP4 inference on Blackwell GeForce (RTX 5090/5080/5070 Ti/RTX PRO 6000) — SM120 patches for vLLM + FlashInfer + CUTLASS. 175 tok/s on Qwen3.6-35B MoE.
Add a description, image, and links to the nvfp4 topic page so that developers can more easily learn about it.
To associate your repository with the nvfp4 topic, visit your repo's landing page and select "manage topics."