From 69d0b641e3dd2fda70e807b8181571e30da4e69d Mon Sep 17 00:00:00 2001
From: Yue Sun Windows ML CLI has validated a set of models for compatibility across all
Execution Providers (EPs)—see the full
-model compatibility report.Supported Models¶
winml-cli supports a wide range of model architectures and tasks. This page lists what's validated and how to discover model support.
| into Compatibility (left) and Accuracy (right):winml perf benchmark without errors or timeout (unquantized model).— accuracy half means that pair was not accuracy-tested.| Model | Task | DML GPU | MIGraph GPU | MLAS CPU | OV NPU | OV GPU | OV CPU | QNN NPU | QNN GPU | TRTRTX GPU | VitisAI NPU |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Comp|Acc | Comp|Acc | Comp|Acc | Comp|Acc | Comp|Acc | Comp|Acc | Comp|Acc | Comp|Acc | Comp|Acc | Comp|Acc | ||
| ▼sentence-transformers | 5 models · 12 pairs | 12/12|9/9 | 12/12|9/9 | 12/12|9/9 | 12/12|9/9 | 12/12|9/9 | 12/12|9/9 | 11/12|9/9 | 12/12|9/9 | 12/12|9/9 | 12/12|9/9 |
| all-MiniLM-L6-v2 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| all-MiniLM-L6-v2 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| all-mpnet-base-v2 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| all-mpnet-base-v2 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| multi-qa-mpnet-base-dot-v1 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| multi-qa-mpnet-base-dot-v1 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| paraphrase-multilingual-MiniLM-L12-v2 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| paraphrase-multilingual-MiniLM-L12-v2 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| paraphrase-multilingual-mpnet-base-v2 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| all-mpnet-base-v2 | fill-mask | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| multi-qa-mpnet-base-dot-v1 | fill-mask | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| paraphrase-multilingual-mpnet-base-v2 | feature-extraction | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— |
| ▼BAAI | 6 models · 10 pairs | 10/10|7/7 | 10/10|7/7 | 10/10|7/7 | 10/10|7/7 | 10/10|7/7 | 10/10|7/7 | 10/10|7/7 | 10/10|7/7 | 10/10|7/7 | 10/10|7/7 |
| bge-base-en-v1.5 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bge-base-en-v1.5 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bge-large-en-v1.5 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bge-m3 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bge-m3 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bge-small-en-v1.5 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bge-small-en-v1.5 | sentence-similarity | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bge-large-en-v1.5 | feature-extraction | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| bge-reranker-base | text-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| bge-reranker-v2-m3 | text-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼google-bert | 4 models · 6 pairs | 6/6|5/5 | 6/6|5/5 | 6/6|5/5 | 6/6|5/5 | 6/6|5/5 | 6/6|5/5 | 6/6|5/5 | 6/6|5/5 | 6/6|5/5 | 6/6|5/5 |
| bert-base-multilingual-cased | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bert-base-multilingual-cased | fill-mask | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bert-base-multilingual-uncased | fill-mask | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bert-base-uncased | fill-mask | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bert-large-uncased-whole-word-masking-finetuned-squad | question-answering | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bert-base-multilingual-cased | masked-lm | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| 19 models · 21 pairs | 17/21|6/6 | 18/21|6/6 | 18/21|6/6 | 16/21|6/6 | 9/21|6/6 | 18/21|6/6 | 15/21|6/6 | 9/21|4/6 | 17/21|6/6 | 18/21|6/6 | |
| convnext-tiny-224 | image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| dino-vitb16 | image-feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| dino-vits16 | image-feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| dinov2-small | image-feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| dinov2-base | image-feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ |
| dinov2-large | image-feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ |
| bart-large-cnn | summarization | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— |
| bart-large-mnli | text-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| bart-large-mnli | zero-shot-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| detr-resnet-50 | feature-extraction | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— |
| detr-resnet-50 | object-detection | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| dinov2-giant | image-feature-extraction | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| nllb-200-distilled-600M | translation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| nougat-base | image-to-text | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| sam-vit-base | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| sam-vit-huge | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| sam-vit-large | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| sam2-hiera-large | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| sam2.1-hiera-large | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| sam2.1-hiera-small | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| sam2.1-hiera-tiny | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| ▼distilbert | 4 models | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 |
| distilbert-base-cased-distilled-squad | question-answering | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| distilbert-base-uncased | fill-mask | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| distilbert-base-uncased-distilled-squad | question-answering | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| distilbert-base-uncased-finetuned-sst-2-english | text-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼microsoft | 28 models | 23/28|10/10 | 23/28|9/10 | 23/28|10/10 | 15/28|10/10 | 20/28|9/10 | 23/28|10/10 | 14/28|10/10 | 11/28|7/10 | 23/28|9/10 | 22/28|10/10 |
| resnet-18 | image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| resnet-50 | image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| swin-large-patch4-window7-224 | image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| beit-base-patch16-224-pt22k-ft22k | image-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| llmlingua-2-xlm-roberta-large-meetingbank | token-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| rad-dino | image-feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ |
| swinv2-tiny-patch4-window16-256 | image-classification | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| trocr-base-handwritten | image-to-text | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ |
| trocr-base-printed | image-to-text | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ |
| trocr-large-handwritten | image-to-text | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ |
| deberta-xlarge-mnli | text-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— |
| Phi-4-multimodal-instruct | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| speecht5_tts | text-to-speech | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| table-transformer-detection | object-detection | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✗ | ✓|✓ | ✓|✓ | ✓|✗ | ✓|✗ | ✓|✓ |
| table-transformer-structure-recognition | object-detection | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| table-transformer-structure-recognition-v1.1-all | object-detection | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| tapex-base | table-question-answering | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| tapex-base-finetuned-wikisql | table-question-answering | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| tapex-base-finetuned-wtq | table-question-answering | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| tapex-large | table-question-answering | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| tapex-large-finetuned-tabfact | table-question-answering | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| tapex-large-finetuned-wikisql | table-question-answering | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| tapex-large-finetuned-wtq | table-question-answering | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| tapex-large-sql-execution | table-question-answering | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| trocr-large-printed | image-to-text | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✗|▼ | ✓|✓ | ✓|✓ |
| VibeVoice-ASR-HF | audio-text-to-text | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| VibeVoice-Realtime-0.5B | text-to-speech | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| xclip-base-patch32 | video-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼deepset | 3 models | 3/3|3/3 | 3/3|3/3 | 3/3|3/3 | 3/3|3/3 | 3/3|3/3 | 3/3|3/3 | 3/3|3/3 | 3/3|3/3 | 3/3|3/3 | 3/3|3/3 |
| bert-large-uncased-whole-word-masking-squad2 | question-answering | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| roberta-base-squad2 | question-answering | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| tinyroberta-squad2 | question-answering | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼FacebookAI | 4 models | 3/4|3/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|4/4 | 4/4|3/4 | 4/4|4/4 | 4/4|3/4 | 4/4|4/4 | 4/4|4/4 |
| roberta-base | fill-mask | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| roberta-large | fill-mask | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| xlm-roberta-base | fill-mask | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| xlm-roberta-large | fill-mask | ✗|◷ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|◷ | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ |
| ▼Intel | 4 models · 5 pairs | 5/5|2/2 | 5/5|2/2 | 5/5|2/2 | 5/5|2/2 | 4/5|2/2 | 4/5|2/2 | 4/5|2/2 | 5/5|2/2 | 5/5|2/2 | 4/5|2/2 |
| bert-base-uncased-mrpc | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bert-base-uncased-mrpc | text-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| dpt-hybrid-midas | depth-estimation | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| dpt-large | depth-estimation | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| zoedepth-nyu-kitti | depth-estimation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✗|— | ✓|— | ✓|— | ✗|— |
| ▼openai | 5 models · 9 pairs | 6/9|6/6 | 6/9|5/6 | 6/9|6/6 | 6/9|6/6 | 6/9|6/6 | 6/9|6/6 | 3/9|6/6 | 6/9|4/6 | 6/9|6/6 | 6/9|3/6 |
| clip-vit-base-patch16 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| clip-vit-base-patch32 | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| clip-vit-base-patch32 | zero-shot-image-classification | ✓|✓ | ✓|— | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|▼ |
| clip-vit-base-patch16 | zero-shot-image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| clip-vit-base-patch16 | zero-shot-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| clip-vit-base-patch32 | zero-shot-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| clip-vit-large-patch14 | zero-shot-image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|▼ | ✓|✓ | ✓|▼ |
| clip-vit-large-patch14-336 | zero-shot-image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|▼ | ✓|✓ | ✓|▼ |
| gpt-oss-20b | text-generation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| 10 models · 12 pairs | 7/12|2/2 | 9/12|2/2 | 7/12|2/2 | 7/12|2/2 | 5/12|2/2 | 7/12|2/2 | 7/12|2/2 | 2/12|2/2 | 5/12|2/2 | 8/12|2/2 | |
| vit-base-patch16-224 | image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| vit-base-patch16-224-in21k | image-feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| deplot | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| flan-t5-base | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | |
| flan-t5-base | summarization | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— |
| flan-t5-base | translation | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— |
| madlad400-3b-mt | translation | ✗|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✓|— |
| pegasus-xsum | summarization | ✗|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✓|— | ✗|— |
| pix2struct-ai2d-base | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| pix2struct-docvqa-base | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| siglip-base-patch16-224 | zero-shot-image-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| siglip-so400m-patch14-384 | zero-shot-image-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| ▼laion | 2 models · 4 pairs | 3/4|3/3 | 3/4|2/3 | 3/4|2/3 | 3/4|3/3 | 3/4|3/3 | 3/4|2/3 | 3/4|3/3 | 3/4|3/3 | 3/4|3/3 | 3/4|2/3 |
| CLIP-ViT-B-32-laion2B-s34B-b79K | feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| CLIP-ViT-B-32-laion2B-s34B-b79K | zero-shot-image-classification | ✓|✓ | ✓|— | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|▼ |
| CLIP-ViT-H-14-laion2B-s32B-b79K | zero-shot-image-classification | ✓|✓ | ✓|✓ | ✓|◷ | ✓|✓ | ✓|✓ | ✓|◷ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| CLIP-ViT-B-32-laion2B-s34B-b79K | zero-shot-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼apple | 2 models | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 |
| mobilevit-small | image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| DepthPro-hf | depth-estimation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼Babelscape | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| wikineural-multilingual-ner | token-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼dbmdz | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| bert-large-cased-finetuned-conll03-english | token-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼dima806 | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| fairface_age_image_detection | image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼dslim | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| bert-base-NER | token-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼Isotonic | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| distilbert_finetuned_ai4privacy_v2 | token-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼joeddav | 2 models | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 | 1/2|1/1 |
| xlm-roberta-large-xnli | zero-shot-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| bart-large-mnli-yahoo-answers | zero-shot-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼ProsusAI | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| finbert | text-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼rizvandwiki | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| gender-classification | image-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼w11wo | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| indonesian-roberta-base-posp-tagger | token-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼cross-encoder | 3 models | 3/3|— | 3/3|— | 3/3|— | 3/3|— | 2/3|— | 3/3|— | 3/3|— | 3/3|— | 3/3|— | 3/3|— |
| ms-marco-MiniLM-L4-v2 | text-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ms-marco-MiniLM-L6-v2 | text-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| nli-deberta-v3-small | zero-shot-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼StanfordAIMI | 2 models | 2/2|1/1 | 2/2|1/1 | 2/2|1/1 | 2/2|1/1 | 2/2|1/1 | 2/2|1/1 | 2/2|1/1 | 2/2|0/1 | 2/2|1/1 | 2/2|1/1 |
| dinov2-base-xray-224 | image-feature-extraction | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ |
| stanford-deidentifier-base | token-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼AdamCodd | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| vit-base-nsfw-detector | image-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼ahotrod | 1 model | 1/1|1/1 | 1/1|0/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| electra_large_discriminator_squad2_512 | question-answering | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼amunchet | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| rorshark-vit-base | image-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼Falconsai | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| nsfw_image_detection | image-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼hustvl | 1 model | 1/1|0/1 | 1/1|0/1 | 1/1|0/1 | 1/1|0/1 | 1/1|0/1 | 1/1|0/1 | 1/1|0/1 | 1/1|0/1 | 1/1|0/1 | 1/1|0/1 |
| yolos-small | object-detection | ✓|▼ | ✓|▼ | ✓|▼ | ✓|▼ | ✓|▼ | ✓|▼ | ✓|▼ | ✓|▼ | ✓|▼ | ✓|▼ |
| ▼Jean-Baptiste | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| camembert-ner-with-dates | token-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼kredor | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| punctuate-all | token-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼lxyuan | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| distilbert-base-multilingual-cased-sentiments-student | zero-shot-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼monologg | 1 model | 1/1|1/1 | 1/1|0/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| koelectra-small-v2-distilled-korquad-384 | question-answering | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼patrickjohncyh | 1 model | 1/1|1/1 | 1/1|0/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|0/1 |
| fashion-clip | zero-shot-image-classification | ✓|✓ | ✓|— | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|▼ |
| ▼tau | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| splinter-base | question-answering | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼valentinafeve | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|0/1 | 1/1|1/1 | 1/1|1/1 | 1/1|0/1 |
| yolos-fashionpedia | object-detection | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ | ✓|▼ |
| ▼nvidia | 6 models | 6/6|3/3 | 6/6|3/3 | 6/6|3/3 | 5/6|3/3 | 0/6|3/3 | 6/6|3/3 | 4/6|3/3 | 6/6|3/3 | 6/6|3/3 | 6/6|3/3 |
| segformer-b1-finetuned-ade-512-512 | image-segmentation | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| segformer-b2-finetuned-ade-512-512 | image-segmentation | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| segformer-b5-finetuned-ade-640-640 | image-segmentation | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| segformer-b0-finetuned-ade-512-512 | image-segmentation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| segformer-b0-finetuned-cityscapes-1024-1024 | image-segmentation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| segformer-b5-finetuned-cityscapes-1024-1024 | image-segmentation | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— |
| ▼cardiffnlp | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 0/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| twitter-roberta-base-sentiment-latest | text-classification | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼mattmdjaga | 1 model | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 0/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 | 1/1|1/1 |
| segformer_b2_clothes | image-segmentation | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ |
| ▼ai-forever | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| Real-ESRGAN | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | |
| ▼alibaba-damo | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| mgp-str-base | image-to-text | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼breezedeus | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| pix2text-mfr | image-to-text | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼buildborderless | 1 model | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 1/1|— |
| CommunityForensics-DeepfakeDet-ViT | image-classification | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— |
| ▼dandelin | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| vilt-b32-finetuned-vqa | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼depth-anything | 3 models | 3/3|— | 3/3|— | 3/3|— | 3/3|— | 0/3|— | 3/3|— | 3/3|— | 3/3|— | 3/3|— | 3/3|— |
| Depth-Anything-V2-Base-hf | depth-estimation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| Depth-Anything-V2-Large-hf | depth-estimation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| Depth-Anything-V2-Small-hf | depth-estimation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼fashn-ai | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| fashn-human-parser | image-segmentation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼flaviagiammarino | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— |
| medsam-vit-base | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| ▼google-t5 | 4 models · 7 pairs | 5/7|— | 7/7|— | 5/7|— | 5/7|— | 5/7|— | 5/7|— | 5/7|— | 0/7|— | 0/7|— | 7/7|— |
| t5-3b | summarization | ✗|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✓|— |
| t5-3b | translation | ✗|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✓|— |
| t5-base | summarization | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— |
| t5-base | translation | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— |
| t5-large | summarization | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— |
| t5-small | summarization | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— |
| t5-small | translation | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— |
| ▼Helsinki-NLP | 5 models | 3/5|— | 3/5|— | 3/5|— | 3/5|— | 3/5|— | 3/5|— | 3/5|— | 0/5|— | 3/5|— | 3/5|— |
| opus-mt-en-ru | translation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| opus-mt-es-en | translation | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| opus-mt-fr-en | translation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| opus-mt-nl-en | translation | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| opus-mt-tr-en | translation | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| ▼hi-wesley | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| gemma3-vision-encoder | image-feature-extraction | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼internlm | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| internlm-xcomposer2d5-7b | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼intfloat | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— |
| multilingual-e5-large | sentence-similarity | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| ▼jonathandinu | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| face-parsing | image-segmentation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼kha-white | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 0/1|— | 1/1|— | 1/1|— |
| manga-ocr-base | image-to-text | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| ▼knkarthick | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| MEETING_SUMMARY | summarization | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼LiheYoung | 3 models | 2/3|— | 2/3|— | 2/3|— | 2/3|— | 0/3|— | 2/3|— | 2/3|— | 2/3|— | 2/3|— | 2/3|— |
| depth-anything-base-hf | depth-estimation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| depth-anything-large-hf | depth-estimation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| depth-anything-small-hf | depth-estimation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼Marqo | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| marqo-fashionSigLIP | zero-shot-image-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼mixedbread-ai | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| mxbai-rerank-xsmall-v1 | text-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼MoritzLaurer | 4 models | 4/4|— | 4/4|— | 4/4|— | 4/4|— | 0/4|— | 4/4|— | 4/4|— | 4/4|— | 4/4|— | 2/4|— |
| DeBERTa-v3-large-mnli-fever-anli-ling-wanli | zero-shot-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— |
| deberta-v3-large-zeroshot-v2.0 | zero-shot-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— |
| mDeBERTa-v3-base-mnli-xnli | zero-shot-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| mDeBERTa-v3-base-xnli-multilingual-nli-2mil7 | zero-shot-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼moussaKam | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| mbarthez | summarization | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼naver-clova-ix | 2 models | 2/2|— | 2/2|— | 2/2|— | 1/2|— | 2/2|— | 2/2|— | 0/2|— | 0/2|— | 2/2|— | 2/2|— |
| donut-base | image-to-text | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| donut-base-finetuned-cord-v2 | image-to-text | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✓|— |
| ▼nlpconnect | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 0/1|— | 1/1|— | 0/1|— |
| vit-gpt2-image-captioning | image-to-text | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✓|— | ✗|— |
| ▼obi | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 1/1|— |
| deid_roberta_i2b2 | token-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— |
| ▼oliverguhr | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 1/1|— |
| fullstop-punctuation-multilang-large | token-classification | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— |
| ▼openai-community | 1 model | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 1/1|— | 0/1|— |
| gpt2 | text-generation | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✓|— | ✗|— |
| ▼PekingU | 3 models | 3/3|— | 2/3|— | 3/3|— | 3/3|— | 0/3|— | 3/3|— | 3/3|— | 0/3|— | 3/3|— | 3/3|— |
| rtdetr_r101vd_coco_o365 | object-detection | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| rtdetr_r50vd_coco_o365 | object-detection | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| rtdetr_v2_r18vd | object-detection | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| ▼philschmid | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| bart-large-cnn-samsum | summarization | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼Qwen | 7 models | 1/7|— | 3/7|— | 1/7|— | 0/7|— | 0/7|— | 0/7|— | 0/7|— | 0/7|— | 2/7|— | 0/7|— |
| Qwen2.5-0.5B-Instruct | text-generation | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✓|— | ✗|— |
| Qwen2.5-1.5B-Instruct | text-generation | ✗|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✓|— | ✗|— |
| Qwen2.5-3B-Instruct | text-generation | ✗|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| Qwen2.5-7B-Instruct | text-generation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| Qwen3-0.6B | text-generation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| Qwen3-1.7B | text-generation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| Qwen3-8B | text-generation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼Salesforce | 5 models | 1/5|1/1 | 1/5|0/1 | 1/5|1/1 | 1/5|1/1 | 1/5|1/1 | 1/5|1/1 | 0/5|1/1 | 0/5|0/1 | 1/5|1/1 | 1/5|1/1 |
| blip-image-captioning-base | image-to-text | ✓|✓ | ✓|▼ | ✓|✓ | ✓|✓ | ✓|✓ | ✓|✓ | ✗|✓ | ✗|▼ | ✓|✓ | ✓|✓ |
| blip-vqa-base | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| blip2-flan-t5-xl | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| blip2-opt-2.7b | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| blip2-opt-2.7b-coco | visual-question-answering | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼sshleifer | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 0/1|— | 0/1|— | 1/1|— |
| distilbart-cnn-12-6 | summarization | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✗|— | ✓|— |
| ▼TahaDouaji | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 1/1|— |
| detr-doc-table-detection | object-detection | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— | ✓|— | ✓|— |
| ▼timm | 2 models | —|— | —|— | —|— | —|— | —|— | —|— | —|— | —|— | 2/2|— | —|— |
| mobilenetv3_small_100.lamb_in1k | image-classification | ∅|— | ∅|— | ∅|— | ∅|— | ∅|— | ∅|— | ∅|— | ∅|— | ✓|— | ∅|— |
| repghostnet_200.in1k | image-classification | ∅|— | ∅|— | ∅|— | ∅|— | ∅|— | ∅|— | ∅|— | ∅|— | ✓|— | ∅|— |
| ▼timpal0l | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 1/1|— |
| mdeberta-v3-base-squad2 | question-answering | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✗|— | ✓|— | ✓|— | ✓|— |
| ▼trl-internal-testing | 1 model | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 1/1|— | 0/1|— |
| tiny-Qwen2ForCausalLM-2.5 | text-generation | ✓|— | ✓|— | ✓|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✓|— | ✗|— |
| ▼valhalla | 1 model | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— | 0/1|— |
| distilbart-mnli-12-3 | zero-shot-classification | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼wanglab | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— |
| medsam-vit-base | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
| ▼Xenova | 2 models | 0/2|— | 0/2|— | 0/2|— | 0/2|— | 0/2|— | 0/2|— | 0/2|— | 0/2|— | 0/2|— | 0/2|— |
| paraphrase-multilingual-MiniLM-L12-v2 | feature-extraction | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| segformer-b0-finetuned-ade-512-512 | image-segmentation | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— | ✗|— |
| ▼Zigeng | 1 model | 1/1|— | 1/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— | 0/1|— | 1/1|— | 1/1|— |
| SlimSAM-uniform-77 | mask-generation | ✓|— | ✓|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— | ✗|— | ✓|— | ✓|— |
winml perf) without errors or timeout, using the unquantized model.
-
- ✓ pass
- ✗ fail
- — not available
-
- An accuracy table is coming soon.
-Windows ML CLI has validated a set of models for compatibility across all Execution Providers (EPs)—see the full -models accuracy report.
+Model Accuracy Report.winml-cli supports a wide range of model architectures and tasks. This page lists what's validated and how to discover model support.
Windows ML CLI is a command line tool for building portable, performant, and high-quality AI models for Windows ML. It takes you from a source model \u2014 whether from Hugging Face or your own pipeline \u2014 to a hardware-optimized artifact in a reproducible workflow.
Purpose-built for Windows hardware diversity, the CLI handles conversion, graph optimization, and compilation across AMD, Intel, NVIDIA, and Qualcomm targets. The CLI fits naturally into CI/CD pipelines so teams can validate and ship models easily.
"},{"location":"#what-you-can-do","title":"What you can do","text":"export, analyze, optimize, quantize, compile), or use an auto-generated config with winml build \u2014 both produce portable models that run across hardware.winml CLI running locally.winml subcommands.To request access to the Windows ML CLI repository, visit aka.ms/winml-cli.
"},{"location":"#license","title":"License","text":"MIT. See LICENSE.
"},{"location":"Privacy/","title":"WinML CLI Privacy Statement","text":"WinML CLI collects limited, unlinked pseudonymized telemetry to help improve the product. This page describes exactly what is collected, what is not, and how to control it.
"},{"location":"Privacy/#data-category","title":"Data category","text":"All WinML CLI telemetry is classified as Optional under Microsoft's data categorization model. None of it is required to run any feature; it exists solely to support product improvement.
A first-run interactive prompt asks for consent before any event is sent. The prompt defaults to accept \u2014 pressing Enter enables telemetry. You can decline explicitly at the prompt, or change your answer later by editing %USERPROFILE%\\.winml\\config.json. Telemetry is automatically disabled in non-interactive contexts (non-TTY stdin, CI/CD pipelines) regardless of stored consent; those contexts do not see the prompt and default to off.
When telemetry is enabled, WinML CLI emits three event types:
"},{"location":"Privacy/#winmlcliheartbeat","title":"WinMLCLIHeartbeat","text":"Sent once per CLI invocation, just before the requested command runs. Carries only context attributes (OS, architecture, app version, device ID) \u2014 no per-event payload.
"},{"location":"Privacy/#winmlcliaction","title":"WinMLCLIAction","text":"Sent once per command completion.
Attribute Descriptioninvoked_from Script or Interactive, based on whether stdin is a TTY. action_name Click subcommand name (e.g., build, analyze). device Target device type, if the subcommand accepts --device (e.g., NPU, GPU). ep Execution provider, if the subcommand accepts --ep (e.g., QNNExecutionProvider). duration_ms Wall-clock execution time in milliseconds. success Whether the command completed without raising."},{"location":"Privacy/#winmlclierror","title":"WinMLCLIError","text":"Sent only when a command raises an unhandled exception.
Attribute Descriptionexception_type Exception class name (e.g., ValueError). exception_message The exception message, with absolute paths trimmed to package-relative, truncated to 200 characters, and with emails, GUIDs, IPv4/IPv6 addresses, and long opaque tokens replaced by <scrubbed>. exception_stack A list of frames, each {file, line, function}. File paths are package-relative. No source line text, no local variable values."},{"location":"Privacy/#common-context-attributes","title":"Common context attributes","text":"Every event carries these attributes (populated by the telemetry module, not by the command code):
Attribute Descriptiondevice_id SHA256 hash of a randomly generated UUID, persisted per machine. Enables counting distinct users without identifying them. id_status EXISTING, NEW, or FAILED \u2014 how the device ID was obtained on this run. os.name, os.version, os.release, os.arch Operating system and architecture (e.g., Windows, 10.0.26200, 11, AMD64). app_version WinML CLI package version. app_instance_id A random UUID generated for this process only; not persisted. initTs Epoch timestamp when telemetry was initialized."},{"location":"Privacy/#data-never-collected","title":"Data never collected","text":"--model path/to/file.onnx)On the first run of any command, WinML CLI prompts:
Enable telemetry? [Y/n]\n The default is Y (telemetry enabled) \u2014 pressing Enter accepts. Your answer is persisted to %USERPROFILE%\\.winml\\config.json under telemetry.consent and the prompt is not shown again.
Edit %USERPROFILE%\\.winml\\config.json directly:
{\n \"telemetry\": {\n \"consent\": \"disabled\"\n }\n}\n Goal Edit Opt out Set telemetry.consent to \"disabled\". Opt in Set telemetry.consent to \"enabled\". Re-show the prompt on next run Delete the file, or remove the telemetry.consent field. There are no CLI subcommands, per-invocation flags, or environment variables for consent \u2014 the config file is the single source of truth.
"},{"location":"Privacy/#ci-cd","title":"CI / CD","text":"Telemetry is automatically disabled when any of these environment variables are set, and no prompt is shown:
CI, TF_BUILD, GITHUB_ACTIONS, JENKINS_URL, CODEBUILD_BUILD_ID, BUILDKITE, SYSTEM_TEAMFOUNDATIONCOLLECTIONURI.
Events that fail to send (e.g., transient network errors) are cached locally and retried on the next run. The cache file lives at:
%USERPROFILE%\\.winml\\telemetry\\winmlcli.cache
The cache is append-only on failure and drain-then-resend on recovery. When telemetry is disabled, the cache is cleared so a disabled session never resends events the user has since opted out of.
"},{"location":"Privacy/#dev-installs","title":"Dev installs","text":"WinML CLI installed from source (pip install -e .) or run directly from a checkout never sends telemetry. The InstrumentationKey is blank in source and is only populated by the official build pipeline. Only official binary releases are capable of sending telemetry, and only after the user has seen the first-run prompt.
For the full contributing guide \u2014 development setup, coding conventions, testing, PR checklist, and CLA \u2014 see CONTRIBUTING.md in the repository root.
# Clone and set up\ngit clone https://github.com/microsoft/winml-cli.git\ncd winml-cli\nuv sync --extra dev\nuv run pre-commit install\n\n# Download runtime check rules (required for `winml analyze`)\ngh release download <tag> --repo microsoft/winml-cli --pattern 'rules-v*.zip' --dir .\nExpand-Archive -Path .\\rules-v*.zip -DestinationPath src\\winml\\modelkit\\analyze\\rules\\runtime_check_rules -Force\n\n# Run tests\nuv run pytest tests/ -m \"not e2e and not npu and not gpu\"\n\n# Lint and format\nuv run ruff check src/ tests/ --fix\nuv run ruff format src/ tests/\n\n# Docs preview\nuv run mkdocs serve\n"},{"location":"contributing/#see-also","title":"See also","text":"Common issues and solutions when working with winml-cli.
"},{"location":"troubleshooting/#compile","title":"Compile","text":""},{"location":"troubleshooting/#cannot-enable-compilation-no-compile-section","title":"Cannot enable compilation: no compile section","text":"UsageError: Cannot enable compilation: no compile section found in the config file\n Cause: Compilation is off by default in winml build. You passed --compile to explicitly enable it, but the config JSON has no \"compile\" section (it's null). This happens when the config was generated without a device target that supports EPContext (e.g., --device cpu or --device auto on a machine without NPU).
Solution: Regenerate the config targeting a device that supports compilation (NPU or GPU with an EP that produces EPContext):
uv run winml config -m <model> -d npu --compile -o output/\n Note
By default winml build skips the compile stage unless --compile is passed or the config contains a non-null \"compile\" section. To include compilation in the generated config, specify a device that maps to an EPContext-capable EP (e.g., -d npu).
ClickException: model_ctx.onnx is already a compiled EPContext model and cannot be re-compiled\n Cause: You're trying to compile a model that is already an EPContext artifact (the _ctx.onnx output).
Solution: Run compilation on the original (pre-compiled) ONNX file instead:
uv run winml compile -m model.onnx -d npu -o output/\n"},{"location":"troubleshooting/#provider-does-not-support-epcontext-compilation","title":"Provider does not support EPContext compilation","text":"ClickException: Provider 'DmlExecutionProvider' does not support EPContext compilation\n Cause: Not all EPs produce EPContext format. DML and CPU do not support pre-compilation.
Solution: EPContext is supported by QNN, OpenVINO, TensorRT, and Vitis AI. For DML/CPU, skip the compile step \u2014 the runtime compiles on first load automatically:
uv run winml build -c config.json -m model -o output/ --no-compile\n"},{"location":"troubleshooting/#analyze","title":"Analyze","text":""},{"location":"troubleshooting/#unsupported-nodes-persist-after-analysis","title":"Unsupported nodes persist after analysis","text":"RuntimeError: Unsupported nodes persist after analysis\n Cause: The model contains operators that the selected EP cannot dispatch natively.
Solution: Run winml analyze with --optim-config to identify problematic operators and get recommended graph optimizations:
# Analyze and output optimization recommendations\nuv run winml analyze -m model.onnx --ep qnn --optim-config optim_config.json\n This produces optim_config.json with the auto-discovered optimization flags. Apply them with winml optimize, then re-analyze:
# Apply recommended optimizations\nuv run winml optimize -m model.onnx -o model_optimized.onnx -c optim_config.json\n\n# Re-analyze to check if unsupported nodes are resolved\nuv run winml analyze -m model_optimized.onnx --ep qnn\n If unsupported nodes still remain after optimization, consider:
onnx-graphsurgeon to replace or remove operators the EP cannot handle--ep dml or --ep cpu) that supports the operators in question--opset-version 18)When winml analyze reports a large number of nodes as \"unknown\", the model likely hasn't been normalized \u2014 it contains raw constant-folding subgraphs, missing shape annotations, or redundant initializer nodes that the analyzer cannot classify.
Solution: Run winml optimize with no optimization flags to normalize the model (constant folding, shape inference, dead-node elimination), then re-analyze:
# Normalize only (no fusion flags)\nuv run winml optimize -m model.onnx -o model_normalized.onnx\n\n# Re-analyze \u2014 constant nodes are now folded, shapes are inferred\nuv run winml analyze -m model_normalized.onnx --ep qnn\n This baseline pass collapses constant subgraphs into initializers and propagates tensor shapes throughout the graph, giving the analyzer enough information to classify nodes correctly.
"},{"location":"troubleshooting/#build-cache","title":"Build / Cache","text":""},{"location":"troubleshooting/#disk-full-out-of-space","title":"Disk full / out of space","text":"Build artifacts (exported ONNX, optimized graphs, quantized models, compiled EPContext files) are cached under:
C:\\Users\\<user>\\.cache\\winml\n This directory can grow significantly after multiple builds with large models. If you encounter disk-full errors or want to reclaim space, it is safe to delete the entire folder:
Remove-Item -Recurse -Force \"$env:USERPROFILE\\.cache\\winml\"\n The next winml build will re-create the cache as needed. Use --rebuild to force a full rebuild without relying on cached intermediates.
uv run winml sys Check EP compatibility uv run winml analyze -m model.onnx --ep <ep> Verbose output Add -v or --verbose to any command Skip a pipeline stage --no-quant, --no-compile, --no-optimize Force rebuild (ignore cache) uv run winml build -c config.json -m <model> -o output/ --rebuild Regenerate config uv run winml config -m <model> -d <device> -o dir/ Free disk space Delete C:\\Users\\<user>\\.cache\\winml"},{"location":"troubleshooting/#see-also","title":"See also","text":"Verify an ONNX model is compatible with a target execution provider before deployment.
"},{"location":"commands/analyze/#when-to-use-this","title":"When to use this","text":"Use winml analyze before running the full build pipeline to confirm that your ONNX model's operators are supported by the intended execution provider and device. It surfaces operator gaps and actionable recommendations early, saving time that would otherwise be spent on a failed compile or quantize run.
$ winml analyze [options]\n"},{"location":"commands/analyze/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m PATH (required) Path to the ONNX model file to analyze. --ep choice auto Target execution provider. Accepts full names (e.g., QNNExecutionProvider) or short aliases (qnn, openvino, vitisai, cpu, cuda, dml, nvtensorrtrtx, migraphx). Use all for every rule-data-backed EP, or auto to infer from local availability. --device cpu\\|gpu\\|npu\\|all\\|auto auto Target device type. auto infers from local availability; all evaluates all rule-data-backed devices. --verbose -v flag off Enable verbose output. --quiet -q flag off Suppress non-essential output. --config -c PATH (none) Build configuration file (YAML/JSON). --output PATH (none) Save the full JSON result to a file in addition to printing the console summary. --information / --no-information flag enabled Include detailed per-operator recommendations and remediation hints in the output. Pass --no-information for a compact pass/fail summary. --htp-metadata PATH (none) Path to an HTP metadata JSON file (produced by winml export). Enriches subgraph pattern extraction by mapping nodes back to their source module hierarchy. Benefits all target EPs. --run-unknown-op / --no-run-unknown-op flag disabled For operators not in the rule database, build a minimal ONNX graph and run it on the target EP locally to determine support. Enable when local EP libraries are available. --save-node partial\\|unsupported (none) Save partial or unsupported node subgraphs to disk for further investigation. Can be specified multiple times: --save-node partial --save-node unsupported. --optim-config PATH (none) Save the auto-discovered optimization config (merged across all analyzed EPs) to a JSON file."},{"location":"commands/analyze/#how-it-works","title":"How it works","text":"winml analyze loads the ONNX model and runs a static analysis pass via ONNXStaticAnalyzer. For each operator (and recognized subgraph pattern), the analyzer consults the target EP's rule database. For operators not in the database, it can optionally probe them locally when --run-unknown-op is enabled. The combined answer classifies each node as supported, partial, unsupported, or unknown (see Analyze and optimize for definitions).
The analysis always produces a lint result \u2014 the pass/fail verdict. When --information is enabled (the default), it additionally produces an autoconf result: a set of fusion-flag suggestions that, if applied in the optimize stage, would resolve partial or unsupported patterns. Pass --no-information to skip autoconf and get just the lint verdict.
0 All operators are fully supported on the target EP. 1 At least one operator is unsupported, partially supported, or unknown. 2 Input or configuration error (bad path, unknown EP, etc.). Exit codes make winml analyze safe to use as a CI gate with set -e or $? checks.
Analyze using auto-detected EP and device:
$ winml analyze --model microsoft/resnet-50.onnx\n The output shows a live progress table per EP followed by an ANALYSIS SUMMARY section. Each EP line displays support counts in S/P/U/Unk format (Supported / Partial / Unsupported / Unknown) with color-coded indicators.
Check QNN NPU support using the short alias:
$ winml analyze --model bert-base-uncased.onnx --ep qnn --device NPU\n Check Intel OpenVINO GPU support and print operator-level recommendations:
$ winml analyze --model bert-base-uncased.onnx --ep openvino --device GPU --information\n Save the full JSON result for offline inspection while still printing the console summary:
$ winml analyze --model facebook/convnext-tiny-224.onnx --output results.json\n Use HTP metadata for enhanced subgraph pattern extraction:
$ winml analyze --model bert-base-uncased.onnx \\\n --ep qnn --device NPU \\\n --htp-metadata bert-base-uncased_htp_metadata.json\n Run a lint-only pass (no recommendations) for a CI gate:
$ winml analyze --model model.onnx --ep qnn --device NPU --no-information\necho \"Exit code: $?\" # 0 = clean, 1 = issues, 2 = input error\n Dump unsupported subgraphs to disk for debugging:
$ winml analyze --model model.onnx --ep qnn \\\n --save-node partial --save-node unsupported \\\n --output result.json\n Enable local execution for operators not in the rule database:
$ winml analyze --model model.onnx --ep qnn --device NPU --run-unknown-op\n"},{"location":"commands/analyze/#common-pitfalls","title":"Common pitfalls","text":"--ep uses auto (inferred from local availability) \u2014 to analyze every EP regardless of what is installed, pass --ep all. Specify --ep <name> when you know your target hardware.--htp-metadata is EP-agnostic \u2014 HTP metadata enriches pattern extraction before any EP-specific checks, so it benefits all target EPs equally. You do not need separate metadata files per EP.--run-unknown-op is disabled by default \u2014 operators not covered by the rule database are classified as UNKNOWN (not unsupported) unless you explicitly pass --run-unknown-op to probe them locally. Enable it only when the target EP's libraries are available on the local machine..onnx file \u2014 symbolic HuggingFace model IDs are not accepted; export the model first with winml export.Run the entire winml-cli pipeline (export \u2192 optimize \u2192 quantize \u2192 compile) in one command.
"},{"location":"commands/build/#when-to-use-this","title":"When to use this","text":"Use winml build when you want to go from a Hugging Face model ID (or an existing .onnx file) to a deployment-ready artifact in a single invocation, without manually chaining winml export, winml optimize, winml quantize, and winml compile. A build config file \u2014 generated by winml config \u2014 controls every stage of the pipeline.
$ winml build [options]\n"},{"location":"commands/build/#flags","title":"Flags","text":"Flag Short Type Default Description --config -c path None WinMLBuildConfig JSON file, generated by winml config. If omitted, config is auto-generated from -m. --model -m string None Hugging Face model ID or path to an existing .onnx file. --output-dir -o path None Directory for all build artifacts. Mutually exclusive with --use-cache. --use-cache/--no-use-cache flag false Store artifacts in the winml-cli global cache (~/.cache/winml/). Mutually exclusive with --output-dir. --rebuild/--no-rebuild flag false Overwrite existing artifacts and re-run the full pipeline. --quant/--no-quant flag true Run the quantization stage (use --no-quant to skip), overriding the config. --no-compile / --compile flag None Override compilation. --compile forces enable (config must have a compile section). --no-compile forces skip. Default: inherit from config. --optimize/--no-optimize flag true Run the optimization stage (use --no-optimize to skip). --ep string None Target execution provider for the analyzer (e.g., qnn). Falls back to the compile config EP if not set. --device -d string auto Target device for the analyzer (e.g., npu, gpu). Default: auto (auto-detect). --analyze/--no-analyze flag true Run the analyzer loop during build (use --no-analyze to skip). --max-optim-iterations integer None Maximum autoconf re-optimization rounds (3 enforced internally when not set). --no-analyze implicitly sets this to 0. --trust-remote-code/--no-trust-remote-code flag false Allow executing custom code from model repositories. Use only with trusted sources. --allow-unsupported-nodes/--no-allow-unsupported-nodes flag false Allow unsupported nodes to remain in the graph instead of failing the build. --help -h flag Show this message and exit."},{"location":"commands/build/#how-it-works","title":"How it works","text":"winml build reads a WinMLBuildConfig JSON file (from winml config) that encodes device, precision, export, quantization, and compilation settings. When -m is a Hugging Face model ID, the full pipeline runs: export \u2192 optimize \u2192 quantize \u2192 compile. When -m points to an existing .onnx file, the export stage is skipped and the pipeline starts at optimization. After compilation, an optional analyzer loop (--max-optim-iterations) re-evaluates graph quality and applies further passes; --no-analyze disables it for a deterministic single-pass build. Individual stages can be suppressed with --no-quant, --no-compile, and --no-optimize without touching the config file.
Reproducible CI/CD builds
The config file is a portable, self-contained pipeline specification. Check it into source control and invoke winml build -c config.json in CI to produce identical artifacts without manual flag management. Set \"auto\": false in the config to disable the autoconf discovery loop for fully deterministic output.
# Full pipeline: HF model \u2192 export \u2192 optimize \u2192 quantize \u2192 compile\nwinml build -c config.json -m microsoft/resnet-50 -o output/\n winml build\n Config: config.json\n Model: microsoft/resnet-50\n Output: output/\n\n export done (28.3s)\n optimize done (4.1s)\n quantize done (6.8s)\n compile done (14.2s)\n\n Build complete in 53.4s\n Final artifact: output/resnet50_ctx.onnx\n # Start from a pre-exported ONNX file (skips export stage)\nwinml build -c config.json -m resnet50.onnx -o output/\n # Export and optimize only \u2014 skip quantization and compilation for quick testing\nwinml build -c config.json -m bert-base-uncased -o output/ \\\n --no-quant --no-compile\n # Force a clean rebuild, overwriting any cached artifacts\nwinml build -c config.json -m facebook/convnext-tiny-224 -o output/ --rebuild\n # Use the global cache and cap optimizer iterations for faster turnaround\nwinml build -c config.json -m microsoft/resnet-50 \\\n --use-cache --max-optim-iterations 1\n"},{"location":"commands/build/#common-pitfalls","title":"Common pitfalls","text":"--output-dir or --use-cache is required; they are mutually exclusive. Omitting both raises an error immediately.--use-cache is not supported in module mode. When the config is a JSON array (module mode), only --output-dir is accepted.winml config. The schema is strict; unknown keys are rejected.--rebuild to force a fresh run after changing the config.Browse the curated winml-cli catalog of validated models and benchmarks.
"},{"location":"commands/catalog/#when-to-use-this","title":"When to use this","text":"Use winml catalog to discover which HuggingFace models have been validated end-to-end by the winml-cli team \u2014 exported, quantized, compiled, and benchmarked on real Windows ML devices. It is the starting point when you want a model that is known to work before investing time in a custom build.
$ winml catalog [options]\n"},{"location":"commands/catalog/#flags","title":"Flags","text":"Flag Short Type Default Description --model-type string null Filter the catalog by model architecture (case-insensitive). Examples: bert, roberta, vit. --task -t string null Filter by HuggingFace task (case-insensitive). Examples: text-classification, image-segmentation. --ep/--execution-provider string null Filter by execution provider (e.g., qnn, dml). If not specified, shows all EPs. --device -d string null Filter by target device (e.g., npu, gpu). If not specified, shows all devices. --output -o path null Save the displayed results to a JSON file. --help -h flag \u2014 Show help and exit. winml catalog reads a local catalog bundled with the package \u2014 no network access is required.
The catalog is stored in winml/modelkit/data/hub_models.json and is loaded directly from the installed package data without any network call. Each catalog entry records the model ID, task, architecture type, and model size. Use --model-type, --task, --ep, or --device to narrow the displayed list. When --output is provided, the filtered results are written as indented JSON to the specified path.
# List all validated models in the catalog\n$ winml catalog\n +--- winml-cli Catalog | 12 validated model(s) --------------------------+\n| Model Task Model Type |\n| microsoft/resnet-50 image-classification resnet |\n| bert-base-uncased fill-mask bert |\n| ProsusAI/finbert text-classification bert |\n| ... |\n+---------------------------------------------------------------------------+\nUse --ep or --device to filter by execution provider or target device.\n # Filter to BERT-family models only\n$ winml catalog --model-type bert\n # Filter by task \u2014 show only text-classification models\n$ winml catalog --task text-classification\n # Combine filters \u2014 BERT models for text classification\n$ winml catalog --model-type bert --task text-classification\n # Save filtered results to JSON for offline review\n$ winml catalog --task image-classification --output results/image_catalog.json\n"},{"location":"commands/catalog/#common-pitfalls","title":"Common pitfalls","text":"--output only saves what was displayed. Combining a filter with --output saves the filtered list. There is no flag to dump the entire catalog in one call \u2014 omit all filters and add --output to do so.winml inspect and winml export work with any HuggingFace model that has a supported architecture, whether or not it appears in the catalog.Compile an ONNX model to an EP-specific format for fast runtime loading.
"},{"location":"commands/compile/#when-to-use-this","title":"When to use this","text":"Use winml compile as the final pipeline stage after winml quantize to produce an execution-provider-native artifact (for example, a QNN EPContext model) that loads faster and avoids online graph compilation at inference time.
$ winml compile [options]\n"},{"location":"commands/compile/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m path (required unless --list) Input ONNX model file. --output -o path \u2014 Output file path (e.g., model_compiled.onnx). Takes precedence over --output-dir. --output-dir path same dir as input Directory to write compiled output artifacts. --device -d choice auto Target device: auto, npu, gpu, or cpu. --ep TEXT \u2014 Force a specific execution provider, overriding device-to-provider mapping. Accepts full names (e.g., QNNExecutionProvider) or aliases (qnn, dml, openvino, vitisai, migraphx, cpu, nvtensorrtrtx). --validate / --no-validate flag --validate Run a post-compilation validation pass on the target hardware. Enabled by default; pass --no-validate to skip when the target hardware or driver is unavailable. --compiler choice ort Compiler backend: ort (ONNX Runtime) or qairt (Qualcomm AI Runtime Tools). --qnn-sdk-root path None Path to the QNN SDK root directory. --embed/--no-embed flag false Embed the EP context blob inside the ONNX file instead of writing a separate .bin file. --list flag false List available compiler backends for the selected device and exit without compiling. --help -h flag Show this message and exit."},{"location":"commands/compile/#how-it-works","title":"How it works","text":"winml compile resolves the target execution provider from --device and --ep, then calls the winml-cli compiler API to hand the ONNX graph to the EP's offline compilation toolchain. When --device auto (the default), the target EP is determined by auto-detecting available hardware. For NPU targets, ONNX Runtime's QNN EP generates a binary .bin context file (or embeds it inline with --embed) that encodes the hardware-optimized execution plan, eliminating graph partitioning at load time. An optional post-compilation validation pass runs a forward pass through the target EP; skip it with --no-validate when the target hardware is absent.
# Compile with auto device detection (default compiler)\nwinml compile -m resnet50_qdq.onnx\n Input: resnet50_qdq.onnx\nDevice: npu\nProvider: qnn\nCompiler: ort\n\nCompiling model...\n\nSuccess! Model compiled\nOutput: resnet50_qdq_ctx.onnx\nCompile time: 12.40s\nTotal time: 13.05s\n # List available compiler backends for NPU before committing to a run\nwinml compile --list --device npu\n # Compile a pre-quantized BERT model for NPU with context embedded inline\nwinml compile -m bert-base-uncased_qdq.onnx --embed\n # Compile for GPU using the OpenVINO execution provider\nwinml compile -m microsoft_resnet50.onnx --device gpu --ep openvino\n"},{"location":"commands/compile/#common-pitfalls","title":"Common pitfalls","text":"--embed inflates the .onnx file significantly. Embedding the EP context produces a single portable file but can make it impractical to open or inspect the ONNX graph with standard tooling.--no-validate.--device auto auto-detects the best available hardware. Pass --device npu, --device gpu, or --device cpu explicitly when targeting specific hardware regardless of what is auto-detected.Generate a reusable build configuration for a Hugging Face model or ONNX file.
"},{"location":"commands/config/#when-to-use-this","title":"When to use this","text":"Use winml config at the start of a new model project to produce a WinMLBuildConfig JSON file. The config captures the model identity, task, precision, and per-stage settings in one shareable artifact that you can edit, version-control, and repeatedly pass to winml build. Running config first lets you review and adjust pipeline settings before committing to a full build.
$ winml config [options]\n"},{"location":"commands/config/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m TEXT (none) HuggingFace model ID (e.g., microsoft/resnet-50) or path to an existing .onnx file. Optional when --model-type or --model-class is provided. --task -t TEXT (auto) Override the auto-detected task (e.g., image-classification, text-classification). When omitted, the first supported task for the model is selected automatically. --model-class TEXT (auto) Override the auto-detected model class (e.g., CLIPTextModelWithProjection). Useful for multi-component models. --model-type TEXT (auto) Override the auto-detected model type (e.g., bert, resnet). Can be used without -m to generate a config from HuggingFace default settings. --module TEXT (none) Generate configs for every submodule whose class name matches the given string (e.g., ResNetConvLayer). The output is a JSON array instead of a single object. --config -c PATH (none) JSON override file in WinMLBuildConfig format. Fields present in this file take precedence over auto-detected values. --shape-config PATH (none) JSON file with input shape overrides for dummy input generation. Valid keys by modality \u2014 text: sequence_length; vision: height, width, num_channels; audio: feature_size, nb_max_frames, audio_sequence_length. --device -d auto\\|npu\\|gpu\\|cpu auto Target device. Affects the generated quantization and compilation sub-configs. auto leaves those sections unchanged from the kit defaults. --ep TEXT (none) Force a specific execution provider (qnn, dml, migraphx, tensorrt, vitisai, openvino, cpu). Overrides the device-to-provider mapping. When used without --device, the device is inferred from the EP. --precision -p TEXT auto Target precision: auto, fp32, fp16, int8, int16, or a mixed format such as w8a16. auto selects the precision based on the chosen device. --output -o PATH (stdout) Write the generated JSON to this file instead of printing to stdout. --library TEXT transformers Source library for TasksManager task lookup. Defaults to transformers; set to diffusers or another Optimum-supported library when needed. --quant/--no-quant flag true Include quantization in the generated config (use --no-quant to omit it and set quant to null). --no-compile / --compile flag --no-compile (compile excluded by default) Controls whether compilation is included in the generated config. By default compilation is excluded (compile: null). Pass --compile to include a compile section. --trust-remote-code/--no-trust-remote-code flag false Allow execution of custom model code from the HuggingFace repository. Required for some community models. Only enable for repositories you trust."},{"location":"commands/config/#how-it-works","title":"How it works","text":"winml config queries the HuggingFace TasksManager to auto-detect the model's task, class, and ONNX export specification. For known model types it looks up a per-model kit in MODEL_BUILD_CONFIGS and uses that as a starting point, layering in your device, precision, and override file on top. When -m points to an existing .onnx file, the export stage is skipped by setting export to null in the output. The result is a complete WinMLBuildConfig JSON printed to stdout or written to a file, ready to be passed to winml build.
Generate a config for ResNet-50 with all auto-detected settings:
$ winml config -m microsoft/resnet-50\n Generating config for microsoft/resnet-50...\nAuto-selected task: image-classification (from 'microsoft/resnet-50')\nGenerated config for task 'image-classification'\n{\n \"loader\": { \"task\": \"image-classification\", ... },\n \"export\": { \"opset_version\": 17, ... },\n \"optim\": { ... },\n \"quant\": null,\n \"compile\": null\n}\n Target NPU with int8 quantization and save to a file:
$ winml config -m microsoft/resnet-50 --device npu --precision int8 -o resnet_npu.json\n Generate a config for BERT and override the task:
$ winml config -m bert-base-uncased --task text-classification -o bert_cls.json\n Generate from a model type alone (no HuggingFace download required at config time):
$ winml config --model-type bert --task fill-mask\n Generate a config from an already-exported ONNX file, skipping quantization (compilation is already excluded by default):
$ winml config -m facebook/convnext-tiny-224.onnx --no-quant -o convnext_optim_only.json\n"},{"location":"commands/config/#common-pitfalls","title":"Common pitfalls","text":"-m, --model-type, or --model-class is required \u2014 calling winml config with none of these three flags raises a usage error immediately.auto precision does not always map to a lower-bit type \u2014 when --device is also auto, precision stays at the kit default (usually fp32). Explicitly pass --device npu or --device gpu for auto precision to resolve to int8 or fp16.--module changes the output shape \u2014 with --module the JSON output is an array of configs, not a single object. Scripts that expect a single object will fail to parse this output.--trust-remote-code has security implications \u2014 only use this flag with model repositories you own or explicitly trust; it allows arbitrary Python execution from the remote model card.--shape-config are modality-specific \u2014 passing a sequence_length key for a vision model has no effect. Check the --help description for valid keys per modality.WinMLBuildConfig and how stages interactEvaluate ONNX model accuracy on a standard dataset.
"},{"location":"commands/eval/#when-to-use-this","title":"When to use this","text":"Use winml eval to measure how accurately a model performs on real data \u2014 especially after quantization, where comparing the quantized model against the floating-point baseline reveals any accuracy regression introduced by precision reduction.
$ winml eval [options]\n"},{"location":"commands/eval/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m TEXT \u2014 HuggingFace model ID, or path to a local .onnx file. Required (unless --model-id is provided directly). --model-id TEXT \u2014 HuggingFace model ID used for preprocessor and config resolution when -m points to an .onnx file. Required when -m is an ONNX file. --task TEXT auto-detected Task name (e.g., image-classification). Auto-detected from --model-id when not provided. Required when -m is an ONNX file and the task cannot be inferred. --precision TEXT auto Precision used when building the model from a HuggingFace ID. One of auto, fp32, fp16, int8, int16, or a mixed w{x}a{y} spec (e.g., w8a16). fp16/fp32 skip quantization. Ignored when -m is a pre-built .onnx file \u2014 the precision is already baked in. --device choice auto Target device. Choices: auto, npu, gpu, cpu. auto selects the best available device. Combined with --precision, this drives the build when -m is a HuggingFace ID. --ep / --execution-provider TEXT \u2014 Target ONNX Runtime execution provider when finer control than --device is needed. Full names (e.g., QNNExecutionProvider, OpenVINOExecutionProvider, VitisAIExecutionProvider) and aliases (qnn, ov/openvino, vitis/vitisai) are accepted. --dataset TEXT task default HuggingFace dataset path (e.g., imagenet-1k, nyu-mll/glue). If omitted, a default dataset is selected based on the task. --dataset-name TEXT \u2014 Dataset configuration name for multi-config datasets. --dataset-revision TEXT \u2014 Git revision (branch, tag, or commit) of the dataset to load. Use refs/convert/parquet for HF datasets that are only served via the parquet mirror. --dataset-script TEXT \u2014 Path to a Python script that builds the evaluation dataset locally. Requires --trust-remote-code. --trust-remote-code / --no-trust-remote-code flag false Allow executing custom code from model repositories or dataset scripts. Required with --dataset-script. Use only with trusted sources. --samples INTEGER 100 Number of dataset samples to evaluate. --split TEXT validation Dataset split to use (e.g., validation, test, train). --shuffle / --no-shuffle flag shuffle Shuffle the dataset before sampling. Disable with --no-shuffle for reproducible sample ordering. --streaming / --no-streaming flag false Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets. --column TEXT (multiple) \u2014 Column mapping as key=value pairs (e.g., --column input_column=image). Can be specified multiple times. --label-mapping PATH \u2014 Path to a JSON file mapping dataset label names to the integer class IDs the model emits: {\"label_name\": id}. --output -o PATH \u2014 Output JSON file path for the evaluation results. --schema flag false Print the expected dataset schema for the given --task and exit. Does not run evaluation. --mode onnx\\|compare onnx Evaluation mode. onnx evaluates the ONNX candidate on a dataset. compare runs the ONNX candidate and the HuggingFace reference on identical random inputs and reports per-tensor similarity metrics \u2014 no dataset required."},{"location":"commands/eval/#how-it-works","title":"How it works","text":"winml eval loads the model and runs the evaluation pipeline via the internal evaluate function (supporting both HuggingFace IDs and local ONNX files), then pulls the requested number of samples from a HuggingFace dataset. Each sample is preprocessed using the tokenizer or image processor associated with the model ID, passed through the ONNX Runtime session, and the output is compared against the ground-truth label. Aggregated metrics (accuracy, F1, etc.) are printed to the console and optionally written to a JSON file. When -m is an ONNX file, --model-id must be provided so the command knows which preprocessor and label vocabulary to use.
Evaluate a HuggingFace model using the task-default dataset:
$ winml eval -m microsoft/resnet-50\n Task: image-classification\nDataset: timm/mini-imagenet (test, 100 samples)\nDevice: auto\n\nAccuracy: 76.00%\n\nResults saved to: microsoft_resnet-50_eval.json\n Evaluate a pre-exported ONNX file, providing the source model ID for preprocessing:
$ winml eval -m model.onnx --model-id microsoft/resnet-50 --dataset timm/mini-imagenet\n Evaluate a BERT model on the MRPC paraphrase task with column remapping:
$ winml eval -m Intel/bert-base-uncased-mrpc --dataset nyu-mll/glue --dataset-name mrpc --column input_column=sentence1 --column second_input_column=sentence2 --samples 500\n Check what dataset columns are expected before running, then remap them to match your dataset:
$ winml eval --schema --task text-classification\n Input schema for text-classification models\n==================================================\n\n--column option schema\n\nEvaluating needs a dataset with the following columns:\n input_column\n input text (default: text)\n label_column\n class label (ClassLabel or integer) (default: label)\n second_input_column\n second text for sentence-pair tasks (optional) (default: None)\n\nOverride any default with --column:\n --column input_column=<your_text_column>\n --column label_column=<your_label_column>\n --column second_input_column=<your_pair_column>\n The GLUE SST-2 dataset uses sentence instead of the default text column, so remap it with a single --column override:
$ winml eval -m distilbert/distilbert-base-uncased-finetuned-sst-2-english --dataset nyu-mll/glue --dataset-name sst2 --column input_column=sentence --samples 500\n Evaluate against a custom dataset whose label names differ from the model's class IDs. The --label-mapping flag points to a JSON file whose keys are the label name strings as they appear in the dataset and whose values are the integer class IDs the model emits. For example, ResNet-50 outputs ImageNet-1k class IDs (0\u2013999), so if your custom dataset uses readable strings like \"tabby cat\" or \"golden retriever\", labels.json translates each dataset label to the corresponding ImageNet ID the model predicts:
{\n \"tabby cat\": 281,\n \"Egyptian cat\": 285,\n \"golden retriever\": 207\n}\n $ winml eval -m microsoft/resnet-50 --dataset my-org/my-pets-dataset --label-mapping labels.json -o results/resnet_eval.json\n Evaluate a composite model from pre-exported ONNX files. Some tasks (e.g., image-to-text, encoder-decoder, dual-encoder) split the model across multiple ONNX files, one per role. Pass -m once per role as <role>=<path>.onnx and supply --model-id so the preprocessor and tokenizer can be resolved. Run winml eval --schema --task image-to-text to see the expected roles for a task:
$ winml eval -m encoder=encoder.onnx -m decoder=decoder.onnx --model-id microsoft/trocr-base-printed\n"},{"location":"commands/eval/#common-pitfalls","title":"Common pitfalls","text":"--model-id fails. When -m is a .onnx path, --model-id is mandatory. Without it the command cannot resolve the preprocessor or label vocabulary and will exit with a usage error.--dataset (and --label-mapping if needed) when evaluating a model whose label space or domain differs from the task default.imagenet-1k) require a HuggingFace account with accepted terms of use. Log in with huggingface-cli login before running eval on gated data.--shuffle is on by default. The random 100-sample slice changes between runs unless you pass --no-shuffle. Use --no-shuffle when comparing two model variants to ensure they see identical samples.--streaming skips the local cache. Streaming mode avoids downloading the full split but prevents random shuffling on large datasets. For reproducible evaluation, download the split once and omit --streaming.winml eval --schema --task <task> to inspect the expected schema and use --column to remap dataset field names to the expected names.--device optionConvert a PyTorch / Hugging Face model to ONNX, preserving module hierarchy.
"},{"location":"commands/export/#when-to-use-this","title":"When to use this","text":"Use winml export when you have a Hugging Face model ID or a local PyTorch checkpoint and need an ONNX file as the first step of the optimization pipeline. This is the entry point before winml quantize or winml compile.
$ winml export [options]\n"},{"location":"commands/export/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m string (required) Hugging Face model name or local path (e.g., prajjwal1/bert-tiny). --output -o path (required) Output ONNX file path (e.g., model.onnx). --with-report/--no-with-report flag false Generate full export reports: Markdown, JSON, and a console tree. --hierarchy/--no-hierarchy flag true Preserve hierarchy_tag metadata in ONNX nodes (use --no-hierarchy for a clean ONNX file). --dynamo/--no-dynamo flag false Enable PyTorch 2.9+ dynamo export for richer node metadata. (Experimental \u2014 currently logs a warning.) --torch-module string None Comma-separated list of torch.nn module types to include in hierarchy (e.g., LayerNorm,Embedding). (Experimental \u2014 currently logs a warning.) --input-specs path None JSON file with explicit input tensor specifications. Auto-generated when omitted. --task -t string None Override auto-detected Hugging Face task (e.g., image-feature-extraction). --export-config path None JSON file with ONNX export parameters such as opset_version and do_constant_folding. --shape-config path None JSON object mapping symbolic dimension names to concrete sizes (e.g., {\"sequence_length\": 2048}). Ignored when --input-specs is provided. --trust-remote-code/--no-trust-remote-code flag false Allow executing custom code from model repositories during export. Use only with trusted sources. --allow-unsupported-nodes/--no-allow-unsupported-nodes flag false Allow unsupported nodes to remain in the exported graph instead of failing export. --help -h flag Show this message and exit."},{"location":"commands/export/#how-it-works","title":"How it works","text":"winml export loads the model via Hugging Face transformers, then runs the eight-step Hierarchy-preserving Tags Protocol (HTP): model preparation, input generation, module-hierarchy tracing, TorchScript ONNX export, node-tagger creation, per-node tagging, tag injection into ONNX metadata_props, and optional report generation. The hierarchy metadata allows downstream tools to reason about operators grouped by their originating module rather than flat graph position. When --no-hierarchy is specified, hierarchy steps are bypassed and a bare ONNX file is written, useful for third-party tools that do not understand custom metadata.
# Minimal export: Hugging Face model ID to ONNX file\nwinml export -m microsoft/resnet-50 -o resnet50.onnx\n Model: microsoft/resnet-50\nOutput: resnet50.onnx\n\nStarting HTP export...\n Detected task: image-classification\n\nSuccess! Model exported to: resnet50.onnx\n # Export with verbose output and full Markdown + JSON reports\nwinml export -m facebook/convnext-tiny-224 -o convnext.onnx -v --with-report\n # Export a BERT model, overriding input shapes for longer sequences\nwinml export -m bert-base-uncased -o bert.onnx \\\n --shape-config shape.json\n# shape.json: {\"sequence_length\": 512}\n # Export with a hand-crafted input-spec file (skips auto-detection)\nwinml export -m bert-base-uncased -o bert.onnx --input-specs inputs.json\n # Produce clean ONNX without hierarchy metadata (for third-party optimizers)\nwinml export -m microsoft/resnet-50 -o resnet50_clean.onnx --no-hierarchy\n"},{"location":"commands/export/#see-also","title":"See also","text":"-t with the correct task string, for example -t image-feature-extraction.--shape-config is silently ignored when --input-specs is set. --input-specs takes full priority; remove it if you only want to override individual dimensions.--dynamo and --torch-module are experimental. Both flags emit a warning and have no effect in the current release. Do not rely on them in automated pipelines yet.HF_HOME or HF_HUB_CACHE to control the download location.Inspect a model's tasks, classes, and hierarchy before committing to an export.
"},{"location":"commands/inspect/#when-to-use-this","title":"When to use this","text":"Use winml inspect to understand how winml-cli will treat a HuggingFace model before running winml export or winml build. It answers questions like \"which task will be auto-detected?\", \"which HF model class will be loaded?\", and \"does this model have a supported exporter?\" without downloading weights or writing any files.
$ winml inspect -m <model_id> [options]\n"},{"location":"commands/inspect/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m string required HuggingFace model ID (e.g. openai/clip-vit-base-patch32). Required unless --list-tasks or --help is used. --format -f table | json table Output format. table renders rich panels; json emits a machine-readable object. --task -t string null Override the auto-detected task (e.g. image-classification, feature-extraction). --hierarchy/--no-hierarchy -H flag false Print the PyTorch module tree. Instantiates the model with random weights \u2014 no weight download required. --verbose -v flag false Show full configuration details. --list-tasks flag false List all known tasks and exit. Does not require --model. --model-type string null Override model type (e.g. bert, resnet). Can be used without --model. --model-class string null Override model class (e.g. BertForMaskedLM). Can be used without --model. --help -h flag \u2014 Show help and exit. winml inspect does not accept --device, --ep, --precision, or --output. It is a read-only discovery command that does not produce any artifacts.
winml inspect calls into the winml-cli registry to resolve the model ID against the known loader and exporter configurations. It fetches only the model's config.json from HuggingFace Hub (no weights), uses the architecture field to look up the matching HF model class and WinML inference class, and then renders the result. When --hierarchy is supplied, the model is instantiated locally with random weights using AutoModel.from_config(), and a forward-pass trace records the full PyTorch module tree. Because no real weights are downloaded, hierarchy inspection is fast even for large models.
# Basic inspection \u2014 check task detection and loader/exporter classes\n$ winml inspect -m microsoft/resnet-50\n +--------------------------- microsoft/resnet-50 ---------------------------+\n| Task image-classification |\n| Model Class ResNetForImageClassification |\n| Exporter OptimumExporter |\n| WinML Class WinMLImageClassificationModel |\n| Status Supported |\n+---------------------------------------------------------------------------+\n # JSON output \u2014 useful for scripting or CI pre-flight checks\n$ winml inspect -m bert-base-uncased --format json\n # Override task when auto-detection picks the wrong one\n$ winml inspect -m bert-base-uncased --task feature-extraction\n # Print the full PyTorch module hierarchy (no weight download)\n$ winml inspect -m openai/clip-vit-base-patch32 --hierarchy\n # Combine verbose logging with hierarchy for deep diagnostics\n$ winml inspect -m facebook/convnext-tiny-224 -v -H\n"},{"location":"commands/inspect/#common-pitfalls","title":"Common pitfalls","text":"--model is required for model inspection. The flag is marked required for model-specific lookups; omitting it returns an error. The only exception is --list-tasks, which lists all known tasks and exits without needing a model.transformers installation, --hierarchy will fail with an import error. Update transformers or omit the flag.--task changes which exporter and WinML class are reported, not just the task field. If the override is incompatible with the model architecture, the status will show as unsupported.--format json is silent on unsupported models. When the model is not found in the winml-cli registry, the command raises a ClickException. Wrap the call in winml inspect ... && ... or check the exit code when scripting.config.json is always fetched from HuggingFace Hub. Set HF_HUB_OFFLINE=1 if you need fully offline inspection of a locally cached model.winml.hierarchy.tag metadata is written and what you can do with the module treeApply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed.
"},{"location":"commands/optimize/#when-to-use-this","title":"When to use this","text":"Use winml optimize after exporting an ONNX model and before quantization or compilation. Graph fusions reduce operator count, improve memory locality, and can make downstream quantization more accurate by presenting cleaner subgraphs to the calibration pass. It is also useful as a standalone step when you want to optimize a pre-exported ONNX file without running the full build pipeline.
$ winml optimize [options]\n"},{"location":"commands/optimize/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m PATH (required unless listing) Input ONNX model file. Not required when --list-capabilities or --list-rewrites is used. --output -o PATH {input}_opt.onnx Output path for the optimized model. Defaults to the input filename with _opt inserted before the extension. --config -c PATH (none) YAML or JSON configuration file. Fields in the file override capability defaults; CLI flags override the file. --verbose -v flag off Enable verbose output. --list-capabilities -l flag off Print all registered optimization capabilities grouped by category and exit. Add --verbose for descriptions and ORT names. --list-rewrites flag off Print all available pattern-rewrite families with their source-to-target mappings and exit. (dynamic) flag (per capability) Each registered capability generates a --enable-<name> / --disable-<name> pair. Run --list-capabilities to see the full current list. Examples: --enable-gelu-fusion, --disable-constant-folding. Pattern-rewrite flags follow the form --enable-<source-slug>-<target-slug>; run --list-rewrites to discover all names."},{"location":"commands/optimize/#configuration-precedence","title":"Configuration precedence","text":"When multiple sources are provided, settings are resolved in this order (highest wins):
--enable-X / --disable-X)-c)winml optimize loads the ONNX model, builds a final capability configuration by merging capability defaults, an optional config file, and any explicit CLI flags, then runs all enabled passes through the Optimizer. Each capability maps to a named optimization or fusion pipe in the winml.modelkit.optim registry. The capability flags are auto-generated at startup from that registry \u2014 adding a new optimization to the registry automatically makes it available as a CLI flag without any change to this command's source. After optimization, the command prints the before-and-after node count and percentage reduction so you can quantify the effect.
Optimize a model with all capability defaults:
$ winml optimize -m microsoft/resnet-50.onnx\n Input: microsoft/resnet-50.onnx\nOutput: microsoft/resnet-50_opt.onnx\n\nLoading model...\nRunning optimizer...\nSaving optimized model...\n\nSuccess! Model optimized: microsoft/resnet-50_opt.onnx\nNodes: 312 -> 289 (7.4% reduction)\n Enable specific fusions for a BERT model:
$ winml optimize -m bert-base-uncased.onnx \\\n --enable-layer-norm-fusion \\\n --enable-attention-fusion \\\n -o bert_layernorm_attn.onnx\n Use a config file to set capabilities and save the result for downstream compilation:
$ winml optimize -m facebook/convnext-tiny-224.onnx \\\n -c optimize_config.yaml \\\n -o convnext_opt.onnx\n List all available optimization capabilities:
$ winml optimize --list-capabilities\n Discover pattern-rewrite families and their flag names:
$ winml optimize --list-rewrites\n"},{"location":"commands/optimize/#common-pitfalls","title":"Common pitfalls","text":"--model is required for actual optimization \u2014 it can be omitted only when using --list-capabilities or --list-rewrites. Missing --model in any other case raises a usage error.--disable-X CLI flag always wins over a config file value that enables the same capability, but omitting the flag leaves the config file value in effect. To turn off a capability set by a config file, pass the explicit --disable-X flag.--list-capabilities to confirm the current set of flags rather than relying on a cached list.-o, the second run silently overwrites {input}_opt.onnx. Specify an explicit output path in scripts.WinMLBuildConfig that includes optimization settingswinml-cli exposes a CLI named winml with 12 subcommands covering the full journey from model discovery to a deployment-ready artifact. Every subcommand shares a consistent invocation style \u2014 winml <command> [flags] \u2014 and the same global flags are available on the root winml group.
The commands group by user intent. Discover (sys, inspect, catalog, analyze) helps you understand your hardware and model before writing any artifacts. Configure (config, optimize) produces a reusable build configuration and tunes the ONNX graph. Build (export, quantize, compile, build) runs the pipeline stages that produce deployment artifacts. Measure (perf, eval) benchmarks and validates the result.
The typical workflow follows that order: run winml sys to confirm hardware and EPs, then winml inspect or winml catalog to verify model support. Use winml config to generate a build configuration, then winml build to execute the full pipeline \u2014 or chain export \u2192 analyze \u2192 optimize \u2192 quantize \u2192 compile individually for finer control. Close with winml perf and winml eval to measure speed and accuracy.
sys Discover Inspect your machine \u2014 devices, EPs, and runtime versions at a glance. inspect Discover Inspect a model's tasks, classes, and hierarchy before committing to an export. catalog Discover Browse the curated winml-cli catalog of validated models and benchmarks. config Configure Generate a reusable build configuration for a Hugging Face model or ONNX file. export Build Convert a PyTorch / Hugging Face model to ONNX, preserving module hierarchy. analyze Build Verify an ONNX model is compatible with a target execution provider before deployment. optimize Build Apply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed. quantize Build Quantize an ONNX model with QDQ insertion and calibration-based scaling. compile Build Compile an ONNX model to an EP-specific format for fast runtime loading. build Build Run the entire winml-cli pipeline (export \u2192 optimize \u2192 quantize \u2192 compile) in one command. perf Measure Benchmark an ONNX model's latency and throughput on a target device. eval Measure Evaluate ONNX model accuracy on a standard dataset."},{"location":"commands/overview/#choosing-a-command","title":"Choosing a command","text":"winml syswinml inspectwinml catalogwinml analyzewinml exportwinml buildwinml perfwinml eval-v / --verbose, -q / --quiet, --version, and -h / --help live on the root winml group only. Subcommands access them through ctx.obj and do not redefine them. See src/winml/modelkit/cli.py for the canonical contract.
Several flags share semantics across the commands that accept them: -m / --model, -d / --device, --ep, -o / --output, -t / --task, and --precision. Defaults and accepted values can differ per command (e.g., -p is a short form for --precision only on config and quantize); check the Flags section of each command page rather than assuming they transfer.
WinMLBuildConfig and how stages interact--device / --ep interactBenchmark an ONNX model's latency and throughput on a target device.
"},{"location":"commands/perf/#when-to-use-this","title":"When to use this","text":"Use winml perf when you want a quantitative latency and throughput baseline for a model on a specific device, or when you need to compare the performance impact of different precision settings, execution providers, or batch sizes.
$ winml perf [options]\n"},{"location":"commands/perf/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m TEXT \u2014 HuggingFace model ID or path to a local .onnx file. Required. --task TEXT auto-detected Explicit task override (e.g., image-classification). Inferred from the model if omitted. --iterations INTEGER 100 Number of timed inference iterations used to compute statistics. --warmup INTEGER 10 Number of warm-up iterations run before timing begins; excluded from statistics. --device -d auto\\|cpu\\|gpu\\|npu auto Device to run the benchmark on. auto selects the highest-priority available device. --precision TEXT auto Precision mode applied during model build: auto, fp32, fp16, int8, int16, or compound forms such as w8a16. --ep TEXT \u2014 Force a specific execution provider (e.g., qnn, dml, vitisai, openvino, cpu). Overrides the device-to-provider mapping. --ep-options KEY=VALUE (multiple) \u2014 Runtime EP provider option forwarded to the inference session (e.g., --ep-options htp_performance_mode=burst). Repeatable. Applies to both HuggingFace model IDs and ONNX file inputs. Unlike build-time options set via --config, these tune the runtime session, not the compiled graph. --output -o PATH ~/.cache/winml/perf/<slug>/<timestamp>.json Output JSON file path for the benchmark report. --batch-size INTEGER 1 Batch size used when generating synthetic input tensors. --shape-config PATH \u2014 Path to a JSON file containing shape overrides (e.g., {\"height\": 480, \"width\": 480}). Ignored for pre-exported ONNX files and in --module mode. --quantize/--no-quantize flag true Run quantization during model build (use --no-quantize to skip it). Useful for measuring the fp32 baseline. --rebuild/--no-rebuild flag false Force model rebuild even if a cached artifact already exists. --ignore-cache/--no-ignore-cache flag false Build from scratch in a temporary folder and discard the artifact after benchmarking. Implies --rebuild. --module TEXT \u2014 PyTorch module class name for per-module benchmarking (e.g., BertAttention). Builds and times each matching instance separately. See Load and export. --monitor/--no-monitor flag false Show a live NPU/CPU utilization chart while the benchmark runs and include hardware metrics in the JSON report."},{"location":"commands/perf/#how-it-works","title":"How it works","text":"winml perf loads the model through WinMLAutoModel \u2014 accepting both HuggingFace IDs and local ONNX files \u2014 then generates random input tensors from the model's I/O configuration. It runs the specified number of warm-up iterations (excluded from statistics) followed by the timed iterations, collecting per-sample latency. The final report includes mean, min, max, P50, P90, P95, P99, standard deviation, and throughput in samples per second. When --monitor is active, a hardware polling loop runs in parallel and records NPU / GPU utilization, CPU usage, and device memory alongside the timing data.
Basic benchmark on the best available device:
$ winml perf -m microsoft/resnet-50\n Device: npu\nPrecision: auto\nTask: image-classification\nIterations: 100 (+ 10 warmup)\nBatch Size: 1\n\nLatency (ms)\n Avg P50 P90 P95 P99 Min Max Std\n 2.14 2.11 2.38 2.51 2.79 1.97 3.04 0.12\n\nThroughput: 467.29 samples/sec\n\nResults saved to: ~/.cache/winml/perf/microsoft_resnet-50/2026-05-27T120000.json\n Benchmark a pre-exported ONNX file on CPU with more iterations:
$ winml perf -m model.onnx --device cpu --iterations 500\n Benchmark a text model with an explicit task, targeting the NPU:
$ winml perf -m bert-base-uncased --task text-classification --device npu --precision w8a16\n Benchmark with live hardware monitoring enabled:
$ winml perf -m microsoft/resnet-50 --device npu --monitor\n Pass runtime EP provider options to tune the session (repeatable):
$ winml perf -m model.onnx --device npu \\\n --ep-options htp_performance_mode=burst \\\n --ep-options htp_graph_finalization_optimization_mode=3\n Per-module benchmarking to find latency hot-spots across all attention blocks:
$ winml perf -m bert-base-uncased --module BertAttention --iterations 200\n"},{"location":"commands/perf/#common-pitfalls","title":"Common pitfalls","text":"--warmup 30 or higher to reach steady-state latency.--shape-config is silently ignored in two cases. It has no effect on pre-exported ONNX files (shapes are baked into the graph) and is ignored in --module mode. The command prints a warning in both situations.winml perf separately with different --device values and compare the resulting JSON reports.perf benchmarks--module per-instance benchmarking works--device vs --epQuantize an ONNX model with QDQ insertion and calibration-based scaling.
"},{"location":"commands/quantize/#when-to-use-this","title":"When to use this","text":"Use winml quantize after winml export to insert QuantizeLinear/DequantizeLinear (QDQ) node pairs into an ONNX graph. The resulting model is ready for winml compile targeting an NPU or other quantization-aware execution provider.
$ winml quantize [options]\n"},{"location":"commands/quantize/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m path (required) Input ONNX model file. --output -o path {input}_qdq.onnx Output path for the quantized model. --task string \u2014 Task name (e.g., image-classification, text-classification) used to select a task-appropriate calibration dataset. Pair with --model-name so the dataset is preprocessed exactly the way the model expects. Without --task, calibration falls back to synthetic random data. --model-name string \u2014 HuggingFace model ID (e.g., microsoft/resnet-50) used to load the matching preprocessor/tokenizer for calibration. Only used when --task is provided. --precision -p string None Precision shorthand: int8, int16, or mixed-precision like w8a16. Overridden by explicit --weight-type / --activation-type. --samples integer 10 Number of calibration samples used to compute quantization ranges. --method choice minmax Calibration algorithm: minmax, entropy, or percentile. --weight-type choice \u2014 Per-tensor type for weights: uint8, int8, uint16, or int16. Overrides --precision. When unset, defaults to uint8 (or the type implied by --precision). --activation-type choice \u2014 Per-tensor type for activations: uint8, int8, uint16, or int16. Overrides --precision. When unset, defaults to uint8 (or the type implied by --precision). --per-channel/--no-per-channel flag false Apply per-channel (rather than per-tensor) quantization to weight tensors. --symmetric/--no-symmetric flag false Use symmetric quantization (zero-point fixed at 0). --help -h flag Show this message and exit."},{"location":"commands/quantize/#how-it-works","title":"How it works","text":"winml quantize applies static post-training quantization (PTQ) using the ONNX Runtime quantization API. Calibration passes collect activation range statistics, which are used to compute scale and zero-point values baked into QuantizeLinear / DequantizeLinear node pairs around each eligible operator. The --method flag controls range estimation: minmax uses global observed extremes, entropy minimizes KL-divergence, and percentile clips outliers. Precision can be set at a coarse level with --precision or tuned per tensor type with --weight-type and --activation-type; explicit type flags always override --precision.
Calibration data is selected from --task and --model-name. For a supported task, a built-in default calibration dataset is loaded and preprocessed through the model's own tokenizer or image processor, so the calibration tensors match what the model will see at inference time. For an unsupported task \u2014 or when --task is omitted entirely \u2014 calibration falls back to synthetic random data synthesized from the ONNX input specification. Random-data calibration is fast and always works, but the resulting scales are typically less accurate than dataset-driven calibration, so always provide --task and --model-name when the model task is supported.
# Minimal quantization: defaults (10 samples, uint8 weights and activations)\nwinml quantize -m resnet50.onnx\n Input: resnet50.onnx\nOutput: resnet50_qdq.onnx\nWeight type: uint8\nActivation type: uint8\nSamples: 10\nMethod: minmax\n\nRunning quantization...\n\nSuccess! Model quantized\nOutput: resnet50_qdq.onnx\nQDQ nodes inserted: 53\nTotal time: 4.31s\n # Task-aware calibration: real samples preprocessed through the model's own image processor\nwinml quantize -m resnet50.onnx --task image-classification --model-name microsoft/resnet-50 --samples 128\n # int8 precision shorthand (equivalent to --weight-type int8 --activation-type int8)\nwinml quantize -m resnet50.onnx -p int8\n # Mixed-precision: int8 weights, uint16 activations with entropy calibration\nwinml quantize -m bert-base-uncased.onnx --weight-type int8 --activation-type uint16 --method entropy --samples 64\n # Per-channel symmetric quantization to a specific output path\nwinml quantize -m facebook_convnext.onnx -o facebook_convnext_qdq.onnx --per-channel --symmetric --samples 32\n # int16 precision (suitable for models sensitive to int8 accuracy loss)\nwinml quantize -m bert-base-uncased.onnx --precision int16\n"},{"location":"commands/quantize/#common-pitfalls","title":"Common pitfalls","text":"--task and --model-name, scales and zero-points are computed from random tensors synthesized from the ONNX input specification \u2014 the model never sees realistic activations, so accuracy after quantization can degrade noticeably. Always pass --task and --model-name for supported tasks (e.g., --task image-classification --model-name microsoft/resnet-50) so calibration runs on real samples preprocessed through the model's own tokenizer or image processor.--weight-type / --activation-type silently override --precision. If you pass both, the explicit type flags win. Omit --precision when setting types explicitly to avoid confusion.--per-channel increases model size. Per-channel quantization stores a separate scale and zero-point per output channel; this can noticeably inflate the model file size compared to per-tensor mode.{stem}_qdq.onnx in the same directory as input. Always pass -o when writing to a specific location to avoid accidentally overwriting or cluttering the source directory.winml compile --no-quant instead if the model already contains QDQ nodes.Inspect your machine \u2014 devices, EPs, and runtime versions at a glance.
"},{"location":"commands/sys/#when-to-use-this","title":"When to use this","text":"Run winml sys before starting any export or build workflow to confirm that the required ML libraries are installed and that the target hardware is visible. It is also the first command to run when diagnosing an unexpected export failure.
$ winml sys [options]\n"},{"location":"commands/sys/#flags","title":"Flags","text":"Flag Short Type Default Description --format -f text | json | compact text Output format. text renders rich tables, json emits machine-readable JSON, compact prints a single-line summary. --list-device \u2014 flag false List available compute devices (NPU, GPU, CPU) in priority order instead of showing the full system report. --list-ep \u2014 flag false List available ONNX Runtime execution providers instead of showing the full system report. Can be combined with --list-device. --verbose -v flag false Surface additional diagnostic sections: backend availability and Export Readiness. --help -h flag \u2014 Show help and exit. winml sys takes no --model, --device, --ep, --task, or --precision arguments. It describes the host environment, not a specific model.
winml sys queries Python's platform and importlib.metadata modules to report library versions, then probes PyTorch for CUDA availability and GPU device names. Backend availability checks use the installed runtime environment, while device enumeration queries hardware directly in NPU > GPU > CPU priority order, and EP enumeration merges the WinML EP registry with ONNX Runtime's get_available_providers(). When --format json is used the full report \u2014 including devices and EPs \u2014 is emitted as a single JSON object, making it easy to capture in CI pipelines.
# Full human-readable system report\n$ winml sys\n +------------------------------------+\n| winml-cli System Information |\n+------------------------------------+\n\nEnvironment\n Python Version 3.11.9\n Python Executable C:\\...\\python.exe\n OS Windows 11\n Machine AMD64\n\nML Libraries\n Library Version Status\n torch 2.4.0 OK\n transformers 4.44.0 OK\n onnx 1.16.1 OK\n ...\n\nAvailable Devices (priority order)\n #1 NPU Qualcomm(R) Hexagon NPU\n #2 GPU Qualcomm(R) Adreno GPU\n #3 CPU Snapdragon(R) X Elite\n\nAvailable Execution Providers\n QNNExecutionProvider -> NPU/GPU\n DmlExecutionProvider -> GPU\n CPUExecutionProvider -> CPU\n # Compact one-liner \u2014 useful for CI logs\n$ winml sys --format compact\n # Machine-readable JSON \u2014 pipe to jq or save for later comparison\n$ winml sys --format json > env.json\n # Only list devices \u2014 skip everything else\n$ winml sys --list-device\n # List EPs as JSON \u2014 useful for scripting EP selection\n$ winml sys --list-ep --format json\n"},{"location":"commands/sys/#common-pitfalls","title":"Common pitfalls","text":"--list-device and --list-ep suppress the full report. When either flag is present, only the requested section is printed. Omit both flags to see the complete system report.--format compact omits device and EP tables. The compact format is designed for single-line log entries and does not include device or EP details. Use text or json when you need the full picture.torch+cuXXX). A CPU-only torch wheel will always report cuda_available: false.--device / --ep flags interactNot every ONNX graph runs efficiently on every execution provider. An operator that compiles cleanly on CPU may be unsupported on an NPU, and a correct graph may still leave performance on the table because adjacent operations were not fused. winml-cli separates the concern into two commands \u2014 winml analyze and winml optimize \u2014 that together form a graph-quality loop driven automatically by winml build.
winml analyze performs static analysis on an ONNX graph to answer one question: will this model run end-to-end on my target execution provider, and if not, what needs to change?
Unlike profiling, static analysis does not require executing the full model on the target device. It inspects each operator (and recognized subgraph pattern) against a rule database of known EP capabilities, classifies every node, and emits actionable recommendations. The same analyzer also drives the autoconf feedback loop inside winml build, so understanding how it works is useful even when you never invoke winml analyze directly.
Specify a target EP with --ep (e.g., --ep qnn or --ep openvino) and a device with --device (CPU, GPU, or NPU). The default --ep auto infers from locally available EPs; pass --ep all to evaluate every rule-data-backed EP regardless of local availability. Results print to the console by default; add --output results.json to save the report as JSON for scripting or archiving.
For each operator (and matched subgraph pattern) the analyzer follows a two-step process:
--run-unknown-op is enabled, the analyzer builds a minimal ONNX graph for the op and runs it on the target EP locally to determine support (see Local op execution below).The combined answer is recorded as a SupportLevel:
SUPPORTED yes yes Fully Supported 0 PARTIAL no yes Partial Support 1 (warning) UNSUPPORTED no no Not Supported 1 (error) UNKNOWN n/a n/a Unknown Support 1 A PARTIAL classification means the operator cannot be dispatched to the requested EP but the ONNX Runtime can still execute the model by falling back to CPU. This is technically a working model, but the latency and power-efficiency goals of NPU deployment are not met. UNSUPPORTED means even the CPU fallback path fails, so the model will not run at all. UNKNOWN appears only when the analyzer lacks both rule-database data and the ability to test locally.
Every analysis produces a lint result; the default (full) mode additionally produces an autoconf result. Understanding these two outputs separately is the easiest way to understand what winml analyze is for and how to consume it.
Lint is the analyzer's verdict on the model as it stands today. It classifies every operator and recognized pattern against the target EP and rolls the classifications up into:
errors \u2014 count of UNSUPPORTED patterns. The model will not run.warnings \u2014 count of PARTIAL patterns. The model runs, but these nodes fall back to CPU.passed \u2014 True iff errors == 0 and warnings == 0.Lint always runs. It is deterministic and sufficient for a yes/no CI gate \u2014 the CLI's exit code is derived from it.
Autoconf is the analyzer's suggestion for how to fix the current model. It lists the fusion flags which, if enabled in the optimize stage, would convert one or more PARTIAL/UNSUPPORTED patterns into SUPPORTED ones.
Autoconf is what powers the build pipeline's re-optimization loop: when the analyzer says \"gelu_fusion would resolve these warnings\", the build re-runs optimize with that flag and re-analyzes \u2014 until no further suggestions remain or the iteration limit is hit. Autoconf is advisory; nothing else in the system flips fusion flags automatically.
winml analyze can run in two modes which differ only in whether autoconf is computed:
--no-information (CLI) or autoconf=False (Python) Lint only. optimization_config is None. CI gate; pass/fail only Full (default) --information (CLI, default) or autoconf=True (Python) Lint plus autoconf and recommendations Local debugging; build pipeline's autoconf loop The only difference between the two modes is whether autoconf and the human-readable recommendations are computed. Skipping them gives a faster, leaner run. The lint result is identical either way.
"},{"location":"concepts/analyze-and-optimize/#three-classes-of-finding","title":"Three classes of finding","text":"Every analysis emits findings in three buckets. Each bucket maps to a different remediation pattern.
Errors (UNSUPPORTED patterns) block deployment. Either the operator does not exist on the target EP at all, or it does not handle the specific input shape/dtype the model uses. Typical remediations:
Each error pattern includes a recommendation that identifies the current pattern and the target pattern the EP does support, so the optimizer (or a manual rewrite) can apply the fix.
Warnings (PARTIAL patterns) mean the model will run, but the target EP cannot dispatch this pattern. Inference falls back to the CPU EP, breaking the deployment goal (e.g., NPU offload) without breaking correctness. Warnings are usually fusion opportunities \u2014 the analyzer recognized a sub-pattern that, if fused, would become a single EP-native op. The fix is to enable the relevant fusion flag in the optimize stage \u2014 this is exactly what the autoconf loop does automatically.
Info (Information items) are lower-priority insights: a hint that an alternative pattern exists, a QDQ-equivalent that could be used after quantization, or a description of why a node was classified as it was. Info entries never affect exit code.
The static rule database does not cover every operator and every shape/dtype combination. When --run-unknown-op is enabled and the analyzer encounters a pattern not present in the database, it builds a tiny ONNX graph containing just that op (with the model's actual input metadata) and runs it on the target EP locally. The compile/run result becomes the classification. Without --run-unknown-op (the default), such patterns are classified as UNKNOWN.
Leave --run-unknown-op disabled when:
When a pattern is unsupported and the recommendation does not immediately tell you what is wrong, use --save-node to dump the offending subgraph to disk as a self-contained, runnable .onnx file. You can then open it in Netron, re-analyze it in isolation, or attach it to a bug report as a minimal reproducer. See the analyze command reference for usage examples.
When a model is exported with hierarchy-preserving tags (HTP), the export produces a sidecar _htp_metadata.json that maps each ONNX node back to its source module (e.g., encoder.layer.0.attention.self.GELUActivation). Passing this file via --htp-metadata lets the PatternExtractor use the module hierarchy to match subgraph patterns more accurately than operator-level heuristics alone.
HTP metadata is consumed at the pattern extraction stage \u2014 before any EP-specific runtime checking \u2014 so the enriched patterns benefit all target EPs equally (QNN, OpenVINO, VitisAI, etc.). Without HTP metadata, the analyzer falls back to attribute-based tag matching and then the general-purpose PatternMatcher; with it, the analyzer can correctly identify fused patterns (GELU, LayerNorm, Attention) that are difficult to detect from the raw operator graph. See the analyze command reference for usage examples.
The analyzer is composed of five stages that run in order. You normally do not need to think about them, but they are worth knowing when reading recommendations or extending the analyzer:
Stage JobONNXLoader Load the ONNX file (or ModelProto), record metadata. PatternExtractor Walk the graph, match operator and subgraph patterns from the rule catalog. Optionally consume HTP metadata. RuntimeChecker For each pattern, consult the rule database; if no rule applies, run the op locally (when allowed). InformationEngine Turn classifications into human-readable Information items; also runs model validators (constant folding, dynamic input, pattern matching, QDQ validation, shape inference). OutputAggregator Assemble the final AnalysisOutput (the JSON you get from --output). The model validators run regardless of whether there are runtime check results \u2014 they are model-level sanity checks (e.g., is shape inference complete? are QDQ pairs well-formed?) and can surface issues even when every operator looks fine in isolation.
"},{"location":"concepts/analyze-and-optimize/#what-optimize-does","title":"What optimize does","text":"winml optimize rewrites the ONNX graph by applying fusions and structural simplifications. Internally the optimizer runs four pipes in sequence:
Every optimization is a named capability toggled via --enable-<name> and --disable-<name> flags. Run --list-capabilities to see all registered optimizations and their defaults. The optimizer currently ships 57 static capabilities across 13 categories:
This granularity matters when a specific fusion breaks a downstream step or when you need an exact optimization profile for a given EP. Some capabilities declare dependencies (e.g., bias-gelu-fusion requires gelu-fusion); the optimizer resolves these automatically when you enable a flag.
Pattern rewrites are a complementary mechanism: instead of folding nodes, rewrites replace one subgraph pattern with a structurally equivalent alternative. Rules are defined in JSON files (default.json for general rewrites, qnn.json for QNN-specific rewrites). The optimizer currently ships 5 rewrite groups containing 12 individual rules \u2014 for example, four GELU source variants can each be rewritten to a single Gelu op, and a MatMul+Add pattern can be rewritten to a GEMM or to a Conv2D for Qualcomm NPU targets. Run --list-rewrites to discover available families and their flag names. Flags follow the form --enable-<source-slug>-<target-slug>.
Commit a specific combination of flags to a --config file for reproducible builds.
A single optimize pass may create fusion opportunities that were not present before, and a freshly fused graph may surface new operator compatibility issues. This is why winml build runs analyze and optimize in an alternating loop rather than once each.
The flow inside winml build (implemented in run_optimize_analyze_loop) is:
The initial optimize pass applies the flags from config.optim. The analyzer then inspects the result; if autoconf discovers fusion flags that were not yet enabled, the optimizer re-runs with those flags and the analyzer re-checks. This repeats up to --max-optim-iterations rounds (default: three). The loop exits early when autoconf suggests no further changes. After the loop, a final analysis validates the result \u2014 if unsupported patterns still exist, the build raises a RuntimeError.
Use --no-analyze to skip the loop and run a single optimization pass \u2014 useful for deterministic rebuilds from a fixed ONNX checkpoint where the graph is already known good.
winml analyze (CLI) \u2014 exit code is the contract Embed analysis in a build script or notebook analyze_onnx(model, ep=...) (flat Python API) Post-process the full result programmatically ONNXStaticAnalyzer().analyze(...) (class API) Analyze an in-memory ModelProto ONNXStaticAnalyzer().analyze_from_proto(...) Optimize with full control over fusions winml optimize (CLI) with --enable- / --disable- flags Reproducible build from a config file winml build -c config.json (pipeline wrapper) The CLI and the flat Python API are sufficient for the vast majority of cases. The class-based API is only needed when you want to call is_fully_supported(ep), get_unsupported_operators(ep), or get_optimization_opportunities(ep) on the full result.
When you run winml compile, you are not simply copying an ONNX file to a new location. You are asking an execution provider (EP) to transform the model into a form it can load and run directly, without repeating that transformation at every startup. Understanding what the compiler produces \u2014 and why \u2014 helps you decide when to compile, what output format to choose, and how to balance file size against runtime performance.
Compilation is an offline, one-time step. The artifact it creates is what you ship with your application and what winml-cli uses for benchmarking and evaluation.
For EPs that are fully integrated into ONNX Runtime \u2014 CPU, DirectML, and similar providers \u2014 the compile step writes a new .onnx file that the runtime loads directly. The ONNX graph has been prepared and, in some cases, partitioned so that the EP's session initializer has less work to do when the application starts.
For EPs that support ahead-of-time compilation (e.g. --ep qnn for Qualcomm NPUs and --ep vitisai for AMD NPUs), the compiler goes further. It takes the ONNX graph and produces a binary artifact \u2014 the EP context blob \u2014 that encodes the fully compiled, hardware-ready version of the network. This blob is then associated with the ONNX model file. On subsequent loads, the EP reads the blob rather than re-compiling the graph, which makes session creation dramatically faster.
The default compiler backend is ort (ONNX Runtime).
For QNN compilation, winml-cli gives you a choice of where the EP context blob lives. By default the blob is written as a sidecar .bin file alongside the .onnx. Passing --embed instead inlines the blob directly into the ONNX file.
External (default): The .onnx is small and human-inspectable; the heavy binary data lives in a separate file. You must keep the two files together \u2014 the ONNX stores a relative path back to the .bin. This layout is preferable for version control and for scenarios where you want to inspect or diff the model graph.
Embedded (--embed): Everything ships in a single .onnx file. Deployment is simpler because there is only one artifact to track. The trade-off is file size: the .onnx grows by the full size of the compiled context, and the file is no longer human-readable in the usual sense. Choose embedded when your deployment tooling expects a single model file, or when you want to minimize the chance of the sidecar being misplaced.
The first time an ONNX Runtime session is created for a model on a hardware EP, the runtime must partition the graph, allocate buffers, and JIT-compile the operators. On an NPU this process can take several seconds. For applications with tight startup budgets \u2014 on-device inference in a UI flow, for example \u2014 that cold-start cost is often unacceptable.
A model produced by winml compile has already paid that cost. The EP context blob is the result of compilation, not its input. When the application loads the compiled model the EP reads the pre-built binary and the session is ready almost immediately. Shipping a compiled model is therefore the standard pattern for production deployments on QNN hardware.
If you are iterating on quantization settings or ONNX graphs and want to check whether the model compiles at all, pass an already-quantized (QDQ) model directly \u2014 winml compile compiles whatever ONNX file you supply and does not have a separate quantization pass to skip.
By default winml compile runs a validation pass after compilation finishes \u2014 it loads the compiled model into an inference session, feeds it dummy inputs (all-ones tensors), and checks that the outputs do not contain NaN or Inf values. This catches basic compilation failures early (e.g., the EP rejecting the graph or producing garbage outputs).
The --no-validate flag skips that pass. It is useful during rapid iteration when you only want to confirm that compilation succeeds without the overhead of a trial inference run.
--ep / --device flagswinml config and winml build are a producer/consumer pair. winml config inspects a Hugging Face model (or an existing ONNX file), auto-detects the task, model class, and I/O specifications, and writes a WinMLBuildConfig JSON file. winml build reads that file and runs the full pipeline \u2014 export, optimize, quantize, compile \u2014 producing a Windows ML-ready ONNX artifact.
Keeping these two responsibilities separate is intentional. The config file is a stable, human-readable description of exactly what the build will do. You can generate it once, review or edit it, commit it to source control, and replay the same build at any time without re-running model introspection. CI pipelines and team workflows both benefit from treating the config file as a versioned artifact rather than a transient intermediate.
"},{"location":"concepts/config-and-build/#generating-a-config","title":"Generating a config","text":"winml config produces a WinMLBuildConfig JSON with sensible defaults for the detected model type. At minimum, provide a model identifier:
winml config -m microsoft/resnet-50 -o resnet50.json\n Several flags shape what ends up in the config:
--task overrides the auto-detected Hugging Face task when detection is ambiguous or when you want a specific variant (for example, text-classification vs feature-extraction).--no-quant sets the quant section to null, so the quantize stage is omitted when winml build consumes the config. Use this for GPU workflows where float16 is preferred over QDQ quantization.--no-compile sets the compile section to null, producing a portable ONNX that the runtime compiles on first load instead of embedding a pre-compiled binary.--trust-remote-code allows model repositories that ship custom modeling code \u2014 required for some community models that define non-standard architectures outside the standard transformers library.If -o is omitted, the config is printed to stdout, which is convenient for piping or quick inspection. The generated JSON is plain text and can be edited directly before being passed to winml build.
A WinMLBuildConfig is a dataclass defined in src/winml/modelkit/config/build.py. It holds five nested sub-configs for the pipeline stages, plus an evaluation config and an auto flag:
loader WinMLLoaderConfig Task, model type, and model class used to load the Hugging Face model. export WinMLExportConfig Input/output tensor specs, opset version, dynamic axes (null for pre-exported ONNX). optim WinMLOptimizationConfig Graph fusion flags (GeLU, LayerNorm, MatMul+Add). quant WinMLQuantizationConfig Precision types (weight_type, activation_type), calibration samples and method (null to skip). compile WinMLCompileConfig Target EP provider, EPContext options, compiler backend (null to skip). eval WinMLEvaluationConfig \\| null Evaluation settings run after the build (null to skip). auto bool When true (default), auto-fills missing fields from model introspection. Setting quant or compile to null tells the pipeline to skip that stage entirely, equivalent to passing --no-quant or --no-compile on the command line.
A generated config looks similar to:
{\n \"loader\": {\n \"task\": \"image-classification\"\n },\n \"export\": {\n \"opset_version\": 17,\n \"batch_size\": 1\n },\n \"optim\": {\n \"gelu_fusion\": false,\n \"layer_norm_fusion\": false,\n \"matmul_add_fusion\": false\n },\n \"quant\": {\n \"mode\": \"qdq\",\n \"weight_type\": \"uint8\",\n \"activation_type\": \"uint8\",\n \"samples\": 10\n },\n \"compile\": {\n \"execution_provider\": \"qnn\",\n \"enable_ep_context\": true\n }\n}\n The file is plain JSON. You can hand-edit any field before passing it to winml build \u2014 adjust the calibration sample count, change the compile provider, or remove a fusion flag.
Pass the config file to winml build with either an output directory or the global cache flag:
# Write artifacts to a local directory\nwinml build -c resnet50.json -m microsoft/resnet-50 --output-dir output/\n\n# Write to the global cache (~/.cache/winml/)\nwinml build -c resnet50.json -m microsoft/resnet-50 --use-cache\n --output-dir and --use-cache are mutually exclusive; you must supply one of the two when running winml build (enforced at runtime, not parse time). Within the output directory, winml build writes one ONNX file per completed stage so that intermediate artifacts are available for inspection, and it writes a copy of the resolved config so the full build parameters are recorded alongside the outputs.
CLI flags passed directly to winml build override the corresponding config sections for that run only, without modifying the JSON file on disk. This makes it straightforward to experiment with a variation without creating a new config:
# Skip quantization and compilation for this run only\nwinml build -c resnet50.json -m microsoft/resnet-50 --output-dir output/ --no-quant --no-compile\n\n# Skip optimization (for a pre-quantized input ONNX)\nwinml build -c resnet50.json -m model_qdq.onnx --output-dir output/ --no-optimize\n --no-quant, --no-compile, and --no-optimize each suppress the corresponding stage regardless of what the config file specifies. Because the config file is unchanged, re-running without the override flag reverts to the full pipeline described in the config.
Storing the WinMLBuildConfig JSON in source control brings three concrete benefits:
Reproducibility. A config file pins every build decision \u2014 task, precision, quantization method, calibration sample count, target EP, fusion flags \u2014 in a single file. Running winml build -c config.json six months later produces the same artifact as it does today, regardless of how the tool's defaults evolve.
CI integration. A CI job can run winml build -c config.json -m <model-id> --output-dir artifacts/ with no human intervention. Because all settings live in the config file, the CI script requires no per-model flag knowledge, and updating build parameters is a pull request to the config file, not a change to the pipeline script.
Team sharing. Handing a colleague a config file is enough for them to reproduce the exact build on their machine. There is no need to document the sequence of primitive commands, precision arguments, or calibration settings separately \u2014 the file is the documentation.
winml build vs individual primitive commandsAn Execution Provider (EP) is a pluggable backend in ONNX Runtime that claims and runs a subset of graph nodes on a specific hardware target. When ONNX Runtime loads a model it partitions the graph among the registered EPs: operators that an EP claims are dispatched to it, and the remainder fall back to the CPU EP. This design lets a single ONNX model exploit an NPU, GPU, or CPU without any change to the graph itself.
A device is the hardware category that an EP targets \u2014 one of npu, gpu, or cpu. winml-cli exposes both levels of control: the high-level --device flag selects a hardware category, while the low-level --ep flag pins a specific ONNX Runtime provider name. In most workflows you set --device and let winml-cli resolve the best available EP; you reach for --ep when you need to compare or force a specific provider.
The table below lists every Execution Provider that winml-cli has explicit support for. EP names are the canonical ONNX Runtime strings accepted by --ep. You can also use the short alias (case-insensitive) anywhere the full name is accepted.
QNNExecutionProvider qnn npu / gpu Qualcomm NPU (Hexagon DSP) / Qualcomm GPU (Adreno) Snapdragon-based Copilot+ PCs; best latency and power efficiency on Qualcomm silicon VitisAIExecutionProvider vitisai npu AMD NPU (XDNA) AMD Ryzen AI platforms; targets the AMD AI Engine via the Vitis AI stack OpenVINOExecutionProvider openvino npu / gpu / cpu Intel CPU / GPU / NPU Intel Core Ultra platforms; flexible device targeting across all three Intel compute types DmlExecutionProvider dml gpu GPU (DirectML) Any DirectX 12 GPU on Windows; broad compatibility across AMD, Intel, and NVIDIA discrete/integrated graphics NvTensorRTRTXExecutionProvider nv_tensorrt_rtx gpu NVIDIA GPU (TensorRT RTX) NVIDIA RTX GPUs; maximum throughput via TensorRT graph optimization MIGraphXExecutionProvider migraphx gpu AMD GPU (MIGraphX) AMD discrete GPUs; hardware-accelerated inference via the MIGraphX graph engine CPUExecutionProvider cpu cpu CPU Universal fallback; always available regardless of hardware To see which EPs are available on the current machine, run:
winml sys --list-ep\n"},{"location":"concepts/eps-and-devices/#device-vs-ep-on-the-cli","title":"Device vs. EP on the CLI","text":"winml-cli exposes two overlapping flags for targeting hardware. Understanding their relationship prevents confusion when using winml analyze, winml compile, or winml build.
--device (high-level)
Accepts one of four values: auto, cpu, gpu, or npu. When set to auto (the default), winml-cli inspects the machine and selects the highest-priority device class that has a compatible EP available, in the order NPU > GPU > CPU. Setting an explicit value such as --device npu requests a device category without naming the EP.
For winml analyze, --device also accepts all \u2014 this evaluates the model against every device that has rule data, producing a side-by-side compatibility report.
# Let winml-cli pick the best available device\nwinml analyze --model model.onnx --device auto\n\n# Target the NPU device class\nwinml analyze --model model.onnx --device npu\n\n# Analyze against all devices at once (analyze only)\nwinml analyze --model model.onnx --device all\n --ep (low-level override)
Accepts a valid EP name or alias (for example qnn, vitisai, dml, openvino), or auto to let winml-cli resolve the EP from the device. When --ep is provided with a specific value it takes precedence over --device and bypasses device-class resolution entirely. Use --ep when you need to pin a specific provider \u2014 for instance to compare QNNExecutionProvider against DmlExecutionProvider on the same machine.
For winml analyze, --ep also accepts all \u2014 this evaluates the model against every registered EP simultaneously.
# Force Qualcomm QNN regardless of device selection\nwinml analyze --model model.onnx --ep QNNExecutionProvider --device npu\n\n# Use the short alias; winml-cli normalizes it to the full name\nwinml analyze --model model.onnx --ep qnn\n\n# Analyze against all EPs at once (analyze only)\nwinml analyze --model model.onnx --ep all\n The --ep flag accepts a free-form string and is not restricted to the choices listed above. This allows forward compatibility with EP names that winml-cli does not yet enumerate.
winml eval answers one question: does this model produce correct results? It measures accuracy \u2014 how well outputs match ground truth \u2014 rather than latency or throughput. You give it a model, point it at a labeled dataset, and get back a JSON report of metric scores. Everything else in the pipeline (compilation, quantization, device selection) is about making the model fast; eval is about knowing whether it is still right.
The dataset is the source of truth. Eval iterates over dataset rows, runs each sample through the model, and compares the prediction to the label recorded in the dataset. This means the dataset must have both input features and ground-truth labels, and the columns carrying those values must be wired to the model's inputs and outputs. winml-cli handles standard tasks automatically, but the column-mapping flags let you override the defaults for non-standard datasets.
"},{"location":"concepts/eval-and-datasets/#what-eval-reports","title":"What eval reports","text":"The metric reported depends on the task. Classification tasks produce accuracy (top-1 and optionally top-5). Object detection tasks produce mean average precision (mAP). The exact set of metrics is printed to stdout and saved to the file specified by --output. The --output flag accepts any .json path; if omitted, results are printed but not persisted. Use --schema to print the expected dataset schema for a given task without running eval, which is useful when you are preparing a custom dataset.
--dataset takes a Hugging Face dataset path \u2014 for example imagenet-1k or glue. If you omit it, winml-cli selects a default dataset based on the detected task. For datasets that have multiple configurations, --dataset-name picks the specific config (e.g. --dataset-name mrpc when using the glue dataset).
By default eval runs on the validation split; --split overrides this. Full validation sets can be large. During development, --samples 200 caps the run to 200 rows so you get quick feedback. For very large datasets that you prefer not to download fully, --streaming fetches rows on demand instead of materialising the whole dataset locally. --shuffle (on by default) randomises sampling order so a capped run is representative rather than biased toward the first rows.
winml-cli must know which dataset column feeds which model input and which column holds the ground-truth label. For well-known task/dataset combinations this mapping is built in. When it is not, use --column key=value to declare it. The key is the name the task pipeline expects (e.g. input_column) and value is the actual column name in the dataset (e.g. image). You can repeat --column as many times as needed.
When the integer label IDs in the dataset do not match the class indices the model was trained against, --label-mapping accepts a JSON file of the form {\"class_name\": id} that translates between the two spaces. This is common with models fine-tuned on a relabelled subset of a public dataset.
Quantization is a lossy transformation. Converting weights from float32 to int8, or activations to a narrow range, introduces rounding error that accumulates differently across architectures and calibration data. The impact on accuracy cannot be predicted analytically; it must be measured. Running winml eval before and after quantization gives you a concrete accuracy delta. A drop within your acceptable threshold confirms the quantized model is ready; a larger drop means you should revisit calibration settings or switch to a less aggressive quantization scheme.
Make this a habit: quantize, then eval. Comparing two --output JSON files is a reliable, reproducible record that the trade-off between performance and accuracy was explicitly checked. See Quantization for the full quantization workflow.
winml eval command reference \u2014 all flags with examplesA .onnx file is, at rest, a binary-serialized Protocol Buffer. Open it in any hex editor and you will find the familiar ONNX magic bytes followed by a dense encoding of every number the model has ever learned, plus the structural description of how those numbers are combined to produce a prediction. The file is self-contained: weights and computation recipe live together, making the artifact portable without any accompanying framework installation.
That computation recipe is a graph \u2014 a directed acyclic structure of operators wired together by named data edges. The graph is what the ONNX Intermediate Representation (IR) actually defines. When winml-cli loads or transforms a model, every operation works against this graph structure, not against framework-specific objects.
"},{"location":"concepts/graphs-and-ir/#what-is-in-a-onnx-file","title":"What is in a .onnx file","text":"An ONNX ModelProto wraps a single GraphProto. Inside the graph you will find:
pixel_values: float32[1, 3, 224, 224]).winml.io.inputs (serialized tensor specs) and winml.hierarchy.tag attributes on individual nodes.ONNX functions as an Intermediate Representation: a portable, framework-neutral description of a computation that can be loaded by any conforming runtime. Unlike a Python object graph or a compiled binary, the ONNX IR makes data flow completely explicit. Every node declares the exact names of its input and output edges; those names form a namespace shared across the whole graph, so any consumer can trace a tensor from the model inputs through every transformation to the final output.
This explicit wiring unlocks two capabilities that winml-cli relies on heavily. First, shape inference can propagate concrete or symbolic dimensions through the graph without running it \u2014 a prerequisite for correct quantization and for generating input specs automatically. Second, EP-targeted compilation can partition the graph by examining which nodes an Execution Provider supports, fuse eligible sub-graphs into accelerated kernels, and serialize the result back into a valid ONNX file using the EPContext convention. Neither of these would be tractable on an opaque binary or a dynamic execution trace.
Because the IR is static \u2014 describing the full computation at load time rather than at call time \u2014 winml-cli can inspect, validate, and transform a model without a GPU, a framework, or sample data.
"},{"location":"concepts/graphs-and-ir/#opsets-and-versioning","title":"Opsets and versioning","text":"Every operator in ONNX belongs to a domain, and every domain advances through numbered opset versions. An opset is a snapshot of the operator catalog: it defines which operators exist, what their inputs and outputs mean, and how edge cases are handled. When a model declares opset_import { domain: \"\" version: 17 }, it is saying \"all unnamed-domain operators in this file must be interpreted according to the rules published in opset 17.\"
winml-cli defaults to opset 17 when exporting a PyTorch model to ONNX. This is the value of opset_version: int = 17 in WinMLExportConfig (src/winml/modelkit/export/config.py, line 75). Opset 17 introduced layer-normalisation and group-normalisation operators in native form, eliminating the multi-node decompositions required by earlier opsets, which is why it is the recommended baseline for modern transformer and vision architectures.
Higher opsets unlock additional operators and fix known edge-case behavior, but not every Execution Provider supports the latest opset. QNN, for instance, may lag behind the ONNX standard by one or two versions. If you need to target an older EP, pass a custom export configuration:
# Write a config override\necho '{\"opset_version\": 16}' > export_cfg.json\n\n# Export with the override\nwinml export -m prajjwal1/bert-tiny -o bert.onnx --export-config export_cfg.json\n You can also check the opset a saved model declares:
winml inspect -m bert.onnx\n Opset: ai.onnx == 17\n When winml-cli's optimization and quantization pipelines transform a model, they preserve the declared opset unless explicitly instructed otherwise, so the model you receive after winml quantize will carry the same opset version as the model you supplied.
winml-cli is a toolkit for converting PyTorch and Hugging Face models into ONNX artifacts that are optimized and compiled for Windows ML execution providers (EPs). Starting from a model identifier or a pre-exported ONNX file, winml-cli runs a staged pipeline \u2014 export, optimize, quantize, compile \u2014 and produces a final model.onnx ready for inference via a Windows ML session.
Each stage is independently controllable. Quantization and compilation are optional and can be bypassed with a flag or by leaving the corresponding section of the build configuration empty. The same pipeline API that powers winml build is also the programmatic entry point for WinMLAutoModel.from_pretrained().
The stages run in order, and each one writes an intermediate ONNX file to the output directory. All intermediate artifacts are preserved so you can inspect any stage's output or feed a pre-processed file into a later stage directly.
"},{"location":"concepts/how-it-works/#pipeline-stages","title":"Pipeline Stages","text":""},{"location":"concepts/how-it-works/#export-winml-export","title":"Export \u2014winml export","text":"winml export loads a Hugging Face model (pretrained or random-weight), traces it with torch.export or an Optimum-based exporter, and writes a portable, device-agnostic ONNX file. The output at this stage is a plain ONNX graph with float32 weights and no EP-specific nodes.
winml analyze","text":"winml analyze performs static compatibility analysis on an ONNX graph against a target execution provider. It classifies every node as Supported, Partial, Unsupported, or Unknown \u2014 without running the model on the device. Use it before building to check if your model (or an intermediate artifact from any pipeline stage) will run cleanly on the target EP:
winml analyze -m model.onnx --ep qnn --device npu\n Add --optim-config optim.json to output auto-discovered optimization recommendations that can be fed directly into winml optimize. The same analyzer also drives the autoconf feedback loop inside winml build.
winml optimize","text":"winml optimize runs graph-level transformations on the exported ONNX: operator fusion (attention, layer norm, GeLU), constant folding, and graph pruning. The optimize stage also contains an autoconf loop: a static analyzer inspects the graph for nodes that the target EP cannot dispatch natively, and re-runs optimization with adjusted fusion flags until no further improvements are found (up to a configurable iteration limit).
winml quantize","text":"winml quantize inserts Quantize-Dequantize (QDQ) nodes into the optimized graph to reduce weights and activations to lower-precision types (for example, int8 weights with uint8 activations). Calibration data is used to compute quantization parameters per tensor. If the input model already contains QDQ nodes, this stage is skipped automatically.
winml compile","text":"winml compile invokes an EP-specific compiler (for example, the QNN compiler for NPU targets) to embed a pre-compiled binary cache inside the ONNX graph as an EPContext node. At inference time, the EP loads the cached binary directly, bypassing per-session compilation. Compilation is optional; omitting it produces a portable ONNX that is compiled on first load by the runtime.
winml perf / winml eval","text":"After the model is built, winml perf benchmarks inference latency and throughput using a Windows ML session, and winml eval runs task-specific accuracy evaluation. Neither command modifies the model; they consume the final model.onnx produced by the pipeline.
winml build as the One-Shot Wrapper","text":"Running each stage individually is useful when iterating on a specific step, but the normal workflow is winml build, which orchestrates the full pipeline in a single command:
winml build -m microsoft/resnet-50 -o output/\n The -c config.json flag is optional. If omitted, winml build auto-generates a default config internally. To customize pipeline settings, generate a config first with winml config and then pass it:
winml config -m microsoft/resnet-50 -o config.json\nwinml build -c config.json -m microsoft/resnet-50 -o output/\n winml build auto-detects whether the input is a Hugging Face model ID or an existing ONNX file and calls the appropriate internal API (build_hf_model or build_onnx_model). When given an ONNX file directly, the export stage is skipped and the pipeline starts at optimize.
Individual stages can be bypassed from the command line without editing the config file:
# Skip quantization and compilation\nwinml build -m bert-base-uncased -o output/ --no-quant --no-compile\n\n# Skip optimization (for pre-quantized input)\nwinml build -m model_qdq.onnx -o output/ --no-optimize\n"},{"location":"concepts/how-it-works/#configuration-winmlbuildconfig-vs-cli-flags","title":"Configuration: WinMLBuildConfig vs CLI Flags","text":"Pipeline behavior is primarily governed by a WinMLBuildConfig JSON file generated by winml config. The config is a hierarchical structure with one section per stage:
WinMLBuildConfig\n\u251c\u2500\u2500 loader \u2014 model type, task, input constraints\n\u251c\u2500\u2500 export \u2014 input tensor specs, opset, backend\n\u251c\u2500\u2500 optim \u2014 fusion flags, optimization level\n\u251c\u2500\u2500 quant \u2014 precision, calibration settings (null = skip stage)\n\u251c\u2500\u2500 compile \u2014 target EP, device (null = skip stage)\n\u2514\u2500\u2500 eval \u2014 evaluation settings\n Setting quant or compile to null in the JSON file is equivalent to passing --no-quant or --no-compile on the command line; both result in the corresponding stage being skipped. CLI flags override the config at runtime without modifying the file, which is convenient for one-off experiments.
The config file is written (or updated) to the output directory after the optimize stage completes, capturing any autoconf-adjusted fusion flags so the build is reproducible. This persisted winml_build_config.json is a self-contained pipeline specification that you can check into version control and run in CI/CD (winml build -c winml_build_config.json -m <model> -o output/) for repeatable, unattended builds across environments.
For the full field-by-field schema, see Reference \u2014 Config Schema.
"},{"location":"concepts/how-it-works/#see-also","title":"See Also","text":"The first stage of the winml-cli pipeline is the most deterministic: bring a model into memory and convert it to ONNX. Everything that follows \u2014 optimization, quantization, compilation \u2014 operates on that ONNX artifact. A well-exported graph with accurate metadata travels cleanly through the rest of the pipeline without requiring patching or re-export.
Loading is an internal operation: the loader module resolves model provenance, selects the right HuggingFace model class, and prepares the weights for tracing. The winml export command is the surface users interact with directly.
When you point winml-cli at a model identifier, the internal loader resolves it in one of two ways. If the identifier looks like a HuggingFace Hub path (e.g., prajjwal1/bert-tiny), the loader downloads the model weights and configuration to the standard HuggingFace cache at ~/.cache/huggingface. Subsequent runs are served from that cache without re-downloading. If the identifier is a path to a local PyTorch checkpoint directory, the loader reads it directly without network access.
In both cases the loader auto-detects the task \u2014 image classification, text feature extraction, and so on \u2014 and selects a corresponding HuggingFace model class. The result is a PyTorch model object ready for tracing.
Before committing to a full export you can verify that the loader resolved everything correctly with winml inspect. It prints the detected task, the HuggingFace model class, the export configuration, and the WinML inference class \u2014 all without downloading weights. Add --hierarchy to reconstruct the PyTorch module tree from random-weight tracing.
Some community models host custom Python code in their repositories. The loader refuses to execute it by default. Pass --trust-remote-code to winml config when generating a build configuration for such a model.
winml export converts the loaded model to ONNX. The conversion uses TorchScript tracing by default, which follows actual execution paths and tends to produce compact, inference-oriented graphs. A --dynamo flag exists for the PyTorch 2.x dynamo exporter; however, Note: the --dynamo flag is reserved for the PyTorch 2.x dynamo exporter but is not yet functional in the current release \u2014 passing it logs a warning and the flag is ignored.
By default the exporter runs an eight-step process that includes hierarchy tracing and tag injection. The result is an ONNX file enriched with structural metadata that powers downstream features such as per-module benchmarking, inspector views, and optimizer scoping.
"},{"location":"concepts/load-and-export/#hierarchy-tagging-in-detail","title":"Hierarchy tagging in detail","text":"During export the HTP (Hierarchy-preserving Tags Protocol) exporter attaches two pieces of information to every ONNX graph node via node.metadata_props:
winml.hierarchy.tag Full module path the node originated from /BertModel/BertEncoder/BertLayer.0/BertAttention winml.hierarchy.depth Number of path segments (integer as string) 4"},{"location":"concepts/load-and-export/#how-tags-are-built","title":"How tags are built","text":"The exporter registers PyTorch forward hooks on each module. When a module executes, a pre-hook pushes its class name onto a tag stack; the post-hook pops it. This produces hierarchical paths that mirror the PyTorch module tree:
flowchart LR\n A[Register hooks] --> B[Run forward pass]\n B --> C[Pre-hook pushes tag]\n C --> D[Child modules execute]\n D --> E[Post-hook pops tag]\n E --> F[Tag stack \u2192 path] Only modules that are actually executed during tracing receive tags \u2014 unused modules are excluded. For example, prajjwal1/bert-tiny has 48 registered modules but only 18 are reached during a forward pass.
Running winml export -m prajjwal1/bert-tiny -o model.onnx -v produces the following hierarchy tree (18 traced modules, 132 ONNX nodes, 100 % coverage):
BertModel (132 nodes)\n\u251c\u2500\u2500 BertEmbeddings: embeddings (7 nodes)\n\u251c\u2500\u2500 BertEncoder: encoder (106 nodes)\n\u2502 \u251c\u2500\u2500 BertLayer: encoder.layer.0 (53 nodes)\n\u2502 \u2502 \u251c\u2500\u2500 BertAttention: encoder.layer.0.attention (39 nodes)\n\u2502 \u2502 \u2502 \u251c\u2500\u2500 BertSelfOutput: encoder.layer.0.attention.output (4 nodes)\n\u2502 \u2502 \u2502 \u2514\u2500\u2500 BertSdpaSelfAttention: encoder.layer.0.attention.self (35 nodes)\n\u2502 \u2502 \u251c\u2500\u2500 BertIntermediate: encoder.layer.0.intermediate (10 nodes)\n\u2502 \u2502 \u2502 \u2514\u2500\u2500 GELUActivation: encoder.layer.0.intermediate.intermediate_act_fn (8 nodes)\n\u2502 \u2502 \u2514\u2500\u2500 BertOutput: encoder.layer.0.output (4 nodes)\n\u2502 \u2514\u2500\u2500 BertLayer: encoder.layer.1 (53 nodes)\n\u2502 \u2514\u2500\u2500 ... (same structure)\n\u2514\u2500\u2500 BertPooler: pooler (0 nodes)\n Each ONNX node gets its tag from the module it belongs to. Here are a few examples from the actual exported model:
ONNX node name Assigned tag/embeddings/word_embeddings/Gather /BertModel/BertEmbeddings /encoder/layer.0/attention/self/query/MatMul /BertModel/BertEncoder/BertLayer.0/BertAttention/BertSdpaSelfAttention /encoder/layer.0/intermediate/intermediate_act_fn/Mul /BertModel/BertEncoder/BertLayer.0/BertIntermediate/GELUActivation /Unsqueeze (no scope) /BertModel (root fallback)"},{"location":"concepts/load-and-export/#node-to-module-mapping","title":"Node-to-module mapping","text":"After the ONNX graph is produced by torch.onnx.export, a 4-priority system assigns each ONNX node to the closest matching module:
/BertModel).This guarantees 100 % tag coverage: every node in the graph carries a non-empty tag.
"},{"location":"concepts/load-and-export/#graph-level-metadata","title":"Graph-level metadata","text":"Beyond per-node tags, the exporter also writes model-level metadata properties:
Key Contentwinml.io.inputs JSON array of InputTensorSpec \u2014 name, shape, dtype, and optional value_range winml.io.outputs JSON array of OutputTensorSpec \u2014 name, shape, dtype These I/O specs enable tools like winml perf to generate correct dummy inputs for benchmarking and winml inspect to display tensor shapes without loading the model into a runtime.
Alongside the .onnx file, the exporter writes a *_htp_metadata.json sidecar containing:
nodes \u2014 complete mapping of every ONNX node name \u2192 hierarchy tagmodules \u2014 traced module information (class name, tag, execution order)statistics \u2014 export time, node counts, coverage percentageoutputs \u2014 I/O tensor specificationsUse --with-report to additionally generate a human-readable markdown report (*_htp_export_report.md).
winml inspect --hierarchy \u2014 traces the model with random weights and displays the resulting module tree in the terminal. This is a lightweight preview of what tags will look like after a full export.winml perf --module <ClassName> \u2014 isolates a submodule (e.g. BertAttention) and benchmarks it independently.If you need a clean, standard-compliant ONNX without custom metadata \u2014 to hand off to a third-party tool, for example \u2014 pass --no-hierarchy. (The old --clean-onnx spelling remains as a deprecated hidden alias.) The graph behaviour is unchanged, but hierarchy-dependent features will not work against that file.
Most export failures fall into three categories.
Task mismatch. The loader auto-detects task from the model card and configuration, but some models are registered under multiple tasks or have ambiguous metadata. If the wrong task is selected the exporter generates incorrect dummy inputs and the trace fails or produces wrong output shapes. Override it explicitly with --task, for example --task image-feature-extraction.
Shape issues. Transformer models often have symbolic sequence-length dimensions; vision models may expect a fixed spatial resolution. If the default dummy inputs do not match what the model accepts, shape inference will fail or produce dynamic shapes that downstream tools cannot handle. Provide a --shape-config JSON file with explicit overrides, or use --input-specs to supply a fully specified input manifest.
Custom modules. Some models contain torch.nn.Module subclasses the tracer cannot automatically decompose. A --torch-module option (comma-separated class names) is intended to include them as distinct hierarchy nodes rather than inlining them \u2014 most often needed for custom normalization or attention implementations defined in the model repository. Note: the --torch-module flag is reserved for module-targeted export but is not yet functional in the current release \u2014 passing it logs a warning and the flag is ignored.
Knowing that a model produces correct outputs is necessary but not sufficient for a production deployment. You also need to know how fast it runs, how consistently it runs, and where the time goes when it does not run fast enough. winml perf is the primary tool in winml-cli for answering those questions. It synthesises end-to-end latency numbers and live hardware utilisation into a single benchmarking workflow.
Because winml perf accepts both HuggingFace model IDs and local .onnx files, you can benchmark at any stage of the development cycle \u2014 from a freshly exported float model through to a compiled, quantized production artifact.
At its core, winml perf runs a configurable number of inference iterations and reports latency statistics. Here is a real example benchmarking bert-tiny on CPU:
$ winml perf -m bert-tiny.onnx --device cpu --iterations 50 --warmup 5\n\nDevice: cpu / CPUExecutionProvider\nTask: auto (auto-detected)\nModel Precision: fp32\nInputs: input_ids [1, 512] int32\n attention_mask [1, 512] int32\n token_type_ids [1, 512] int32\nOutputs: last_hidden_state [1, 512, 128]\n Output latency table:
Avg P50 P90 P95 P99 Min Max Std 5.53 5.40 6.55 6.87 7.65 4.89 7.65 0.58Warmup: 14.14 ms avg (first 5 iterations)\nThroughput: 180.72 samples/sec\n Key parameters:
Flag Purpose Default--iterations Number of benchmark iterations 100 --warmup Warmup iterations excluded from statistics 10 --batch-size Batch size for input generation 1 -d, --device Target device: auto, cpu, gpu, npu auto --ep Specific execution provider (e.g. qnn, dml, openvino) auto-resolved from device --precision Precision mode: auto, fp32, fp16, int8, int16, or w{x}a{y} auto --quantize/--no-quantize Include quantization during model build --quantize --skip-build/--no-skip-build Skip the build pipeline for ONNX inputs --skip-build"},{"location":"concepts/perf-and-monitoring/#output-format","title":"Output format","text":"Add -f json to emit structured JSON to stdout, suitable for CI pipelines or automated comparisons:
{\n \"benchmark_info\": {\n \"model_id\": \"bert-tiny.onnx\",\n \"task\": \"auto-detected\",\n \"device\": \"cpu\",\n \"ep\": \"CPUExecutionProvider\",\n \"precision\": \"auto\",\n \"iterations\": 50,\n \"warmup\": 5,\n \"batch_size\": 1,\n \"timestamp\": \"2026-06-11T03:27:24+00:00\"\n },\n \"model_info\": {\n \"input_names\": [\"input_ids\", \"attention_mask\", \"token_type_ids\"],\n \"input_shapes\": [[1, 512], [1, 512], [1, 512]],\n \"input_types\": [\"int32\", \"int32\", \"int32\"],\n \"output_names\": [\"last_hidden_state\"],\n \"output_shapes\": [[1, 512, 128]]\n },\n \"latency_ms\": {\n \"mean\": 5.53, \"p50\": 5.40, \"p90\": 6.55,\n \"p95\": 6.87, \"p99\": 7.65, \"min\": 4.89, \"max\": 7.65,\n \"std\": 0.58, \"warmup_mean\": 14.14\n },\n \"throughput\": { \"samples_per_sec\": 180.72, \"batches_per_sec\": 180.72 },\n \"raw_samples_ms\": [5.12, 5.40, ...]\n}\n Results are also saved automatically to ~/.cache/winml/perf/<model_slug>/<timestamp>.json for later comparison. Override the path with --output.
Latency numbers alone do not tell you whether the hardware is actually being used. A slow NPU inference could mean the model is running on the NPU and hitting a memory bottleneck, or it could mean the EP silently fell back to CPU and is not using the NPU at all.
The --monitor flag adds a live terminal chart (powered by plotext + Rich Live) that streams hardware utilisation for whichever device is being benchmarked. The chart updates once per iteration so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN or DirectML hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, it is a strong signal that the model may not be executing on the expected device \u2014 investigate further with EP-specific tools.
winml perf -m model.onnx --device npu --monitor\n Display updates are not included in the timed inference call, but monitoring may introduce small system overhead from background PDH polling.
"},{"location":"concepts/perf-and-monitoring/#memory-and-resource-metrics","title":"Memory and resource metrics","text":"When --monitor is active, hardware metrics are sampled throughout the benchmark and reported at the end. These metrics help answer questions like \"how much device memory does this model need?\" and \"is the model memory-bound?\".
The metrics collected depend on the target device:
Metric CPU GPU NPU CPU utilisation (mean/peak %) \u2713 \u2713 \u2713 RAM (used MB, peak MB) \u2713 \u2713 \u2713 Device utilisation (mean/peak %) \u2014 \u2713 \u2713 Device memory local (peak MB) \u2014 \u2713 \u2713 Device memory shared (peak MB) \u2014 \u2713 \u2713 Engine running time (ns) \u2014 \u2713 \u2713device_memory and running_time_ns are still present but will be zero.local_peak_mb) and shared system memory (shared_peak_mb) allocated by the GPU driver.local_peak_mb represents dedicated adapter memory; shared_peak_mb is system memory shared with the NPU.CPU device:
Hardware (during benchmark)\n CPU: 8.3% avg | Mem: 644 MB\n NPU or GPU device:
Hardware (during benchmark)\n NPU: 87.3% avg, 100.0% peak | CPU: 12.1% avg | Mem: 1842 MB\n Device Mem: 245/0 MB (local/shared)\n"},{"location":"concepts/perf-and-monitoring/#json-structure","title":"JSON structure","text":"In JSON output (-f json), these metrics appear under the hw_monitor key:
\"hw_monitor\": {\n \"monitor\": \"HWMonitor\",\n \"device_kind\": null,\n \"adapter_luid\": null,\n \"cpu\": { \"mean_pct\": 15.8, \"peak_pct\": 16.71, \"sample_count\": 2 },\n \"ram\": { \"used_mb\": 640.21, \"peak_mb\": 640.21 },\n \"device_memory\": { \"local_peak_mb\": 0.0, \"shared_peak_mb\": 0.0 },\n \"running_time_ns\": 0\n}\n When a hardware accelerator is active, device_kind will be \"npu\" or \"gpu\", and an additional key (e.g. \"npu\") appears with device utilisation:
\"hw_monitor\": {\n \"monitor\": \"HWMonitor\",\n \"device_kind\": \"npu\",\n \"adapter_luid\": \"0x0000abcd12340000\",\n \"cpu\": { \"mean_pct\": 12.1, \"peak_pct\": 34.5, \"sample_count\": 50 },\n \"ram\": { \"used_mb\": 1842.0, \"peak_mb\": 1910.0 },\n \"device_memory\": { \"local_peak_mb\": 245.0, \"shared_peak_mb\": 0.0 },\n \"npu\": { \"mean_pct\": 87.3, \"peak_pct\": 100.0, \"sample_count\": 50 },\n \"running_time_ns\": 4820000000\n}\n This makes it straightforward to track memory consumption across model revisions or compare devices programmatically.
"},{"location":"concepts/perf-and-monitoring/#per-module-benchmarking","title":"Per-module benchmarking","text":"Large Transformer-family models contain many repeated module instances \u2014 attention blocks, feed-forward layers, encoder stages. When you want to understand the cost of one type of block rather than the full network, --module <ClassName> isolates and benchmarks matching modules from the HuggingFace model hierarchy.
winml perf -m bert-base-uncased --module BertAttention\n This builds and benchmarks each BertAttention instance separately and reports per-instance statistics. The --module argument must be a class name (e.g. BertAttention), not a dotted module path (e.g. not encoder.layer.0.attention).
Internally, --module uses torchinfo to discover all submodule instances matching the given class name in the HuggingFace model. For each match it generates a separate build config, exports an isolated ONNX file, and benchmarks it independently. This requires a HuggingFace model ID (not a local .onnx file) because it needs access to the PyTorch module tree.
--module targets gets writtenwinml-cli exposes two ways to turn a Hugging Face model or ONNX file into a Windows ML-ready artifact. You can invoke each stage of the pipeline as an individual primitive command \u2014 winml export, winml analyze, winml optimize, winml quantize, winml compile, winml perf, winml eval \u2014 running one step at a time with full control over inputs and outputs. Alternatively, winml build wraps all of those stages into a single command driven by a WinMLBuildConfig JSON file.
Understanding when to reach for a primitive versus the pipeline wrapper is the central workflow decision in winml-cli. Both paths produce the same artifacts; the difference is in repeatability, convenience, and how much you need to inspect or vary individual stages.
"},{"location":"concepts/primitives-and-pipeline/#the-primitive-commands","title":"The primitive commands","text":"Each primitive command corresponds to one stage of the pipeline described in How winml-cli works. They run in order, each producing an ONNX file that the next stage consumes:
winml export \u2014 loads a Hugging Face model, traces it with PyTorch and the Optimum exporter, and writes a portable float32 ONNX file with no EP-specific nodes.winml analyze \u2014 runs compatibility and runtime checks on the exported ONNX graph, detecting unsupported operators, QDQ issues, and device-specific constraints before further pipeline stages.winml optimize \u2014 applies graph transformations (operator fusion, constant folding, graph pruning) and runs an autoconf loop to maximize EP-compatible coverage.winml quantize \u2014 inserts QDQ nodes using calibration data, reducing weight and activation types to lower precision (for example, int8) for efficient inference.winml compile \u2014 invokes an EP-specific compiler (for example, QNN for NPU targets) to embed a pre-compiled binary cache in the ONNX graph as an EPContext node.winml perf \u2014 benchmarks latency and throughput against a Windows ML session; does not modify the model.winml eval \u2014 evaluates task-specific accuracy on a dataset; does not modify the model.You can enter the pipeline at any stage. If you already have an optimized ONNX file, pass it directly to winml quantize without re-exporting. Each command writes its output to a path you specify, so all intermediate artifacts are preserved for inspection.
winml build orchestrates all of the above stages in order from a single WinMLBuildConfig JSON file:
winml build -c config.json -m microsoft/resnet-50 -o output/\n The config file tells winml build which stages to run and how to configure them. Setting the quant or compile section to null in the JSON skips that stage; passing --no-quant, --no-compile, or --no-optimize on the command line achieves the same effect at runtime without editing the file.
When the model argument points to an existing ONNX file instead of a Hugging Face ID, winml build detects this and skips the export stage, running analyze \u2192 optimize \u2192 quantize \u2192 compile directly. This mirrors how each primitive command handles the same case.
winml build also accepts --use-cache in place of -o/--output-dir, routing artifacts to the winml-cli global cache at ~/.cache/winml/ instead of a local directory. Use --rebuild to force a clean re-run even when cached artifacts already exist.
Use primitive commands when:
Use winml build when:
winml build coordinates end-to-end.quant: null in the config) rather than remembered flag-by-flag across invocations.The two approaches are not exclusive. A common pattern is to prototype with primitives \u2014 iterating on winml optimize and winml quantize individually to tune fusion flags and calibration \u2014 and then encode the final settings into a WinMLBuildConfig for repeatable production builds via winml build.
WinMLBuildConfigEvery ONNX tensor carries data in a specific numeric type \u2014 float32, float16, int8, int16 \u2014 and every winml-cli pipeline makes deliberate choices about which type to use where. This page covers both halves of that decision: the datatype family winml-cli understands, and the quantization workflow that converts a model from one datatype to another to shrink it and run it faster on integer-native hardware.
Quantization is the headline use of datatypes in winml-cli. By replacing float32 weights and activations with int8 or mixed precisions, you typically get a 2\u20134\u00d7 smaller model artifact and a 2\u20138\u00d7 latency speedup on NPU hardware. The trade-off is a potential reduction in model accuracy, the degree of which depends on the precision chosen and the sensitivity of the model.
winml-cli exposes a precision shorthand on the --precision flag that encodes the weight/activation dtype pair as a single string. The table below lists every precision from _NAMED_PRECISIONS in config/precision.py, together with the resolved quantization types. Float precisions (fp32, fp16) carry no quantization types because weights and activations remain in floating point throughout.
auto device-dependent device-dependent Resolves to w8a16 (NPU), fp16 (GPU/CPU) at runtime fp32 float32 float32 No quantization; baseline accuracy fp16 float16 float16 Half-precision float; no QDQ nodes inserted int8 uint8 uint8 Static quantization; valid for QNN EP int16 int16 uint16 Higher-accuracy quantization; larger model than int8 w8a8 uint8 uint8 Equivalent to int8; explicit mixed-precision notation w8a16 uint8 uint16 Mixed: compact weights, wider activations for accuracy w4a16 n/a n/a Not supported. Rejected at validation \u2014 is_quantized_precision(\"w4a16\") returns False because 4-bit weight types are absent from _BITS_TO_WEIGHT_TYPE in precision.py. The string is not a recognized precision. The --weight-type and --activation-type flags on winml quantize accept uint8, int8, uint16, or int16 and override whatever the --precision shorthand would have resolved. This is useful when you need an unsigned weight type for QNN compatibility but a signed activation type for a specific operator constraint. See Weight and Activation for why the two need separate flags in the first place.
winml-cli applies quantization by inserting QDQ (Quantize/Dequantize) nodes into the ONNX graph. The resulting file is a standard ONNX model that any ONNX Runtime execution provider can consume and optimize for its target hardware \u2014 the EP reads the QDQ pattern and fuses adjacent operations into true integer kernels.
"},{"location":"concepts/quantization/#calibration","title":"Calibration","text":"Static quantization \u2014 the kind winml-cli applies \u2014 requires a calibration pass before inserting QDQ nodes. During calibration, a small set of representative inputs runs through the original floating-point model so that winml-cli can observe the actual range of values each tensor takes at runtime. Those observed ranges are then used to choose the scale and zero-point constants baked into the QDQ nodes.
The --samples flag controls how many calibration inputs are used (default: 10). More samples generally produce better range estimates but take longer. The --method flag selects the algorithm used to summarize the observed ranges:
minmax (default) \u2014 uses the absolute minimum and maximum observed values. Fast and predictable; can be sensitive to outliers.entropy \u2014 minimizes the KL-divergence between the original and quantized distribution. Often yields better accuracy on models with heavy-tailed activation distributions.percentile \u2014 clips a small fraction of extreme values before computing the range. A practical middle ground when outliers are present but entropy calibration is slow.Example using entropy calibration with more samples:
winml quantize -m model.onnx --precision int8 --samples 128 --method entropy\n"},{"location":"concepts/quantization/#the-qdq-pattern","title":"The QDQ pattern","text":"The QDQ pattern is the standard ONNX representation for static quantization. winml-cli wraps the inputs and outputs of quantizable operators with pairs of QuantizeLinear and DequantizeLinear nodes. At the graph level the model still operates in floating-point; the QDQ nodes encode the scale and zero-point metadata that a runtime needs to fuse adjacent operations into true integer kernels.
When the model runs under ONNX Runtime, the execution provider \u2014 whether CPU, DirectML, or a dedicated NPU EP \u2014 reads those QDQ patterns and performs its own graph fusion. This means the EP is free to apply hardware-specific optimizations without winml-cli needing to know anything about the target device's internal ISA or operator library. The QDQ model produced by winml quantize is a single portable artifact that can be deployed to any EP that supports integer execution.
Not all precision choices carry equal accuracy risk:
fp16 is usually lossless in practice. Rounding errors relative to fp32 are small enough that most models show no measurable accuracy difference.int8 and int16 are inherently lossy. Compressing a 32-bit float into 8 or 16 bits discards information, and the magnitude of accuracy degradation depends on how well the calibration data represents the deployment distribution.w8a16 reduce the risk compared to full int8 by preserving more precision in activations, but they are still lossy relative to fp32.Always validate accuracy after quantizing an integer-precision model. Run winml eval on a representative dataset and compare the metrics against the original floating-point baseline before shipping the quantized artifact.
Every neural network model stores two kinds of numeric tensors that matter for deployment: weights, the static parameters baked in at training time, and activations, the intermediate values that flow through the graph at every inference call. Understanding the distinction is the key to reading winml-cli's precision flags, deciding when quantization is safe, and knowing why a model that runs fine on one execution provider may stall or degrade on another.
"},{"location":"concepts/weight-and-activation/#weights-are-static","title":"Weights are static","text":"Weights are the trained parameters of the model: convolution kernels, linear projection matrices, attention weights, embedding tables, bias vectors. They are fixed at the moment the model is exported and stay constant for every inference call. Because they are static, their quantization parameters \u2014 the scale and zero-point used to compress them from fp32 to int8 \u2014 can be computed once, offline, using calibration data. winml quantize does exactly that: it observes the weight distributions in your exported ONNX and bakes the per-tensor scale/zero-point into the QDQ nodes that wrap the weights.
In ONNX terms, weights are stored as initializers inside the graph. The runtime treats them as graph inputs that are always pre-supplied; you do not pass weights to a session at inference time, the way you pass an image tensor or a text prompt.
"},{"location":"concepts/weight-and-activation/#activations-are-dynamic","title":"Activations are dynamic","text":"Activations are the intermediate results that flow through the graph during inference: the output of every matrix multiply, every layer norm, every attention softmax. Unlike weights, activations are regenerated on every forward pass and depend entirely on the input data. winml-cli cannot pre-compute their quantization parameters offline \u2014 instead, calibration runs a small set of representative inputs through the model and observes the actual ranges each activation tensor takes. Those observed ranges become the scale/zero-point baked into QDQ nodes around each activation.
This is why calibration data matters. If the calibration set fails to represent the inputs you will see in production, the per-activation ranges will be wrong and the quantized model will lose more accuracy than necessary on real traffic.
"},{"location":"concepts/weight-and-activation/#why-they-need-separate-flags","title":"Why they need separate flags","text":"The --weight-type and --activation-type flags on winml quantize exist because the optimal bit-width for weights is not necessarily the optimal bit-width for activations:
The compound precision shorthand w8a16 (8-bit weights, 16-bit activations) reflects this asymmetry directly: weights and activations get different bit-widths in one config string. For the full precision family and how each maps to weight/activation dtypes, see Datatype and Quantization.
winml-cli ships a Copilot Skill (use-winml-cli) that lets AI coding agents drive the entire model-building pipeline on your behalf. When a coding agent has this skill attached, it can inspect models, generate configs, run builds, and interpret results \u2014 without you having to remember exact flags or stage ordering.
The skill teaches the agent:
Capability What the agent learns Pipeline shape The stage order (inspect \u2192 export \u2192 analyze \u2192 optimize \u2192 quantize \u2192 compile \u2192 perf) and when to enter mid-pipeline Flag discovery Always run winml <command> --help before quoting a command \u2014 never fabricate flags Output mapping Which command's -o produces the artifact the user actually needs Scope awareness Which model architectures are supported (classic DL) vs. out-of-scope (LLMs, diffusion) Hardware detection Use winml sys --list-ep to confirm what's available before targeting an EP Two paths When to use primitives (debugging, exploring) vs. config + build (production, CI)"},{"location":"getting-started/agent-skill/#how-to-use-it","title":"How to use it","text":""},{"location":"getting-started/agent-skill/#with-github-copilot-coding-agent","title":"With GitHub Copilot Coding Agent","text":"To make the Copilot Coding Agent (the cloud agent that creates PRs) follow the skill's guidance, reference it in .github/copilot-instructions.md. The Coding Agent reads that file automatically when working on this repository.
For agents that support custom instructions (e.g., Copilot Extensions, Claude, ChatGPT with file uploads, or custom MCP tool servers), attach the skill file as context:
skills/use-winml-cli/SKILL.md\n You can copy the file contents into your agent's system prompt, upload it as a reference document, or include it in a .github/copilot-instructions.md for VS Code Copilot Chat. The skill uses standard markdown with YAML front-matter \u2014 any agent that accepts text context can benefit from it.
winml-cli/\n\u2514\u2500\u2500 skills/\n \u2514\u2500\u2500 use-winml-cli/\n \u2514\u2500\u2500 SKILL.md \u2190 the skill definition\n"},{"location":"getting-started/agent-skill/#example-agent-interaction","title":"Example agent interaction","text":"User: Can I run ConvNeXt on my Snapdragon X Elite NPU?\n\nAgent (with skill):\n1. Runs `winml sys --list-ep` \u2192 confirms QNNExecutionProvider is registered\n2. Runs `winml inspect -m microsoft/convnext-tiny-224` \u2192 confirms supported\n3. Runs `winml config --onnx ... -d npu -o config.json`\n4. Runs `winml build -c config.json -m microsoft/convnext-tiny-224 -o output/`\n5. Runs `winml perf -m output/model.onnx -d npu --monitor`\n6. Reports latency + NPU utilization to user\n"},{"location":"getting-started/installation/","title":"Installation","text":""},{"location":"getting-started/installation/#prerequisites","title":"Prerequisites","text":"Component Details Windows Windows 11 24H2 or later (required for NPU support) Hardware Device with CPU, GPU, or NPU Python 3.11 Package manager uv Version control git No NPU?
You can follow most of these docs without NPU hardware. All winml-cli commands accept --device auto and fall back to CPU or DirectML automatically. The tutorials document explicit CPU fallback paths.
uv python install 3.11\nuv pip install winml-cli\n uv python install 3.11 downloads and pins the exact Python version the project requires. uv pip install winml-cli installs the latest release from PyPI into a managed environment. No separate venv activation is needed.
Install from source (for development)
If you want to contribute or run the latest unreleased code:
git clone https://github.com/microsoft/winml-cli.git\ncd winml-cli\nuv sync\n"},{"location":"getting-started/installation/#verify","title":"Verify","text":"winml sys\n Expected output (abbreviated):
+------------------------------------+\n| winml-cli System Information |\n+------------------------------------+\n\nEnvironment\n Python Version 3.11.x\n OS Windows 11\n Machine AMD64\n\nML Libraries\n Library Version Status\n torch 2.x.x OK\n onnx 1.x.x OK\n\nAvailable Devices (priority order)\n #1 NPU ...\n #2 GPU ...\n #3 CPU ...\n\nAvailable Execution Providers\n QNNExecutionProvider -> NPU\n DmlExecutionProvider -> GPU\n CPUExecutionProvider -> CPU\n This command enumerates available compute devices and execution providers on your machine. If an expected device or execution provider is missing, winml sys is the right place to diagnose it. See winml sys for the full flag reference and troubleshooting tips.
Run the following command to enumerate available devices and execution providers on your machine:
uv run winml sys --list-device --list-ep\n --list-device and --list-ep print only the hardware and EP inventory. If the command exits without error, your winml-cli install is ready. See winml sys for the full flag reference.
Before downloading any models, confirm that winml-cli recognises the model:
uv run winml inspect -m microsoft/resnet-50\n +--------------------------- microsoft/resnet-50 ---------------------------+\n| Task image-classification |\n| Model Class ResNetForImageClassification |\n| Exporter OptimumExporter |\n| WinML Class WinMLImageClassificationModel |\n| Status Supported |\n+---------------------------------------------------------------------------+\n Tip
Always inspect before build to catch unsupported architectures early.
"},{"location":"getting-started/quickstart/#build-the-model","title":"Build the model","text":"uv run winml build -m microsoft/resnet-50 -o resnet_out/ --no-quant\n winml build runs all pipeline steps in sequence \u2014 export, optimize, quantize. You can start a model build without a config file, or provide one to configure each step in the sequence (see winml config to customize). All intermediate artifacts land in resnet_out/, so you can reuse any stage independently.
After a successful build, you will find the following outputs in resnet_out/:
analyze_result.json \u2014 detailed model compatibility insights for each Windows ML EP, including supported, partially supported, and unsupported operators, detected optimization patterns, and recommended optimization workflows.winml_build_config file \u2014 automatically generated after the build step to capture the full workflow end-to-end.uv run winml perf -m resnet_out/model.onnx --device auto --iterations 50 --monitor\n --device auto lets the CLI resolve the best available device on your machine \u2014 NPU first, then GPU, then CPU.
winml buildwinml inspectwinml perfwinml sysIf you prefer a graphical interface, you can use the Foundry Toolkit extension for VS Code to run Windows ML CLI model conversion without typing commands.
"},{"location":"getting-started/ui-quickstart/#quick-reference","title":"Quick reference","text":"Foundry Toolkit in the VS Code Extensions viewFor a full walkthrough, see Build with Windows ML CLI (Preview) in the VS Code documentation.
"},{"location":"reference/","title":"Reference \u2014 Config Schema","text":"This page documents the full schema for WinMLBuildConfig, the JSON configuration file that drives the winml-cli pipeline. Generate a config with winml config, then pass it to any command with -c config.json.
The config is accepted by all pipeline commands \u2014 not just winml build. For example, winml export -c config.json, winml quantize -c config.json, and winml compile -c config.json each read the relevant section of the same config file. This lets you use a single config as the source of truth across all stages.
{\n \"loader\": { ... },\n \"export\": { ... },\n \"optim\": { ... },\n \"quant\": { ... },\n \"compile\": { ... },\n \"eval\": { ... },\n \"auto\": true\n}\n Setting quant or compile to null skips that pipeline stage entirely. Setting auto to true (default) lets winml-cli auto-configure downstream stages based on the target device and precision.
loader \u2014 Model Loading","text":"Field Type Default Description task str \\| null null HuggingFace task (e.g., image-classification). Auto-detected if omitted. model_class str \\| null null Override model class (e.g., AutoModelForCTC). model_type str \\| null null HuggingFace model type (e.g., bert, resnet). module_path str \\| null null Dotted path to a submodule for targeted export. user_script str \\| null null Path to custom model class script. trust_remote_code bool false Trust remote code from HuggingFace."},{"location":"reference/#export-onnx-export","title":"export \u2014 ONNX Export","text":"Field Type Default Description opset_version int 17 ONNX opset version. batch_size int 1 Static batch size. Use 1 for QNN compatibility. input_tensors list[InputTensorSpec] \\| null null Input tensor specifications. Auto-inferred if omitted. output_tensors list[OutputTensorSpec] \\| null null Output tensor specifications. dynamic_axes dict \\| null null Dynamic axes mapping. \u26a0\ufe0f Breaks MatMulAddFusion on QNN. export_params bool true Include model parameters in ONNX. do_constant_folding bool true Fold constants during export. verbose bool false Verbose export logging. dynamo bool false Use PyTorch 2.x Dynamo exporter. enable_hierarchy_tags bool true Add module hierarchy tags to ONNX nodes. clean_onnx bool false Strip hierarchy tags after export. hierarchy_tag_format \"full\" \\| \"module_only\" \"full\" Tag detail level. InputTensorSpec:
Field Type Descriptionname str \\| null Tensor name (e.g., pixel_values). dtype str \\| null Data type (e.g., float32, int64). shape list[int] \\| null Tensor shape (e.g., [1, 3, 224, 224]). value_range [float, float] \\| null Min/max for dummy tensor generation."},{"location":"reference/#optim-graph-optimization","title":"optim \u2014 Graph Optimization","text":"A dictionary of boolean fusion flags. All default to false unless auto-configured.
gelu_fusion bool Fuse GeLU activation patterns. layer_norm_fusion bool Fuse LayerNorm patterns. matmul_add_fusion bool Fuse MatMul + Add (enables BiasGelu). Additional fusion flags can be added as key-value pairs.
"},{"location":"reference/#quant-quantization","title":"quant \u2014 Quantization","text":"Set to null to skip quantization.
mode \"qdq\" \\| \"static\" \\| \"dynamic\" \"qdq\" Quantization mode. weight_type \"uint8\" \\| \"int8\" \\| \"uint16\" \\| \"int16\" \"uint8\" Weight data type. activation_type \"uint8\" \\| \"int8\" \\| \"uint16\" \\| \"int16\" \"uint8\" Activation data type. calibration_method \"minmax\" \\| \"entropy\" \\| \"percentile\" \"minmax\" Scale computation method. samples int 10 Number of calibration samples. per_channel bool false Per-channel quantization. symmetric bool false Symmetric quantization. task str \\| null null Task for dataset-aware calibration. model_name str \\| null null Model ID for calibration dataset resolution. dataset_name str \\| null null Override calibration dataset. distribution str \"uniform\" Random distribution for dummy data. seed int \\| null null Random seed for reproducibility. calibration_load_path str \\| null null Load pre-computed calibration scales. calibration_save_path str \\| null null Save calibration scales. op_types_to_quantize list[str] \\| null null Operator types to quantize (all if null). nodes_to_exclude list[str] \\| null null Node names to skip."},{"location":"reference/#compile-ep-compilation","title":"compile \u2014 EP Compilation","text":"Set to null to skip compilation.
ep_config.provider str \"qnn\" EP alias: qnn, cpu, dml, openvino, tensorrt, vitisai, migraphx. ep_config.device str \"auto\" Target device: npu, gpu, cpu, auto. ep_config.enable_ep_context bool true Generate EPContext model. ep_config.embed_context bool false Embed binary in ONNX (true) or external .bin (false). ep_config.compiler str \"ort\" Compiler backend: ort or qairt. ep_config.provider_options dict {} EP-specific options. ep_config.qnn_sdk_root str \\| null null QNN SDK path for QAIRT compiler backend. validate bool true Validate compiled model. verbose bool false Verbose compilation logging."},{"location":"reference/#eval-evaluation","title":"eval \u2014 Evaluation","text":"Set to null (default) to skip evaluation.
model_id str \\| null null HuggingFace model ID for config resolution. model_path str \\| dict[str, str] \\| null null Path to .onnx file, or a {role: path} dict for composite models. task str \\| null null Task type. device str \"auto\" Inference device. precision str \"auto\" Precision (fp32, fp16, w8a16, etc.). ep str \\| null null EP override. dataset.path str \\| null null HuggingFace dataset path. dataset.name str \\| null null Dataset config name. dataset.split str \"validation\" Dataset split. dataset.samples int 100 Evaluation sample count. dataset.shuffle bool true Shuffle before sampling. dataset.seed int 42 Random seed. output_path str \\| null null Path for JSON results output."},{"location":"reference/#example-full-config","title":"Example: Full Config","text":"{\n \"loader\": {\n \"task\": \"image-classification\",\n \"model_type\": \"resnet\"\n },\n \"export\": {\n \"opset_version\": 17,\n \"batch_size\": 1\n },\n \"optim\": {\n \"gelu_fusion\": true,\n \"layer_norm_fusion\": true,\n \"matmul_add_fusion\": true\n },\n \"quant\": {\n \"mode\": \"qdq\",\n \"weight_type\": \"uint8\",\n \"activation_type\": \"uint8\",\n \"samples\": 10,\n \"calibration_method\": \"minmax\"\n },\n \"compile\": {\n \"ep_config\": {\n \"provider\": \"qnn\",\n \"device\": \"npu\",\n \"enable_ep_context\": true,\n \"embed_context\": false\n },\n \"validate\": true\n },\n \"auto\": true\n}\n"},{"location":"reference/#the-auto-field","title":"The auto field","text":"The top-level \"auto\" field (default: true) controls whether the build pipeline runs the autoconf loop \u2014 an iterative analyze \u2192 discover \u2192 re-optimize cycle that automatically detects which additional graph optimizations the model needs for the target EP.
true (default) After initial optimization, the analyzer inspects the graph for unsupported or sub-optimal nodes and proposes additional optimization flags. The pipeline re-optimizes using the discovered flags and repeats (up to --max-optim-iterations, default 3). The final optimization result depends on what the analyzer discovers at runtime, so outputs may vary if the model or EP support changes between runs. false The pipeline applies only the explicit optim flags from the config \u2014 no autoconf discovery, no re-optimization loop. Builds are fully deterministic given the same config and input model. Use this for reproducible CI builds or when you have already tuned the optimization flags manually. When auto is true and the autoconf loop discovers additional flags, the final persisted config (written to the output directory) includes the merged result so you can inspect what was discovered.
When you run winml build, the tool writes all artifacts to the output directory. This page documents what each file is and which ones you need for deployment.
After a full pipeline run (export \u2192 optimize \u2192 quantize \u2192 compile):
output/\n\u251c\u2500\u2500 model.onnx \u2190 FINAL artifact (deploy this)\n\u251c\u2500\u2500 model.onnx.data \u2190 External weights (if model \u2265 100 MiB)\n\u251c\u2500\u2500 winml_build_config.json \u2190 Persisted build config\n\u251c\u2500\u2500 analyze_result.json \u2190 Static analysis (EP compatibility)\n\u251c\u2500\u2500 build_manifest.json \u2190 Build provenance (Python API only)\n\u251c\u2500\u2500 export_htp_metadata.json \u2190 HTP export metadata (hierarchy info)\n\u251c\u2500\u2500 export.onnx \u2190 Intermediate: raw ONNX export\n\u251c\u2500\u2500 export.onnx.data\n\u251c\u2500\u2500 optimized.onnx \u2190 Intermediate: after graph optimization\n\u251c\u2500\u2500 optimized.onnx.data\n\u251c\u2500\u2500 quantized.onnx \u2190 Intermediate: after QDQ insertion\n\u251c\u2500\u2500 quantized.onnx.data\n\u251c\u2500\u2500 compiled.onnx \u2190 Intermediate: after EP compilation\n\u2514\u2500\u2500 compiled.onnx.data\n"},{"location":"reference/output-layout/#file-categories","title":"File Categories","text":""},{"location":"reference/output-layout/#final-artifacts-keep-for-deployment","title":"Final Artifacts (Keep for Deployment)","text":"File Purpose model.onnx The deployment-ready model. Always present. model.onnx.data External weight data (only if model \u2265 100 MiB). Must stay alongside model.onnx. winml_build_config.json The complete pipeline config used for this build (includes auto-discovered optimization flags). This file is a reproducible pipeline specification \u2014 check it into version control or feed it directly to winml build -c in a CI/CD pipeline to guarantee identical model processing across machines and runs (set \"auto\": false for fully deterministic builds). analyze_result.json Static analysis output: EP compatibility, operator classification, detected patterns. build_manifest.json Build provenance with stage timings. Only generated via the Python API (build_hf_model/build_onnx_model). export_htp_metadata.json HTP export metadata: module hierarchy, tracing info, tagging coverage."},{"location":"reference/output-layout/#intermediate-files-can-delete-after-build","title":"Intermediate Files (Can Delete After Build)","text":"File Stage Contents export.onnx Export Raw PyTorch \u2192 ONNX conversion (float32) optimized.onnx Optimize Graph with fused operators, shape inference applied quantized.onnx Quantize QDQ nodes inserted, calibrated scales compiled.onnx Compile EPContext binary embedded or sidecar Each intermediate has a corresponding .onnx.data file if the model exceeds 100 MiB.
winml export)","text":"output/\n\u251c\u2500\u2500 export.onnx\n\u2514\u2500\u2500 export.onnx.data (if \u2265 100 MiB)\n"},{"location":"reference/output-layout/#optimize-only-winml-optimize","title":"Optimize only (winml optimize)","text":"output/\n\u251c\u2500\u2500 optimized.onnx\n\u2514\u2500\u2500 optimized.onnx.data\n"},{"location":"reference/output-layout/#full-build-winml-build","title":"Full build (winml build)","text":"All stages write their intermediate, and model.onnx is a copy of the last successful stage output. If you skip quantization (--no-quant), the final model is a copy of optimized.onnx. If you skip compilation too, it's still a copy of optimized.onnx.
Models larger than 100 MiB store weights in a separate .onnx.data file. Both files must be kept together \u2014 the .onnx file contains a reference to the data file by name.
model.onnx only (weights embedded) \u2265 100 MiB model.onnx + model.onnx.data Warning
If you move model.onnx, always move model.onnx.data alongside it. The ONNX file references the data file by relative path.
analyze_result.json contains the static analysis output from the build pipeline's analyze stage. It reports EP compatibility and operator classification:
{\n \"analysis_timestamp\": \"2026-06-04T19:45:17.496169\",\n \"metadata\": {\n \"model_path\": \"iter.onnx\",\n \"opset_version\": 17,\n \"producer_name\": \"pytorch\",\n \"producer_version\": \"2.12.0\",\n \"total_operators\": 122,\n \"operator_counts\": {\n \"Conv\": 53,\n \"Relu\": 49,\n \"MaxPool\": 1,\n \"Add\": 16,\n \"GlobalAveragePool\": 1,\n \"Flatten\": 1,\n \"Gemm\": 1\n },\n \"unique_operator_types\": 7,\n \"detected_pattern_count\": {}\n },\n \"results\": [\n {\n \"ihv_type\": \"Microsoft\",\n \"ep_type\": \"CPUExecutionProvider\",\n \"device_type\": \"cpu\",\n \"runtime_support\": false,\n \"has_errors\": false,\n \"has_warnings\": false,\n \"classification\": {\n \"supported\": [],\n \"partial\": [],\n \"unsupported\": [],\n \"unknown\": [\n \"OP/ai.onnx/Conv\",\n \"OP/ai.onnx/Relu\",\n \"OP/ai.onnx/MaxPool\",\n \"OP/ai.onnx/Add\",\n \"OP/ai.onnx/GlobalAveragePool\",\n \"OP/ai.onnx/Flatten\",\n \"OP/ai.onnx/Gemm\"\n ]\n },\n \"information\": []\n }\n ]\n}\n Key fields:
Field Descriptionmetadata.total_operators Total ONNX operator nodes in the model graph metadata.operator_counts Frequency of each operator type metadata.detected_pattern_count Fused subgraph patterns (GeLU, LayerNorm, etc.) results[].ihv_type Hardware vendor (\"Microsoft\", \"QC\", \"Intel\", etc.) results[].runtime_support true if the EP can run all operators results[].classification Operators grouped by support level: supported, partial, unsupported, unknown results[].has_errors true if unsupported ops exist (model won't run on that EP)"},{"location":"reference/output-layout/#build-manifest","title":"Build Manifest","text":"build_manifest.json records provenance for every build:
{\n \"schema_version\": 1,\n \"model_id\": \"microsoft/resnet-50\",\n \"task\": \"image-classification\",\n \"cache_key\": \"a1b2c3d4e5f6\",\n \"config_hash\": \"f7e8d9c0b1a2\",\n \"timestamp\": \"2026-01-15T10:30:00.000000+00:00\",\n \"elapsed_seconds\": 45.1,\n \"final_artifact\": \"model.onnx\",\n \"analyze_iterations\": 2,\n \"analyze_unsupported_node_count\": 0,\n \"analyze_details\": { \"lint\": {}, \"autoconf\": {} },\n \"stages\": [\n {\n \"name\": \"export\",\n \"status\": \"completed\",\n \"filename\": \"export.onnx\",\n \"elapsed_seconds\": 12.5\n },\n {\n \"name\": \"optimize\",\n \"status\": \"completed\",\n \"filename\": \"optimized.onnx\",\n \"elapsed_seconds\": 8.2\n },\n {\n \"name\": \"quantize\",\n \"status\": \"completed\",\n \"filename\": \"quantized.onnx\",\n \"elapsed_seconds\": 15.3,\n \"nodes_quantized\": 150,\n \"nodes_skipped\": 12\n },\n {\n \"name\": \"compile\",\n \"status\": \"completed\",\n \"filename\": \"compiled.onnx\",\n \"elapsed_seconds\": 9.1\n }\n ]\n}\n"},{"location":"reference/output-layout/#rebuild-behavior","title":"Rebuild Behavior","text":"model.onnx already exists and rebuild=False (default), the build is skipped entirely.--rebuild (CLI) or force_rebuild=True (Python API) to force a fresh build..onnx and .onnx.data files are deleted before the pipeline runs.winml-cli can be used as a Python library for programmatic model building and inference. This page documents the public API surface.
"},{"location":"reference/python-api/#quick-example","title":"Quick Example","text":"from winml.modelkit import WinMLAutoModel\n\n# Build and load in one call\nmodel = WinMLAutoModel.from_pretrained(\"microsoft/resnet-50\", device=\"npu\")\noutput = model(pixel_values=images)\n\n# From a local ONNX file\nmodel = WinMLAutoModel.from_onnx(\"model.onnx\", task=\"image-classification\")\n"},{"location":"reference/python-api/#winmlautomodel","title":"WinMLAutoModel","text":"Factory class for automatic model building and loading. Not instantiable directly \u2014 use the class methods.
"},{"location":"reference/python-api/#from_pretrained","title":"from_pretrained()","text":"Build and load a model from a HuggingFace ID or local path. Runs the full pipeline: config \u2192 export \u2192 optimize \u2192 quantize \u2192 compile \u2192 load.
WinMLAutoModel.from_pretrained(\n model_id_or_path: str | Path,\n *,\n task: str | None = None,\n config: WinMLBuildConfig | None = None,\n device: str = \"auto\",\n precision: str = \"auto\",\n cache_dir: str | Path | None = None,\n use_cache: bool = True,\n force_rebuild: bool = False,\n trust_remote_code: bool = False,\n shape_config: dict | None = None,\n no_compile: bool = False,\n) -> WinMLPreTrainedModel\n Parameter Type Default Description model_id_or_path str \\| Path required HuggingFace model ID or path to local model. task str \\| None None Task name. Auto-detected if omitted. config WinMLBuildConfig \\| None None Custom build config. Auto-generated if omitted. device str \"auto\" Target device: \"auto\", \"npu\", \"gpu\", \"cpu\". precision str \"auto\" Precision: \"auto\", \"fp32\", \"fp16\", \"w8a8\", etc. cache_dir str \\| Path \\| None None Cache directory for built artifacts. use_cache bool True Reuse cached build if available. force_rebuild bool False Force rebuild even if cache exists. trust_remote_code bool False Trust remote code from HuggingFace. no_compile bool False Skip the compilation stage. Returns: A task-specific WinMLPreTrainedModel subclass.
from_onnx()","text":"Build from a pre-exported ONNX file. Runs: optimize \u2192 quantize \u2192 compile \u2192 load.
WinMLAutoModel.from_onnx(\n onnx_path: str | Path | dict[str, str | Path],\n *,\n task: str | None = None,\n config: WinMLBuildConfig | None = None,\n device: str = \"auto\",\n precision: str = \"auto\",\n ep: str | None = None,\n cache_dir: str | Path | None = None,\n use_cache: bool = True,\n force_rebuild: bool = False,\n skip_build: bool = False,\n session_options: Any | None = None,\n hf_config: PretrainedConfig | None = None,\n **kwargs: Any,\n) -> WinMLPreTrainedModel | WinMLCompositeModel\n Parameter Type Default Description onnx_path str \\| Path \\| dict required ONNX file path, or dict of submodel paths for composite models. skip_build bool False Load ONNX directly without running optimize/quantize/compile. hf_config PretrainedConfig \\| None None Required for composite models (dict inputs)."},{"location":"reference/python-api/#supported_tasks","title":"supported_tasks()","text":"WinMLAutoModel.supported_tasks() -> list[str]\n Returns all task strings with dedicated inference classes (16 tasks).
"},{"location":"reference/python-api/#build-pipeline-functions","title":"Build Pipeline Functions","text":"Lower-level functions for fine-grained control over the pipeline.
"},{"location":"reference/python-api/#build_hf_model","title":"build_hf_model()","text":"from winml.modelkit.build import build_hf_model\n\nresult = build_hf_model(\n config: WinMLBuildConfig,\n output_dir: Path,\n *,\n model_id: str | None = None,\n pytorch_model: nn.Module | None = None,\n rebuild: bool = False,\n trust_remote_code: bool = False,\n random_init: bool = False,\n cache_key: str | None = None,\n ep: str | None = None,\n device: str | None = None,\n **kwargs: Any,\n) -> BuildResult\n Runs the full pipeline (export \u2192 optimize \u2192 analyze \u2192 quantize \u2192 compile) and writes all artifacts to output_dir.
build_onnx_model()","text":"from winml.modelkit.build import build_onnx_model\n\nresult = build_onnx_model(\n onnx_path: Path | str,\n *,\n config: WinMLBuildConfig,\n output_dir: Path | str,\n rebuild: bool = False,\n ep: str | None = None,\n device: str | None = None,\n **kwargs: Any,\n) -> BuildResult\n Builds from an existing ONNX file (skips export).
"},{"location":"reference/python-api/#buildresult","title":"BuildResult","text":"@dataclass\nclass BuildResult:\n output_dir: Path # Directory containing all artifacts\n final_onnx_path: Path # Path to final model.onnx\n config_path: Path # Path to winml_build_config.json\n stages_completed: list[str] # e.g., [\"export\", \"optimize\", \"quantize\"]\n stages_skipped: list[str]\n stage_timings: dict[str, float] # Per-stage seconds\n elapsed: float # Total build time (seconds)\n reused: bool # True if cache hit, no build ran\n manifest_path: Path | None # Path to build_manifest.json\n"},{"location":"reference/python-api/#config-generation","title":"Config Generation","text":""},{"location":"reference/python-api/#generate_build_config","title":"generate_build_config()","text":"from winml.modelkit.config import generate_build_config\n\nconfig = generate_build_config(\n model_id: str | None = None,\n *,\n task: str | None = None,\n model_class: str | None = None,\n model_type: str | None = None,\n module: str | None = None,\n override: WinMLBuildConfig | None = None,\n shape_config: dict | None = None,\n library_name: str = \"transformers\",\n device: str = \"auto\",\n precision: str = \"auto\",\n trust_remote_code: bool = False,\n ep: str | None = None,\n onnx_path: str | Path | None = None,\n) -> WinMLBuildConfig | list[WinMLBuildConfig]\n Auto-generates a complete build config by probing the model's config.json (does not download weights). Equivalent to what winml config produces. Returns a list when module is specified (one config per submodule).
All inference models inherit from WinMLPreTrainedModel and are HuggingFace pipeline-compatible.
WinMLPreTrainedModel (Base)","text":"class WinMLPreTrainedModel:\n def __call__(self, **kwargs) -> Any: ...\n def perf(self, warmup: int = 0) -> ContextManager: ...\n\n @property\n def device(self) -> str: ...\n @property\n def ep_name(self) -> str | None: ...\n @property\n def io_config(self) -> dict: ...\n @property\n def task(self) -> str | None: ...\n"},{"location":"reference/python-api/#task-specific-classes","title":"Task-Specific Classes","text":"Class Task WinMLModelForImageClassification image-classification WinMLModelForSequenceClassification text-classification WinMLModelForImageSegmentation image-segmentation WinMLModelForSemanticSegmentation semantic-segmentation WinMLModelForObjectDetection object-detection WinMLModelForFeatureExtraction feature-extraction WinMLModelForQuestionAnswering question-answering WinMLModelForZeroShotImageClassification zero-shot-image-classification WinMLModelForGenericTask fallback (raw outputs)"},{"location":"reference/python-api/#performance-tracking","title":"Performance Tracking","text":"model = WinMLAutoModel.from_pretrained(\"microsoft/resnet-50\", device=\"npu\")\n\nwith model.perf(warmup=5) as stats:\n for img in test_images:\n model(pixel_values=img)\n\nprint(f\"P99 latency: {stats.p99_ms:.2f} ms\")\n"},{"location":"reference/python-api/#see-also","title":"See also","text":"Windows ML CLI has validated a set of models for compatibility across all Execution Providers (EPs)\u2014see the full model compatibility report.
winml-cli supports a wide range of model architectures and tasks. This page lists what's validated and how to discover model support.
"},{"location":"reference/supported-models/#discovery-commands","title":"Discovery Commands","text":"# Browse the curated catalog (64 validated models)\nuv run winml catalog\n\n# Filter by task\nuv run winml catalog -t image-classification\n\n# Check if a specific model is supported\nuv run winml inspect -m microsoft/resnet-50\n\n# List all known tasks\nuv run winml inspect --list-tasks\n"},{"location":"reference/supported-models/#supported-tasks","title":"Supported Tasks","text":"winml-cli recognizes 35 task types across vision, NLP, audio, and multimodal domains. Of these, 16 have dedicated inference classes; the remainder are supported via the generic task fallback.
"},{"location":"reference/supported-models/#vision","title":"Vision","text":"Task Example Modelsimage-classification ResNet, ConvNeXt, ViT, Swin image-segmentation Segformer, Mask2Former semantic-segmentation Segformer object-detection DETR, YOLOS, Table-Transformer depth-estimation Depth Anything, ZoeDepth image-feature-extraction DINOv2, ViT zero-shot-image-classification CLIP, SigLIP"},{"location":"reference/supported-models/#nlp","title":"NLP","text":"Task Example Models text-classification BERT, RoBERTa, XLM-RoBERTa token-classification BERT, RoBERTa (NER) question-answering BERT, RoBERTa fill-mask BERT, RoBERTa feature-extraction BGE, BERT, all-MiniLM text-generation Qwen3 (composite) text2text-generation T5, BART, Marian"},{"location":"reference/supported-models/#audio","title":"Audio","text":"Task Example Models automatic-speech-recognition Whisper audio-classification Wav2Vec2"},{"location":"reference/supported-models/#multimodal","title":"Multimodal","text":"Task Example Models zero-shot-image-classification CLIP (text + vision) image-to-text VisionEncoderDecoder visual-question-answering BLIP"},{"location":"reference/supported-models/#validated-model-catalog","title":"Validated Model Catalog","text":"The following models have been validated end-to-end with EP compatibility testing. Use winml catalog to browse the full list interactively.
apple/mobilevit-small MobileViT dima806/fairface_age_image_detection ViT facebook/convnext-tiny-224 ConvNeXt google/vit-base-patch16-224 ViT microsoft/resnet-18 ResNet microsoft/resnet-50 ResNet microsoft/swin-large-patch4-window7-224 Swin rizvandwiki/gender-classification ViT"},{"location":"reference/supported-models/#image-feature-extraction","title":"Image Feature Extraction","text":"Model Architecture facebook/dino-vitb16 ViT facebook/dino-vits16 ViT facebook/dinov2-small DINOv2 google/vit-base-patch16-224-in21k ViT"},{"location":"reference/supported-models/#feature-extraction-text","title":"Feature Extraction (Text)","text":"Model Architecture BAAI/bge-base-en-v1.5 BERT BAAI/bge-m3 XLM-RoBERTa BAAI/bge-small-en-v1.5 BERT google-bert/bert-base-multilingual-cased BERT Intel/bert-base-uncased-mrpc BERT laion/CLIP-ViT-B-32-laion2B-s34B-b79K CLIP openai/clip-vit-base-patch16 CLIP openai/clip-vit-base-patch32 CLIP sentence-transformers/all-MiniLM-L6-v2 BERT sentence-transformers/all-mpnet-base-v2 MPNet sentence-transformers/multi-qa-mpnet-base-dot-v1 MPNet sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 BERT"},{"location":"reference/supported-models/#sentence-similarity","title":"Sentence Similarity","text":"Model Architecture BAAI/bge-base-en-v1.5 BERT BAAI/bge-large-en-v1.5 BERT BAAI/bge-m3 XLM-RoBERTa BAAI/bge-small-en-v1.5 BERT sentence-transformers/all-MiniLM-L6-v2 BERT sentence-transformers/all-mpnet-base-v2 MPNet sentence-transformers/multi-qa-mpnet-base-dot-v1 MPNet sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 BERT sentence-transformers/paraphrase-multilingual-mpnet-base-v2 XLM-RoBERTa"},{"location":"reference/supported-models/#fill-mask","title":"Fill-Mask","text":"Model Architecture distilbert/distilbert-base-uncased DistilBERT FacebookAI/roberta-base RoBERTa FacebookAI/roberta-large RoBERTa FacebookAI/xlm-roberta-base XLM-RoBERTa google-bert/bert-base-multilingual-cased BERT google-bert/bert-base-multilingual-uncased BERT google-bert/bert-base-uncased BERT"},{"location":"reference/supported-models/#text-classification","title":"Text Classification","text":"Model Architecture cardiffnlp/twitter-roberta-base-sentiment-latest RoBERTa distilbert/distilbert-base-uncased-finetuned-sst-2-english DistilBERT Intel/bert-base-uncased-mrpc BERT ProsusAI/finbert BERT"},{"location":"reference/supported-models/#token-classification","title":"Token Classification","text":"Model Architecture Babelscape/wikineural-multilingual-ner BERT dbmdz/bert-large-cased-finetuned-conll03-english BERT dslim/bert-base-NER BERT Isotonic/distilbert_finetuned_ai4privacy_v2 DistilBERT w11wo/indonesian-roberta-base-posp-tagger RoBERTa"},{"location":"reference/supported-models/#question-answering","title":"Question Answering","text":"Model Architecture deepset/bert-large-uncased-whole-word-masking-squad2 BERT deepset/roberta-base-squad2 RoBERTa deepset/tinyroberta-squad2 RoBERTa distilbert/distilbert-base-cased-distilled-squad DistilBERT distilbert/distilbert-base-uncased-distilled-squad DistilBERT google-bert/bert-large-uncased-whole-word-masking-finetuned-squad BERT"},{"location":"reference/supported-models/#zero-shot-classification","title":"Zero-Shot Classification","text":"Model Architecture joeddav/xlm-roberta-large-xnli XLM-RoBERTa"},{"location":"reference/supported-models/#zero-shot-image-classification","title":"Zero-Shot Image Classification","text":"Model Architecture openai/clip-vit-base-patch16 CLIP"},{"location":"reference/supported-models/#image-segmentation","title":"Image Segmentation","text":"Model Architecture mattmdjaga/segformer_b2_clothes Segformer nvidia/segformer-b1-finetuned-ade-512-512 Segformer nvidia/segformer-b2-finetuned-ade-512-512 Segformer nvidia/segformer-b5-finetuned-ade-640-640 Segformer"},{"location":"reference/supported-models/#image-to-text","title":"Image-to-Text","text":"Model Architecture microsoft/trocr-base-handwritten VisionEncoderDecoder microsoft/trocr-base-printed VisionEncoderDecoder microsoft/trocr-large-handwritten VisionEncoderDecoder"},{"location":"reference/supported-models/#execution-provider-compatibility","title":"Execution Provider Compatibility","text":"Each validated model is tested against available EPs:
EP Alias Devices Notes NvTensorRTRTXExecutionProvidernvtensorrtrtx, nv_tensorrt_rtx GPU NVIDIA TensorRT-RTX; NVIDIA GPU with TensorRT runtime CUDAExecutionProvider cuda GPU NVIDIA CUDA; any CUDA-capable GPU MIGraphXExecutionProvider migraphx GPU AMD ROCm MIGraphX QNNExecutionProvider qnn NPU, GPU Qualcomm Snapdragon; bundled in ORT OpenVINOExecutionProvider openvino NPU, GPU, CPU Intel hardware DmlExecutionProvider dml GPU DirectML; any DirectX 12 GPU CPUExecutionProvider cpu CPU Always available VitisAIExecutionProvider vitisai NPU AMD/Xilinx"},{"location":"reference/supported-models/#adding-unsupported-models","title":"Adding Unsupported Models","text":"If your model architecture isn't in the catalog, winml-cli may still support it through auto-detection:
# Try inspecting first\nuv run winml inspect -m your-org/your-model\n\n# If \"Status: Supported\", proceed normally\nuv run winml build -m your-org/your-model -d auto -o output/\n For truly custom architectures, use --trust-remote-code to allow execution of model code from the Hugging Face Hub.
BERT (bert-base-uncased) is a canonical text model that exercises every stage of the winml-cli pipeline: it has multiple input tensors, benefits from graph fusion (GeLU, LayerNorm, MatMul+Add), and produces quantizable activations that run well on NPU. That combination makes it a useful reference point for teams deploying transformer encoders on Windows.
This sample walks through the production-style workflow: generate a reusable WinMLBuildConfig JSON file with winml config, run the full export \u2192 optimize \u2192 quantize \u2192 compile pipeline in one shot with winml build, and measure the result with winml perf. If you want to understand each pipeline stage individually before running the all-in-one command, read the Hugging Face Model to NPU tutorial first.
winml on your PATH.winml config -m bert-base-uncased -t text-classification -o bert_config.json\n This writes a WinMLBuildConfig JSON file to bert_config.json. The file captures every pipeline setting in a single artifact that you can version-control and share. A representative excerpt looks like this:
{\n \"loader\": {\n \"task\": \"text-classification\",\n \"model_class\": \"AutoModelForSequenceClassification\",\n \"model_type\": \"bert\"\n },\n \"export\": {\n \"opset_version\": 17,\n \"batch_size\": 1\n .. // truncated: input_tensors, output_tensors\n },\n \"optim\": {\n \"clamp_constant_values\": true\n },\n \"quant\": {\n \"mode\": \"qdq\",\n \"weight_type\": \"uint8\",\n \"activation_type\": \"uint16\",\n \"samples\": 10,\n \"calibration_method\": \"minmax\",\n \"task\": \"text-classification\",\n \"model_name\": \"bert-base-uncased\"\n ... // truncated: per_channel, symmetric, distribution, ...\n },\n \"compile\": null\n}\n Note
The five top-level keys \u2014 loader, export, optim, quant, and compile \u2014 map directly to the five pipeline stages. Setting quant or compile to null skips that stage entirely. See Config and build for a field-by-field description of every option.
winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/\n winml-cli reads the config, downloads the model weights once, and runs the pipeline in sequence. Terminal output shows each stage as it completes:
winml build\n Config: bert_config.json\n Model: bert-base-uncased\n Output: bert_out/\n\n export done (42.1s)\n optimize done (6.3s)\n quantize done (18.7s)\n compile done (21.4s)\n\n Build complete in 88.5s\n Final artifact: bert_out/model.onnx\n Note
After the optimize stage, winml-cli runs an analyzer loop that inspects the graph for nodes the target EP cannot dispatch natively and re-runs optimization with adjusted fusion flags. The loop repeats up to --max-optim-iterations times (default: 3). Pass --no-optimize to skip this stage entirely when starting from a pre-optimized ONNX file. See How winml-cli Works for a full description of the autoconf loop.
winml perf -m bert_out/model.onnx --iterations 50\n After a short warm-up, winml perf reports latency percentiles and throughput:
Device: npu\nTask: text-classification\nIterations: 50 (+ 10 warmup)\nBatch Size: 1\n\nLatency (ms)\n Avg P50 P90 P95 P99 Min Max Std\n 4.83 4.79 5.12 5.31 5.68 4.51 6.04 0.21\n\nThroughput: 206.99 samples/sec\n\nResults saved to: model_perf.json\n"},{"location":"samples/bert-config-build/#customizing-the-config","title":"Customizing the config","text":"The JSON file is plain text and can be edited before running winml build. Two common adjustments:
Change precision. To target fp16 instead of the default uint8 QDQ quantization, regenerate the config with an explicit precision flag:
winml config -m bert-base-uncased -t text-classification --precision fp16 -o bert_config.json\n Alternatively, edit bert_config.json directly: set quant.weight_type and quant.activation_type to \"int8\" or \"uint16\", or set quant to null to skip quantization entirely.
Disable a stage at build time. You can suppress a stage for a single run without touching the config file using the --no-quant flags:
winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/ --no-quant \n This is useful for measuring the fp32 baseline before committing to a quantized build. The quant section in bert_config.json is unchanged; the flag only affects this invocation. See Config and build for the full list of configurable fields.
winml config generates a complete, version-controllable WinMLBuildConfig JSON from a HuggingFace model ID in one command.winml build orchestrates the full export \u2192 optimize \u2192 quantize \u2192 compile pipeline from a single config file and model ID.winml perf gives a latency and throughput baseline on the built artifact in seconds.CLIP (openai/clip-vit-base-patch32) is a dual-encoder vision-language model: one tower encodes images, the other encodes text, and both project into a shared embedding space. winml-cli treats it as a composite model \u2014 a model that is split into multiple ONNX sub-models that run together at inference time. For CLIP, the two sub-models are:
image-encoder Encodes images into embeddings pixel_values [1, 3, 224, 224] image_embeds [1, 512] text-encoder Encodes text labels into embeddings input_ids [1, 77] text_embeds [1, 512] Zero-shot classification is achieved by embedding the image and the candidate text labels, then ranking the labels by the cosine similarity between their embeddings. Splitting the towers into two ONNX graphs lets each encoder have fully static shapes (required for efficient NPU compilation) and lets you build, cache, and benchmark them independently.
"},{"location":"samples/clip-composite/#prerequisites","title":"Prerequisites","text":"winml on your PATH.The composite model architecture for CLIP:
graph LR\n A[winml config] -->|\"(clip, zero-shot-image-classification)\"| B[Composite Registry]\n B --> C[image-encoder config]\n B --> D[text-encoder config]\n C --> E[winml build \u2192 image-encoder.onnx]\n D --> F[winml build \u2192 text-encoder.onnx]\n E --> G[WinMLAutoModel]\n F --> G\n G -->|logits_per_image| H[Classification scores]"},{"location":"samples/clip-composite/#step-1-generate-build-configs","title":"Step 1: Generate build configs","text":"winml config -m openai/clip-vit-base-patch32 --task zero-shot-image-classification -o clip.json\n Because (clip, zero-shot-image-classification) is registered as a composite model, this command produces two config files \u2014 one per sub-model:
clip_image-encoder.json \u2014 export config using image-feature-extraction taskclip_text-encoder.json \u2014 export config using feature-extraction taskEach config includes CLIP-specific optimizations (GELU fusion, LayerNorm fusion, MatMul+Add fusion, and clamp constant values).
"},{"location":"samples/clip-composite/#step-2-build-each-sub-model","title":"Step 2: Build each sub-model","text":"Build both sub-models individually using their config files:
# Build the image encoder\nwinml build -c clip_image-encoder.json -m openai/clip-vit-base-patch32 -o output/image-encoder\n\n# Build the text encoder\nwinml build -c clip_text-encoder.json -m openai/clip-vit-base-patch32 -o output/text-encoder\n Each winml build runs the full pipeline: export \u2192 optimize \u2192 quantize \u2192 compile. The output directories contain the final ONNX files ready for inference.
To target a specific execution provider (e.g., QNN for NPU):
winml build -c clip_image-encoder.json -m openai/clip-vit-base-patch32 -o output/image-encoder --ep qnn\nwinml build -c clip_text-encoder.json -m openai/clip-vit-base-patch32 -o output/text-encoder --ep qnn\n"},{"location":"samples/clip-composite/#step-3-benchmark-each-sub-model","title":"Step 3: Benchmark each sub-model","text":"winml perf output/image-encoder -d npu\nwinml perf output/text-encoder -d npu\n This lets you identify whether the image or text encoder is the bottleneck on your target hardware.
"},{"location":"samples/clip-composite/#step-4-run-inference-python-api","title":"Step 4: Run inference (Python API)","text":"There are two ways to get a ready-to-run model. Both return the same WinMLModelForZeroShotImageClassification \u2014 a single object that orchestrates the two encoders and combines their projected embeddings into similarity scores \u2014 so the inference code afterward is identical.
Option 1 \u2014 Load the ONNX files built in Step 2 (skips re-export/optimization). Pass a dict mapping each component name to its built model.onnx, plus the HF config so the composite registry can resolve (clip, zero-shot-image-classification):
from transformers import AutoConfig\n\nfrom winml.modelkit.models import WinMLAutoModel\n\nmodel = WinMLAutoModel.from_onnx(\n {\n \"image-encoder\": \"output/image-encoder/model.onnx\",\n \"text-encoder\": \"output/text-encoder/model.onnx\",\n },\n task=\"zero-shot-image-classification\",\n hf_config=AutoConfig.from_pretrained(\"openai/clip-vit-base-patch32\"),\n skip_build=True,\n)\n Option 2 \u2014 Build both encoders from the HuggingFace model in one call. WinMLAutoModel.from_pretrained detects the composite task and runs the full pipeline for each sub-model:
from winml.modelkit.models import WinMLAutoModel\n\nmodel = WinMLAutoModel.from_pretrained(\n \"openai/clip-vit-base-patch32\",\n task=\"zero-shot-image-classification\",\n)\n Either way, run inference the same way \u2014 prepare an image plus candidate labels with the HF processor, then call the model:
from PIL import Image\nfrom transformers import CLIPProcessor\n\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\nimage = Image.open(\"cat.jpg\")\nlabels = [\"a photo of a cat\", \"a photo of a dog\", \"a photo of a car\"]\ninputs = processor(text=labels, images=image, return_tensors=\"pt\", padding=True)\n\n# Run both encoders and combine into per-label similarity scores\noutputs = model(**inputs)\nprobs = outputs.logits_per_image.softmax(dim=-1)\nfor label, p in zip(labels, probs[0].tolist()):\n print(f\"{label}: {p:.4f}\")\n The text encoder's fixed sequence length (77) is handled for you \u2014 the processor's tokens are padded or truncated to match the ONNX graph before each run.
"},{"location":"samples/clip-composite/#customizing-shape-config-per-sub-model","title":"Customizing shape config per sub-model","text":"Each encoder takes its own shape_config, passed through sub_model_kwargs. The image encoder accepts vision keys (height, width); the text encoder accepts text keys (sequence_length):
model = WinMLAutoModel.from_pretrained(\n \"openai/clip-vit-base-patch32\",\n task=\"zero-shot-image-classification\",\n sub_model_kwargs={\n \"image-encoder\": {\"shape_config\": {\"height\": 224, \"width\": 224}},\n \"text-encoder\": {\"shape_config\": {\"sequence_length\": 77}},\n },\n)\n"},{"location":"samples/clip-composite/#other-composite-models","title":"Other composite models","text":"The same composite model pattern is used for:
google/siglip-base-patch16-224) \u2014 dual-encoder zero-shot image classification; shares the same composite wrapper as CLIPgoogle-t5/t5-small) \u2014 encoder + decoder for translation/summarizationfacebook/bart-large-cnn) \u2014 encoder + decoder for summarization and table-question-answering (TAPEX)Helsinki-NLP/opus-mt-en-de) \u2014 encoder + decoder for translationQwen/Qwen3-0.6B) \u2014 prefill + generation decoders for text generationSalesforce/blip-image-captioning-base) \u2014 vision encoder + text decoder for image-to-text captioningmicrosoft/trocr-base-handwritten) \u2014 vision encoder + text decoder for image-to-text (TrOCR, Donut)Tutorials are linear, prescriptive, end-to-end walkthroughs that guide you through building something concrete with winml-cli. Each tutorial moves in one direction\u2014start to finish\u2014so you can follow along without making decisions. If you need to understand the reasoning behind a feature, see the Concepts section (the why and when). If you need a quick reference for a specific command, see Commands (the what). Tutorials sit alongside Samples, which are reference-style demos that compare multiple approaches side by side rather than walking through a single path.
More tutorials are coming, covering additional model families, execution providers, and deployment scenarios. Check back as the winml-cli documentation expands.
This tutorial walks you through the complete workflow for optimizing, analyzing, and deploying an ONNX model you already have \u2014 whether you exported it yourself (torch.onnx.export, ONNX Runtime tools), received it from a teammate, or downloaded it from the ONNX Model Zoo.
Unlike the Hugging Face Model to NPU tutorial which starts from a HuggingFace model ID, this tutorial assumes you already have a .onnx file on disk and want to make it run faster on your target hardware.
The tutorial is split into two sections. Section A walks through the analyze \u2192 optimize \u2192 re-analyze loop using primitive commands, teaching you how the optimization feedback cycle works. Section B shows how winml build automates that same loop in a single command, optionally targeting NPU with quantization.
pip install uv or follow astral.sh/uv)my_model.onnx as a placeholder; substitute your own fileNo NPU? Set --device cpu wherever you see --device npu. Every other flag stays the same.
Working through the primitive commands one at a time reveals how the analyze\u2013optimize feedback cycle works. Each command accepts the output of the previous step as input, and every intermediate artifact is available for inspection.
"},{"location":"tutorials/build-from-onnx/#step-1-analyze-the-original-model","title":"Step 1: Analyze the original model","text":"Before any optimization, run the static analyzer to understand your model's EP compatibility and get optimization recommendations:
uv run winml analyze --model my_model.onnx --optim-config optim_config.json\n The analyzer classifies every operator in the graph as supported, partial, unsupported, or unknown for each available EP. It also detects fusible subgraph patterns and writes the recommended optimization flags to optim_config.json.
To target a specific EP:
uv run winml analyze --model my_model.onnx --ep qnn --device npu --optim-config optim_config.json\n The output shows per-EP compatibility results:
\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\n ANALYSIS SUMMARY\n\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\n QNNExecutionProvider (NPU): 122/0/0/0\n Ready to deploy\n If the analyzer detects fusible patterns (GeLU, LayerNorm, etc.), they will appear in the output and the optim_config.json will contain the recommended fusion settings. If no patterns are detected (as with simple architectures like ResNet), the config will be empty {}.
What we just did
The analyzer performs static analysis \u2014 no runtime or hardware required. It tells you two things: (1) can the model run on your target EP at all, and (2) are there graph patterns that the optimizer can fuse to improve performance. The --optim-config flag outputs a JSON file with the exact optimization settings the optimizer needs. S/P/U/Unk = Supported/Partial/Unsupported/Unknown.
Pass the analyzer's output config directly to the optimizer:
uv run winml optimize -m my_model.onnx -c optim_config.json -o my_model_optimized.onnx\n The optimizer applies the fusions specified in the config and reports how many nodes it reduced:
Input: my_model.onnx\nOutput: my_model_optimized.onnx\n\nSuccess! Model optimized: my_model_optimized.onnx\nNodes: 122 -> 122 (0.0% reduction)\n Tip
The node reduction depends on your model's architecture. Simple models like ResNet (only Conv, Relu, Add) have no fusible patterns. Transformer-based models (BERT, ViT) typically see 10\u201330% node reduction from GeLU, LayerNorm, and Attention fusions.
What we just did
Graph optimization fuses multi-node patterns (like the 5-node GeLU/Erf sequence) into single high-level operators that EPs can execute more efficiently. The optimizer is purely a graph transformation \u2014 it doesn't change the model's numerical behavior or require calibration data. Running it before quantization is important: calibration should be performed on the already-fused topology, not the verbose original graph.
"},{"location":"tutorials/build-from-onnx/#step-3-re-analyze-the-optimized-model","title":"Step 3: Re-analyze the optimized model","text":"Run the analyzer again on the optimized output to confirm that the fusions resolved and no new issues appeared:
uv run winml analyze --model my_model_optimized.onnx --ep qnn --device npu\n If the original analysis found fusible patterns that were optimized away, this run should show zero detected patterns and the same or better EP compatibility score.
What we just did
The analyze \u2192 optimize \u2192 re-analyze cycle is the fundamental feedback loop in winml-cli. In Section B you'll see that winml build automates this loop \u2014 it calls the analyzer, applies recommendations, re-analyzes, and repeats until convergence (typically 1\u20133 iterations). Doing it manually here teaches you what the automation is actually doing under the hood.
Insert QDQ (Quantize-Dequantize) nodes into the optimized graph using static calibration:
uv run winml quantize -m my_model_optimized.onnx -o my_model_int8.onnx --precision int8 --samples 32\n The quantizer generates 32 random calibration samples, runs them through the model to collect activation statistics, and uses those statistics to set the quantization scale and zero-point for each tensor.
What we just did
--precision int8 sets both weights and activations to 8-bit integers, which is the precision most NPU compilers expect. The output model still contains standard QuantizeLinear and DequantizeLinear ONNX nodes, so it is portable and can run on any ONNX Runtime backend. See Concepts \u2192 Quantization and QDQ for calibration methods and per-channel options.
Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time:
Qualcomm NPUIntel NPUAMD NPUCPUuv run winml compile -m my_model_int8.onnx --device npu --ep qnn\n uv run winml compile -m my_model_int8.onnx --device npu --ep openvino\n uv run winml compile -m my_model_int8.onnx --device npu --ep vitisai\n uv run winml compile -m my_model_int8.onnx --device cpu\n What we just did
Compilation embeds EP context \u2014 the compiled binary \u2014 inside or alongside the ONNX file using the EPContext node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph. See Concepts \u2192 Compile and EPContext for details.
Measure the performance of your model:
Optimized (CPU)Compiled (NPU)uv run winml perf -m my_model_optimized.onnx --device cpu --warmup 5 --iterations 50\n uv run winml perf -m my_model_int8_npu_ctx.onnx --device npu --iterations 50 --monitor\n What we just did
winml perf generates random inputs matching the model's I/O spec, runs warmup iterations (excluded from statistics), then the benchmark iterations, and reports full latency percentiles alongside throughput. The --monitor flag activates live hardware utilization polling. See Concepts \u2192 Perf and monitoring for details.
winml build","text":"Once you understand the analyze \u2192 optimize \u2192 re-analyze loop (which you now do), you can let winml build handle everything in one command. When you pass a .onnx file, winml-cli auto-detects it and skips the export stage \u2014 running the optimization loop, quantization, and compilation automatically.
uv run winml build -m my_model.onnx -o output/ --device npu --precision int8\n Config file is optional
The -c config.json flag is optional. Without it, winml build auto-generates an internal config from the flags you pass (like --device and --precision). If you need a reusable config, generate one with winml config:
uv run winml config --onnx my_model.onnx -d npu --precision int8 -o config.json\nuv run winml build -m my_model.onnx -c config.json -o output/\n The pipeline runs: analyze \u2192 optimize \u2192 (re-analyze \u2192 re-optimize if needed) \u2192 quantize \u2192 compile \u2192 model.onnx. The output directory looks like:
output/\n\u251c\u2500\u2500 model.onnx \u2190 FINAL: deploy this\n\u251c\u2500\u2500 my_model.onnx \u2190 Copy of your input\n\u251c\u2500\u2500 my_model_optimized.onnx \u2190 After optimization loop converged\n\u251c\u2500\u2500 my_model_quantized.onnx \u2190 After INT8 quantization\n\u251c\u2500\u2500 my_model_compiled.onnx \u2190 After EP compilation\n\u251c\u2500\u2500 winml_build_config.json \u2190 Config used (including auto-detected options)\n\u2514\u2500\u2500 analyze_result.json \u2190 Analysis from optimize stage\n You can selectively skip stages using the override flags:
--no-optimize \u2014 skip graph optimization (rarely needed; useful if you have a pre-optimized ONNX)--no-quant \u2014 skip quantization (produces a floating-point compiled model)--no-compile \u2014 skip compilation (produces a quantized but not device-locked ONNX)For example, to produce an optimized model without quantization or compilation:
uv run winml build -m my_model.onnx -o output/ --device cpu\n What we just did
winml build is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary.
Once the build completes, benchmark the final artifact:
uv run winml perf -m output/model.onnx --device npu --iterations 50 --monitor\n"},{"location":"tutorials/build-from-onnx/#using-the-python-api","title":"Using the Python API","text":"from winml.modelkit import WinMLAutoModel\n\n# Load from a pre-built ONNX (skips the build pipeline)\nmodel = WinMLAutoModel.from_onnx(\n \"output/model.onnx\",\n task=\"image-classification\", # set your task\n skip_build=True,\n)\n\noutput = model(pixel_values=your_input_tensor)\n Or trigger the full build programmatically:
from winml.modelkit.build import build_onnx_model\nfrom winml.modelkit.config import generate_build_config\n\nconfig = generate_build_config(onnx_path=\"my_model.onnx\", device=\"npu\", precision=\"int8\")\nresult = build_onnx_model(\"my_model.onnx\", config=config, output_dir=\"output/\")\nprint(f\"Final model: {result.final_onnx_path}\")\n"},{"location":"tutorials/build-from-onnx/#troubleshooting","title":"Troubleshooting","text":"Problem Solution \"ONNX file not found\" Use an absolute path or ensure the file is in the current directory Analyzer reports unsupported ops Check if an optimization fusion resolves them; if not, the model needs modification for that EP Optimization loop doesn't converge The default max is 3 iterations; if patterns persist, they may not be fusible \u2014 use --no-quant --no-compile and inspect Quantization accuracy regression Try --precision int16, --per-channel, or increase --samples for better calibration EP compilation fails Check the selected EP, model compatibility, and target device availability Model too large for memory Use --no-compile and compile on the target device"},{"location":"tutorials/build-from-onnx/#where-to-go-next","title":"Where to go next","text":"analyze_result.json schemaPick the right ConvNeXt page
Two pages use ConvNeXt as their vehicle:
winml build one-shot. Start here if you want to ship to NPU.This tutorial walks you through the complete journey from a pretrained Hugging Face model \u2014 facebook/convnext-tiny-224 \u2014 to a quantized, compiled artifact running on an NPU. By the end you will have benchmarked the model on your device and measured real inference latency. Nothing is skipped, and every command produces a file you can inspect or reuse.
The primary hardware target is a Copilot+PC with a Snapdragon X-class NPU (40+ TOPS). If you do not have an NPU, every step works on CPU or DirectML as a fallback \u2014 the only thing that changes is the --device and --ep flags on the compile and perf commands. Those variations are shown explicitly in the tabbed blocks below.
The tutorial is split into two sections. Section A runs through eight primitive commands \u2014 one per pipeline stage \u2014 so you understand what each stage does, what artifact it produces, and why it matters. Section B shows you that winml build runs the same pipeline in a single command once you have a config file. Most production workflows live in Section B; Section A is how you learn to trust it.
pip install uv or follow astral.sh/uv)No NPU? Set --device cpu wherever you see --device npu and drop --monitor from perf commands. Every other flag stays the same.
Working through the primitive commands one at a time is the best way to understand what the winml build wrapper does under the hood. Each step accepts the output of the previous step as its input, so the chain is explicit and every intermediate artifact is available for inspection.
Before downloading any weights, confirm that winml-cli knows how to handle facebook/convnext-tiny-224.
uv run winml inspect -m facebook/convnext-tiny-224\n You should see output similar to the following:
Model facebook/convnext-tiny-224\nTask image-classification\nModel class ConvNextForImageClassification\nExporter optimum/onnx\nInput pixel_values: float32 [1, 3, 224, 224]\nOutput logits: float32 [1, 1000]\nSupport status supported\n What we just did
winml inspect queries the Hugging Face model card and winml-cli's internal registry without downloading weights. It confirms three things: the auto-detected task (image-classification), the model class that will be used for loading, and the exporter that will handle the ONNX conversion. If this command fails, stop here \u2014 something about the model is unsupported and proceeding would waste time. A successful inspect is the green light for every stage that follows.
Generate a WinMLBuildConfig JSON file for the model. For the primitive workflow this file is optional \u2014 you can drive each stage entirely through CLI flags \u2014 but generating it now gives you a versioned record of every auto-detected setting, and it is required for Section B.
uv run winml config -m facebook/convnext-tiny-224 --device npu --precision int8 -o convnext_config.json\n Open convnext_config.json to see what was auto-detected: the task, I/O tensor shapes, quantization parameters, and the compile target. The --device npu --precision int8 flags tell the config generator to pre-populate the quantization and compile sections for NPU deployment rather than leaving them at defaults.
What we just did
winml config auto-resolves every setting that would otherwise require you to look up flags manually. The resulting JSON is the single source of truth for a reproducible build. You can commit it to version control, share it with teammates, edit a single field to try a different precision, and replay the exact same build on any machine. See Concepts \u2192 Config and build for a deeper look at the config schema and how the stages interact.
Download the pretrained weights and convert the PyTorch model to ONNX format.
uv run winml export -m facebook/convnext-tiny-224 -o convnext.onnx\n This runs an eight-stage export pipeline: model preparation, input generation, hierarchy building, ONNX conversion, node tagging, tag injection, and metadata generation. The result is a standards-compliant ONNX file with winml-cli's Hierarchy-preserving Tags Protocol (HTP) metadata embedded in node metadata_props. That metadata is what lets downstream tools make architecture-aware optimization decisions without hardcoded model knowledge.
What we just did
The default export embeds hierarchy tags \u2014 a tree of source module names mapped onto ONNX nodes \u2014 so that the optimizer and analyzer can reason about the graph in terms of the original model structure rather than flat node lists. If you need a clean ONNX without that metadata (for compatibility with other tools), add --no-hierarchy. See Concepts \u2192 Load and export for what hierarchy preservation adds and when it matters.
Before spending time on optimization and quantization, check that the model's operators are supported by your target execution provider.
uv run winml analyze -m convnext.onnx --ep qnn --device npu\n The analyzer performs static analysis \u2014 no runtime required \u2014 and classifies every operator in the graph as supported, partial, or unsupported for the target EP. It reports a coverage summary, flags any operators that may fall back to CPU, and exits with code 0 for full support or 1 for partial support.
For CPU fallback, run:
uv run winml analyze -m convnext.onnx --ep cpu --device cpu\n What we just did
Knowing your operator coverage before you quantize or compile saves you from discovering EP incompatibilities at the very last step of a long pipeline. ConvNeXt's operators (Conv, GELU, LayerNorm, Add) have broad support across QNN and OpenVINO, so this command should exit 0. If it exits 1, the output tells you which operators are problematic and includes recommendations for resolving them \u2014 typically by enabling a graph rewrite in the optimizer that fuses the unsupported pattern into a supported one. See Concepts \u2192 Analyze and optimize for details on the analyzer's recommendation engine.
"},{"location":"tutorials/npu-convnext/#step-5-optimize-the-graph","title":"Step 5: Optimize the graph","text":"Apply graph-level optimizations: operator fusion, constant folding, shape inference, and EP-specific graph rewrites.
uv run winml optimize -m convnext.onnx -o convnext_optim.onnx\n The optimizer reports how many nodes it reduced. A typical ConvNeXt-tiny optimization fuses several element-wise sequences and removes redundant reshape operations, cutting the node count noticeably without changing model semantics. If you want to apply a specific preset suited to the Snapdragon NPU, add --preset qnn-compatible to disable fusions that QNN does not benefit from.
What we just did
Graph optimization is a separate stage from quantization so that you can inspect the intermediate graph, compare node counts, and selectively enable or disable individual fusion passes using the --enable-* / --disable-* flags. Run uv run winml optimize --list-capabilities to see every registered optimization flag and its default state. Optimization always happens on the floating-point graph; quantization is applied after so that calibration statistics are computed on the already-fused topology.
Insert QDQ (Quantize-Dequantize) nodes into the optimized graph using static calibration. This reduces model size and speeds up inference on hardware with integer execution units, which includes Snapdragon NPUs and Intel NPUs.
uv run winml quantize -m convnext_optim.onnx -o convnext_int8.onnx --precision int8 --samples 32\n The quantizer generates 32 random calibration samples, runs them through the model to collect activation statistics, and uses those statistics (with the default minmax method) to set the quantization scale and zero-point for each tensor. Thirty-two samples is sufficient for a vision model with fixed-size inputs like ConvNeXt. For models with variable-length inputs or complex activation distributions, increase --samples to 64 or 128.
What we just did
--precision int8 sets both weights and activations to 8-bit integers, which is the precision most NPU compilers expect. The output model still contains standard QuantizeLinear and DequantizeLinear ONNX nodes, so it is portable and can run on any ONNX Runtime backend \u2014 you do not need special tooling to inspect it. See Concepts \u2192 Quantization and QDQ for a detailed explanation of the QDQ node pattern, calibration methods, and how to choose between per-tensor and per-channel quantization.
Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time. This is the step that produces a device-locked artifact tied to the selected EP.
The examples below use the default compiler backend (--compiler ort), which uses ONNX Runtime's built-in EP context compiler:
uv run winml compile -m convnext_int8.onnx --device npu --ep qnn\n uv run winml compile -m convnext_int8.onnx --device npu --ep openvino\n uv run winml compile -m convnext_int8.onnx --device npu --ep vitisai\n uv run winml compile -m convnext_int8.onnx --device cpu\n The compiled output file appears in the same directory as the input model. The file name follows the pattern convnext_int8_npu_ctx.onnx (using the resolved device string npu, not the EP name) and an accompanying .bin context binary is written alongside it (unless --embed is passed, which embeds the binary inside the ONNX file). CPU builds do not produce a new artifact \u2014 the compile step validates EP compatibility but writes no output file; use convnext_int8.onnx directly for CPU inference.
What we just did
Compilation embeds EP context \u2014 the compiled binary \u2014 inside or alongside the ONNX file using the EPContext node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph, eliminating the 15\u201360 second JIT penalty on first load. The default --compiler ort backend bundles compilation within ONNX Runtime itself. See Concepts \u2192 Compile and EPContext for the full picture of what gets embedded and how the context is consumed at runtime.
Measure inference latency and throughput with the --monitor flag to see live NPU utilization alongside the timing numbers.
uv run winml perf -m convnext_int8_npu_ctx.onnx --device npu --iterations 50 --monitor\n uv run winml perf -m convnext_int8_npu_ctx.onnx --device npu --ep openvino --iterations 50 --monitor\n uv run winml perf -m convnext_int8.onnx --device cpu --iterations 50\n A representative run on a Snapdragon X Elite NPU produces output like the following:
Device: npu\nTask: image-classification\nIterations: 50 (+ 10 warmup)\nBatch Size: 1\n\nLatency (ms)\n Avg P50 P90 P95 P99 Min Max Std\n 2.14 2.11 2.31 2.38 2.59 1.98 2.71 0.14\n\nThroughput: 467.29 samples/sec\n\nHardware (during benchmark)\n NPU: 72.4% avg, 89.1% peak | CPU: 3.2% avg\n Sys Mem: 1842 MB | Device Mem: 48/12 MB (local/shared)\n The CPU fallback (same model, --device cpu) will typically show latencies 8\u201315x higher and near-zero NPU utilization. The contrast between those two runs is the best proof that your NPU path is actually being used.
What we just did
winml perf generates random inputs matching the model's I/O spec, runs the configured number of warmup iterations (excluded from statistics), then the benchmark iterations, and reports full latency percentiles alongside throughput. The --monitor flag activates live hardware utilization polling at 200 ms intervals, displaying an in-terminal chart and attaching the hardware metrics to the JSON report saved alongside the console output. See Concepts \u2192 Perf and monitoring for how to interpret the utilization numbers and what hw_monitor fields look like in the JSON report.
After quantization it is good practice to verify that INT8 accuracy is close to the FP32 baseline. The winml eval command runs the model against a held-out dataset slice and reports task-relevant metrics.
uv run winml eval -m convnext_int8.onnx --model-id facebook/convnext-tiny-224 --dataset imagenet-1k --split validation --samples 100 --device npu\n The --model-id flag is required when passing an ONNX file, because the evaluator needs it to locate the preprocessor and label mappings. The command downloads 100 shuffled validation samples, runs inference, and reports top-1 and top-5 accuracy. A well-quantized ConvNeXt-tiny should lose less than 0.5 percentage points of top-1 accuracy compared to the floating-point checkpoint.
What we just did
Accuracy evaluation gives you a principled stopping criterion for quantization decisions. If the accuracy drop is larger than acceptable, return to Step 6 and try --precision int16 or per-channel quantization (--per-channel) instead of the default per-tensor int8. See Concepts \u2192 Eval and datasets for the full list of supported datasets, tasks, and column mapping options.
winml build","text":"Once you understand what each primitive stage does (which you now do), you can collapse the entire pipeline into a single command. winml build orchestrates export, optimize, quantize, and compile in sequence.
uv run winml build -m facebook/convnext-tiny-224 -o convnext_out/ --device npu --precision int8\n Config file is optional
The -c config.json flag is optional. Without it, winml build auto-generates an internal config from the flags you pass (like --device and --precision). If you need a reusable config, generate one with winml config.
The command downloads the pretrained weights, runs all four pipeline stages, and writes every intermediate and final artifact into convnext_out/. The stage timing is printed as each stage completes, and the final line tells you the path of the compiled model.
You can selectively skip stages using the override flags:
--no-optimize \u2014 skip graph optimization (rarely needed; useful if you have a pre-optimized ONNX)--no-quant \u2014 skip quantization (produces a floating-point compiled model)--no-compile \u2014 skip compilation (produces a quantized but not device-locked ONNX)For example, to produce an optimized and quantized model without the compile step:
uv run winml build -m facebook/convnext-tiny-224 -o convnext_out/ --device npu --precision int8 --no-compile\n What we just did
winml build is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary.
Once the build completes, benchmark the final artifact from convnext_out/:
uv run winml perf -m convnext_out/model.onnx --device npu --iterations 50 --monitor\n The result should match what you saw in Step 8, confirming that the winml build pipeline produces bit-identical output to the manual primitive chain.
Windows ML CLI is a command line tool for building portable, performant, and high-quality AI models for Windows ML. It takes you from a source model \u2014 whether from Hugging Face or your own pipeline \u2014 to a hardware-optimized artifact in a reproducible workflow.
Purpose-built for Windows hardware diversity, the CLI handles conversion, graph optimization, and compilation across AMD, Intel, NVIDIA, and Qualcomm targets. The CLI fits naturally into CI/CD pipelines so teams can validate and ship models easily.
"},{"location":"#what-you-can-do","title":"What you can do","text":"export, analyze, optimize, quantize, compile), or use an auto-generated config with winml build \u2014 both produce portable models that run across hardware.winml CLI running locally.winml subcommands.To request access to the Windows ML CLI repository, visit aka.ms/winml-cli.
"},{"location":"#license","title":"License","text":"MIT. See LICENSE.
"},{"location":"Privacy/","title":"WinML CLI Privacy Statement","text":"WinML CLI collects limited, unlinked pseudonymized telemetry to help improve the product. This page describes exactly what is collected, what is not, and how to control it.
"},{"location":"Privacy/#data-category","title":"Data category","text":"All WinML CLI telemetry is classified as Optional under Microsoft's data categorization model. None of it is required to run any feature; it exists solely to support product improvement.
A first-run interactive prompt asks for consent before any event is sent. The prompt defaults to accept \u2014 pressing Enter enables telemetry. You can decline explicitly at the prompt, or change your answer later by editing %USERPROFILE%\\.winml\\config.json. Telemetry is automatically disabled in non-interactive contexts (non-TTY stdin, CI/CD pipelines) regardless of stored consent; those contexts do not see the prompt and default to off.
When telemetry is enabled, WinML CLI emits three event types:
"},{"location":"Privacy/#winmlcliheartbeat","title":"WinMLCLIHeartbeat","text":"Sent once per CLI invocation, just before the requested command runs. Carries only context attributes (OS, architecture, app version, device ID) \u2014 no per-event payload.
"},{"location":"Privacy/#winmlcliaction","title":"WinMLCLIAction","text":"Sent once per command completion.
Attribute Descriptioninvoked_from Script or Interactive, based on whether stdin is a TTY. action_name Click subcommand name (e.g., build, analyze). device Target device type, if the subcommand accepts --device (e.g., NPU, GPU). ep Execution provider, if the subcommand accepts --ep (e.g., QNNExecutionProvider). duration_ms Wall-clock execution time in milliseconds. success Whether the command completed without raising."},{"location":"Privacy/#winmlclierror","title":"WinMLCLIError","text":"Sent only when a command raises an unhandled exception.
Attribute Descriptionexception_type Exception class name (e.g., ValueError). exception_message The exception message, with absolute paths trimmed to package-relative, truncated to 200 characters, and with emails, GUIDs, IPv4/IPv6 addresses, and long opaque tokens replaced by <scrubbed>. exception_stack A list of frames, each {file, line, function}. File paths are package-relative. No source line text, no local variable values."},{"location":"Privacy/#common-context-attributes","title":"Common context attributes","text":"Every event carries these attributes (populated by the telemetry module, not by the command code):
Attribute Descriptiondevice_id SHA256 hash of a randomly generated UUID, persisted per machine. Enables counting distinct users without identifying them. id_status EXISTING, NEW, or FAILED \u2014 how the device ID was obtained on this run. os.name, os.version, os.release, os.arch Operating system and architecture (e.g., Windows, 10.0.26200, 11, AMD64). app_version WinML CLI package version. app_instance_id A random UUID generated for this process only; not persisted. initTs Epoch timestamp when telemetry was initialized."},{"location":"Privacy/#data-never-collected","title":"Data never collected","text":"--model path/to/file.onnx)On the first run of any command, WinML CLI prompts:
Enable telemetry? [Y/n]\n The default is Y (telemetry enabled) \u2014 pressing Enter accepts. Your answer is persisted to %USERPROFILE%\\.winml\\config.json under telemetry.consent and the prompt is not shown again.
Edit %USERPROFILE%\\.winml\\config.json directly:
{\n \"telemetry\": {\n \"consent\": \"disabled\"\n }\n}\n Goal Edit Opt out Set telemetry.consent to \"disabled\". Opt in Set telemetry.consent to \"enabled\". Re-show the prompt on next run Delete the file, or remove the telemetry.consent field. There are no CLI subcommands, per-invocation flags, or environment variables for consent \u2014 the config file is the single source of truth.
"},{"location":"Privacy/#ci-cd","title":"CI / CD","text":"Telemetry is automatically disabled when any of these environment variables are set, and no prompt is shown:
CI, TF_BUILD, GITHUB_ACTIONS, JENKINS_URL, CODEBUILD_BUILD_ID, BUILDKITE, SYSTEM_TEAMFOUNDATIONCOLLECTIONURI.
Events that fail to send (e.g., transient network errors) are cached locally and retried on the next run. The cache file lives at:
%USERPROFILE%\\.winml\\telemetry\\winmlcli.cache
The cache is append-only on failure and drain-then-resend on recovery. When telemetry is disabled, the cache is cleared so a disabled session never resends events the user has since opted out of.
"},{"location":"Privacy/#dev-installs","title":"Dev installs","text":"WinML CLI installed from source (pip install -e .) or run directly from a checkout never sends telemetry. The InstrumentationKey is blank in source and is only populated by the official build pipeline. Only official binary releases are capable of sending telemetry, and only after the user has seen the first-run prompt.
For the full contributing guide \u2014 development setup, coding conventions, testing, PR checklist, and CLA \u2014 see CONTRIBUTING.md in the repository root.
# Clone and set up\ngit clone https://github.com/microsoft/winml-cli.git\ncd winml-cli\nuv sync --extra dev\nuv run pre-commit install\n\n# Download runtime check rules (required for `winml analyze`)\ngh release download <tag> --repo microsoft/winml-cli --pattern 'rules-v*.zip' --dir .\nExpand-Archive -Path .\\rules-v*.zip -DestinationPath src\\winml\\modelkit\\analyze\\rules\\runtime_check_rules -Force\n\n# Run tests\nuv run pytest tests/ -m \"not e2e and not npu and not gpu\"\n\n# Lint and format\nuv run ruff check src/ tests/ --fix\nuv run ruff format src/ tests/\n\n# Docs preview\nuv run mkdocs serve\n"},{"location":"contributing/#see-also","title":"See also","text":"Common issues and solutions when working with winml-cli.
"},{"location":"troubleshooting/#compile","title":"Compile","text":""},{"location":"troubleshooting/#cannot-enable-compilation-no-compile-section","title":"Cannot enable compilation: no compile section","text":"UsageError: Cannot enable compilation: no compile section found in the config file\n Cause: Compilation is off by default in winml build. You passed --compile to explicitly enable it, but the config JSON has no \"compile\" section (it's null). This happens when the config was generated without a device target that supports EPContext (e.g., --device cpu or --device auto on a machine without NPU).
Solution: Regenerate the config targeting a device that supports compilation (NPU or GPU with an EP that produces EPContext):
uv run winml config -m <model> -d npu --compile -o output/\n Note
By default winml build skips the compile stage unless --compile is passed or the config contains a non-null \"compile\" section. To include compilation in the generated config, specify a device that maps to an EPContext-capable EP (e.g., -d npu).
ClickException: model_ctx.onnx is already a compiled EPContext model and cannot be re-compiled\n Cause: You're trying to compile a model that is already an EPContext artifact (the _ctx.onnx output).
Solution: Run compilation on the original (pre-compiled) ONNX file instead:
uv run winml compile -m model.onnx -d npu -o output/\n"},{"location":"troubleshooting/#provider-does-not-support-epcontext-compilation","title":"Provider does not support EPContext compilation","text":"ClickException: Provider 'DmlExecutionProvider' does not support EPContext compilation\n Cause: Not all EPs produce EPContext format. DML and CPU do not support pre-compilation.
Solution: EPContext is supported by QNN, OpenVINO, TensorRT, and Vitis AI. For DML/CPU, skip the compile step \u2014 the runtime compiles on first load automatically:
uv run winml build -c config.json -m model -o output/ --no-compile\n"},{"location":"troubleshooting/#analyze","title":"Analyze","text":""},{"location":"troubleshooting/#unsupported-nodes-persist-after-analysis","title":"Unsupported nodes persist after analysis","text":"RuntimeError: Unsupported nodes persist after analysis\n Cause: The model contains operators that the selected EP cannot dispatch natively.
Solution: Run winml analyze with --optim-config to identify problematic operators and get recommended graph optimizations:
# Analyze and output optimization recommendations\nuv run winml analyze -m model.onnx --ep qnn --optim-config optim_config.json\n This produces optim_config.json with the auto-discovered optimization flags. Apply them with winml optimize, then re-analyze:
# Apply recommended optimizations\nuv run winml optimize -m model.onnx -o model_optimized.onnx -c optim_config.json\n\n# Re-analyze to check if unsupported nodes are resolved\nuv run winml analyze -m model_optimized.onnx --ep qnn\n If unsupported nodes still remain after optimization, consider:
onnx-graphsurgeon to replace or remove operators the EP cannot handle--ep dml or --ep cpu) that supports the operators in question--opset-version 18)When winml analyze reports a large number of nodes as \"unknown\", the model likely hasn't been normalized \u2014 it contains raw constant-folding subgraphs, missing shape annotations, or redundant initializer nodes that the analyzer cannot classify.
Solution: Run winml optimize with no optimization flags to normalize the model (constant folding, shape inference, dead-node elimination), then re-analyze:
# Normalize only (no fusion flags)\nuv run winml optimize -m model.onnx -o model_normalized.onnx\n\n# Re-analyze \u2014 constant nodes are now folded, shapes are inferred\nuv run winml analyze -m model_normalized.onnx --ep qnn\n This baseline pass collapses constant subgraphs into initializers and propagates tensor shapes throughout the graph, giving the analyzer enough information to classify nodes correctly.
"},{"location":"troubleshooting/#build-cache","title":"Build / Cache","text":""},{"location":"troubleshooting/#disk-full-out-of-space","title":"Disk full / out of space","text":"Build artifacts (exported ONNX, optimized graphs, quantized models, compiled EPContext files) are cached under:
C:\\Users\\<user>\\.cache\\winml\n This directory can grow significantly after multiple builds with large models. If you encounter disk-full errors or want to reclaim space, it is safe to delete the entire folder:
Remove-Item -Recurse -Force \"$env:USERPROFILE\\.cache\\winml\"\n The next winml build will re-create the cache as needed. Use --rebuild to force a full rebuild without relying on cached intermediates.
uv run winml sys Check EP compatibility uv run winml analyze -m model.onnx --ep <ep> Verbose output Add -v or --verbose to any command Skip a pipeline stage --no-quant, --no-compile, --no-optimize Force rebuild (ignore cache) uv run winml build -c config.json -m <model> -o output/ --rebuild Regenerate config uv run winml config -m <model> -d <device> -o dir/ Free disk space Delete C:\\Users\\<user>\\.cache\\winml"},{"location":"troubleshooting/#see-also","title":"See also","text":"Verify an ONNX model is compatible with a target execution provider before deployment.
"},{"location":"commands/analyze/#when-to-use-this","title":"When to use this","text":"Use winml analyze before running the full build pipeline to confirm that your ONNX model's operators are supported by the intended execution provider and device. It surfaces operator gaps and actionable recommendations early, saving time that would otherwise be spent on a failed compile or quantize run.
$ winml analyze [options]\n"},{"location":"commands/analyze/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m PATH (required) Path to the ONNX model file to analyze. --ep choice auto Target execution provider. Accepts full names (e.g., QNNExecutionProvider) or short aliases (qnn, openvino, vitisai, cpu, cuda, dml, nvtensorrtrtx, migraphx). Use all for every rule-data-backed EP, or auto to infer from local availability. --device cpu\\|gpu\\|npu\\|all\\|auto auto Target device type. auto infers from local availability; all evaluates all rule-data-backed devices. --verbose -v flag off Enable verbose output. --quiet -q flag off Suppress non-essential output. --config -c PATH (none) Build configuration file (YAML/JSON). --output PATH (none) Save the full JSON result to a file in addition to printing the console summary. --information / --no-information flag enabled Include detailed per-operator recommendations and remediation hints in the output. Pass --no-information for a compact pass/fail summary. --htp-metadata PATH (none) Path to an HTP metadata JSON file (produced by winml export). Enriches subgraph pattern extraction by mapping nodes back to their source module hierarchy. Benefits all target EPs. --run-unknown-op / --no-run-unknown-op flag disabled For operators not in the rule database, build a minimal ONNX graph and run it on the target EP locally to determine support. Enable when local EP libraries are available. --save-node partial\\|unsupported (none) Save partial or unsupported node subgraphs to disk for further investigation. Can be specified multiple times: --save-node partial --save-node unsupported. --optim-config PATH (none) Save the auto-discovered optimization config (merged across all analyzed EPs) to a JSON file."},{"location":"commands/analyze/#how-it-works","title":"How it works","text":"winml analyze loads the ONNX model and runs a static analysis pass via ONNXStaticAnalyzer. For each operator (and recognized subgraph pattern), the analyzer consults the target EP's rule database. For operators not in the database, it can optionally probe them locally when --run-unknown-op is enabled. The combined answer classifies each node as supported, partial, unsupported, or unknown (see Analyze and optimize for definitions).
The analysis always produces a lint result \u2014 the pass/fail verdict. When --information is enabled (the default), it additionally produces an autoconf result: a set of fusion-flag suggestions that, if applied in the optimize stage, would resolve partial or unsupported patterns. Pass --no-information to skip autoconf and get just the lint verdict.
0 All operators are fully supported on the target EP. 1 At least one operator is unsupported, partially supported, or unknown. 2 Input or configuration error (bad path, unknown EP, etc.). Exit codes make winml analyze safe to use as a CI gate with set -e or $? checks.
Analyze using auto-detected EP and device:
$ winml analyze --model microsoft/resnet-50.onnx\n The output shows a live progress table per EP followed by an ANALYSIS SUMMARY section. Each EP line displays support counts in S/P/U/Unk format (Supported / Partial / Unsupported / Unknown) with color-coded indicators.
Check QNN NPU support using the short alias:
$ winml analyze --model bert-base-uncased.onnx --ep qnn --device NPU\n Check Intel OpenVINO GPU support and print operator-level recommendations:
$ winml analyze --model bert-base-uncased.onnx --ep openvino --device GPU --information\n Save the full JSON result for offline inspection while still printing the console summary:
$ winml analyze --model facebook/convnext-tiny-224.onnx --output results.json\n Use HTP metadata for enhanced subgraph pattern extraction:
$ winml analyze --model bert-base-uncased.onnx \\\n --ep qnn --device NPU \\\n --htp-metadata bert-base-uncased_htp_metadata.json\n Run a lint-only pass (no recommendations) for a CI gate:
$ winml analyze --model model.onnx --ep qnn --device NPU --no-information\necho \"Exit code: $?\" # 0 = clean, 1 = issues, 2 = input error\n Dump unsupported subgraphs to disk for debugging:
$ winml analyze --model model.onnx --ep qnn \\\n --save-node partial --save-node unsupported \\\n --output result.json\n Enable local execution for operators not in the rule database:
$ winml analyze --model model.onnx --ep qnn --device NPU --run-unknown-op\n"},{"location":"commands/analyze/#common-pitfalls","title":"Common pitfalls","text":"--ep uses auto (inferred from local availability) \u2014 to analyze every EP regardless of what is installed, pass --ep all. Specify --ep <name> when you know your target hardware.--htp-metadata is EP-agnostic \u2014 HTP metadata enriches pattern extraction before any EP-specific checks, so it benefits all target EPs equally. You do not need separate metadata files per EP.--run-unknown-op is disabled by default \u2014 operators not covered by the rule database are classified as UNKNOWN (not unsupported) unless you explicitly pass --run-unknown-op to probe them locally. Enable it only when the target EP's libraries are available on the local machine..onnx file \u2014 symbolic HuggingFace model IDs are not accepted; export the model first with winml export.Run the entire winml-cli pipeline (export \u2192 optimize \u2192 quantize \u2192 compile) in one command.
"},{"location":"commands/build/#when-to-use-this","title":"When to use this","text":"Use winml build when you want to go from a Hugging Face model ID (or an existing .onnx file) to a deployment-ready artifact in a single invocation, without manually chaining winml export, winml optimize, winml quantize, and winml compile. A build config file \u2014 generated by winml config \u2014 controls every stage of the pipeline.
$ winml build [options]\n"},{"location":"commands/build/#flags","title":"Flags","text":"Flag Short Type Default Description --config -c path None WinMLBuildConfig JSON file, generated by winml config. If omitted, config is auto-generated from -m. --model -m string None Hugging Face model ID or path to an existing .onnx file. --output-dir -o path None Directory for all build artifacts. Mutually exclusive with --use-cache. --use-cache/--no-use-cache flag false Store artifacts in the winml-cli global cache (~/.cache/winml/). Mutually exclusive with --output-dir. --rebuild/--no-rebuild flag false Overwrite existing artifacts and re-run the full pipeline. --quant/--no-quant flag true Run the quantization stage (use --no-quant to skip), overriding the config. --no-compile / --compile flag None Override compilation. --compile forces enable (config must have a compile section). --no-compile forces skip. Default: inherit from config. --optimize/--no-optimize flag true Run the optimization stage (use --no-optimize to skip). --ep string None Target execution provider for the analyzer (e.g., qnn). Falls back to the compile config EP if not set. --device -d string auto Target device for the analyzer (e.g., npu, gpu). Default: auto (auto-detect). --analyze/--no-analyze flag true Run the analyzer loop during build (use --no-analyze to skip). --max-optim-iterations integer None Maximum autoconf re-optimization rounds (3 enforced internally when not set). --no-analyze implicitly sets this to 0. --trust-remote-code/--no-trust-remote-code flag false Allow executing custom code from model repositories. Use only with trusted sources. --allow-unsupported-nodes/--no-allow-unsupported-nodes flag false Allow unsupported nodes to remain in the graph instead of failing the build. --help -h flag Show this message and exit."},{"location":"commands/build/#how-it-works","title":"How it works","text":"winml build reads a WinMLBuildConfig JSON file (from winml config) that encodes device, precision, export, quantization, and compilation settings. When -m is a Hugging Face model ID, the full pipeline runs: export \u2192 optimize \u2192 quantize \u2192 compile. When -m points to an existing .onnx file, the export stage is skipped and the pipeline starts at optimization. After compilation, an optional analyzer loop (--max-optim-iterations) re-evaluates graph quality and applies further passes; --no-analyze disables it for a deterministic single-pass build. Individual stages can be suppressed with --no-quant, --no-compile, and --no-optimize without touching the config file.
Reproducible CI/CD builds
The config file is a portable, self-contained pipeline specification. Check it into source control and invoke winml build -c config.json in CI to produce identical artifacts without manual flag management. Set \"auto\": false in the config to disable the autoconf discovery loop for fully deterministic output.
# Full pipeline: HF model \u2192 export \u2192 optimize \u2192 quantize \u2192 compile\nwinml build -c config.json -m microsoft/resnet-50 -o output/\n winml build\n Config: config.json\n Model: microsoft/resnet-50\n Output: output/\n\n export done (28.3s)\n optimize done (4.1s)\n quantize done (6.8s)\n compile done (14.2s)\n\n Build complete in 53.4s\n Final artifact: output/resnet50_ctx.onnx\n # Start from a pre-exported ONNX file (skips export stage)\nwinml build -c config.json -m resnet50.onnx -o output/\n # Export and optimize only \u2014 skip quantization and compilation for quick testing\nwinml build -c config.json -m bert-base-uncased -o output/ \\\n --no-quant --no-compile\n # Force a clean rebuild, overwriting any cached artifacts\nwinml build -c config.json -m facebook/convnext-tiny-224 -o output/ --rebuild\n # Use the global cache and cap optimizer iterations for faster turnaround\nwinml build -c config.json -m microsoft/resnet-50 \\\n --use-cache --max-optim-iterations 1\n"},{"location":"commands/build/#common-pitfalls","title":"Common pitfalls","text":"--output-dir or --use-cache is required; they are mutually exclusive. Omitting both raises an error immediately.--use-cache is not supported in module mode. When the config is a JSON array (module mode), only --output-dir is accepted.winml config. The schema is strict; unknown keys are rejected.--rebuild to force a fresh run after changing the config.Browse the curated winml-cli catalog of validated models and benchmarks.
"},{"location":"commands/catalog/#when-to-use-this","title":"When to use this","text":"Use winml catalog to discover which HuggingFace models have been validated end-to-end by the winml-cli team \u2014 exported, quantized, compiled, and benchmarked on real Windows ML devices. It is the starting point when you want a model that is known to work before investing time in a custom build.
$ winml catalog [options]\n"},{"location":"commands/catalog/#flags","title":"Flags","text":"Flag Short Type Default Description --model-type string null Filter the catalog by model architecture (case-insensitive). Examples: bert, roberta, vit. --task -t string null Filter by HuggingFace task (case-insensitive). Examples: text-classification, image-segmentation. --ep/--execution-provider string null Filter by execution provider (e.g., qnn, dml). If not specified, shows all EPs. --device -d string null Filter by target device (e.g., npu, gpu). If not specified, shows all devices. --output -o path null Save the displayed results to a JSON file. --help -h flag \u2014 Show help and exit. winml catalog reads a local catalog bundled with the package \u2014 no network access is required.
The catalog is stored in winml/modelkit/data/hub_models.json and is loaded directly from the installed package data without any network call. Each catalog entry records the model ID, task, architecture type, and model size. Use --model-type, --task, --ep, or --device to narrow the displayed list. When --output is provided, the filtered results are written as indented JSON to the specified path.
# List all validated models in the catalog\n$ winml catalog\n +--- winml-cli Catalog | 12 validated model(s) --------------------------+\n| Model Task Model Type |\n| microsoft/resnet-50 image-classification resnet |\n| bert-base-uncased fill-mask bert |\n| ProsusAI/finbert text-classification bert |\n| ... |\n+---------------------------------------------------------------------------+\nUse --ep or --device to filter by execution provider or target device.\n # Filter to BERT-family models only\n$ winml catalog --model-type bert\n # Filter by task \u2014 show only text-classification models\n$ winml catalog --task text-classification\n # Combine filters \u2014 BERT models for text classification\n$ winml catalog --model-type bert --task text-classification\n # Save filtered results to JSON for offline review\n$ winml catalog --task image-classification --output results/image_catalog.json\n"},{"location":"commands/catalog/#common-pitfalls","title":"Common pitfalls","text":"--output only saves what was displayed. Combining a filter with --output saves the filtered list. There is no flag to dump the entire catalog in one call \u2014 omit all filters and add --output to do so.winml inspect and winml export work with any HuggingFace model that has a supported architecture, whether or not it appears in the catalog.Compile an ONNX model to an EP-specific format for fast runtime loading.
"},{"location":"commands/compile/#when-to-use-this","title":"When to use this","text":"Use winml compile as the final pipeline stage after winml quantize to produce an execution-provider-native artifact (for example, a QNN EPContext model) that loads faster and avoids online graph compilation at inference time.
$ winml compile [options]\n"},{"location":"commands/compile/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m path (required unless --list) Input ONNX model file. --output -o path \u2014 Output file path (e.g., model_compiled.onnx). Takes precedence over --output-dir. --output-dir path same dir as input Directory to write compiled output artifacts. --device -d choice auto Target device: auto, npu, gpu, or cpu. --ep TEXT \u2014 Force a specific execution provider, overriding device-to-provider mapping. Accepts full names (e.g., QNNExecutionProvider) or aliases (qnn, dml, openvino, vitisai, migraphx, cpu, nvtensorrtrtx). --validate / --no-validate flag --validate Run a post-compilation validation pass on the target hardware. Enabled by default; pass --no-validate to skip when the target hardware or driver is unavailable. --compiler choice ort Compiler backend: ort (ONNX Runtime) or qairt (Qualcomm AI Runtime Tools). --qnn-sdk-root path None Path to the QNN SDK root directory. --embed/--no-embed flag false Embed the EP context blob inside the ONNX file instead of writing a separate .bin file. --list flag false List available compiler backends for the selected device and exit without compiling. --help -h flag Show this message and exit."},{"location":"commands/compile/#how-it-works","title":"How it works","text":"winml compile resolves the target execution provider from --device and --ep, then calls the winml-cli compiler API to hand the ONNX graph to the EP's offline compilation toolchain. When --device auto (the default), the target EP is determined by auto-detecting available hardware. For NPU targets, ONNX Runtime's QNN EP generates a binary .bin context file (or embeds it inline with --embed) that encodes the hardware-optimized execution plan, eliminating graph partitioning at load time. An optional post-compilation validation pass runs a forward pass through the target EP; skip it with --no-validate when the target hardware is absent.
# Compile with auto device detection (default compiler)\nwinml compile -m resnet50_qdq.onnx\n Input: resnet50_qdq.onnx\nDevice: npu\nProvider: qnn\nCompiler: ort\n\nCompiling model...\n\nSuccess! Model compiled\nOutput: resnet50_qdq_ctx.onnx\nCompile time: 12.40s\nTotal time: 13.05s\n # List available compiler backends for NPU before committing to a run\nwinml compile --list --device npu\n # Compile a pre-quantized BERT model for NPU with context embedded inline\nwinml compile -m bert-base-uncased_qdq.onnx --embed\n # Compile for GPU using the OpenVINO execution provider\nwinml compile -m microsoft_resnet50.onnx --device gpu --ep openvino\n"},{"location":"commands/compile/#common-pitfalls","title":"Common pitfalls","text":"--embed inflates the .onnx file significantly. Embedding the EP context produces a single portable file but can make it impractical to open or inspect the ONNX graph with standard tooling.--no-validate.--device auto auto-detects the best available hardware. Pass --device npu, --device gpu, or --device cpu explicitly when targeting specific hardware regardless of what is auto-detected.Generate a reusable build configuration for a Hugging Face model or ONNX file.
"},{"location":"commands/config/#when-to-use-this","title":"When to use this","text":"Use winml config at the start of a new model project to produce a WinMLBuildConfig JSON file. The config captures the model identity, task, precision, and per-stage settings in one shareable artifact that you can edit, version-control, and repeatedly pass to winml build. Running config first lets you review and adjust pipeline settings before committing to a full build.
$ winml config [options]\n"},{"location":"commands/config/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m TEXT (none) HuggingFace model ID (e.g., microsoft/resnet-50) or path to an existing .onnx file. Optional when --model-type or --model-class is provided. --task -t TEXT (auto) Override the auto-detected task (e.g., image-classification, text-classification). When omitted, the first supported task for the model is selected automatically. --model-class TEXT (auto) Override the auto-detected model class (e.g., CLIPTextModelWithProjection). Useful for multi-component models. --model-type TEXT (auto) Override the auto-detected model type (e.g., bert, resnet). Can be used without -m to generate a config from HuggingFace default settings. --module TEXT (none) Generate configs for every submodule whose class name matches the given string (e.g., ResNetConvLayer). The output is a JSON array instead of a single object. --config -c PATH (none) JSON override file in WinMLBuildConfig format. Fields present in this file take precedence over auto-detected values. --shape-config PATH (none) JSON file with input shape overrides for dummy input generation. Valid keys by modality \u2014 text: sequence_length; vision: height, width, num_channels; audio: feature_size, nb_max_frames, audio_sequence_length. --device -d auto\\|npu\\|gpu\\|cpu auto Target device. Affects the generated quantization and compilation sub-configs. auto leaves those sections unchanged from the kit defaults. --ep TEXT (none) Force a specific execution provider (qnn, dml, migraphx, tensorrt, vitisai, openvino, cpu). Overrides the device-to-provider mapping. When used without --device, the device is inferred from the EP. --precision -p TEXT auto Target precision: auto, fp32, fp16, int8, int16, or a mixed format such as w8a16. auto selects the precision based on the chosen device. --output -o PATH (stdout) Write the generated JSON to this file instead of printing to stdout. --library TEXT transformers Source library for TasksManager task lookup. Defaults to transformers; set to diffusers or another Optimum-supported library when needed. --quant/--no-quant flag true Include quantization in the generated config (use --no-quant to omit it and set quant to null). --no-compile / --compile flag --no-compile (compile excluded by default) Controls whether compilation is included in the generated config. By default compilation is excluded (compile: null). Pass --compile to include a compile section. --trust-remote-code/--no-trust-remote-code flag false Allow execution of custom model code from the HuggingFace repository. Required for some community models. Only enable for repositories you trust."},{"location":"commands/config/#how-it-works","title":"How it works","text":"winml config queries the HuggingFace TasksManager to auto-detect the model's task, class, and ONNX export specification. For known model types it looks up a per-model kit in MODEL_BUILD_CONFIGS and uses that as a starting point, layering in your device, precision, and override file on top. When -m points to an existing .onnx file, the export stage is skipped by setting export to null in the output. The result is a complete WinMLBuildConfig JSON printed to stdout or written to a file, ready to be passed to winml build.
Generate a config for ResNet-50 with all auto-detected settings:
$ winml config -m microsoft/resnet-50\n Generating config for microsoft/resnet-50...\nAuto-selected task: image-classification (from 'microsoft/resnet-50')\nGenerated config for task 'image-classification'\n{\n \"loader\": { \"task\": \"image-classification\", ... },\n \"export\": { \"opset_version\": 17, ... },\n \"optim\": { ... },\n \"quant\": null,\n \"compile\": null\n}\n Target NPU with int8 quantization and save to a file:
$ winml config -m microsoft/resnet-50 --device npu --precision int8 -o resnet_npu.json\n Generate a config for BERT and override the task:
$ winml config -m bert-base-uncased --task text-classification -o bert_cls.json\n Generate from a model type alone (no HuggingFace download required at config time):
$ winml config --model-type bert --task fill-mask\n Generate a config from an already-exported ONNX file, skipping quantization (compilation is already excluded by default):
$ winml config -m facebook/convnext-tiny-224.onnx --no-quant -o convnext_optim_only.json\n"},{"location":"commands/config/#common-pitfalls","title":"Common pitfalls","text":"-m, --model-type, or --model-class is required \u2014 calling winml config with none of these three flags raises a usage error immediately.auto precision does not always map to a lower-bit type \u2014 when --device is also auto, precision stays at the kit default (usually fp32). Explicitly pass --device npu or --device gpu for auto precision to resolve to int8 or fp16.--module changes the output shape \u2014 with --module the JSON output is an array of configs, not a single object. Scripts that expect a single object will fail to parse this output.--trust-remote-code has security implications \u2014 only use this flag with model repositories you own or explicitly trust; it allows arbitrary Python execution from the remote model card.--shape-config are modality-specific \u2014 passing a sequence_length key for a vision model has no effect. Check the --help description for valid keys per modality.WinMLBuildConfig and how stages interactEvaluate ONNX model accuracy on a standard dataset.
"},{"location":"commands/eval/#when-to-use-this","title":"When to use this","text":"Use winml eval to measure how accurately a model performs on real data \u2014 especially after quantization, where comparing the quantized model against the floating-point baseline reveals any accuracy regression introduced by precision reduction.
$ winml eval [options]\n"},{"location":"commands/eval/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m TEXT \u2014 HuggingFace model ID, or path to a local .onnx file. Required (unless --model-id is provided directly). --model-id TEXT \u2014 HuggingFace model ID used for preprocessor and config resolution when -m points to an .onnx file. Required when -m is an ONNX file. --task TEXT auto-detected Task name (e.g., image-classification). Auto-detected from --model-id when not provided. Required when -m is an ONNX file and the task cannot be inferred. --precision TEXT auto Precision used when building the model from a HuggingFace ID. One of auto, fp32, fp16, int8, int16, or a mixed w{x}a{y} spec (e.g., w8a16). fp16/fp32 skip quantization. Ignored when -m is a pre-built .onnx file \u2014 the precision is already baked in. --device choice auto Target device. Choices: auto, npu, gpu, cpu. auto selects the best available device. Combined with --precision, this drives the build when -m is a HuggingFace ID. --ep / --execution-provider TEXT \u2014 Target ONNX Runtime execution provider when finer control than --device is needed. Full names (e.g., QNNExecutionProvider, OpenVINOExecutionProvider, VitisAIExecutionProvider) and aliases (qnn, ov/openvino, vitis/vitisai) are accepted. --dataset TEXT task default HuggingFace dataset path (e.g., imagenet-1k, nyu-mll/glue). If omitted, a default dataset is selected based on the task. --dataset-name TEXT \u2014 Dataset configuration name for multi-config datasets. --dataset-revision TEXT \u2014 Git revision (branch, tag, or commit) of the dataset to load. Use refs/convert/parquet for HF datasets that are only served via the parquet mirror. --dataset-script TEXT \u2014 Path to a Python script that builds the evaluation dataset locally. Requires --trust-remote-code. --trust-remote-code / --no-trust-remote-code flag false Allow executing custom code from model repositories or dataset scripts. Required with --dataset-script. Use only with trusted sources. --samples INTEGER 100 Number of dataset samples to evaluate. --split TEXT validation Dataset split to use (e.g., validation, test, train). --shuffle / --no-shuffle flag shuffle Shuffle the dataset before sampling. Disable with --no-shuffle for reproducible sample ordering. --streaming / --no-streaming flag false Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets. --column TEXT (multiple) \u2014 Column mapping as key=value pairs (e.g., --column input_column=image). Can be specified multiple times. --label-mapping PATH \u2014 Path to a JSON file mapping dataset label names to the integer class IDs the model emits: {\"label_name\": id}. --output -o PATH \u2014 Output JSON file path for the evaluation results. --schema flag false Print the expected dataset schema for the given --task and exit. Does not run evaluation. --mode onnx\\|compare onnx Evaluation mode. onnx evaluates the ONNX candidate on a dataset. compare runs the ONNX candidate and the HuggingFace reference on identical random inputs and reports per-tensor similarity metrics \u2014 no dataset required."},{"location":"commands/eval/#how-it-works","title":"How it works","text":"winml eval loads the model and runs the evaluation pipeline via the internal evaluate function (supporting both HuggingFace IDs and local ONNX files), then pulls the requested number of samples from a HuggingFace dataset. Each sample is preprocessed using the tokenizer or image processor associated with the model ID, passed through the ONNX Runtime session, and the output is compared against the ground-truth label. Aggregated metrics (accuracy, F1, etc.) are printed to the console and optionally written to a JSON file. When -m is an ONNX file, --model-id must be provided so the command knows which preprocessor and label vocabulary to use.
Evaluate a HuggingFace model using the task-default dataset:
$ winml eval -m microsoft/resnet-50\n Task: image-classification\nDataset: timm/mini-imagenet (test, 100 samples)\nDevice: auto\n\nAccuracy: 76.00%\n\nResults saved to: microsoft_resnet-50_eval.json\n Evaluate a pre-exported ONNX file, providing the source model ID for preprocessing:
$ winml eval -m model.onnx --model-id microsoft/resnet-50 --dataset timm/mini-imagenet\n Evaluate a BERT model on the MRPC paraphrase task with column remapping:
$ winml eval -m Intel/bert-base-uncased-mrpc --dataset nyu-mll/glue --dataset-name mrpc --column input_column=sentence1 --column second_input_column=sentence2 --samples 500\n Check what dataset columns are expected before running, then remap them to match your dataset:
$ winml eval --schema --task text-classification\n Input schema for text-classification models\n==================================================\n\n--column option schema\n\nEvaluating needs a dataset with the following columns:\n input_column\n input text (default: text)\n label_column\n class label (ClassLabel or integer) (default: label)\n second_input_column\n second text for sentence-pair tasks (optional) (default: None)\n\nOverride any default with --column:\n --column input_column=<your_text_column>\n --column label_column=<your_label_column>\n --column second_input_column=<your_pair_column>\n The GLUE SST-2 dataset uses sentence instead of the default text column, so remap it with a single --column override:
$ winml eval -m distilbert/distilbert-base-uncased-finetuned-sst-2-english --dataset nyu-mll/glue --dataset-name sst2 --column input_column=sentence --samples 500\n Evaluate against a custom dataset whose label names differ from the model's class IDs. The --label-mapping flag points to a JSON file whose keys are the label name strings as they appear in the dataset and whose values are the integer class IDs the model emits. For example, ResNet-50 outputs ImageNet-1k class IDs (0\u2013999), so if your custom dataset uses readable strings like \"tabby cat\" or \"golden retriever\", labels.json translates each dataset label to the corresponding ImageNet ID the model predicts:
{\n \"tabby cat\": 281,\n \"Egyptian cat\": 285,\n \"golden retriever\": 207\n}\n $ winml eval -m microsoft/resnet-50 --dataset my-org/my-pets-dataset --label-mapping labels.json -o results/resnet_eval.json\n Evaluate a composite model from pre-exported ONNX files. Some tasks (e.g., image-to-text, encoder-decoder, dual-encoder) split the model across multiple ONNX files, one per role. Pass -m once per role as <role>=<path>.onnx and supply --model-id so the preprocessor and tokenizer can be resolved. Run winml eval --schema --task image-to-text to see the expected roles for a task:
$ winml eval -m encoder=encoder.onnx -m decoder=decoder.onnx --model-id microsoft/trocr-base-printed\n"},{"location":"commands/eval/#common-pitfalls","title":"Common pitfalls","text":"--model-id fails. When -m is a .onnx path, --model-id is mandatory. Without it the command cannot resolve the preprocessor or label vocabulary and will exit with a usage error.--dataset (and --label-mapping if needed) when evaluating a model whose label space or domain differs from the task default.imagenet-1k) require a HuggingFace account with accepted terms of use. Log in with huggingface-cli login before running eval on gated data.--shuffle is on by default. The random 100-sample slice changes between runs unless you pass --no-shuffle. Use --no-shuffle when comparing two model variants to ensure they see identical samples.--streaming skips the local cache. Streaming mode avoids downloading the full split but prevents random shuffling on large datasets. For reproducible evaluation, download the split once and omit --streaming.winml eval --schema --task <task> to inspect the expected schema and use --column to remap dataset field names to the expected names.--device optionConvert a PyTorch / Hugging Face model to ONNX, preserving module hierarchy.
"},{"location":"commands/export/#when-to-use-this","title":"When to use this","text":"Use winml export when you have a Hugging Face model ID or a local PyTorch checkpoint and need an ONNX file as the first step of the optimization pipeline. This is the entry point before winml quantize or winml compile.
$ winml export [options]\n"},{"location":"commands/export/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m string (required) Hugging Face model name or local path (e.g., prajjwal1/bert-tiny). --output -o path (required) Output ONNX file path (e.g., model.onnx). --with-report/--no-with-report flag false Generate full export reports: Markdown, JSON, and a console tree. --hierarchy/--no-hierarchy flag true Preserve hierarchy_tag metadata in ONNX nodes (use --no-hierarchy for a clean ONNX file). --dynamo/--no-dynamo flag false Enable PyTorch 2.9+ dynamo export for richer node metadata. (Experimental \u2014 currently logs a warning.) --torch-module string None Comma-separated list of torch.nn module types to include in hierarchy (e.g., LayerNorm,Embedding). (Experimental \u2014 currently logs a warning.) --input-specs path None JSON file with explicit input tensor specifications. Auto-generated when omitted. --task -t string None Override auto-detected Hugging Face task (e.g., image-feature-extraction). --export-config path None JSON file with ONNX export parameters such as opset_version and do_constant_folding. --shape-config path None JSON object mapping symbolic dimension names to concrete sizes (e.g., {\"sequence_length\": 2048}). Ignored when --input-specs is provided. --trust-remote-code/--no-trust-remote-code flag false Allow executing custom code from model repositories during export. Use only with trusted sources. --allow-unsupported-nodes/--no-allow-unsupported-nodes flag false Allow unsupported nodes to remain in the exported graph instead of failing export. --help -h flag Show this message and exit."},{"location":"commands/export/#how-it-works","title":"How it works","text":"winml export loads the model via Hugging Face transformers, then runs the eight-step Hierarchy-preserving Tags Protocol (HTP): model preparation, input generation, module-hierarchy tracing, TorchScript ONNX export, node-tagger creation, per-node tagging, tag injection into ONNX metadata_props, and optional report generation. The hierarchy metadata allows downstream tools to reason about operators grouped by their originating module rather than flat graph position. When --no-hierarchy is specified, hierarchy steps are bypassed and a bare ONNX file is written, useful for third-party tools that do not understand custom metadata.
# Minimal export: Hugging Face model ID to ONNX file\nwinml export -m microsoft/resnet-50 -o resnet50.onnx\n Model: microsoft/resnet-50\nOutput: resnet50.onnx\n\nStarting HTP export...\n Detected task: image-classification\n\nSuccess! Model exported to: resnet50.onnx\n # Export with verbose output and full Markdown + JSON reports\nwinml export -m facebook/convnext-tiny-224 -o convnext.onnx -v --with-report\n # Export a BERT model, overriding input shapes for longer sequences\nwinml export -m bert-base-uncased -o bert.onnx \\\n --shape-config shape.json\n# shape.json: {\"sequence_length\": 512}\n # Export with a hand-crafted input-spec file (skips auto-detection)\nwinml export -m bert-base-uncased -o bert.onnx --input-specs inputs.json\n # Produce clean ONNX without hierarchy metadata (for third-party optimizers)\nwinml export -m microsoft/resnet-50 -o resnet50_clean.onnx --no-hierarchy\n"},{"location":"commands/export/#see-also","title":"See also","text":"-t with the correct task string, for example -t image-feature-extraction.--shape-config is silently ignored when --input-specs is set. --input-specs takes full priority; remove it if you only want to override individual dimensions.--dynamo and --torch-module are experimental. Both flags emit a warning and have no effect in the current release. Do not rely on them in automated pipelines yet.HF_HOME or HF_HUB_CACHE to control the download location.Inspect a model's tasks, classes, and hierarchy before committing to an export.
"},{"location":"commands/inspect/#when-to-use-this","title":"When to use this","text":"Use winml inspect to understand how winml-cli will treat a HuggingFace model before running winml export or winml build. It answers questions like \"which task will be auto-detected?\", \"which HF model class will be loaded?\", and \"does this model have a supported exporter?\" without downloading weights or writing any files.
$ winml inspect -m <model_id> [options]\n"},{"location":"commands/inspect/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m string required HuggingFace model ID (e.g. openai/clip-vit-base-patch32). Required unless --list-tasks or --help is used. --format -f table | json table Output format. table renders rich panels; json emits a machine-readable object. --task -t string null Override the auto-detected task (e.g. image-classification, feature-extraction). --hierarchy/--no-hierarchy -H flag false Print the PyTorch module tree. Instantiates the model with random weights \u2014 no weight download required. --verbose -v flag false Show full configuration details. --list-tasks flag false List all known tasks and exit. Does not require --model. --model-type string null Override model type (e.g. bert, resnet). Can be used without --model. --model-class string null Override model class (e.g. BertForMaskedLM). Can be used without --model. --help -h flag \u2014 Show help and exit. winml inspect does not accept --device, --ep, --precision, or --output. It is a read-only discovery command that does not produce any artifacts.
winml inspect calls into the winml-cli registry to resolve the model ID against the known loader and exporter configurations. It fetches only the model's config.json from HuggingFace Hub (no weights), uses the architecture field to look up the matching HF model class and WinML inference class, and then renders the result. When --hierarchy is supplied, the model is instantiated locally with random weights using AutoModel.from_config(), and a forward-pass trace records the full PyTorch module tree. Because no real weights are downloaded, hierarchy inspection is fast even for large models.
# Basic inspection \u2014 check task detection and loader/exporter classes\n$ winml inspect -m microsoft/resnet-50\n +--------------------------- microsoft/resnet-50 ---------------------------+\n| Task image-classification |\n| Model Class ResNetForImageClassification |\n| Exporter OptimumExporter |\n| WinML Class WinMLImageClassificationModel |\n| Status Supported |\n+---------------------------------------------------------------------------+\n # JSON output \u2014 useful for scripting or CI pre-flight checks\n$ winml inspect -m bert-base-uncased --format json\n # Override task when auto-detection picks the wrong one\n$ winml inspect -m bert-base-uncased --task feature-extraction\n # Print the full PyTorch module hierarchy (no weight download)\n$ winml inspect -m openai/clip-vit-base-patch32 --hierarchy\n # Combine verbose logging with hierarchy for deep diagnostics\n$ winml inspect -m facebook/convnext-tiny-224 -v -H\n"},{"location":"commands/inspect/#common-pitfalls","title":"Common pitfalls","text":"--model is required for model inspection. The flag is marked required for model-specific lookups; omitting it returns an error. The only exception is --list-tasks, which lists all known tasks and exits without needing a model.transformers installation, --hierarchy will fail with an import error. Update transformers or omit the flag.--task changes which exporter and WinML class are reported, not just the task field. If the override is incompatible with the model architecture, the status will show as unsupported.--format json is silent on unsupported models. When the model is not found in the winml-cli registry, the command raises a ClickException. Wrap the call in winml inspect ... && ... or check the exit code when scripting.config.json is always fetched from HuggingFace Hub. Set HF_HUB_OFFLINE=1 if you need fully offline inspection of a locally cached model.winml.hierarchy.tag metadata is written and what you can do with the module treeApply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed.
"},{"location":"commands/optimize/#when-to-use-this","title":"When to use this","text":"Use winml optimize after exporting an ONNX model and before quantization or compilation. Graph fusions reduce operator count, improve memory locality, and can make downstream quantization more accurate by presenting cleaner subgraphs to the calibration pass. It is also useful as a standalone step when you want to optimize a pre-exported ONNX file without running the full build pipeline.
$ winml optimize [options]\n"},{"location":"commands/optimize/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m PATH (required unless listing) Input ONNX model file. Not required when --list-capabilities or --list-rewrites is used. --output -o PATH {input}_opt.onnx Output path for the optimized model. Defaults to the input filename with _opt inserted before the extension. --config -c PATH (none) YAML or JSON configuration file. Fields in the file override capability defaults; CLI flags override the file. --verbose -v flag off Enable verbose output. --list-capabilities -l flag off Print all registered optimization capabilities grouped by category and exit. Add --verbose for descriptions and ORT names. --list-rewrites flag off Print all available pattern-rewrite families with their source-to-target mappings and exit. (dynamic) flag (per capability) Each registered capability generates a --enable-<name> / --disable-<name> pair. Run --list-capabilities to see the full current list. Examples: --enable-gelu-fusion, --disable-constant-folding. Pattern-rewrite flags follow the form --enable-<source-slug>-<target-slug>; run --list-rewrites to discover all names."},{"location":"commands/optimize/#configuration-precedence","title":"Configuration precedence","text":"When multiple sources are provided, settings are resolved in this order (highest wins):
--enable-X / --disable-X)-c)winml optimize loads the ONNX model, builds a final capability configuration by merging capability defaults, an optional config file, and any explicit CLI flags, then runs all enabled passes through the Optimizer. Each capability maps to a named optimization or fusion pipe in the winml.modelkit.optim registry. The capability flags are auto-generated at startup from that registry \u2014 adding a new optimization to the registry automatically makes it available as a CLI flag without any change to this command's source. After optimization, the command prints the before-and-after node count and percentage reduction so you can quantify the effect.
Optimize a model with all capability defaults:
$ winml optimize -m microsoft/resnet-50.onnx\n Input: microsoft/resnet-50.onnx\nOutput: microsoft/resnet-50_opt.onnx\n\nLoading model...\nRunning optimizer...\nSaving optimized model...\n\nSuccess! Model optimized: microsoft/resnet-50_opt.onnx\nNodes: 312 -> 289 (7.4% reduction)\n Enable specific fusions for a BERT model:
$ winml optimize -m bert-base-uncased.onnx \\\n --enable-layer-norm-fusion \\\n --enable-attention-fusion \\\n -o bert_layernorm_attn.onnx\n Use a config file to set capabilities and save the result for downstream compilation:
$ winml optimize -m facebook/convnext-tiny-224.onnx \\\n -c optimize_config.yaml \\\n -o convnext_opt.onnx\n List all available optimization capabilities:
$ winml optimize --list-capabilities\n Discover pattern-rewrite families and their flag names:
$ winml optimize --list-rewrites\n"},{"location":"commands/optimize/#common-pitfalls","title":"Common pitfalls","text":"--model is required for actual optimization \u2014 it can be omitted only when using --list-capabilities or --list-rewrites. Missing --model in any other case raises a usage error.--disable-X CLI flag always wins over a config file value that enables the same capability, but omitting the flag leaves the config file value in effect. To turn off a capability set by a config file, pass the explicit --disable-X flag.--list-capabilities to confirm the current set of flags rather than relying on a cached list.-o, the second run silently overwrites {input}_opt.onnx. Specify an explicit output path in scripts.WinMLBuildConfig that includes optimization settingswinml-cli exposes a CLI named winml with 12 subcommands covering the full journey from model discovery to a deployment-ready artifact. Every subcommand shares a consistent invocation style \u2014 winml <command> [flags] \u2014 and the same global flags are available on the root winml group.
The commands group by user intent. Discover (sys, inspect, catalog, analyze) helps you understand your hardware and model before writing any artifacts. Configure (config, optimize) produces a reusable build configuration and tunes the ONNX graph. Build (export, quantize, compile, build) runs the pipeline stages that produce deployment artifacts. Measure (perf, eval) benchmarks and validates the result.
The typical workflow follows that order: run winml sys to confirm hardware and EPs, then winml inspect or winml catalog to verify model support. Use winml config to generate a build configuration, then winml build to execute the full pipeline \u2014 or chain export \u2192 analyze \u2192 optimize \u2192 quantize \u2192 compile individually for finer control. Close with winml perf and winml eval to measure speed and accuracy.
sys Discover Inspect your machine \u2014 devices, EPs, and runtime versions at a glance. inspect Discover Inspect a model's tasks, classes, and hierarchy before committing to an export. catalog Discover Browse the curated winml-cli catalog of validated models and benchmarks. config Configure Generate a reusable build configuration for a Hugging Face model or ONNX file. export Build Convert a PyTorch / Hugging Face model to ONNX, preserving module hierarchy. analyze Build Verify an ONNX model is compatible with a target execution provider before deployment. optimize Build Apply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed. quantize Build Quantize an ONNX model with QDQ insertion and calibration-based scaling. compile Build Compile an ONNX model to an EP-specific format for fast runtime loading. build Build Run the entire winml-cli pipeline (export \u2192 optimize \u2192 quantize \u2192 compile) in one command. perf Measure Benchmark an ONNX model's latency and throughput on a target device. eval Measure Evaluate ONNX model accuracy on a standard dataset."},{"location":"commands/overview/#choosing-a-command","title":"Choosing a command","text":"winml syswinml inspectwinml catalogwinml analyzewinml exportwinml buildwinml perfwinml eval-v / --verbose, -q / --quiet, --version, and -h / --help live on the root winml group only. Subcommands access them through ctx.obj and do not redefine them. See src/winml/modelkit/cli.py for the canonical contract.
Several flags share semantics across the commands that accept them: -m / --model, -d / --device, --ep, -o / --output, -t / --task, and --precision. Defaults and accepted values can differ per command (e.g., -p is a short form for --precision only on config and quantize); check the Flags section of each command page rather than assuming they transfer.
WinMLBuildConfig and how stages interact--device / --ep interactBenchmark an ONNX model's latency and throughput on a target device.
"},{"location":"commands/perf/#when-to-use-this","title":"When to use this","text":"Use winml perf when you want a quantitative latency and throughput baseline for a model on a specific device, or when you need to compare the performance impact of different precision settings, execution providers, or batch sizes.
$ winml perf [options]\n"},{"location":"commands/perf/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m TEXT \u2014 HuggingFace model ID or path to a local .onnx file. Required. --task TEXT auto-detected Explicit task override (e.g., image-classification). Inferred from the model if omitted. --iterations INTEGER 100 Number of timed inference iterations used to compute statistics. --warmup INTEGER 10 Number of warm-up iterations run before timing begins; excluded from statistics. --device -d auto\\|cpu\\|gpu\\|npu auto Device to run the benchmark on. auto selects the highest-priority available device. --precision TEXT auto Precision mode applied during model build: auto, fp32, fp16, int8, int16, or compound forms such as w8a16. --ep TEXT \u2014 Force a specific execution provider (e.g., qnn, dml, vitisai, openvino, cpu). Overrides the device-to-provider mapping. --ep-options KEY=VALUE (multiple) \u2014 Runtime EP provider option forwarded to the inference session (e.g., --ep-options htp_performance_mode=burst). Repeatable. Applies to both HuggingFace model IDs and ONNX file inputs. Unlike build-time options set via --config, these tune the runtime session, not the compiled graph. --output -o PATH ~/.cache/winml/perf/<slug>/<timestamp>.json Output JSON file path for the benchmark report. --batch-size INTEGER 1 Batch size used when generating synthetic input tensors. --shape-config PATH \u2014 Path to a JSON file containing shape overrides (e.g., {\"height\": 480, \"width\": 480}). Ignored for pre-exported ONNX files and in --module mode. --quantize/--no-quantize flag true Run quantization during model build (use --no-quantize to skip it). Useful for measuring the fp32 baseline. --rebuild/--no-rebuild flag false Force model rebuild even if a cached artifact already exists. --ignore-cache/--no-ignore-cache flag false Build from scratch in a temporary folder and discard the artifact after benchmarking. Implies --rebuild. --module TEXT \u2014 PyTorch module class name for per-module benchmarking (e.g., BertAttention). Builds and times each matching instance separately. See Load and export. --monitor/--no-monitor flag false Show a live NPU/CPU utilization chart while the benchmark runs and include hardware metrics in the JSON report."},{"location":"commands/perf/#how-it-works","title":"How it works","text":"winml perf loads the model through WinMLAutoModel \u2014 accepting both HuggingFace IDs and local ONNX files \u2014 then generates random input tensors from the model's I/O configuration. It runs the specified number of warm-up iterations (excluded from statistics) followed by the timed iterations, collecting per-sample latency. The final report includes mean, min, max, P50, P90, P95, P99, standard deviation, and throughput in samples per second. When --monitor is active, a hardware polling loop runs in parallel and records NPU / GPU utilization, CPU usage, and device memory alongside the timing data.
Basic benchmark on the best available device:
$ winml perf -m microsoft/resnet-50\n Device: npu\nPrecision: auto\nTask: image-classification\nIterations: 100 (+ 10 warmup)\nBatch Size: 1\n\nLatency (ms)\n Avg P50 P90 P95 P99 Min Max Std\n 2.14 2.11 2.38 2.51 2.79 1.97 3.04 0.12\n\nThroughput: 467.29 samples/sec\n\nResults saved to: ~/.cache/winml/perf/microsoft_resnet-50/2026-05-27T120000.json\n Benchmark a pre-exported ONNX file on CPU with more iterations:
$ winml perf -m model.onnx --device cpu --iterations 500\n Benchmark a text model with an explicit task, targeting the NPU:
$ winml perf -m bert-base-uncased --task text-classification --device npu --precision w8a16\n Benchmark with live hardware monitoring enabled:
$ winml perf -m microsoft/resnet-50 --device npu --monitor\n Pass runtime EP provider options to tune the session (repeatable):
$ winml perf -m model.onnx --device npu \\\n --ep-options htp_performance_mode=burst \\\n --ep-options htp_graph_finalization_optimization_mode=3\n Per-module benchmarking to find latency hot-spots across all attention blocks:
$ winml perf -m bert-base-uncased --module BertAttention --iterations 200\n"},{"location":"commands/perf/#common-pitfalls","title":"Common pitfalls","text":"--warmup 30 or higher to reach steady-state latency.--shape-config is silently ignored in two cases. It has no effect on pre-exported ONNX files (shapes are baked into the graph) and is ignored in --module mode. The command prints a warning in both situations.winml perf separately with different --device values and compare the resulting JSON reports.perf benchmarks--module per-instance benchmarking works--device vs --epQuantize an ONNX model with QDQ insertion and calibration-based scaling.
"},{"location":"commands/quantize/#when-to-use-this","title":"When to use this","text":"Use winml quantize after winml export to insert QuantizeLinear/DequantizeLinear (QDQ) node pairs into an ONNX graph. The resulting model is ready for winml compile targeting an NPU or other quantization-aware execution provider.
$ winml quantize [options]\n"},{"location":"commands/quantize/#flags","title":"Flags","text":"Flag Short Type Default Description --model -m path (required) Input ONNX model file. --output -o path {input}_qdq.onnx Output path for the quantized model. --task string \u2014 Task name (e.g., image-classification, text-classification) used to select a task-appropriate calibration dataset. Pair with --model-name so the dataset is preprocessed exactly the way the model expects. Without --task, calibration falls back to synthetic random data. --model-name string \u2014 HuggingFace model ID (e.g., microsoft/resnet-50) used to load the matching preprocessor/tokenizer for calibration. Only used when --task is provided. --precision -p string None Precision shorthand: int8, int16, or mixed-precision like w8a16. Overridden by explicit --weight-type / --activation-type. --samples integer 10 Number of calibration samples used to compute quantization ranges. --method choice minmax Calibration algorithm: minmax, entropy, or percentile. --weight-type choice \u2014 Per-tensor type for weights: uint8, int8, uint16, or int16. Overrides --precision. When unset, defaults to uint8 (or the type implied by --precision). --activation-type choice \u2014 Per-tensor type for activations: uint8, int8, uint16, or int16. Overrides --precision. When unset, defaults to uint8 (or the type implied by --precision). --per-channel/--no-per-channel flag false Apply per-channel (rather than per-tensor) quantization to weight tensors. --symmetric/--no-symmetric flag false Use symmetric quantization (zero-point fixed at 0). --help -h flag Show this message and exit."},{"location":"commands/quantize/#how-it-works","title":"How it works","text":"winml quantize applies static post-training quantization (PTQ) using the ONNX Runtime quantization API. Calibration passes collect activation range statistics, which are used to compute scale and zero-point values baked into QuantizeLinear / DequantizeLinear node pairs around each eligible operator. The --method flag controls range estimation: minmax uses global observed extremes, entropy minimizes KL-divergence, and percentile clips outliers. Precision can be set at a coarse level with --precision or tuned per tensor type with --weight-type and --activation-type; explicit type flags always override --precision.
Calibration data is selected from --task and --model-name. For a supported task, a built-in default calibration dataset is loaded and preprocessed through the model's own tokenizer or image processor, so the calibration tensors match what the model will see at inference time. For an unsupported task \u2014 or when --task is omitted entirely \u2014 calibration falls back to synthetic random data synthesized from the ONNX input specification. Random-data calibration is fast and always works, but the resulting scales are typically less accurate than dataset-driven calibration, so always provide --task and --model-name when the model task is supported.
# Minimal quantization: defaults (10 samples, uint8 weights and activations)\nwinml quantize -m resnet50.onnx\n Input: resnet50.onnx\nOutput: resnet50_qdq.onnx\nWeight type: uint8\nActivation type: uint8\nSamples: 10\nMethod: minmax\n\nRunning quantization...\n\nSuccess! Model quantized\nOutput: resnet50_qdq.onnx\nQDQ nodes inserted: 53\nTotal time: 4.31s\n # Task-aware calibration: real samples preprocessed through the model's own image processor\nwinml quantize -m resnet50.onnx --task image-classification --model-name microsoft/resnet-50 --samples 128\n # int8 precision shorthand (equivalent to --weight-type int8 --activation-type int8)\nwinml quantize -m resnet50.onnx -p int8\n # Mixed-precision: int8 weights, uint16 activations with entropy calibration\nwinml quantize -m bert-base-uncased.onnx --weight-type int8 --activation-type uint16 --method entropy --samples 64\n # Per-channel symmetric quantization to a specific output path\nwinml quantize -m facebook_convnext.onnx -o facebook_convnext_qdq.onnx --per-channel --symmetric --samples 32\n # int16 precision (suitable for models sensitive to int8 accuracy loss)\nwinml quantize -m bert-base-uncased.onnx --precision int16\n"},{"location":"commands/quantize/#common-pitfalls","title":"Common pitfalls","text":"--task and --model-name, scales and zero-points are computed from random tensors synthesized from the ONNX input specification \u2014 the model never sees realistic activations, so accuracy after quantization can degrade noticeably. Always pass --task and --model-name for supported tasks (e.g., --task image-classification --model-name microsoft/resnet-50) so calibration runs on real samples preprocessed through the model's own tokenizer or image processor.--weight-type / --activation-type silently override --precision. If you pass both, the explicit type flags win. Omit --precision when setting types explicitly to avoid confusion.--per-channel increases model size. Per-channel quantization stores a separate scale and zero-point per output channel; this can noticeably inflate the model file size compared to per-tensor mode.{stem}_qdq.onnx in the same directory as input. Always pass -o when writing to a specific location to avoid accidentally overwriting or cluttering the source directory.winml compile --no-quant instead if the model already contains QDQ nodes.Inspect your machine \u2014 devices, EPs, and runtime versions at a glance.
"},{"location":"commands/sys/#when-to-use-this","title":"When to use this","text":"Run winml sys before starting any export or build workflow to confirm that the required ML libraries are installed and that the target hardware is visible. It is also the first command to run when diagnosing an unexpected export failure.
$ winml sys [options]\n"},{"location":"commands/sys/#flags","title":"Flags","text":"Flag Short Type Default Description --format -f text | json | compact text Output format. text renders rich tables, json emits machine-readable JSON, compact prints a single-line summary. --list-device \u2014 flag false List available compute devices (NPU, GPU, CPU) in priority order instead of showing the full system report. --list-ep \u2014 flag false List available ONNX Runtime execution providers instead of showing the full system report. Can be combined with --list-device. --verbose -v flag false Surface additional diagnostic sections: backend availability and Export Readiness. --help -h flag \u2014 Show help and exit. winml sys takes no --model, --device, --ep, --task, or --precision arguments. It describes the host environment, not a specific model.
winml sys queries Python's platform and importlib.metadata modules to report library versions, then probes PyTorch for CUDA availability and GPU device names. Backend availability checks use the installed runtime environment, while device enumeration queries hardware directly in NPU > GPU > CPU priority order, and EP enumeration merges the WinML EP registry with ONNX Runtime's get_available_providers(). When --format json is used the full report \u2014 including devices and EPs \u2014 is emitted as a single JSON object, making it easy to capture in CI pipelines.
# Full human-readable system report\n$ winml sys\n +------------------------------------+\n| winml-cli System Information |\n+------------------------------------+\n\nEnvironment\n Python Version 3.11.9\n Python Executable C:\\...\\python.exe\n OS Windows 11\n Machine AMD64\n\nML Libraries\n Library Version Status\n torch 2.4.0 OK\n transformers 4.44.0 OK\n onnx 1.16.1 OK\n ...\n\nAvailable Devices (priority order)\n #1 NPU Qualcomm(R) Hexagon NPU\n #2 GPU Qualcomm(R) Adreno GPU\n #3 CPU Snapdragon(R) X Elite\n\nAvailable Execution Providers\n QNNExecutionProvider -> NPU/GPU\n DmlExecutionProvider -> GPU\n CPUExecutionProvider -> CPU\n # Compact one-liner \u2014 useful for CI logs\n$ winml sys --format compact\n # Machine-readable JSON \u2014 pipe to jq or save for later comparison\n$ winml sys --format json > env.json\n # Only list devices \u2014 skip everything else\n$ winml sys --list-device\n # List EPs as JSON \u2014 useful for scripting EP selection\n$ winml sys --list-ep --format json\n"},{"location":"commands/sys/#common-pitfalls","title":"Common pitfalls","text":"--list-device and --list-ep suppress the full report. When either flag is present, only the requested section is printed. Omit both flags to see the complete system report.--format compact omits device and EP tables. The compact format is designed for single-line log entries and does not include device or EP details. Use text or json when you need the full picture.torch+cuXXX). A CPU-only torch wheel will always report cuda_available: false.--device / --ep flags interactNot every ONNX graph runs efficiently on every execution provider. An operator that compiles cleanly on CPU may be unsupported on an NPU, and a correct graph may still leave performance on the table because adjacent operations were not fused. winml-cli separates the concern into two commands \u2014 winml analyze and winml optimize \u2014 that together form a graph-quality loop driven automatically by winml build.
winml analyze performs static analysis on an ONNX graph to answer one question: will this model run end-to-end on my target execution provider, and if not, what needs to change?
Unlike profiling, static analysis does not require executing the full model on the target device. It inspects each operator (and recognized subgraph pattern) against a rule database of known EP capabilities, classifies every node, and emits actionable recommendations. The same analyzer also drives the autoconf feedback loop inside winml build, so understanding how it works is useful even when you never invoke winml analyze directly.
Specify a target EP with --ep (e.g., --ep qnn or --ep openvino) and a device with --device (CPU, GPU, or NPU). The default --ep auto infers from locally available EPs; pass --ep all to evaluate every rule-data-backed EP regardless of local availability. Results print to the console by default; add --output results.json to save the report as JSON for scripting or archiving.
For each operator (and matched subgraph pattern) the analyzer follows a two-step process:
--run-unknown-op is enabled, the analyzer builds a minimal ONNX graph for the op and runs it on the target EP locally to determine support (see Local op execution below).The combined answer is recorded as a SupportLevel:
SUPPORTED yes yes Fully Supported 0 PARTIAL no yes Partial Support 1 (warning) UNSUPPORTED no no Not Supported 1 (error) UNKNOWN n/a n/a Unknown Support 1 A PARTIAL classification means the operator cannot be dispatched to the requested EP but the ONNX Runtime can still execute the model by falling back to CPU. This is technically a working model, but the latency and power-efficiency goals of NPU deployment are not met. UNSUPPORTED means even the CPU fallback path fails, so the model will not run at all. UNKNOWN appears only when the analyzer lacks both rule-database data and the ability to test locally.
Every analysis produces a lint result; the default (full) mode additionally produces an autoconf result. Understanding these two outputs separately is the easiest way to understand what winml analyze is for and how to consume it.
Lint is the analyzer's verdict on the model as it stands today. It classifies every operator and recognized pattern against the target EP and rolls the classifications up into:
errors \u2014 count of UNSUPPORTED patterns. The model will not run.warnings \u2014 count of PARTIAL patterns. The model runs, but these nodes fall back to CPU.passed \u2014 True iff errors == 0 and warnings == 0.Lint always runs. It is deterministic and sufficient for a yes/no CI gate \u2014 the CLI's exit code is derived from it.
Autoconf is the analyzer's suggestion for how to fix the current model. It lists the fusion flags which, if enabled in the optimize stage, would convert one or more PARTIAL/UNSUPPORTED patterns into SUPPORTED ones.
Autoconf is what powers the build pipeline's re-optimization loop: when the analyzer says \"gelu_fusion would resolve these warnings\", the build re-runs optimize with that flag and re-analyzes \u2014 until no further suggestions remain or the iteration limit is hit. Autoconf is advisory; nothing else in the system flips fusion flags automatically.
winml analyze can run in two modes which differ only in whether autoconf is computed:
--no-information (CLI) or autoconf=False (Python) Lint only. optimization_config is None. CI gate; pass/fail only Full (default) --information (CLI, default) or autoconf=True (Python) Lint plus autoconf and recommendations Local debugging; build pipeline's autoconf loop The only difference between the two modes is whether autoconf and the human-readable recommendations are computed. Skipping them gives a faster, leaner run. The lint result is identical either way.
"},{"location":"concepts/analyze-and-optimize/#three-classes-of-finding","title":"Three classes of finding","text":"Every analysis emits findings in three buckets. Each bucket maps to a different remediation pattern.
Errors (UNSUPPORTED patterns) block deployment. Either the operator does not exist on the target EP at all, or it does not handle the specific input shape/dtype the model uses. Typical remediations:
Each error pattern includes a recommendation that identifies the current pattern and the target pattern the EP does support, so the optimizer (or a manual rewrite) can apply the fix.
Warnings (PARTIAL patterns) mean the model will run, but the target EP cannot dispatch this pattern. Inference falls back to the CPU EP, breaking the deployment goal (e.g., NPU offload) without breaking correctness. Warnings are usually fusion opportunities \u2014 the analyzer recognized a sub-pattern that, if fused, would become a single EP-native op. The fix is to enable the relevant fusion flag in the optimize stage \u2014 this is exactly what the autoconf loop does automatically.
Info (Information items) are lower-priority insights: a hint that an alternative pattern exists, a QDQ-equivalent that could be used after quantization, or a description of why a node was classified as it was. Info entries never affect exit code.
The static rule database does not cover every operator and every shape/dtype combination. When --run-unknown-op is enabled and the analyzer encounters a pattern not present in the database, it builds a tiny ONNX graph containing just that op (with the model's actual input metadata) and runs it on the target EP locally. The compile/run result becomes the classification. Without --run-unknown-op (the default), such patterns are classified as UNKNOWN.
Leave --run-unknown-op disabled when:
When a pattern is unsupported and the recommendation does not immediately tell you what is wrong, use --save-node to dump the offending subgraph to disk as a self-contained, runnable .onnx file. You can then open it in Netron, re-analyze it in isolation, or attach it to a bug report as a minimal reproducer. See the analyze command reference for usage examples.
When a model is exported with hierarchy-preserving tags (HTP), the export produces a sidecar _htp_metadata.json that maps each ONNX node back to its source module (e.g., encoder.layer.0.attention.self.GELUActivation). Passing this file via --htp-metadata lets the PatternExtractor use the module hierarchy to match subgraph patterns more accurately than operator-level heuristics alone.
HTP metadata is consumed at the pattern extraction stage \u2014 before any EP-specific runtime checking \u2014 so the enriched patterns benefit all target EPs equally (QNN, OpenVINO, VitisAI, etc.). Without HTP metadata, the analyzer falls back to attribute-based tag matching and then the general-purpose PatternMatcher; with it, the analyzer can correctly identify fused patterns (GELU, LayerNorm, Attention) that are difficult to detect from the raw operator graph. See the analyze command reference for usage examples.
The analyzer is composed of five stages that run in order. You normally do not need to think about them, but they are worth knowing when reading recommendations or extending the analyzer:
Stage JobONNXLoader Load the ONNX file (or ModelProto), record metadata. PatternExtractor Walk the graph, match operator and subgraph patterns from the rule catalog. Optionally consume HTP metadata. RuntimeChecker For each pattern, consult the rule database; if no rule applies, run the op locally (when allowed). InformationEngine Turn classifications into human-readable Information items; also runs model validators (constant folding, dynamic input, pattern matching, QDQ validation, shape inference). OutputAggregator Assemble the final AnalysisOutput (the JSON you get from --output). The model validators run regardless of whether there are runtime check results \u2014 they are model-level sanity checks (e.g., is shape inference complete? are QDQ pairs well-formed?) and can surface issues even when every operator looks fine in isolation.
"},{"location":"concepts/analyze-and-optimize/#what-optimize-does","title":"What optimize does","text":"winml optimize rewrites the ONNX graph by applying fusions and structural simplifications. Internally the optimizer runs four pipes in sequence:
Every optimization is a named capability toggled via --enable-<name> and --disable-<name> flags. Run --list-capabilities to see all registered optimizations and their defaults. The optimizer currently ships 57 static capabilities across 13 categories:
This granularity matters when a specific fusion breaks a downstream step or when you need an exact optimization profile for a given EP. Some capabilities declare dependencies (e.g., bias-gelu-fusion requires gelu-fusion); the optimizer resolves these automatically when you enable a flag.
Pattern rewrites are a complementary mechanism: instead of folding nodes, rewrites replace one subgraph pattern with a structurally equivalent alternative. Rules are defined in JSON files (default.json for general rewrites, qnn.json for QNN-specific rewrites). The optimizer currently ships 5 rewrite groups containing 12 individual rules \u2014 for example, four GELU source variants can each be rewritten to a single Gelu op, and a MatMul+Add pattern can be rewritten to a GEMM or to a Conv2D for Qualcomm NPU targets. Run --list-rewrites to discover available families and their flag names. Flags follow the form --enable-<source-slug>-<target-slug>.
Commit a specific combination of flags to a --config file for reproducible builds.
A single optimize pass may create fusion opportunities that were not present before, and a freshly fused graph may surface new operator compatibility issues. This is why winml build runs analyze and optimize in an alternating loop rather than once each.
The flow inside winml build (implemented in run_optimize_analyze_loop) is:
The initial optimize pass applies the flags from config.optim. The analyzer then inspects the result; if autoconf discovers fusion flags that were not yet enabled, the optimizer re-runs with those flags and the analyzer re-checks. This repeats up to --max-optim-iterations rounds (default: three). The loop exits early when autoconf suggests no further changes. After the loop, a final analysis validates the result \u2014 if unsupported patterns still exist, the build raises a RuntimeError.
Use --no-analyze to skip the loop and run a single optimization pass \u2014 useful for deterministic rebuilds from a fixed ONNX checkpoint where the graph is already known good.
winml analyze (CLI) \u2014 exit code is the contract Embed analysis in a build script or notebook analyze_onnx(model, ep=...) (flat Python API) Post-process the full result programmatically ONNXStaticAnalyzer().analyze(...) (class API) Analyze an in-memory ModelProto ONNXStaticAnalyzer().analyze_from_proto(...) Optimize with full control over fusions winml optimize (CLI) with --enable- / --disable- flags Reproducible build from a config file winml build -c config.json (pipeline wrapper) The CLI and the flat Python API are sufficient for the vast majority of cases. The class-based API is only needed when you want to call is_fully_supported(ep), get_unsupported_operators(ep), or get_optimization_opportunities(ep) on the full result.
When you run winml compile, you are not simply copying an ONNX file to a new location. You are asking an execution provider (EP) to transform the model into a form it can load and run directly, without repeating that transformation at every startup. Understanding what the compiler produces \u2014 and why \u2014 helps you decide when to compile, what output format to choose, and how to balance file size against runtime performance.
Compilation is an offline, one-time step. The artifact it creates is what you ship with your application and what winml-cli uses for benchmarking and evaluation.
For EPs that are fully integrated into ONNX Runtime \u2014 CPU, DirectML, and similar providers \u2014 the compile step writes a new .onnx file that the runtime loads directly. The ONNX graph has been prepared and, in some cases, partitioned so that the EP's session initializer has less work to do when the application starts.
For EPs that support ahead-of-time compilation (e.g. --ep qnn for Qualcomm NPUs and --ep vitisai for AMD NPUs), the compiler goes further. It takes the ONNX graph and produces a binary artifact \u2014 the EP context blob \u2014 that encodes the fully compiled, hardware-ready version of the network. This blob is then associated with the ONNX model file. On subsequent loads, the EP reads the blob rather than re-compiling the graph, which makes session creation dramatically faster.
The default compiler backend is ort (ONNX Runtime).
For QNN compilation, winml-cli gives you a choice of where the EP context blob lives. By default the blob is written as a sidecar .bin file alongside the .onnx. Passing --embed instead inlines the blob directly into the ONNX file.
External (default): The .onnx is small and human-inspectable; the heavy binary data lives in a separate file. You must keep the two files together \u2014 the ONNX stores a relative path back to the .bin. This layout is preferable for version control and for scenarios where you want to inspect or diff the model graph.
Embedded (--embed): Everything ships in a single .onnx file. Deployment is simpler because there is only one artifact to track. The trade-off is file size: the .onnx grows by the full size of the compiled context, and the file is no longer human-readable in the usual sense. Choose embedded when your deployment tooling expects a single model file, or when you want to minimize the chance of the sidecar being misplaced.
The first time an ONNX Runtime session is created for a model on a hardware EP, the runtime must partition the graph, allocate buffers, and JIT-compile the operators. On an NPU this process can take several seconds. For applications with tight startup budgets \u2014 on-device inference in a UI flow, for example \u2014 that cold-start cost is often unacceptable.
A model produced by winml compile has already paid that cost. The EP context blob is the result of compilation, not its input. When the application loads the compiled model the EP reads the pre-built binary and the session is ready almost immediately. Shipping a compiled model is therefore the standard pattern for production deployments on QNN hardware.
If you are iterating on quantization settings or ONNX graphs and want to check whether the model compiles at all, pass an already-quantized (QDQ) model directly \u2014 winml compile compiles whatever ONNX file you supply and does not have a separate quantization pass to skip.
By default winml compile runs a validation pass after compilation finishes \u2014 it loads the compiled model into an inference session, feeds it dummy inputs (all-ones tensors), and checks that the outputs do not contain NaN or Inf values. This catches basic compilation failures early (e.g., the EP rejecting the graph or producing garbage outputs).
The --no-validate flag skips that pass. It is useful during rapid iteration when you only want to confirm that compilation succeeds without the overhead of a trial inference run.
--ep / --device flagswinml config and winml build are a producer/consumer pair. winml config inspects a Hugging Face model (or an existing ONNX file), auto-detects the task, model class, and I/O specifications, and writes a WinMLBuildConfig JSON file. winml build reads that file and runs the full pipeline \u2014 export, optimize, quantize, compile \u2014 producing a Windows ML-ready ONNX artifact.
Keeping these two responsibilities separate is intentional. The config file is a stable, human-readable description of exactly what the build will do. You can generate it once, review or edit it, commit it to source control, and replay the same build at any time without re-running model introspection. CI pipelines and team workflows both benefit from treating the config file as a versioned artifact rather than a transient intermediate.
"},{"location":"concepts/config-and-build/#generating-a-config","title":"Generating a config","text":"winml config produces a WinMLBuildConfig JSON with sensible defaults for the detected model type. At minimum, provide a model identifier:
winml config -m microsoft/resnet-50 -o resnet50.json\n Several flags shape what ends up in the config:
--task overrides the auto-detected Hugging Face task when detection is ambiguous or when you want a specific variant (for example, text-classification vs feature-extraction).--no-quant sets the quant section to null, so the quantize stage is omitted when winml build consumes the config. Use this for GPU workflows where float16 is preferred over QDQ quantization.--no-compile sets the compile section to null, producing a portable ONNX that the runtime compiles on first load instead of embedding a pre-compiled binary.--trust-remote-code allows model repositories that ship custom modeling code \u2014 required for some community models that define non-standard architectures outside the standard transformers library.If -o is omitted, the config is printed to stdout, which is convenient for piping or quick inspection. The generated JSON is plain text and can be edited directly before being passed to winml build.
A WinMLBuildConfig is a dataclass defined in src/winml/modelkit/config/build.py. It holds five nested sub-configs for the pipeline stages, plus an evaluation config and an auto flag:
loader WinMLLoaderConfig Task, model type, and model class used to load the Hugging Face model. export WinMLExportConfig Input/output tensor specs, opset version, dynamic axes (null for pre-exported ONNX). optim WinMLOptimizationConfig Graph fusion flags (GeLU, LayerNorm, MatMul+Add). quant WinMLQuantizationConfig Precision types (weight_type, activation_type), calibration samples and method (null to skip). compile WinMLCompileConfig Target EP provider, EPContext options, compiler backend (null to skip). eval WinMLEvaluationConfig \\| null Evaluation settings run after the build (null to skip). auto bool When true (default), auto-fills missing fields from model introspection. Setting quant or compile to null tells the pipeline to skip that stage entirely, equivalent to passing --no-quant or --no-compile on the command line.
A generated config looks similar to:
{\n \"loader\": {\n \"task\": \"image-classification\"\n },\n \"export\": {\n \"opset_version\": 17,\n \"batch_size\": 1\n },\n \"optim\": {\n \"gelu_fusion\": false,\n \"layer_norm_fusion\": false,\n \"matmul_add_fusion\": false\n },\n \"quant\": {\n \"mode\": \"qdq\",\n \"weight_type\": \"uint8\",\n \"activation_type\": \"uint8\",\n \"samples\": 10\n },\n \"compile\": {\n \"execution_provider\": \"qnn\",\n \"enable_ep_context\": true\n }\n}\n The file is plain JSON. You can hand-edit any field before passing it to winml build \u2014 adjust the calibration sample count, change the compile provider, or remove a fusion flag.
Pass the config file to winml build with either an output directory or the global cache flag:
# Write artifacts to a local directory\nwinml build -c resnet50.json -m microsoft/resnet-50 --output-dir output/\n\n# Write to the global cache (~/.cache/winml/)\nwinml build -c resnet50.json -m microsoft/resnet-50 --use-cache\n --output-dir and --use-cache are mutually exclusive; you must supply one of the two when running winml build (enforced at runtime, not parse time). Within the output directory, winml build writes one ONNX file per completed stage so that intermediate artifacts are available for inspection, and it writes a copy of the resolved config so the full build parameters are recorded alongside the outputs.
CLI flags passed directly to winml build override the corresponding config sections for that run only, without modifying the JSON file on disk. This makes it straightforward to experiment with a variation without creating a new config:
# Skip quantization and compilation for this run only\nwinml build -c resnet50.json -m microsoft/resnet-50 --output-dir output/ --no-quant --no-compile\n\n# Skip optimization (for a pre-quantized input ONNX)\nwinml build -c resnet50.json -m model_qdq.onnx --output-dir output/ --no-optimize\n --no-quant, --no-compile, and --no-optimize each suppress the corresponding stage regardless of what the config file specifies. Because the config file is unchanged, re-running without the override flag reverts to the full pipeline described in the config.
Storing the WinMLBuildConfig JSON in source control brings three concrete benefits:
Reproducibility. A config file pins every build decision \u2014 task, precision, quantization method, calibration sample count, target EP, fusion flags \u2014 in a single file. Running winml build -c config.json six months later produces the same artifact as it does today, regardless of how the tool's defaults evolve.
CI integration. A CI job can run winml build -c config.json -m <model-id> --output-dir artifacts/ with no human intervention. Because all settings live in the config file, the CI script requires no per-model flag knowledge, and updating build parameters is a pull request to the config file, not a change to the pipeline script.
Team sharing. Handing a colleague a config file is enough for them to reproduce the exact build on their machine. There is no need to document the sequence of primitive commands, precision arguments, or calibration settings separately \u2014 the file is the documentation.
winml build vs individual primitive commandsAn Execution Provider (EP) is a pluggable backend in ONNX Runtime that claims and runs a subset of graph nodes on a specific hardware target. When ONNX Runtime loads a model it partitions the graph among the registered EPs: operators that an EP claims are dispatched to it, and the remainder fall back to the CPU EP. This design lets a single ONNX model exploit an NPU, GPU, or CPU without any change to the graph itself.
A device is the hardware category that an EP targets \u2014 one of npu, gpu, or cpu. winml-cli exposes both levels of control: the high-level --device flag selects a hardware category, while the low-level --ep flag pins a specific ONNX Runtime provider name. In most workflows you set --device and let winml-cli resolve the best available EP; you reach for --ep when you need to compare or force a specific provider.
The table below lists every Execution Provider that winml-cli has explicit support for. EP names are the canonical ONNX Runtime strings accepted by --ep. You can also use the short alias (case-insensitive) anywhere the full name is accepted.
QNNExecutionProvider qnn npu / gpu Qualcomm NPU (Hexagon DSP) / Qualcomm GPU (Adreno) Snapdragon-based Copilot+ PCs; best latency and power efficiency on Qualcomm silicon VitisAIExecutionProvider vitisai npu AMD NPU (XDNA) AMD Ryzen AI platforms; targets the AMD AI Engine via the Vitis AI stack OpenVINOExecutionProvider openvino npu / gpu / cpu Intel CPU / GPU / NPU Intel Core Ultra platforms; flexible device targeting across all three Intel compute types DmlExecutionProvider dml gpu GPU (DirectML) Any DirectX 12 GPU on Windows; broad compatibility across AMD, Intel, and NVIDIA discrete/integrated graphics NvTensorRTRTXExecutionProvider nv_tensorrt_rtx gpu NVIDIA GPU (TensorRT RTX) NVIDIA RTX GPUs; maximum throughput via TensorRT graph optimization MIGraphXExecutionProvider migraphx gpu AMD GPU (MIGraphX) AMD discrete GPUs; hardware-accelerated inference via the MIGraphX graph engine CPUExecutionProvider cpu cpu CPU Universal fallback; always available regardless of hardware To see which EPs are available on the current machine, run:
winml sys --list-ep\n"},{"location":"concepts/eps-and-devices/#device-vs-ep-on-the-cli","title":"Device vs. EP on the CLI","text":"winml-cli exposes two overlapping flags for targeting hardware. Understanding their relationship prevents confusion when using winml analyze, winml compile, or winml build.
--device (high-level)
Accepts one of four values: auto, cpu, gpu, or npu. When set to auto (the default), winml-cli inspects the machine and selects the highest-priority device class that has a compatible EP available, in the order NPU > GPU > CPU. Setting an explicit value such as --device npu requests a device category without naming the EP.
For winml analyze, --device also accepts all \u2014 this evaluates the model against every device that has rule data, producing a side-by-side compatibility report.
# Let winml-cli pick the best available device\nwinml analyze --model model.onnx --device auto\n\n# Target the NPU device class\nwinml analyze --model model.onnx --device npu\n\n# Analyze against all devices at once (analyze only)\nwinml analyze --model model.onnx --device all\n --ep (low-level override)
Accepts a valid EP name or alias (for example qnn, vitisai, dml, openvino), or auto to let winml-cli resolve the EP from the device. When --ep is provided with a specific value it takes precedence over --device and bypasses device-class resolution entirely. Use --ep when you need to pin a specific provider \u2014 for instance to compare QNNExecutionProvider against DmlExecutionProvider on the same machine.
For winml analyze, --ep also accepts all \u2014 this evaluates the model against every registered EP simultaneously.
# Force Qualcomm QNN regardless of device selection\nwinml analyze --model model.onnx --ep QNNExecutionProvider --device npu\n\n# Use the short alias; winml-cli normalizes it to the full name\nwinml analyze --model model.onnx --ep qnn\n\n# Analyze against all EPs at once (analyze only)\nwinml analyze --model model.onnx --ep all\n The --ep flag accepts a free-form string and is not restricted to the choices listed above. This allows forward compatibility with EP names that winml-cli does not yet enumerate.
winml eval answers one question: does this model produce correct results? It measures accuracy \u2014 how well outputs match ground truth \u2014 rather than latency or throughput. You give it a model, point it at a labeled dataset, and get back a JSON report of metric scores. Everything else in the pipeline (compilation, quantization, device selection) is about making the model fast; eval is about knowing whether it is still right.
The dataset is the source of truth. Eval iterates over dataset rows, runs each sample through the model, and compares the prediction to the label recorded in the dataset. This means the dataset must have both input features and ground-truth labels, and the columns carrying those values must be wired to the model's inputs and outputs. winml-cli handles standard tasks automatically, but the column-mapping flags let you override the defaults for non-standard datasets.
"},{"location":"concepts/eval-and-datasets/#what-eval-reports","title":"What eval reports","text":"The metric reported depends on the task. Classification tasks produce accuracy (top-1 and optionally top-5). Object detection tasks produce mean average precision (mAP). The exact set of metrics is printed to stdout and saved to the file specified by --output. The --output flag accepts any .json path; if omitted, results are printed but not persisted. Use --schema to print the expected dataset schema for a given task without running eval, which is useful when you are preparing a custom dataset.
--dataset takes a Hugging Face dataset path \u2014 for example imagenet-1k or glue. If you omit it, winml-cli selects a default dataset based on the detected task. For datasets that have multiple configurations, --dataset-name picks the specific config (e.g. --dataset-name mrpc when using the glue dataset).
By default eval runs on the validation split; --split overrides this. Full validation sets can be large. During development, --samples 200 caps the run to 200 rows so you get quick feedback. For very large datasets that you prefer not to download fully, --streaming fetches rows on demand instead of materialising the whole dataset locally. --shuffle (on by default) randomises sampling order so a capped run is representative rather than biased toward the first rows.
winml-cli must know which dataset column feeds which model input and which column holds the ground-truth label. For well-known task/dataset combinations this mapping is built in. When it is not, use --column key=value to declare it. The key is the name the task pipeline expects (e.g. input_column) and value is the actual column name in the dataset (e.g. image). You can repeat --column as many times as needed.
When the integer label IDs in the dataset do not match the class indices the model was trained against, --label-mapping accepts a JSON file of the form {\"class_name\": id} that translates between the two spaces. This is common with models fine-tuned on a relabelled subset of a public dataset.
Quantization is a lossy transformation. Converting weights from float32 to int8, or activations to a narrow range, introduces rounding error that accumulates differently across architectures and calibration data. The impact on accuracy cannot be predicted analytically; it must be measured. Running winml eval before and after quantization gives you a concrete accuracy delta. A drop within your acceptable threshold confirms the quantized model is ready; a larger drop means you should revisit calibration settings or switch to a less aggressive quantization scheme.
Make this a habit: quantize, then eval. Comparing two --output JSON files is a reliable, reproducible record that the trade-off between performance and accuracy was explicitly checked. See Quantization for the full quantization workflow.
winml eval command reference \u2014 all flags with examplesA .onnx file is, at rest, a binary-serialized Protocol Buffer. Open it in any hex editor and you will find the familiar ONNX magic bytes followed by a dense encoding of every number the model has ever learned, plus the structural description of how those numbers are combined to produce a prediction. The file is self-contained: weights and computation recipe live together, making the artifact portable without any accompanying framework installation.
That computation recipe is a graph \u2014 a directed acyclic structure of operators wired together by named data edges. The graph is what the ONNX Intermediate Representation (IR) actually defines. When winml-cli loads or transforms a model, every operation works against this graph structure, not against framework-specific objects.
"},{"location":"concepts/graphs-and-ir/#what-is-in-a-onnx-file","title":"What is in a .onnx file","text":"An ONNX ModelProto wraps a single GraphProto. Inside the graph you will find:
pixel_values: float32[1, 3, 224, 224]).winml.io.inputs (serialized tensor specs) and winml.hierarchy.tag attributes on individual nodes.ONNX functions as an Intermediate Representation: a portable, framework-neutral description of a computation that can be loaded by any conforming runtime. Unlike a Python object graph or a compiled binary, the ONNX IR makes data flow completely explicit. Every node declares the exact names of its input and output edges; those names form a namespace shared across the whole graph, so any consumer can trace a tensor from the model inputs through every transformation to the final output.
This explicit wiring unlocks two capabilities that winml-cli relies on heavily. First, shape inference can propagate concrete or symbolic dimensions through the graph without running it \u2014 a prerequisite for correct quantization and for generating input specs automatically. Second, EP-targeted compilation can partition the graph by examining which nodes an Execution Provider supports, fuse eligible sub-graphs into accelerated kernels, and serialize the result back into a valid ONNX file using the EPContext convention. Neither of these would be tractable on an opaque binary or a dynamic execution trace.
Because the IR is static \u2014 describing the full computation at load time rather than at call time \u2014 winml-cli can inspect, validate, and transform a model without a GPU, a framework, or sample data.
"},{"location":"concepts/graphs-and-ir/#opsets-and-versioning","title":"Opsets and versioning","text":"Every operator in ONNX belongs to a domain, and every domain advances through numbered opset versions. An opset is a snapshot of the operator catalog: it defines which operators exist, what their inputs and outputs mean, and how edge cases are handled. When a model declares opset_import { domain: \"\" version: 17 }, it is saying \"all unnamed-domain operators in this file must be interpreted according to the rules published in opset 17.\"
winml-cli defaults to opset 17 when exporting a PyTorch model to ONNX. This is the value of opset_version: int = 17 in WinMLExportConfig (src/winml/modelkit/export/config.py, line 75). Opset 17 introduced layer-normalisation and group-normalisation operators in native form, eliminating the multi-node decompositions required by earlier opsets, which is why it is the recommended baseline for modern transformer and vision architectures.
Higher opsets unlock additional operators and fix known edge-case behavior, but not every Execution Provider supports the latest opset. QNN, for instance, may lag behind the ONNX standard by one or two versions. If you need to target an older EP, pass a custom export configuration:
# Write a config override\necho '{\"opset_version\": 16}' > export_cfg.json\n\n# Export with the override\nwinml export -m prajjwal1/bert-tiny -o bert.onnx --export-config export_cfg.json\n You can also check the opset a saved model declares:
winml inspect -m bert.onnx\n Opset: ai.onnx == 17\n When winml-cli's optimization and quantization pipelines transform a model, they preserve the declared opset unless explicitly instructed otherwise, so the model you receive after winml quantize will carry the same opset version as the model you supplied.
winml-cli is a toolkit for converting PyTorch and Hugging Face models into ONNX artifacts that are optimized and compiled for Windows ML execution providers (EPs). Starting from a model identifier or a pre-exported ONNX file, winml-cli runs a staged pipeline \u2014 export, optimize, quantize, compile \u2014 and produces a final model.onnx ready for inference via a Windows ML session.
Each stage is independently controllable. Quantization and compilation are optional and can be bypassed with a flag or by leaving the corresponding section of the build configuration empty. The same pipeline API that powers winml build is also the programmatic entry point for WinMLAutoModel.from_pretrained().
The stages run in order, and each one writes an intermediate ONNX file to the output directory. All intermediate artifacts are preserved so you can inspect any stage's output or feed a pre-processed file into a later stage directly.
"},{"location":"concepts/how-it-works/#pipeline-stages","title":"Pipeline Stages","text":""},{"location":"concepts/how-it-works/#export-winml-export","title":"Export \u2014winml export","text":"winml export loads a Hugging Face model (pretrained or random-weight), traces it with torch.export or an Optimum-based exporter, and writes a portable, device-agnostic ONNX file. The output at this stage is a plain ONNX graph with float32 weights and no EP-specific nodes.
winml analyze","text":"winml analyze performs static compatibility analysis on an ONNX graph against a target execution provider. It classifies every node as Supported, Partial, Unsupported, or Unknown \u2014 without running the model on the device. Use it before building to check if your model (or an intermediate artifact from any pipeline stage) will run cleanly on the target EP:
winml analyze -m model.onnx --ep qnn --device npu\n Add --optim-config optim.json to output auto-discovered optimization recommendations that can be fed directly into winml optimize. The same analyzer also drives the autoconf feedback loop inside winml build.
winml optimize","text":"winml optimize runs graph-level transformations on the exported ONNX: operator fusion (attention, layer norm, GeLU), constant folding, and graph pruning. The optimize stage also contains an autoconf loop: a static analyzer inspects the graph for nodes that the target EP cannot dispatch natively, and re-runs optimization with adjusted fusion flags until no further improvements are found (up to a configurable iteration limit).
winml quantize","text":"winml quantize inserts Quantize-Dequantize (QDQ) nodes into the optimized graph to reduce weights and activations to lower-precision types (for example, int8 weights with uint8 activations). Calibration data is used to compute quantization parameters per tensor. If the input model already contains QDQ nodes, this stage is skipped automatically.
winml compile","text":"winml compile invokes an EP-specific compiler (for example, the QNN compiler for NPU targets) to embed a pre-compiled binary cache inside the ONNX graph as an EPContext node. At inference time, the EP loads the cached binary directly, bypassing per-session compilation. Compilation is optional; omitting it produces a portable ONNX that is compiled on first load by the runtime.
winml perf / winml eval","text":"After the model is built, winml perf benchmarks inference latency and throughput using a Windows ML session, and winml eval runs task-specific accuracy evaluation. Neither command modifies the model; they consume the final model.onnx produced by the pipeline.
winml build as the One-Shot Wrapper","text":"Running each stage individually is useful when iterating on a specific step, but the normal workflow is winml build, which orchestrates the full pipeline in a single command:
winml build -m microsoft/resnet-50 -o output/\n The -c config.json flag is optional. If omitted, winml build auto-generates a default config internally. To customize pipeline settings, generate a config first with winml config and then pass it:
winml config -m microsoft/resnet-50 -o config.json\nwinml build -c config.json -m microsoft/resnet-50 -o output/\n winml build auto-detects whether the input is a Hugging Face model ID or an existing ONNX file and calls the appropriate internal API (build_hf_model or build_onnx_model). When given an ONNX file directly, the export stage is skipped and the pipeline starts at optimize.
Individual stages can be bypassed from the command line without editing the config file:
# Skip quantization and compilation\nwinml build -m bert-base-uncased -o output/ --no-quant --no-compile\n\n# Skip optimization (for pre-quantized input)\nwinml build -m model_qdq.onnx -o output/ --no-optimize\n"},{"location":"concepts/how-it-works/#configuration-winmlbuildconfig-vs-cli-flags","title":"Configuration: WinMLBuildConfig vs CLI Flags","text":"Pipeline behavior is primarily governed by a WinMLBuildConfig JSON file generated by winml config. The config is a hierarchical structure with one section per stage:
WinMLBuildConfig\n\u251c\u2500\u2500 loader \u2014 model type, task, input constraints\n\u251c\u2500\u2500 export \u2014 input tensor specs, opset, backend\n\u251c\u2500\u2500 optim \u2014 fusion flags, optimization level\n\u251c\u2500\u2500 quant \u2014 precision, calibration settings (null = skip stage)\n\u251c\u2500\u2500 compile \u2014 target EP, device (null = skip stage)\n\u2514\u2500\u2500 eval \u2014 evaluation settings\n Setting quant or compile to null in the JSON file is equivalent to passing --no-quant or --no-compile on the command line; both result in the corresponding stage being skipped. CLI flags override the config at runtime without modifying the file, which is convenient for one-off experiments.
The config file is written (or updated) to the output directory after the optimize stage completes, capturing any autoconf-adjusted fusion flags so the build is reproducible. This persisted winml_build_config.json is a self-contained pipeline specification that you can check into version control and run in CI/CD (winml build -c winml_build_config.json -m <model> -o output/) for repeatable, unattended builds across environments.
For the full field-by-field schema, see Reference \u2014 Config Schema.
"},{"location":"concepts/how-it-works/#see-also","title":"See Also","text":"The first stage of the winml-cli pipeline is the most deterministic: bring a model into memory and convert it to ONNX. Everything that follows \u2014 optimization, quantization, compilation \u2014 operates on that ONNX artifact. A well-exported graph with accurate metadata travels cleanly through the rest of the pipeline without requiring patching or re-export.
Loading is an internal operation: the loader module resolves model provenance, selects the right HuggingFace model class, and prepares the weights for tracing. The winml export command is the surface users interact with directly.
When you point winml-cli at a model identifier, the internal loader resolves it in one of two ways. If the identifier looks like a HuggingFace Hub path (e.g., prajjwal1/bert-tiny), the loader downloads the model weights and configuration to the standard HuggingFace cache at ~/.cache/huggingface. Subsequent runs are served from that cache without re-downloading. If the identifier is a path to a local PyTorch checkpoint directory, the loader reads it directly without network access.
In both cases the loader auto-detects the task \u2014 image classification, text feature extraction, and so on \u2014 and selects a corresponding HuggingFace model class. The result is a PyTorch model object ready for tracing.
Before committing to a full export you can verify that the loader resolved everything correctly with winml inspect. It prints the detected task, the HuggingFace model class, the export configuration, and the WinML inference class \u2014 all without downloading weights. Add --hierarchy to reconstruct the PyTorch module tree from random-weight tracing.
Some community models host custom Python code in their repositories. The loader refuses to execute it by default. Pass --trust-remote-code to winml config when generating a build configuration for such a model.
winml export converts the loaded model to ONNX. The conversion uses TorchScript tracing by default, which follows actual execution paths and tends to produce compact, inference-oriented graphs. A --dynamo flag exists for the PyTorch 2.x dynamo exporter; however, Note: the --dynamo flag is reserved for the PyTorch 2.x dynamo exporter but is not yet functional in the current release \u2014 passing it logs a warning and the flag is ignored.
By default the exporter runs an eight-step process that includes hierarchy tracing and tag injection. The result is an ONNX file enriched with structural metadata that powers downstream features such as per-module benchmarking, inspector views, and optimizer scoping.
"},{"location":"concepts/load-and-export/#hierarchy-tagging-in-detail","title":"Hierarchy tagging in detail","text":"During export the HTP (Hierarchy-preserving Tags Protocol) exporter attaches two pieces of information to every ONNX graph node via node.metadata_props:
winml.hierarchy.tag Full module path the node originated from /BertModel/BertEncoder/BertLayer.0/BertAttention winml.hierarchy.depth Number of path segments (integer as string) 4"},{"location":"concepts/load-and-export/#how-tags-are-built","title":"How tags are built","text":"The exporter registers PyTorch forward hooks on each module. When a module executes, a pre-hook pushes its class name onto a tag stack; the post-hook pops it. This produces hierarchical paths that mirror the PyTorch module tree:
flowchart LR\n A[Register hooks] --> B[Run forward pass]\n B --> C[Pre-hook pushes tag]\n C --> D[Child modules execute]\n D --> E[Post-hook pops tag]\n E --> F[Tag stack \u2192 path] Only modules that are actually executed during tracing receive tags \u2014 unused modules are excluded. For example, prajjwal1/bert-tiny has 48 registered modules but only 18 are reached during a forward pass.
Running winml export -m prajjwal1/bert-tiny -o model.onnx -v produces the following hierarchy tree (18 traced modules, 132 ONNX nodes, 100 % coverage):
BertModel (132 nodes)\n\u251c\u2500\u2500 BertEmbeddings: embeddings (7 nodes)\n\u251c\u2500\u2500 BertEncoder: encoder (106 nodes)\n\u2502 \u251c\u2500\u2500 BertLayer: encoder.layer.0 (53 nodes)\n\u2502 \u2502 \u251c\u2500\u2500 BertAttention: encoder.layer.0.attention (39 nodes)\n\u2502 \u2502 \u2502 \u251c\u2500\u2500 BertSelfOutput: encoder.layer.0.attention.output (4 nodes)\n\u2502 \u2502 \u2502 \u2514\u2500\u2500 BertSdpaSelfAttention: encoder.layer.0.attention.self (35 nodes)\n\u2502 \u2502 \u251c\u2500\u2500 BertIntermediate: encoder.layer.0.intermediate (10 nodes)\n\u2502 \u2502 \u2502 \u2514\u2500\u2500 GELUActivation: encoder.layer.0.intermediate.intermediate_act_fn (8 nodes)\n\u2502 \u2502 \u2514\u2500\u2500 BertOutput: encoder.layer.0.output (4 nodes)\n\u2502 \u2514\u2500\u2500 BertLayer: encoder.layer.1 (53 nodes)\n\u2502 \u2514\u2500\u2500 ... (same structure)\n\u2514\u2500\u2500 BertPooler: pooler (0 nodes)\n Each ONNX node gets its tag from the module it belongs to. Here are a few examples from the actual exported model:
ONNX node name Assigned tag/embeddings/word_embeddings/Gather /BertModel/BertEmbeddings /encoder/layer.0/attention/self/query/MatMul /BertModel/BertEncoder/BertLayer.0/BertAttention/BertSdpaSelfAttention /encoder/layer.0/intermediate/intermediate_act_fn/Mul /BertModel/BertEncoder/BertLayer.0/BertIntermediate/GELUActivation /Unsqueeze (no scope) /BertModel (root fallback)"},{"location":"concepts/load-and-export/#node-to-module-mapping","title":"Node-to-module mapping","text":"After the ONNX graph is produced by torch.onnx.export, a 4-priority system assigns each ONNX node to the closest matching module:
/BertModel).This guarantees 100 % tag coverage: every node in the graph carries a non-empty tag.
"},{"location":"concepts/load-and-export/#graph-level-metadata","title":"Graph-level metadata","text":"Beyond per-node tags, the exporter also writes model-level metadata properties:
Key Contentwinml.io.inputs JSON array of InputTensorSpec \u2014 name, shape, dtype, and optional value_range winml.io.outputs JSON array of OutputTensorSpec \u2014 name, shape, dtype These I/O specs enable tools like winml perf to generate correct dummy inputs for benchmarking and winml inspect to display tensor shapes without loading the model into a runtime.
Alongside the .onnx file, the exporter writes a *_htp_metadata.json sidecar containing:
nodes \u2014 complete mapping of every ONNX node name \u2192 hierarchy tagmodules \u2014 traced module information (class name, tag, execution order)statistics \u2014 export time, node counts, coverage percentageoutputs \u2014 I/O tensor specificationsUse --with-report to additionally generate a human-readable markdown report (*_htp_export_report.md).
winml inspect --hierarchy \u2014 traces the model with random weights and displays the resulting module tree in the terminal. This is a lightweight preview of what tags will look like after a full export.winml perf --module <ClassName> \u2014 isolates a submodule (e.g. BertAttention) and benchmarks it independently.If you need a clean, standard-compliant ONNX without custom metadata \u2014 to hand off to a third-party tool, for example \u2014 pass --no-hierarchy. (The old --clean-onnx spelling remains as a deprecated hidden alias.) The graph behaviour is unchanged, but hierarchy-dependent features will not work against that file.
Most export failures fall into three categories.
Task mismatch. The loader auto-detects task from the model card and configuration, but some models are registered under multiple tasks or have ambiguous metadata. If the wrong task is selected the exporter generates incorrect dummy inputs and the trace fails or produces wrong output shapes. Override it explicitly with --task, for example --task image-feature-extraction.
Shape issues. Transformer models often have symbolic sequence-length dimensions; vision models may expect a fixed spatial resolution. If the default dummy inputs do not match what the model accepts, shape inference will fail or produce dynamic shapes that downstream tools cannot handle. Provide a --shape-config JSON file with explicit overrides, or use --input-specs to supply a fully specified input manifest.
Custom modules. Some models contain torch.nn.Module subclasses the tracer cannot automatically decompose. A --torch-module option (comma-separated class names) is intended to include them as distinct hierarchy nodes rather than inlining them \u2014 most often needed for custom normalization or attention implementations defined in the model repository. Note: the --torch-module flag is reserved for module-targeted export but is not yet functional in the current release \u2014 passing it logs a warning and the flag is ignored.
Knowing that a model produces correct outputs is necessary but not sufficient for a production deployment. You also need to know how fast it runs, how consistently it runs, and where the time goes when it does not run fast enough. winml perf is the primary tool in winml-cli for answering those questions. It synthesises end-to-end latency numbers and live hardware utilisation into a single benchmarking workflow.
Because winml perf accepts both HuggingFace model IDs and local .onnx files, you can benchmark at any stage of the development cycle \u2014 from a freshly exported float model through to a compiled, quantized production artifact.
At its core, winml perf runs a configurable number of inference iterations and reports latency statistics. Here is a real example benchmarking bert-tiny on CPU:
$ winml perf -m bert-tiny.onnx --device cpu --iterations 50 --warmup 5\n\nDevice: cpu / CPUExecutionProvider\nTask: auto (auto-detected)\nModel Precision: fp32\nInputs: input_ids [1, 512] int32\n attention_mask [1, 512] int32\n token_type_ids [1, 512] int32\nOutputs: last_hidden_state [1, 512, 128]\n Output latency table:
Avg P50 P90 P95 P99 Min Max Std 5.53 5.40 6.55 6.87 7.65 4.89 7.65 0.58Warmup: 14.14 ms avg (first 5 iterations)\nThroughput: 180.72 samples/sec\n Key parameters:
Flag Purpose Default--iterations Number of benchmark iterations 100 --warmup Warmup iterations excluded from statistics 10 --batch-size Batch size for input generation 1 -d, --device Target device: auto, cpu, gpu, npu auto --ep Specific execution provider (e.g. qnn, dml, openvino) auto-resolved from device --precision Precision mode: auto, fp32, fp16, int8, int16, or w{x}a{y} auto --quantize/--no-quantize Include quantization during model build --quantize --skip-build/--no-skip-build Skip the build pipeline for ONNX inputs --skip-build"},{"location":"concepts/perf-and-monitoring/#output-format","title":"Output format","text":"Add -f json to emit structured JSON to stdout, suitable for CI pipelines or automated comparisons:
{\n \"benchmark_info\": {\n \"model_id\": \"bert-tiny.onnx\",\n \"task\": \"auto-detected\",\n \"device\": \"cpu\",\n \"ep\": \"CPUExecutionProvider\",\n \"precision\": \"auto\",\n \"iterations\": 50,\n \"warmup\": 5,\n \"batch_size\": 1,\n \"timestamp\": \"2026-06-11T03:27:24+00:00\"\n },\n \"model_info\": {\n \"input_names\": [\"input_ids\", \"attention_mask\", \"token_type_ids\"],\n \"input_shapes\": [[1, 512], [1, 512], [1, 512]],\n \"input_types\": [\"int32\", \"int32\", \"int32\"],\n \"output_names\": [\"last_hidden_state\"],\n \"output_shapes\": [[1, 512, 128]]\n },\n \"latency_ms\": {\n \"mean\": 5.53, \"p50\": 5.40, \"p90\": 6.55,\n \"p95\": 6.87, \"p99\": 7.65, \"min\": 4.89, \"max\": 7.65,\n \"std\": 0.58, \"warmup_mean\": 14.14\n },\n \"throughput\": { \"samples_per_sec\": 180.72, \"batches_per_sec\": 180.72 },\n \"raw_samples_ms\": [5.12, 5.40, ...]\n}\n Results are also saved automatically to ~/.cache/winml/perf/<model_slug>/<timestamp>.json for later comparison. Override the path with --output.
Latency numbers alone do not tell you whether the hardware is actually being used. A slow NPU inference could mean the model is running on the NPU and hitting a memory bottleneck, or it could mean the EP silently fell back to CPU and is not using the NPU at all.
The --monitor flag adds a live terminal chart (powered by plotext + Rich Live) that streams hardware utilisation for whichever device is being benchmarked. The chart updates once per iteration so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN or DirectML hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, it is a strong signal that the model may not be executing on the expected device \u2014 investigate further with EP-specific tools.
winml perf -m model.onnx --device npu --monitor\n Display updates are not included in the timed inference call, but monitoring may introduce small system overhead from background PDH polling.
"},{"location":"concepts/perf-and-monitoring/#memory-and-resource-metrics","title":"Memory and resource metrics","text":"When --monitor is active, hardware metrics are sampled throughout the benchmark and reported at the end. These metrics help answer questions like \"how much device memory does this model need?\" and \"is the model memory-bound?\".
The metrics collected depend on the target device:
Metric CPU GPU NPU CPU utilisation (mean/peak %) \u2713 \u2713 \u2713 RAM (used MB, peak MB) \u2713 \u2713 \u2713 Device utilisation (mean/peak %) \u2014 \u2713 \u2713 Device memory local (peak MB) \u2014 \u2713 \u2713 Device memory shared (peak MB) \u2014 \u2713 \u2713 Engine running time (ns) \u2014 \u2713 \u2713device_memory and running_time_ns are still present but will be zero.local_peak_mb) and shared system memory (shared_peak_mb) allocated by the GPU driver.local_peak_mb represents dedicated adapter memory; shared_peak_mb is system memory shared with the NPU.CPU device:
Hardware (during benchmark)\n CPU: 8.3% avg | Mem: 644 MB\n NPU or GPU device:
Hardware (during benchmark)\n NPU: 87.3% avg, 100.0% peak | CPU: 12.1% avg | Mem: 1842 MB\n Device Mem: 245/0 MB (local/shared)\n"},{"location":"concepts/perf-and-monitoring/#json-structure","title":"JSON structure","text":"In JSON output (-f json), these metrics appear under the hw_monitor key:
\"hw_monitor\": {\n \"monitor\": \"HWMonitor\",\n \"device_kind\": null,\n \"adapter_luid\": null,\n \"cpu\": { \"mean_pct\": 15.8, \"peak_pct\": 16.71, \"sample_count\": 2 },\n \"ram\": { \"used_mb\": 640.21, \"peak_mb\": 640.21 },\n \"device_memory\": { \"local_peak_mb\": 0.0, \"shared_peak_mb\": 0.0 },\n \"running_time_ns\": 0\n}\n When a hardware accelerator is active, device_kind will be \"npu\" or \"gpu\", and an additional key (e.g. \"npu\") appears with device utilisation:
\"hw_monitor\": {\n \"monitor\": \"HWMonitor\",\n \"device_kind\": \"npu\",\n \"adapter_luid\": \"0x0000abcd12340000\",\n \"cpu\": { \"mean_pct\": 12.1, \"peak_pct\": 34.5, \"sample_count\": 50 },\n \"ram\": { \"used_mb\": 1842.0, \"peak_mb\": 1910.0 },\n \"device_memory\": { \"local_peak_mb\": 245.0, \"shared_peak_mb\": 0.0 },\n \"npu\": { \"mean_pct\": 87.3, \"peak_pct\": 100.0, \"sample_count\": 50 },\n \"running_time_ns\": 4820000000\n}\n This makes it straightforward to track memory consumption across model revisions or compare devices programmatically.
"},{"location":"concepts/perf-and-monitoring/#per-module-benchmarking","title":"Per-module benchmarking","text":"Large Transformer-family models contain many repeated module instances \u2014 attention blocks, feed-forward layers, encoder stages. When you want to understand the cost of one type of block rather than the full network, --module <ClassName> isolates and benchmarks matching modules from the HuggingFace model hierarchy.
winml perf -m bert-base-uncased --module BertAttention\n This builds and benchmarks each BertAttention instance separately and reports per-instance statistics. The --module argument must be a class name (e.g. BertAttention), not a dotted module path (e.g. not encoder.layer.0.attention).
Internally, --module uses torchinfo to discover all submodule instances matching the given class name in the HuggingFace model. For each match it generates a separate build config, exports an isolated ONNX file, and benchmarks it independently. This requires a HuggingFace model ID (not a local .onnx file) because it needs access to the PyTorch module tree.
--module targets gets writtenwinml-cli exposes two ways to turn a Hugging Face model or ONNX file into a Windows ML-ready artifact. You can invoke each stage of the pipeline as an individual primitive command \u2014 winml export, winml analyze, winml optimize, winml quantize, winml compile, winml perf, winml eval \u2014 running one step at a time with full control over inputs and outputs. Alternatively, winml build wraps all of those stages into a single command driven by a WinMLBuildConfig JSON file.
Understanding when to reach for a primitive versus the pipeline wrapper is the central workflow decision in winml-cli. Both paths produce the same artifacts; the difference is in repeatability, convenience, and how much you need to inspect or vary individual stages.
"},{"location":"concepts/primitives-and-pipeline/#the-primitive-commands","title":"The primitive commands","text":"Each primitive command corresponds to one stage of the pipeline described in How winml-cli works. They run in order, each producing an ONNX file that the next stage consumes:
winml export \u2014 loads a Hugging Face model, traces it with PyTorch and the Optimum exporter, and writes a portable float32 ONNX file with no EP-specific nodes.winml analyze \u2014 runs compatibility and runtime checks on the exported ONNX graph, detecting unsupported operators, QDQ issues, and device-specific constraints before further pipeline stages.winml optimize \u2014 applies graph transformations (operator fusion, constant folding, graph pruning) and runs an autoconf loop to maximize EP-compatible coverage.winml quantize \u2014 inserts QDQ nodes using calibration data, reducing weight and activation types to lower precision (for example, int8) for efficient inference.winml compile \u2014 invokes an EP-specific compiler (for example, QNN for NPU targets) to embed a pre-compiled binary cache in the ONNX graph as an EPContext node.winml perf \u2014 benchmarks latency and throughput against a Windows ML session; does not modify the model.winml eval \u2014 evaluates task-specific accuracy on a dataset; does not modify the model.You can enter the pipeline at any stage. If you already have an optimized ONNX file, pass it directly to winml quantize without re-exporting. Each command writes its output to a path you specify, so all intermediate artifacts are preserved for inspection.
winml build orchestrates all of the above stages in order from a single WinMLBuildConfig JSON file:
winml build -c config.json -m microsoft/resnet-50 -o output/\n The config file tells winml build which stages to run and how to configure them. Setting the quant or compile section to null in the JSON skips that stage; passing --no-quant, --no-compile, or --no-optimize on the command line achieves the same effect at runtime without editing the file.
When the model argument points to an existing ONNX file instead of a Hugging Face ID, winml build detects this and skips the export stage, running analyze \u2192 optimize \u2192 quantize \u2192 compile directly. This mirrors how each primitive command handles the same case.
winml build also accepts --use-cache in place of -o/--output-dir, routing artifacts to the winml-cli global cache at ~/.cache/winml/ instead of a local directory. Use --rebuild to force a clean re-run even when cached artifacts already exist.
Use primitive commands when:
Use winml build when:
winml build coordinates end-to-end.quant: null in the config) rather than remembered flag-by-flag across invocations.The two approaches are not exclusive. A common pattern is to prototype with primitives \u2014 iterating on winml optimize and winml quantize individually to tune fusion flags and calibration \u2014 and then encode the final settings into a WinMLBuildConfig for repeatable production builds via winml build.
WinMLBuildConfigEvery ONNX tensor carries data in a specific numeric type \u2014 float32, float16, int8, int16 \u2014 and every winml-cli pipeline makes deliberate choices about which type to use where. This page covers both halves of that decision: the datatype family winml-cli understands, and the quantization workflow that converts a model from one datatype to another to shrink it and run it faster on integer-native hardware.
Quantization is the headline use of datatypes in winml-cli. By replacing float32 weights and activations with int8 or mixed precisions, you typically get a 2\u20134\u00d7 smaller model artifact and a 2\u20138\u00d7 latency speedup on NPU hardware. The trade-off is a potential reduction in model accuracy, the degree of which depends on the precision chosen and the sensitivity of the model.
winml-cli exposes a precision shorthand on the --precision flag that encodes the weight/activation dtype pair as a single string. The table below lists every precision from _NAMED_PRECISIONS in config/precision.py, together with the resolved quantization types. Float precisions (fp32, fp16) carry no quantization types because weights and activations remain in floating point throughout.
auto device-dependent device-dependent Resolves to w8a16 (NPU), fp16 (GPU/CPU) at runtime fp32 float32 float32 No quantization; baseline accuracy fp16 float16 float16 Half-precision float; no QDQ nodes inserted int8 uint8 uint8 Static quantization; valid for QNN EP int16 int16 uint16 Higher-accuracy quantization; larger model than int8 w8a8 uint8 uint8 Equivalent to int8; explicit mixed-precision notation w8a16 uint8 uint16 Mixed: compact weights, wider activations for accuracy w4a16 n/a n/a Not supported. Rejected at validation \u2014 is_quantized_precision(\"w4a16\") returns False because 4-bit weight types are absent from _BITS_TO_WEIGHT_TYPE in precision.py. The string is not a recognized precision. The --weight-type and --activation-type flags on winml quantize accept uint8, int8, uint16, or int16 and override whatever the --precision shorthand would have resolved. This is useful when you need an unsigned weight type for QNN compatibility but a signed activation type for a specific operator constraint. See Weight and Activation for why the two need separate flags in the first place.
winml-cli applies quantization by inserting QDQ (Quantize/Dequantize) nodes into the ONNX graph. The resulting file is a standard ONNX model that any ONNX Runtime execution provider can consume and optimize for its target hardware \u2014 the EP reads the QDQ pattern and fuses adjacent operations into true integer kernels.
"},{"location":"concepts/quantization/#calibration","title":"Calibration","text":"Static quantization \u2014 the kind winml-cli applies \u2014 requires a calibration pass before inserting QDQ nodes. During calibration, a small set of representative inputs runs through the original floating-point model so that winml-cli can observe the actual range of values each tensor takes at runtime. Those observed ranges are then used to choose the scale and zero-point constants baked into the QDQ nodes.
The --samples flag controls how many calibration inputs are used (default: 10). More samples generally produce better range estimates but take longer. The --method flag selects the algorithm used to summarize the observed ranges:
minmax (default) \u2014 uses the absolute minimum and maximum observed values. Fast and predictable; can be sensitive to outliers.entropy \u2014 minimizes the KL-divergence between the original and quantized distribution. Often yields better accuracy on models with heavy-tailed activation distributions.percentile \u2014 clips a small fraction of extreme values before computing the range. A practical middle ground when outliers are present but entropy calibration is slow.Example using entropy calibration with more samples:
winml quantize -m model.onnx --precision int8 --samples 128 --method entropy\n"},{"location":"concepts/quantization/#the-qdq-pattern","title":"The QDQ pattern","text":"The QDQ pattern is the standard ONNX representation for static quantization. winml-cli wraps the inputs and outputs of quantizable operators with pairs of QuantizeLinear and DequantizeLinear nodes. At the graph level the model still operates in floating-point; the QDQ nodes encode the scale and zero-point metadata that a runtime needs to fuse adjacent operations into true integer kernels.
When the model runs under ONNX Runtime, the execution provider \u2014 whether CPU, DirectML, or a dedicated NPU EP \u2014 reads those QDQ patterns and performs its own graph fusion. This means the EP is free to apply hardware-specific optimizations without winml-cli needing to know anything about the target device's internal ISA or operator library. The QDQ model produced by winml quantize is a single portable artifact that can be deployed to any EP that supports integer execution.
Not all precision choices carry equal accuracy risk:
fp16 is usually lossless in practice. Rounding errors relative to fp32 are small enough that most models show no measurable accuracy difference.int8 and int16 are inherently lossy. Compressing a 32-bit float into 8 or 16 bits discards information, and the magnitude of accuracy degradation depends on how well the calibration data represents the deployment distribution.w8a16 reduce the risk compared to full int8 by preserving more precision in activations, but they are still lossy relative to fp32.Always validate accuracy after quantizing an integer-precision model. Run winml eval on a representative dataset and compare the metrics against the original floating-point baseline before shipping the quantized artifact.
Every neural network model stores two kinds of numeric tensors that matter for deployment: weights, the static parameters baked in at training time, and activations, the intermediate values that flow through the graph at every inference call. Understanding the distinction is the key to reading winml-cli's precision flags, deciding when quantization is safe, and knowing why a model that runs fine on one execution provider may stall or degrade on another.
"},{"location":"concepts/weight-and-activation/#weights-are-static","title":"Weights are static","text":"Weights are the trained parameters of the model: convolution kernels, linear projection matrices, attention weights, embedding tables, bias vectors. They are fixed at the moment the model is exported and stay constant for every inference call. Because they are static, their quantization parameters \u2014 the scale and zero-point used to compress them from fp32 to int8 \u2014 can be computed once, offline, using calibration data. winml quantize does exactly that: it observes the weight distributions in your exported ONNX and bakes the per-tensor scale/zero-point into the QDQ nodes that wrap the weights.
In ONNX terms, weights are stored as initializers inside the graph. The runtime treats them as graph inputs that are always pre-supplied; you do not pass weights to a session at inference time, the way you pass an image tensor or a text prompt.
"},{"location":"concepts/weight-and-activation/#activations-are-dynamic","title":"Activations are dynamic","text":"Activations are the intermediate results that flow through the graph during inference: the output of every matrix multiply, every layer norm, every attention softmax. Unlike weights, activations are regenerated on every forward pass and depend entirely on the input data. winml-cli cannot pre-compute their quantization parameters offline \u2014 instead, calibration runs a small set of representative inputs through the model and observes the actual ranges each activation tensor takes. Those observed ranges become the scale/zero-point baked into QDQ nodes around each activation.
This is why calibration data matters. If the calibration set fails to represent the inputs you will see in production, the per-activation ranges will be wrong and the quantized model will lose more accuracy than necessary on real traffic.
"},{"location":"concepts/weight-and-activation/#why-they-need-separate-flags","title":"Why they need separate flags","text":"The --weight-type and --activation-type flags on winml quantize exist because the optimal bit-width for weights is not necessarily the optimal bit-width for activations:
The compound precision shorthand w8a16 (8-bit weights, 16-bit activations) reflects this asymmetry directly: weights and activations get different bit-widths in one config string. For the full precision family and how each maps to weight/activation dtypes, see Datatype and Quantization.
winml-cli ships a Copilot Skill (use-winml-cli) that lets AI coding agents drive the entire model-building pipeline on your behalf. When a coding agent has this skill attached, it can inspect models, generate configs, run builds, and interpret results \u2014 without you having to remember exact flags or stage ordering.
The skill teaches the agent:
Capability What the agent learns Pipeline shape The stage order (inspect \u2192 export \u2192 analyze \u2192 optimize \u2192 quantize \u2192 compile \u2192 perf) and when to enter mid-pipeline Flag discovery Always run winml <command> --help before quoting a command \u2014 never fabricate flags Output mapping Which command's -o produces the artifact the user actually needs Scope awareness Which model architectures are supported (classic DL) vs. out-of-scope (LLMs, diffusion) Hardware detection Use winml sys --list-ep to confirm what's available before targeting an EP Two paths When to use primitives (debugging, exploring) vs. config + build (production, CI)"},{"location":"getting-started/agent-skill/#how-to-use-it","title":"How to use it","text":""},{"location":"getting-started/agent-skill/#with-github-copilot-coding-agent","title":"With GitHub Copilot Coding Agent","text":"To make the Copilot Coding Agent (the cloud agent that creates PRs) follow the skill's guidance, reference it in .github/copilot-instructions.md. The Coding Agent reads that file automatically when working on this repository.
For agents that support custom instructions (e.g., Copilot Extensions, Claude, ChatGPT with file uploads, or custom MCP tool servers), attach the skill file as context:
skills/use-winml-cli/SKILL.md\n You can copy the file contents into your agent's system prompt, upload it as a reference document, or include it in a .github/copilot-instructions.md for VS Code Copilot Chat. The skill uses standard markdown with YAML front-matter \u2014 any agent that accepts text context can benefit from it.
winml-cli/\n\u2514\u2500\u2500 skills/\n \u2514\u2500\u2500 use-winml-cli/\n \u2514\u2500\u2500 SKILL.md \u2190 the skill definition\n"},{"location":"getting-started/agent-skill/#example-agent-interaction","title":"Example agent interaction","text":"User: Can I run ConvNeXt on my Snapdragon X Elite NPU?\n\nAgent (with skill):\n1. Runs `winml sys --list-ep` \u2192 confirms QNNExecutionProvider is registered\n2. Runs `winml inspect -m microsoft/convnext-tiny-224` \u2192 confirms supported\n3. Runs `winml config --onnx ... -d npu -o config.json`\n4. Runs `winml build -c config.json -m microsoft/convnext-tiny-224 -o output/`\n5. Runs `winml perf -m output/model.onnx -d npu --monitor`\n6. Reports latency + NPU utilization to user\n"},{"location":"getting-started/installation/","title":"Installation","text":""},{"location":"getting-started/installation/#prerequisites","title":"Prerequisites","text":"Component Details Windows Windows 11 24H2 or later (required for NPU support) Hardware Device with CPU, GPU, or NPU Python 3.11 Package manager uv Version control git No NPU?
You can follow most of these docs without NPU hardware. All winml-cli commands accept --device auto and fall back to CPU or DirectML automatically. The tutorials document explicit CPU fallback paths.
uv python install 3.11\nuv pip install winml-cli\n uv python install 3.11 downloads and pins the exact Python version the project requires. uv pip install winml-cli installs the latest release from PyPI into a managed environment. No separate venv activation is needed.
Install from source (for development)
If you want to contribute or run the latest unreleased code:
git clone https://github.com/microsoft/winml-cli.git\ncd winml-cli\nuv sync\n"},{"location":"getting-started/installation/#verify","title":"Verify","text":"winml sys\n Expected output (abbreviated):
+------------------------------------+\n| winml-cli System Information |\n+------------------------------------+\n\nEnvironment\n Python Version 3.11.x\n OS Windows 11\n Machine AMD64\n\nML Libraries\n Library Version Status\n torch 2.x.x OK\n onnx 1.x.x OK\n\nAvailable Devices (priority order)\n #1 NPU ...\n #2 GPU ...\n #3 CPU ...\n\nAvailable Execution Providers\n QNNExecutionProvider -> NPU\n DmlExecutionProvider -> GPU\n CPUExecutionProvider -> CPU\n This command enumerates available compute devices and execution providers on your machine. If an expected device or execution provider is missing, winml sys is the right place to diagnose it. See winml sys for the full flag reference and troubleshooting tips.
Run the following command to enumerate available devices and execution providers on your machine:
uv run winml sys --list-device --list-ep\n --list-device and --list-ep print only the hardware and EP inventory. If the command exits without error, your winml-cli install is ready. See winml sys for the full flag reference.
Before downloading any models, confirm that winml-cli recognises the model:
uv run winml inspect -m microsoft/resnet-50\n +--------------------------- microsoft/resnet-50 ---------------------------+\n| Task image-classification |\n| Model Class ResNetForImageClassification |\n| Exporter OptimumExporter |\n| WinML Class WinMLImageClassificationModel |\n| Status Supported |\n+---------------------------------------------------------------------------+\n Tip
Always inspect before build to catch unsupported architectures early.
"},{"location":"getting-started/quickstart/#build-the-model","title":"Build the model","text":"uv run winml build -m microsoft/resnet-50 -o resnet_out/ --no-quant\n winml build runs all pipeline steps in sequence \u2014 export, optimize, quantize. You can start a model build without a config file, or provide one to configure each step in the sequence (see winml config to customize). All intermediate artifacts land in resnet_out/, so you can reuse any stage independently.
After a successful build, you will find the following outputs in resnet_out/:
analyze_result.json \u2014 detailed model compatibility insights for each Windows ML EP, including supported, partially supported, and unsupported operators, detected optimization patterns, and recommended optimization workflows.winml_build_config file \u2014 automatically generated after the build step to capture the full workflow end-to-end.uv run winml perf -m resnet_out/model.onnx --device auto --iterations 50 --monitor\n --device auto lets the CLI resolve the best available device on your machine \u2014 NPU first, then GPU, then CPU.
winml buildwinml inspectwinml perfwinml sysIf you prefer a graphical interface, you can use the Foundry Toolkit extension for VS Code to run Windows ML CLI model conversion without typing commands.
"},{"location":"getting-started/ui-quickstart/#quick-reference","title":"Quick reference","text":"Foundry Toolkit in the VS Code Extensions viewFor a full walkthrough, see Build with Windows ML CLI (Preview) in the VS Code documentation.
"},{"location":"reference/","title":"Reference \u2014 Config Schema","text":"This page documents the full schema for WinMLBuildConfig, the JSON configuration file that drives the winml-cli pipeline. Generate a config with winml config, then pass it to any command with -c config.json.
The config is accepted by all pipeline commands \u2014 not just winml build. For example, winml export -c config.json, winml quantize -c config.json, and winml compile -c config.json each read the relevant section of the same config file. This lets you use a single config as the source of truth across all stages.
{\n \"loader\": { ... },\n \"export\": { ... },\n \"optim\": { ... },\n \"quant\": { ... },\n \"compile\": { ... },\n \"eval\": { ... },\n \"auto\": true\n}\n Setting quant or compile to null skips that pipeline stage entirely. Setting auto to true (default) lets winml-cli auto-configure downstream stages based on the target device and precision.
loader \u2014 Model Loading","text":"Field Type Default Description task str \\| null null HuggingFace task (e.g., image-classification). Auto-detected if omitted. model_class str \\| null null Override model class (e.g., AutoModelForCTC). model_type str \\| null null HuggingFace model type (e.g., bert, resnet). module_path str \\| null null Dotted path to a submodule for targeted export. user_script str \\| null null Path to custom model class script. trust_remote_code bool false Trust remote code from HuggingFace."},{"location":"reference/#export-onnx-export","title":"export \u2014 ONNX Export","text":"Field Type Default Description opset_version int 17 ONNX opset version. batch_size int 1 Static batch size. Use 1 for QNN compatibility. input_tensors list[InputTensorSpec] \\| null null Input tensor specifications. Auto-inferred if omitted. output_tensors list[OutputTensorSpec] \\| null null Output tensor specifications. dynamic_axes dict \\| null null Dynamic axes mapping. \u26a0\ufe0f Breaks MatMulAddFusion on QNN. export_params bool true Include model parameters in ONNX. do_constant_folding bool true Fold constants during export. verbose bool false Verbose export logging. dynamo bool false Use PyTorch 2.x Dynamo exporter. enable_hierarchy_tags bool true Add module hierarchy tags to ONNX nodes. clean_onnx bool false Strip hierarchy tags after export. hierarchy_tag_format \"full\" \\| \"module_only\" \"full\" Tag detail level. InputTensorSpec:
Field Type Descriptionname str \\| null Tensor name (e.g., pixel_values). dtype str \\| null Data type (e.g., float32, int64). shape list[int] \\| null Tensor shape (e.g., [1, 3, 224, 224]). value_range [float, float] \\| null Min/max for dummy tensor generation."},{"location":"reference/#optim-graph-optimization","title":"optim \u2014 Graph Optimization","text":"A dictionary of boolean fusion flags. All default to false unless auto-configured.
gelu_fusion bool Fuse GeLU activation patterns. layer_norm_fusion bool Fuse LayerNorm patterns. matmul_add_fusion bool Fuse MatMul + Add (enables BiasGelu). Additional fusion flags can be added as key-value pairs.
"},{"location":"reference/#quant-quantization","title":"quant \u2014 Quantization","text":"Set to null to skip quantization.
mode \"qdq\" \\| \"static\" \\| \"dynamic\" \"qdq\" Quantization mode. weight_type \"uint8\" \\| \"int8\" \\| \"uint16\" \\| \"int16\" \"uint8\" Weight data type. activation_type \"uint8\" \\| \"int8\" \\| \"uint16\" \\| \"int16\" \"uint8\" Activation data type. calibration_method \"minmax\" \\| \"entropy\" \\| \"percentile\" \"minmax\" Scale computation method. samples int 10 Number of calibration samples. per_channel bool false Per-channel quantization. symmetric bool false Symmetric quantization. task str \\| null null Task for dataset-aware calibration. model_name str \\| null null Model ID for calibration dataset resolution. dataset_name str \\| null null Override calibration dataset. distribution str \"uniform\" Random distribution for dummy data. seed int \\| null null Random seed for reproducibility. calibration_load_path str \\| null null Load pre-computed calibration scales. calibration_save_path str \\| null null Save calibration scales. op_types_to_quantize list[str] \\| null null Operator types to quantize (all if null). nodes_to_exclude list[str] \\| null null Node names to skip."},{"location":"reference/#compile-ep-compilation","title":"compile \u2014 EP Compilation","text":"Set to null to skip compilation.
ep_config.provider str \"qnn\" EP alias: qnn, cpu, dml, openvino, tensorrt, vitisai, migraphx. ep_config.device str \"auto\" Target device: npu, gpu, cpu, auto. ep_config.enable_ep_context bool true Generate EPContext model. ep_config.embed_context bool false Embed binary in ONNX (true) or external .bin (false). ep_config.compiler str \"ort\" Compiler backend: ort or qairt. ep_config.provider_options dict {} EP-specific options. ep_config.qnn_sdk_root str \\| null null QNN SDK path for QAIRT compiler backend. validate bool true Validate compiled model. verbose bool false Verbose compilation logging."},{"location":"reference/#eval-evaluation","title":"eval \u2014 Evaluation","text":"Set to null (default) to skip evaluation.
model_id str \\| null null HuggingFace model ID for config resolution. model_path str \\| dict[str, str] \\| null null Path to .onnx file, or a {role: path} dict for composite models. task str \\| null null Task type. device str \"auto\" Inference device. precision str \"auto\" Precision (fp32, fp16, w8a16, etc.). ep str \\| null null EP override. dataset.path str \\| null null HuggingFace dataset path. dataset.name str \\| null null Dataset config name. dataset.split str \"validation\" Dataset split. dataset.samples int 100 Evaluation sample count. dataset.shuffle bool true Shuffle before sampling. dataset.seed int 42 Random seed. output_path str \\| null null Path for JSON results output."},{"location":"reference/#example-full-config","title":"Example: Full Config","text":"{\n \"loader\": {\n \"task\": \"image-classification\",\n \"model_type\": \"resnet\"\n },\n \"export\": {\n \"opset_version\": 17,\n \"batch_size\": 1\n },\n \"optim\": {\n \"gelu_fusion\": true,\n \"layer_norm_fusion\": true,\n \"matmul_add_fusion\": true\n },\n \"quant\": {\n \"mode\": \"qdq\",\n \"weight_type\": \"uint8\",\n \"activation_type\": \"uint8\",\n \"samples\": 10,\n \"calibration_method\": \"minmax\"\n },\n \"compile\": {\n \"ep_config\": {\n \"provider\": \"qnn\",\n \"device\": \"npu\",\n \"enable_ep_context\": true,\n \"embed_context\": false\n },\n \"validate\": true\n },\n \"auto\": true\n}\n"},{"location":"reference/#the-auto-field","title":"The auto field","text":"The top-level \"auto\" field (default: true) controls whether the build pipeline runs the autoconf loop \u2014 an iterative analyze \u2192 discover \u2192 re-optimize cycle that automatically detects which additional graph optimizations the model needs for the target EP.
true (default) After initial optimization, the analyzer inspects the graph for unsupported or sub-optimal nodes and proposes additional optimization flags. The pipeline re-optimizes using the discovered flags and repeats (up to --max-optim-iterations, default 3). The final optimization result depends on what the analyzer discovers at runtime, so outputs may vary if the model or EP support changes between runs. false The pipeline applies only the explicit optim flags from the config \u2014 no autoconf discovery, no re-optimization loop. Builds are fully deterministic given the same config and input model. Use this for reproducible CI builds or when you have already tuned the optimization flags manually. When auto is true and the autoconf loop discovers additional flags, the final persisted config (written to the output directory) includes the merged result so you can inspect what was discovered.
When you run winml build, the tool writes all artifacts to the output directory. This page documents what each file is and which ones you need for deployment.
After a full pipeline run (export \u2192 optimize \u2192 quantize \u2192 compile):
output/\n\u251c\u2500\u2500 model.onnx \u2190 FINAL artifact (deploy this)\n\u251c\u2500\u2500 model.onnx.data \u2190 External weights (if model \u2265 100 MiB)\n\u251c\u2500\u2500 winml_build_config.json \u2190 Persisted build config\n\u251c\u2500\u2500 analyze_result.json \u2190 Static analysis (EP compatibility)\n\u251c\u2500\u2500 build_manifest.json \u2190 Build provenance (Python API only)\n\u251c\u2500\u2500 export_htp_metadata.json \u2190 HTP export metadata (hierarchy info)\n\u251c\u2500\u2500 export.onnx \u2190 Intermediate: raw ONNX export\n\u251c\u2500\u2500 export.onnx.data\n\u251c\u2500\u2500 optimized.onnx \u2190 Intermediate: after graph optimization\n\u251c\u2500\u2500 optimized.onnx.data\n\u251c\u2500\u2500 quantized.onnx \u2190 Intermediate: after QDQ insertion\n\u251c\u2500\u2500 quantized.onnx.data\n\u251c\u2500\u2500 compiled.onnx \u2190 Intermediate: after EP compilation\n\u2514\u2500\u2500 compiled.onnx.data\n"},{"location":"reference/output-layout/#file-categories","title":"File Categories","text":""},{"location":"reference/output-layout/#final-artifacts-keep-for-deployment","title":"Final Artifacts (Keep for Deployment)","text":"File Purpose model.onnx The deployment-ready model. Always present. model.onnx.data External weight data (only if model \u2265 100 MiB). Must stay alongside model.onnx. winml_build_config.json The complete pipeline config used for this build (includes auto-discovered optimization flags). This file is a reproducible pipeline specification \u2014 check it into version control or feed it directly to winml build -c in a CI/CD pipeline to guarantee identical model processing across machines and runs (set \"auto\": false for fully deterministic builds). analyze_result.json Static analysis output: EP compatibility, operator classification, detected patterns. build_manifest.json Build provenance with stage timings. Only generated via the Python API (build_hf_model/build_onnx_model). export_htp_metadata.json HTP export metadata: module hierarchy, tracing info, tagging coverage."},{"location":"reference/output-layout/#intermediate-files-can-delete-after-build","title":"Intermediate Files (Can Delete After Build)","text":"File Stage Contents export.onnx Export Raw PyTorch \u2192 ONNX conversion (float32) optimized.onnx Optimize Graph with fused operators, shape inference applied quantized.onnx Quantize QDQ nodes inserted, calibrated scales compiled.onnx Compile EPContext binary embedded or sidecar Each intermediate has a corresponding .onnx.data file if the model exceeds 100 MiB.
winml export)","text":"output/\n\u251c\u2500\u2500 export.onnx\n\u2514\u2500\u2500 export.onnx.data (if \u2265 100 MiB)\n"},{"location":"reference/output-layout/#optimize-only-winml-optimize","title":"Optimize only (winml optimize)","text":"output/\n\u251c\u2500\u2500 optimized.onnx\n\u2514\u2500\u2500 optimized.onnx.data\n"},{"location":"reference/output-layout/#full-build-winml-build","title":"Full build (winml build)","text":"All stages write their intermediate, and model.onnx is a copy of the last successful stage output. If you skip quantization (--no-quant), the final model is a copy of optimized.onnx. If you skip compilation too, it's still a copy of optimized.onnx.
Models larger than 100 MiB store weights in a separate .onnx.data file. Both files must be kept together \u2014 the .onnx file contains a reference to the data file by name.
model.onnx only (weights embedded) \u2265 100 MiB model.onnx + model.onnx.data Warning
If you move model.onnx, always move model.onnx.data alongside it. The ONNX file references the data file by relative path.
analyze_result.json contains the static analysis output from the build pipeline's analyze stage. It reports EP compatibility and operator classification:
{\n \"analysis_timestamp\": \"2026-06-04T19:45:17.496169\",\n \"metadata\": {\n \"model_path\": \"iter.onnx\",\n \"opset_version\": 17,\n \"producer_name\": \"pytorch\",\n \"producer_version\": \"2.12.0\",\n \"total_operators\": 122,\n \"operator_counts\": {\n \"Conv\": 53,\n \"Relu\": 49,\n \"MaxPool\": 1,\n \"Add\": 16,\n \"GlobalAveragePool\": 1,\n \"Flatten\": 1,\n \"Gemm\": 1\n },\n \"unique_operator_types\": 7,\n \"detected_pattern_count\": {}\n },\n \"results\": [\n {\n \"ihv_type\": \"Microsoft\",\n \"ep_type\": \"CPUExecutionProvider\",\n \"device_type\": \"cpu\",\n \"runtime_support\": false,\n \"has_errors\": false,\n \"has_warnings\": false,\n \"classification\": {\n \"supported\": [],\n \"partial\": [],\n \"unsupported\": [],\n \"unknown\": [\n \"OP/ai.onnx/Conv\",\n \"OP/ai.onnx/Relu\",\n \"OP/ai.onnx/MaxPool\",\n \"OP/ai.onnx/Add\",\n \"OP/ai.onnx/GlobalAveragePool\",\n \"OP/ai.onnx/Flatten\",\n \"OP/ai.onnx/Gemm\"\n ]\n },\n \"information\": []\n }\n ]\n}\n Key fields:
Field Descriptionmetadata.total_operators Total ONNX operator nodes in the model graph metadata.operator_counts Frequency of each operator type metadata.detected_pattern_count Fused subgraph patterns (GeLU, LayerNorm, etc.) results[].ihv_type Hardware vendor (\"Microsoft\", \"QC\", \"Intel\", etc.) results[].runtime_support true if the EP can run all operators results[].classification Operators grouped by support level: supported, partial, unsupported, unknown results[].has_errors true if unsupported ops exist (model won't run on that EP)"},{"location":"reference/output-layout/#build-manifest","title":"Build Manifest","text":"build_manifest.json records provenance for every build:
{\n \"schema_version\": 1,\n \"model_id\": \"microsoft/resnet-50\",\n \"task\": \"image-classification\",\n \"cache_key\": \"a1b2c3d4e5f6\",\n \"config_hash\": \"f7e8d9c0b1a2\",\n \"timestamp\": \"2026-01-15T10:30:00.000000+00:00\",\n \"elapsed_seconds\": 45.1,\n \"final_artifact\": \"model.onnx\",\n \"analyze_iterations\": 2,\n \"analyze_unsupported_node_count\": 0,\n \"analyze_details\": { \"lint\": {}, \"autoconf\": {} },\n \"stages\": [\n {\n \"name\": \"export\",\n \"status\": \"completed\",\n \"filename\": \"export.onnx\",\n \"elapsed_seconds\": 12.5\n },\n {\n \"name\": \"optimize\",\n \"status\": \"completed\",\n \"filename\": \"optimized.onnx\",\n \"elapsed_seconds\": 8.2\n },\n {\n \"name\": \"quantize\",\n \"status\": \"completed\",\n \"filename\": \"quantized.onnx\",\n \"elapsed_seconds\": 15.3,\n \"nodes_quantized\": 150,\n \"nodes_skipped\": 12\n },\n {\n \"name\": \"compile\",\n \"status\": \"completed\",\n \"filename\": \"compiled.onnx\",\n \"elapsed_seconds\": 9.1\n }\n ]\n}\n"},{"location":"reference/output-layout/#rebuild-behavior","title":"Rebuild Behavior","text":"model.onnx already exists and rebuild=False (default), the build is skipped entirely.--rebuild (CLI) or force_rebuild=True (Python API) to force a fresh build..onnx and .onnx.data files are deleted before the pipeline runs.winml-cli can be used as a Python library for programmatic model building and inference. This page documents the public API surface.
"},{"location":"reference/python-api/#quick-example","title":"Quick Example","text":"from winml.modelkit import WinMLAutoModel\n\n# Build and load in one call\nmodel = WinMLAutoModel.from_pretrained(\"microsoft/resnet-50\", device=\"npu\")\noutput = model(pixel_values=images)\n\n# From a local ONNX file\nmodel = WinMLAutoModel.from_onnx(\"model.onnx\", task=\"image-classification\")\n"},{"location":"reference/python-api/#winmlautomodel","title":"WinMLAutoModel","text":"Factory class for automatic model building and loading. Not instantiable directly \u2014 use the class methods.
"},{"location":"reference/python-api/#from_pretrained","title":"from_pretrained()","text":"Build and load a model from a HuggingFace ID or local path. Runs the full pipeline: config \u2192 export \u2192 optimize \u2192 quantize \u2192 compile \u2192 load.
WinMLAutoModel.from_pretrained(\n model_id_or_path: str | Path,\n *,\n task: str | None = None,\n config: WinMLBuildConfig | None = None,\n device: str = \"auto\",\n precision: str = \"auto\",\n cache_dir: str | Path | None = None,\n use_cache: bool = True,\n force_rebuild: bool = False,\n trust_remote_code: bool = False,\n shape_config: dict | None = None,\n no_compile: bool = False,\n) -> WinMLPreTrainedModel\n Parameter Type Default Description model_id_or_path str \\| Path required HuggingFace model ID or path to local model. task str \\| None None Task name. Auto-detected if omitted. config WinMLBuildConfig \\| None None Custom build config. Auto-generated if omitted. device str \"auto\" Target device: \"auto\", \"npu\", \"gpu\", \"cpu\". precision str \"auto\" Precision: \"auto\", \"fp32\", \"fp16\", \"w8a8\", etc. cache_dir str \\| Path \\| None None Cache directory for built artifacts. use_cache bool True Reuse cached build if available. force_rebuild bool False Force rebuild even if cache exists. trust_remote_code bool False Trust remote code from HuggingFace. no_compile bool False Skip the compilation stage. Returns: A task-specific WinMLPreTrainedModel subclass.
from_onnx()","text":"Build from a pre-exported ONNX file. Runs: optimize \u2192 quantize \u2192 compile \u2192 load.
WinMLAutoModel.from_onnx(\n onnx_path: str | Path | dict[str, str | Path],\n *,\n task: str | None = None,\n config: WinMLBuildConfig | None = None,\n device: str = \"auto\",\n precision: str = \"auto\",\n ep: str | None = None,\n cache_dir: str | Path | None = None,\n use_cache: bool = True,\n force_rebuild: bool = False,\n skip_build: bool = False,\n session_options: Any | None = None,\n hf_config: PretrainedConfig | None = None,\n **kwargs: Any,\n) -> WinMLPreTrainedModel | WinMLCompositeModel\n Parameter Type Default Description onnx_path str \\| Path \\| dict required ONNX file path, or dict of submodel paths for composite models. skip_build bool False Load ONNX directly without running optimize/quantize/compile. hf_config PretrainedConfig \\| None None Required for composite models (dict inputs)."},{"location":"reference/python-api/#supported_tasks","title":"supported_tasks()","text":"WinMLAutoModel.supported_tasks() -> list[str]\n Returns all task strings with dedicated inference classes (16 tasks).
"},{"location":"reference/python-api/#build-pipeline-functions","title":"Build Pipeline Functions","text":"Lower-level functions for fine-grained control over the pipeline.
"},{"location":"reference/python-api/#build_hf_model","title":"build_hf_model()","text":"from winml.modelkit.build import build_hf_model\n\nresult = build_hf_model(\n config: WinMLBuildConfig,\n output_dir: Path,\n *,\n model_id: str | None = None,\n pytorch_model: nn.Module | None = None,\n rebuild: bool = False,\n trust_remote_code: bool = False,\n random_init: bool = False,\n cache_key: str | None = None,\n ep: str | None = None,\n device: str | None = None,\n **kwargs: Any,\n) -> BuildResult\n Runs the full pipeline (export \u2192 optimize \u2192 analyze \u2192 quantize \u2192 compile) and writes all artifacts to output_dir.
build_onnx_model()","text":"from winml.modelkit.build import build_onnx_model\n\nresult = build_onnx_model(\n onnx_path: Path | str,\n *,\n config: WinMLBuildConfig,\n output_dir: Path | str,\n rebuild: bool = False,\n ep: str | None = None,\n device: str | None = None,\n **kwargs: Any,\n) -> BuildResult\n Builds from an existing ONNX file (skips export).
"},{"location":"reference/python-api/#buildresult","title":"BuildResult","text":"@dataclass\nclass BuildResult:\n output_dir: Path # Directory containing all artifacts\n final_onnx_path: Path # Path to final model.onnx\n config_path: Path # Path to winml_build_config.json\n stages_completed: list[str] # e.g., [\"export\", \"optimize\", \"quantize\"]\n stages_skipped: list[str]\n stage_timings: dict[str, float] # Per-stage seconds\n elapsed: float # Total build time (seconds)\n reused: bool # True if cache hit, no build ran\n manifest_path: Path | None # Path to build_manifest.json\n"},{"location":"reference/python-api/#config-generation","title":"Config Generation","text":""},{"location":"reference/python-api/#generate_build_config","title":"generate_build_config()","text":"from winml.modelkit.config import generate_build_config\n\nconfig = generate_build_config(\n model_id: str | None = None,\n *,\n task: str | None = None,\n model_class: str | None = None,\n model_type: str | None = None,\n module: str | None = None,\n override: WinMLBuildConfig | None = None,\n shape_config: dict | None = None,\n library_name: str = \"transformers\",\n device: str = \"auto\",\n precision: str = \"auto\",\n trust_remote_code: bool = False,\n ep: str | None = None,\n onnx_path: str | Path | None = None,\n) -> WinMLBuildConfig | list[WinMLBuildConfig]\n Auto-generates a complete build config by probing the model's config.json (does not download weights). Equivalent to what winml config produces. Returns a list when module is specified (one config per submodule).
All inference models inherit from WinMLPreTrainedModel and are HuggingFace pipeline-compatible.
WinMLPreTrainedModel (Base)","text":"class WinMLPreTrainedModel:\n def __call__(self, **kwargs) -> Any: ...\n def perf(self, warmup: int = 0) -> ContextManager: ...\n\n @property\n def device(self) -> str: ...\n @property\n def ep_name(self) -> str | None: ...\n @property\n def io_config(self) -> dict: ...\n @property\n def task(self) -> str | None: ...\n"},{"location":"reference/python-api/#task-specific-classes","title":"Task-Specific Classes","text":"Class Task WinMLModelForImageClassification image-classification WinMLModelForSequenceClassification text-classification WinMLModelForImageSegmentation image-segmentation WinMLModelForSemanticSegmentation semantic-segmentation WinMLModelForObjectDetection object-detection WinMLModelForFeatureExtraction feature-extraction WinMLModelForQuestionAnswering question-answering WinMLModelForZeroShotImageClassification zero-shot-image-classification WinMLModelForGenericTask fallback (raw outputs)"},{"location":"reference/python-api/#performance-tracking","title":"Performance Tracking","text":"model = WinMLAutoModel.from_pretrained(\"microsoft/resnet-50\", device=\"npu\")\n\nwith model.perf(warmup=5) as stats:\n for img in test_images:\n model(pixel_values=img)\n\nprint(f\"P99 latency: {stats.p99_ms:.2f} ms\")\n"},{"location":"reference/python-api/#see-also","title":"See also","text":"Windows ML CLI has validated a set of models for compatibility across all Execution Providers (EPs)\u2014see the full Model Accuracy Report.
winml-cli supports a wide range of model architectures and tasks. This page lists what's validated and how to discover model support.
"},{"location":"reference/supported-models/#discovery-commands","title":"Discovery Commands","text":"# Browse the curated catalog (64 validated models)\nuv run winml catalog\n\n# Filter by task\nuv run winml catalog -t image-classification\n\n# Check if a specific model is supported\nuv run winml inspect -m microsoft/resnet-50\n\n# List all known tasks\nuv run winml inspect --list-tasks\n"},{"location":"reference/supported-models/#supported-tasks","title":"Supported Tasks","text":"winml-cli recognizes 35 task types across vision, NLP, audio, and multimodal domains. Of these, 16 have dedicated inference classes; the remainder are supported via the generic task fallback.
"},{"location":"reference/supported-models/#vision","title":"Vision","text":"Task Example Modelsimage-classification ResNet, ConvNeXt, ViT, Swin image-segmentation Segformer, Mask2Former semantic-segmentation Segformer object-detection DETR, YOLOS, Table-Transformer depth-estimation Depth Anything, ZoeDepth image-feature-extraction DINOv2, ViT zero-shot-image-classification CLIP, SigLIP"},{"location":"reference/supported-models/#nlp","title":"NLP","text":"Task Example Models text-classification BERT, RoBERTa, XLM-RoBERTa token-classification BERT, RoBERTa (NER) question-answering BERT, RoBERTa fill-mask BERT, RoBERTa feature-extraction BGE, BERT, all-MiniLM text-generation Qwen3 (composite) text2text-generation T5, BART, Marian"},{"location":"reference/supported-models/#audio","title":"Audio","text":"Task Example Models automatic-speech-recognition Whisper audio-classification Wav2Vec2"},{"location":"reference/supported-models/#multimodal","title":"Multimodal","text":"Task Example Models zero-shot-image-classification CLIP (text + vision) image-to-text VisionEncoderDecoder visual-question-answering BLIP"},{"location":"reference/supported-models/#validated-model-catalog","title":"Validated Model Catalog","text":"The following models have been validated end-to-end with EP compatibility testing. Use winml catalog to browse the full list interactively.
apple/mobilevit-small MobileViT dima806/fairface_age_image_detection ViT facebook/convnext-tiny-224 ConvNeXt google/vit-base-patch16-224 ViT microsoft/resnet-18 ResNet microsoft/resnet-50 ResNet microsoft/swin-large-patch4-window7-224 Swin rizvandwiki/gender-classification ViT"},{"location":"reference/supported-models/#image-feature-extraction","title":"Image Feature Extraction","text":"Model Architecture facebook/dino-vitb16 ViT facebook/dino-vits16 ViT facebook/dinov2-small DINOv2 google/vit-base-patch16-224-in21k ViT"},{"location":"reference/supported-models/#feature-extraction-text","title":"Feature Extraction (Text)","text":"Model Architecture BAAI/bge-base-en-v1.5 BERT BAAI/bge-m3 XLM-RoBERTa BAAI/bge-small-en-v1.5 BERT google-bert/bert-base-multilingual-cased BERT Intel/bert-base-uncased-mrpc BERT laion/CLIP-ViT-B-32-laion2B-s34B-b79K CLIP openai/clip-vit-base-patch16 CLIP openai/clip-vit-base-patch32 CLIP sentence-transformers/all-MiniLM-L6-v2 BERT sentence-transformers/all-mpnet-base-v2 MPNet sentence-transformers/multi-qa-mpnet-base-dot-v1 MPNet sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 BERT"},{"location":"reference/supported-models/#sentence-similarity","title":"Sentence Similarity","text":"Model Architecture BAAI/bge-base-en-v1.5 BERT BAAI/bge-large-en-v1.5 BERT BAAI/bge-m3 XLM-RoBERTa BAAI/bge-small-en-v1.5 BERT sentence-transformers/all-MiniLM-L6-v2 BERT sentence-transformers/all-mpnet-base-v2 MPNet sentence-transformers/multi-qa-mpnet-base-dot-v1 MPNet sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 BERT sentence-transformers/paraphrase-multilingual-mpnet-base-v2 XLM-RoBERTa"},{"location":"reference/supported-models/#fill-mask","title":"Fill-Mask","text":"Model Architecture distilbert/distilbert-base-uncased DistilBERT FacebookAI/roberta-base RoBERTa FacebookAI/roberta-large RoBERTa FacebookAI/xlm-roberta-base XLM-RoBERTa google-bert/bert-base-multilingual-cased BERT google-bert/bert-base-multilingual-uncased BERT google-bert/bert-base-uncased BERT"},{"location":"reference/supported-models/#text-classification","title":"Text Classification","text":"Model Architecture cardiffnlp/twitter-roberta-base-sentiment-latest RoBERTa distilbert/distilbert-base-uncased-finetuned-sst-2-english DistilBERT Intel/bert-base-uncased-mrpc BERT ProsusAI/finbert BERT"},{"location":"reference/supported-models/#token-classification","title":"Token Classification","text":"Model Architecture Babelscape/wikineural-multilingual-ner BERT dbmdz/bert-large-cased-finetuned-conll03-english BERT dslim/bert-base-NER BERT Isotonic/distilbert_finetuned_ai4privacy_v2 DistilBERT w11wo/indonesian-roberta-base-posp-tagger RoBERTa"},{"location":"reference/supported-models/#question-answering","title":"Question Answering","text":"Model Architecture deepset/bert-large-uncased-whole-word-masking-squad2 BERT deepset/roberta-base-squad2 RoBERTa deepset/tinyroberta-squad2 RoBERTa distilbert/distilbert-base-cased-distilled-squad DistilBERT distilbert/distilbert-base-uncased-distilled-squad DistilBERT google-bert/bert-large-uncased-whole-word-masking-finetuned-squad BERT"},{"location":"reference/supported-models/#zero-shot-classification","title":"Zero-Shot Classification","text":"Model Architecture joeddav/xlm-roberta-large-xnli XLM-RoBERTa"},{"location":"reference/supported-models/#zero-shot-image-classification","title":"Zero-Shot Image Classification","text":"Model Architecture openai/clip-vit-base-patch16 CLIP"},{"location":"reference/supported-models/#image-segmentation","title":"Image Segmentation","text":"Model Architecture mattmdjaga/segformer_b2_clothes Segformer nvidia/segformer-b1-finetuned-ade-512-512 Segformer nvidia/segformer-b2-finetuned-ade-512-512 Segformer nvidia/segformer-b5-finetuned-ade-640-640 Segformer"},{"location":"reference/supported-models/#image-to-text","title":"Image-to-Text","text":"Model Architecture microsoft/trocr-base-handwritten VisionEncoderDecoder microsoft/trocr-base-printed VisionEncoderDecoder microsoft/trocr-large-handwritten VisionEncoderDecoder"},{"location":"reference/supported-models/#execution-provider-compatibility","title":"Execution Provider Compatibility","text":"Each validated model is tested against available EPs:
EP Alias Devices Notes NvTensorRTRTXExecutionProvidernvtensorrtrtx, nv_tensorrt_rtx GPU NVIDIA TensorRT-RTX; NVIDIA GPU with TensorRT runtime CUDAExecutionProvider cuda GPU NVIDIA CUDA; any CUDA-capable GPU MIGraphXExecutionProvider migraphx GPU AMD ROCm MIGraphX QNNExecutionProvider qnn NPU, GPU Qualcomm Snapdragon; bundled in ORT OpenVINOExecutionProvider openvino NPU, GPU, CPU Intel hardware DmlExecutionProvider dml GPU DirectML; any DirectX 12 GPU CPUExecutionProvider cpu CPU Always available VitisAIExecutionProvider vitisai NPU AMD/Xilinx"},{"location":"reference/supported-models/#adding-unsupported-models","title":"Adding Unsupported Models","text":"If your model architecture isn't in the catalog, winml-cli may still support it through auto-detection:
# Try inspecting first\nuv run winml inspect -m your-org/your-model\n\n# If \"Status: Supported\", proceed normally\nuv run winml build -m your-org/your-model -d auto -o output/\n For truly custom architectures, use --trust-remote-code to allow execution of model code from the Hugging Face Hub.
BERT (bert-base-uncased) is a canonical text model that exercises every stage of the winml-cli pipeline: it has multiple input tensors, benefits from graph fusion (GeLU, LayerNorm, MatMul+Add), and produces quantizable activations that run well on NPU. That combination makes it a useful reference point for teams deploying transformer encoders on Windows.
This sample walks through the production-style workflow: generate a reusable WinMLBuildConfig JSON file with winml config, run the full export \u2192 optimize \u2192 quantize \u2192 compile pipeline in one shot with winml build, and measure the result with winml perf. If you want to understand each pipeline stage individually before running the all-in-one command, read the Hugging Face Model to NPU tutorial first.
winml on your PATH.winml config -m bert-base-uncased -t text-classification -o bert_config.json\n This writes a WinMLBuildConfig JSON file to bert_config.json. The file captures every pipeline setting in a single artifact that you can version-control and share. A representative excerpt looks like this:
{\n \"loader\": {\n \"task\": \"text-classification\",\n \"model_class\": \"AutoModelForSequenceClassification\",\n \"model_type\": \"bert\"\n },\n \"export\": {\n \"opset_version\": 17,\n \"batch_size\": 1\n .. // truncated: input_tensors, output_tensors\n },\n \"optim\": {\n \"clamp_constant_values\": true\n },\n \"quant\": {\n \"mode\": \"qdq\",\n \"weight_type\": \"uint8\",\n \"activation_type\": \"uint16\",\n \"samples\": 10,\n \"calibration_method\": \"minmax\",\n \"task\": \"text-classification\",\n \"model_name\": \"bert-base-uncased\"\n ... // truncated: per_channel, symmetric, distribution, ...\n },\n \"compile\": null\n}\n Note
The five top-level keys \u2014 loader, export, optim, quant, and compile \u2014 map directly to the five pipeline stages. Setting quant or compile to null skips that stage entirely. See Config and build for a field-by-field description of every option.
winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/\n winml-cli reads the config, downloads the model weights once, and runs the pipeline in sequence. Terminal output shows each stage as it completes:
winml build\n Config: bert_config.json\n Model: bert-base-uncased\n Output: bert_out/\n\n export done (42.1s)\n optimize done (6.3s)\n quantize done (18.7s)\n compile done (21.4s)\n\n Build complete in 88.5s\n Final artifact: bert_out/model.onnx\n Note
After the optimize stage, winml-cli runs an analyzer loop that inspects the graph for nodes the target EP cannot dispatch natively and re-runs optimization with adjusted fusion flags. The loop repeats up to --max-optim-iterations times (default: 3). Pass --no-optimize to skip this stage entirely when starting from a pre-optimized ONNX file. See How winml-cli Works for a full description of the autoconf loop.
winml perf -m bert_out/model.onnx --iterations 50\n After a short warm-up, winml perf reports latency percentiles and throughput:
Device: npu\nTask: text-classification\nIterations: 50 (+ 10 warmup)\nBatch Size: 1\n\nLatency (ms)\n Avg P50 P90 P95 P99 Min Max Std\n 4.83 4.79 5.12 5.31 5.68 4.51 6.04 0.21\n\nThroughput: 206.99 samples/sec\n\nResults saved to: model_perf.json\n"},{"location":"samples/bert-config-build/#customizing-the-config","title":"Customizing the config","text":"The JSON file is plain text and can be edited before running winml build. Two common adjustments:
Change precision. To target fp16 instead of the default uint8 QDQ quantization, regenerate the config with an explicit precision flag:
winml config -m bert-base-uncased -t text-classification --precision fp16 -o bert_config.json\n Alternatively, edit bert_config.json directly: set quant.weight_type and quant.activation_type to \"int8\" or \"uint16\", or set quant to null to skip quantization entirely.
Disable a stage at build time. You can suppress a stage for a single run without touching the config file using the --no-quant flags:
winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/ --no-quant \n This is useful for measuring the fp32 baseline before committing to a quantized build. The quant section in bert_config.json is unchanged; the flag only affects this invocation. See Config and build for the full list of configurable fields.
winml config generates a complete, version-controllable WinMLBuildConfig JSON from a HuggingFace model ID in one command.winml build orchestrates the full export \u2192 optimize \u2192 quantize \u2192 compile pipeline from a single config file and model ID.winml perf gives a latency and throughput baseline on the built artifact in seconds.CLIP (openai/clip-vit-base-patch32) is a dual-encoder vision-language model: one tower encodes images, the other encodes text, and both project into a shared embedding space. winml-cli treats it as a composite model \u2014 a model that is split into multiple ONNX sub-models that run together at inference time. For CLIP, the two sub-models are:
image-encoder Encodes images into embeddings pixel_values [1, 3, 224, 224] image_embeds [1, 512] text-encoder Encodes text labels into embeddings input_ids [1, 77] text_embeds [1, 512] Zero-shot classification is achieved by embedding the image and the candidate text labels, then ranking the labels by the cosine similarity between their embeddings. Splitting the towers into two ONNX graphs lets each encoder have fully static shapes (required for efficient NPU compilation) and lets you build, cache, and benchmark them independently.
"},{"location":"samples/clip-composite/#prerequisites","title":"Prerequisites","text":"winml on your PATH.The composite model architecture for CLIP:
graph LR\n A[winml config] -->|\"(clip, zero-shot-image-classification)\"| B[Composite Registry]\n B --> C[image-encoder config]\n B --> D[text-encoder config]\n C --> E[winml build \u2192 image-encoder.onnx]\n D --> F[winml build \u2192 text-encoder.onnx]\n E --> G[WinMLAutoModel]\n F --> G\n G -->|logits_per_image| H[Classification scores]"},{"location":"samples/clip-composite/#step-1-generate-build-configs","title":"Step 1: Generate build configs","text":"winml config -m openai/clip-vit-base-patch32 --task zero-shot-image-classification -o clip.json\n Because (clip, zero-shot-image-classification) is registered as a composite model, this command produces two config files \u2014 one per sub-model:
clip_image-encoder.json \u2014 export config using image-feature-extraction taskclip_text-encoder.json \u2014 export config using feature-extraction taskEach config includes CLIP-specific optimizations (GELU fusion, LayerNorm fusion, MatMul+Add fusion, and clamp constant values).
"},{"location":"samples/clip-composite/#step-2-build-each-sub-model","title":"Step 2: Build each sub-model","text":"Build both sub-models individually using their config files:
# Build the image encoder\nwinml build -c clip_image-encoder.json -m openai/clip-vit-base-patch32 -o output/image-encoder\n\n# Build the text encoder\nwinml build -c clip_text-encoder.json -m openai/clip-vit-base-patch32 -o output/text-encoder\n Each winml build runs the full pipeline: export \u2192 optimize \u2192 quantize \u2192 compile. The output directories contain the final ONNX files ready for inference.
To target a specific execution provider (e.g., QNN for NPU):
winml build -c clip_image-encoder.json -m openai/clip-vit-base-patch32 -o output/image-encoder --ep qnn\nwinml build -c clip_text-encoder.json -m openai/clip-vit-base-patch32 -o output/text-encoder --ep qnn\n"},{"location":"samples/clip-composite/#step-3-benchmark-each-sub-model","title":"Step 3: Benchmark each sub-model","text":"winml perf output/image-encoder -d npu\nwinml perf output/text-encoder -d npu\n This lets you identify whether the image or text encoder is the bottleneck on your target hardware.
"},{"location":"samples/clip-composite/#step-4-run-inference-python-api","title":"Step 4: Run inference (Python API)","text":"There are two ways to get a ready-to-run model. Both return the same WinMLModelForZeroShotImageClassification \u2014 a single object that orchestrates the two encoders and combines their projected embeddings into similarity scores \u2014 so the inference code afterward is identical.
Option 1 \u2014 Load the ONNX files built in Step 2 (skips re-export/optimization). Pass a dict mapping each component name to its built model.onnx, plus the HF config so the composite registry can resolve (clip, zero-shot-image-classification):
from transformers import AutoConfig\n\nfrom winml.modelkit.models import WinMLAutoModel\n\nmodel = WinMLAutoModel.from_onnx(\n {\n \"image-encoder\": \"output/image-encoder/model.onnx\",\n \"text-encoder\": \"output/text-encoder/model.onnx\",\n },\n task=\"zero-shot-image-classification\",\n hf_config=AutoConfig.from_pretrained(\"openai/clip-vit-base-patch32\"),\n skip_build=True,\n)\n Option 2 \u2014 Build both encoders from the HuggingFace model in one call. WinMLAutoModel.from_pretrained detects the composite task and runs the full pipeline for each sub-model:
from winml.modelkit.models import WinMLAutoModel\n\nmodel = WinMLAutoModel.from_pretrained(\n \"openai/clip-vit-base-patch32\",\n task=\"zero-shot-image-classification\",\n)\n Either way, run inference the same way \u2014 prepare an image plus candidate labels with the HF processor, then call the model:
from PIL import Image\nfrom transformers import CLIPProcessor\n\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\nimage = Image.open(\"cat.jpg\")\nlabels = [\"a photo of a cat\", \"a photo of a dog\", \"a photo of a car\"]\ninputs = processor(text=labels, images=image, return_tensors=\"pt\", padding=True)\n\n# Run both encoders and combine into per-label similarity scores\noutputs = model(**inputs)\nprobs = outputs.logits_per_image.softmax(dim=-1)\nfor label, p in zip(labels, probs[0].tolist()):\n print(f\"{label}: {p:.4f}\")\n The text encoder's fixed sequence length (77) is handled for you \u2014 the processor's tokens are padded or truncated to match the ONNX graph before each run.
"},{"location":"samples/clip-composite/#customizing-shape-config-per-sub-model","title":"Customizing shape config per sub-model","text":"Each encoder takes its own shape_config, passed through sub_model_kwargs. The image encoder accepts vision keys (height, width); the text encoder accepts text keys (sequence_length):
model = WinMLAutoModel.from_pretrained(\n \"openai/clip-vit-base-patch32\",\n task=\"zero-shot-image-classification\",\n sub_model_kwargs={\n \"image-encoder\": {\"shape_config\": {\"height\": 224, \"width\": 224}},\n \"text-encoder\": {\"shape_config\": {\"sequence_length\": 77}},\n },\n)\n"},{"location":"samples/clip-composite/#other-composite-models","title":"Other composite models","text":"The same composite model pattern is used for:
google/siglip-base-patch16-224) \u2014 dual-encoder zero-shot image classification; shares the same composite wrapper as CLIPgoogle-t5/t5-small) \u2014 encoder + decoder for translation/summarizationfacebook/bart-large-cnn) \u2014 encoder + decoder for summarization and table-question-answering (TAPEX)Helsinki-NLP/opus-mt-en-de) \u2014 encoder + decoder for translationQwen/Qwen3-0.6B) \u2014 prefill + generation decoders for text generationSalesforce/blip-image-captioning-base) \u2014 vision encoder + text decoder for image-to-text captioningmicrosoft/trocr-base-handwritten) \u2014 vision encoder + text decoder for image-to-text (TrOCR, Donut)Tutorials are linear, prescriptive, end-to-end walkthroughs that guide you through building something concrete with winml-cli. Each tutorial moves in one direction\u2014start to finish\u2014so you can follow along without making decisions. If you need to understand the reasoning behind a feature, see the Concepts section (the why and when). If you need a quick reference for a specific command, see Commands (the what). Tutorials sit alongside Samples, which are reference-style demos that compare multiple approaches side by side rather than walking through a single path.
More tutorials are coming, covering additional model families, execution providers, and deployment scenarios. Check back as the winml-cli documentation expands.
This tutorial walks you through the complete workflow for optimizing, analyzing, and deploying an ONNX model you already have \u2014 whether you exported it yourself (torch.onnx.export, ONNX Runtime tools), received it from a teammate, or downloaded it from the ONNX Model Zoo.
Unlike the Hugging Face Model to NPU tutorial which starts from a HuggingFace model ID, this tutorial assumes you already have a .onnx file on disk and want to make it run faster on your target hardware.
The tutorial is split into two sections. Section A walks through the analyze \u2192 optimize \u2192 re-analyze loop using primitive commands, teaching you how the optimization feedback cycle works. Section B shows how winml build automates that same loop in a single command, optionally targeting NPU with quantization.
pip install uv or follow astral.sh/uv)my_model.onnx as a placeholder; substitute your own fileNo NPU? Set --device cpu wherever you see --device npu. Every other flag stays the same.
Working through the primitive commands one at a time reveals how the analyze\u2013optimize feedback cycle works. Each command accepts the output of the previous step as input, and every intermediate artifact is available for inspection.
"},{"location":"tutorials/build-from-onnx/#step-1-analyze-the-original-model","title":"Step 1: Analyze the original model","text":"Before any optimization, run the static analyzer to understand your model's EP compatibility and get optimization recommendations:
uv run winml analyze --model my_model.onnx --optim-config optim_config.json\n The analyzer classifies every operator in the graph as supported, partial, unsupported, or unknown for each available EP. It also detects fusible subgraph patterns and writes the recommended optimization flags to optim_config.json.
To target a specific EP:
uv run winml analyze --model my_model.onnx --ep qnn --device npu --optim-config optim_config.json\n The output shows per-EP compatibility results:
\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\n ANALYSIS SUMMARY\n\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\n QNNExecutionProvider (NPU): 122/0/0/0\n Ready to deploy\n If the analyzer detects fusible patterns (GeLU, LayerNorm, etc.), they will appear in the output and the optim_config.json will contain the recommended fusion settings. If no patterns are detected (as with simple architectures like ResNet), the config will be empty {}.
What we just did
The analyzer performs static analysis \u2014 no runtime or hardware required. It tells you two things: (1) can the model run on your target EP at all, and (2) are there graph patterns that the optimizer can fuse to improve performance. The --optim-config flag outputs a JSON file with the exact optimization settings the optimizer needs. S/P/U/Unk = Supported/Partial/Unsupported/Unknown.
Pass the analyzer's output config directly to the optimizer:
uv run winml optimize -m my_model.onnx -c optim_config.json -o my_model_optimized.onnx\n The optimizer applies the fusions specified in the config and reports how many nodes it reduced:
Input: my_model.onnx\nOutput: my_model_optimized.onnx\n\nSuccess! Model optimized: my_model_optimized.onnx\nNodes: 122 -> 122 (0.0% reduction)\n Tip
The node reduction depends on your model's architecture. Simple models like ResNet (only Conv, Relu, Add) have no fusible patterns. Transformer-based models (BERT, ViT) typically see 10\u201330% node reduction from GeLU, LayerNorm, and Attention fusions.
What we just did
Graph optimization fuses multi-node patterns (like the 5-node GeLU/Erf sequence) into single high-level operators that EPs can execute more efficiently. The optimizer is purely a graph transformation \u2014 it doesn't change the model's numerical behavior or require calibration data. Running it before quantization is important: calibration should be performed on the already-fused topology, not the verbose original graph.
"},{"location":"tutorials/build-from-onnx/#step-3-re-analyze-the-optimized-model","title":"Step 3: Re-analyze the optimized model","text":"Run the analyzer again on the optimized output to confirm that the fusions resolved and no new issues appeared:
uv run winml analyze --model my_model_optimized.onnx --ep qnn --device npu\n If the original analysis found fusible patterns that were optimized away, this run should show zero detected patterns and the same or better EP compatibility score.
What we just did
The analyze \u2192 optimize \u2192 re-analyze cycle is the fundamental feedback loop in winml-cli. In Section B you'll see that winml build automates this loop \u2014 it calls the analyzer, applies recommendations, re-analyzes, and repeats until convergence (typically 1\u20133 iterations). Doing it manually here teaches you what the automation is actually doing under the hood.
Insert QDQ (Quantize-Dequantize) nodes into the optimized graph using static calibration:
uv run winml quantize -m my_model_optimized.onnx -o my_model_int8.onnx --precision int8 --samples 32\n The quantizer generates 32 random calibration samples, runs them through the model to collect activation statistics, and uses those statistics to set the quantization scale and zero-point for each tensor.
What we just did
--precision int8 sets both weights and activations to 8-bit integers, which is the precision most NPU compilers expect. The output model still contains standard QuantizeLinear and DequantizeLinear ONNX nodes, so it is portable and can run on any ONNX Runtime backend. See Concepts \u2192 Quantization and QDQ for calibration methods and per-channel options.
Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time:
Qualcomm NPUIntel NPUAMD NPUCPUuv run winml compile -m my_model_int8.onnx --device npu --ep qnn\n uv run winml compile -m my_model_int8.onnx --device npu --ep openvino\n uv run winml compile -m my_model_int8.onnx --device npu --ep vitisai\n uv run winml compile -m my_model_int8.onnx --device cpu\n What we just did
Compilation embeds EP context \u2014 the compiled binary \u2014 inside or alongside the ONNX file using the EPContext node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph. See Concepts \u2192 Compile and EPContext for details.
Measure the performance of your model:
Optimized (CPU)Compiled (NPU)uv run winml perf -m my_model_optimized.onnx --device cpu --warmup 5 --iterations 50\n uv run winml perf -m my_model_int8_npu_ctx.onnx --device npu --iterations 50 --monitor\n What we just did
winml perf generates random inputs matching the model's I/O spec, runs warmup iterations (excluded from statistics), then the benchmark iterations, and reports full latency percentiles alongside throughput. The --monitor flag activates live hardware utilization polling. See Concepts \u2192 Perf and monitoring for details.
winml build","text":"Once you understand the analyze \u2192 optimize \u2192 re-analyze loop (which you now do), you can let winml build handle everything in one command. When you pass a .onnx file, winml-cli auto-detects it and skips the export stage \u2014 running the optimization loop, quantization, and compilation automatically.
uv run winml build -m my_model.onnx -o output/ --device npu --precision int8\n Config file is optional
The -c config.json flag is optional. Without it, winml build auto-generates an internal config from the flags you pass (like --device and --precision). If you need a reusable config, generate one with winml config:
uv run winml config --onnx my_model.onnx -d npu --precision int8 -o config.json\nuv run winml build -m my_model.onnx -c config.json -o output/\n The pipeline runs: analyze \u2192 optimize \u2192 (re-analyze \u2192 re-optimize if needed) \u2192 quantize \u2192 compile \u2192 model.onnx. The output directory looks like:
output/\n\u251c\u2500\u2500 model.onnx \u2190 FINAL: deploy this\n\u251c\u2500\u2500 my_model.onnx \u2190 Copy of your input\n\u251c\u2500\u2500 my_model_optimized.onnx \u2190 After optimization loop converged\n\u251c\u2500\u2500 my_model_quantized.onnx \u2190 After INT8 quantization\n\u251c\u2500\u2500 my_model_compiled.onnx \u2190 After EP compilation\n\u251c\u2500\u2500 winml_build_config.json \u2190 Config used (including auto-detected options)\n\u2514\u2500\u2500 analyze_result.json \u2190 Analysis from optimize stage\n You can selectively skip stages using the override flags:
--no-optimize \u2014 skip graph optimization (rarely needed; useful if you have a pre-optimized ONNX)--no-quant \u2014 skip quantization (produces a floating-point compiled model)--no-compile \u2014 skip compilation (produces a quantized but not device-locked ONNX)For example, to produce an optimized model without quantization or compilation:
uv run winml build -m my_model.onnx -o output/ --device cpu\n What we just did
winml build is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary.
Once the build completes, benchmark the final artifact:
uv run winml perf -m output/model.onnx --device npu --iterations 50 --monitor\n"},{"location":"tutorials/build-from-onnx/#using-the-python-api","title":"Using the Python API","text":"from winml.modelkit import WinMLAutoModel\n\n# Load from a pre-built ONNX (skips the build pipeline)\nmodel = WinMLAutoModel.from_onnx(\n \"output/model.onnx\",\n task=\"image-classification\", # set your task\n skip_build=True,\n)\n\noutput = model(pixel_values=your_input_tensor)\n Or trigger the full build programmatically:
from winml.modelkit.build import build_onnx_model\nfrom winml.modelkit.config import generate_build_config\n\nconfig = generate_build_config(onnx_path=\"my_model.onnx\", device=\"npu\", precision=\"int8\")\nresult = build_onnx_model(\"my_model.onnx\", config=config, output_dir=\"output/\")\nprint(f\"Final model: {result.final_onnx_path}\")\n"},{"location":"tutorials/build-from-onnx/#troubleshooting","title":"Troubleshooting","text":"Problem Solution \"ONNX file not found\" Use an absolute path or ensure the file is in the current directory Analyzer reports unsupported ops Check if an optimization fusion resolves them; if not, the model needs modification for that EP Optimization loop doesn't converge The default max is 3 iterations; if patterns persist, they may not be fusible \u2014 use --no-quant --no-compile and inspect Quantization accuracy regression Try --precision int16, --per-channel, or increase --samples for better calibration EP compilation fails Check the selected EP, model compatibility, and target device availability Model too large for memory Use --no-compile and compile on the target device"},{"location":"tutorials/build-from-onnx/#where-to-go-next","title":"Where to go next","text":"analyze_result.json schemaPick the right ConvNeXt page
Two pages use ConvNeXt as their vehicle:
winml build one-shot. Start here if you want to ship to NPU.This tutorial walks you through the complete journey from a pretrained Hugging Face model \u2014 facebook/convnext-tiny-224 \u2014 to a quantized, compiled artifact running on an NPU. By the end you will have benchmarked the model on your device and measured real inference latency. Nothing is skipped, and every command produces a file you can inspect or reuse.
The primary hardware target is a Copilot+PC with a Snapdragon X-class NPU (40+ TOPS). If you do not have an NPU, every step works on CPU or DirectML as a fallback \u2014 the only thing that changes is the --device and --ep flags on the compile and perf commands. Those variations are shown explicitly in the tabbed blocks below.
The tutorial is split into two sections. Section A runs through eight primitive commands \u2014 one per pipeline stage \u2014 so you understand what each stage does, what artifact it produces, and why it matters. Section B shows you that winml build runs the same pipeline in a single command once you have a config file. Most production workflows live in Section B; Section A is how you learn to trust it.
pip install uv or follow astral.sh/uv)No NPU? Set --device cpu wherever you see --device npu and drop --monitor from perf commands. Every other flag stays the same.
Working through the primitive commands one at a time is the best way to understand what the winml build wrapper does under the hood. Each step accepts the output of the previous step as its input, so the chain is explicit and every intermediate artifact is available for inspection.
Before downloading any weights, confirm that winml-cli knows how to handle facebook/convnext-tiny-224.
uv run winml inspect -m facebook/convnext-tiny-224\n You should see output similar to the following:
Model facebook/convnext-tiny-224\nTask image-classification\nModel class ConvNextForImageClassification\nExporter optimum/onnx\nInput pixel_values: float32 [1, 3, 224, 224]\nOutput logits: float32 [1, 1000]\nSupport status supported\n What we just did
winml inspect queries the Hugging Face model card and winml-cli's internal registry without downloading weights. It confirms three things: the auto-detected task (image-classification), the model class that will be used for loading, and the exporter that will handle the ONNX conversion. If this command fails, stop here \u2014 something about the model is unsupported and proceeding would waste time. A successful inspect is the green light for every stage that follows.
Generate a WinMLBuildConfig JSON file for the model. For the primitive workflow this file is optional \u2014 you can drive each stage entirely through CLI flags \u2014 but generating it now gives you a versioned record of every auto-detected setting, and it is required for Section B.
uv run winml config -m facebook/convnext-tiny-224 --device npu --precision int8 -o convnext_config.json\n Open convnext_config.json to see what was auto-detected: the task, I/O tensor shapes, quantization parameters, and the compile target. The --device npu --precision int8 flags tell the config generator to pre-populate the quantization and compile sections for NPU deployment rather than leaving them at defaults.
What we just did
winml config auto-resolves every setting that would otherwise require you to look up flags manually. The resulting JSON is the single source of truth for a reproducible build. You can commit it to version control, share it with teammates, edit a single field to try a different precision, and replay the exact same build on any machine. See Concepts \u2192 Config and build for a deeper look at the config schema and how the stages interact.
Download the pretrained weights and convert the PyTorch model to ONNX format.
uv run winml export -m facebook/convnext-tiny-224 -o convnext.onnx\n This runs an eight-stage export pipeline: model preparation, input generation, hierarchy building, ONNX conversion, node tagging, tag injection, and metadata generation. The result is a standards-compliant ONNX file with winml-cli's Hierarchy-preserving Tags Protocol (HTP) metadata embedded in node metadata_props. That metadata is what lets downstream tools make architecture-aware optimization decisions without hardcoded model knowledge.
What we just did
The default export embeds hierarchy tags \u2014 a tree of source module names mapped onto ONNX nodes \u2014 so that the optimizer and analyzer can reason about the graph in terms of the original model structure rather than flat node lists. If you need a clean ONNX without that metadata (for compatibility with other tools), add --no-hierarchy. See Concepts \u2192 Load and export for what hierarchy preservation adds and when it matters.
Before spending time on optimization and quantization, check that the model's operators are supported by your target execution provider.
uv run winml analyze -m convnext.onnx --ep qnn --device npu\n The analyzer performs static analysis \u2014 no runtime required \u2014 and classifies every operator in the graph as supported, partial, or unsupported for the target EP. It reports a coverage summary, flags any operators that may fall back to CPU, and exits with code 0 for full support or 1 for partial support.
For CPU fallback, run:
uv run winml analyze -m convnext.onnx --ep cpu --device cpu\n What we just did
Knowing your operator coverage before you quantize or compile saves you from discovering EP incompatibilities at the very last step of a long pipeline. ConvNeXt's operators (Conv, GELU, LayerNorm, Add) have broad support across QNN and OpenVINO, so this command should exit 0. If it exits 1, the output tells you which operators are problematic and includes recommendations for resolving them \u2014 typically by enabling a graph rewrite in the optimizer that fuses the unsupported pattern into a supported one. See Concepts \u2192 Analyze and optimize for details on the analyzer's recommendation engine.
"},{"location":"tutorials/npu-convnext/#step-5-optimize-the-graph","title":"Step 5: Optimize the graph","text":"Apply graph-level optimizations: operator fusion, constant folding, shape inference, and EP-specific graph rewrites.
uv run winml optimize -m convnext.onnx -o convnext_optim.onnx\n The optimizer reports how many nodes it reduced. A typical ConvNeXt-tiny optimization fuses several element-wise sequences and removes redundant reshape operations, cutting the node count noticeably without changing model semantics. If you want to apply a specific preset suited to the Snapdragon NPU, add --preset qnn-compatible to disable fusions that QNN does not benefit from.
What we just did
Graph optimization is a separate stage from quantization so that you can inspect the intermediate graph, compare node counts, and selectively enable or disable individual fusion passes using the --enable-* / --disable-* flags. Run uv run winml optimize --list-capabilities to see every registered optimization flag and its default state. Optimization always happens on the floating-point graph; quantization is applied after so that calibration statistics are computed on the already-fused topology.
Insert QDQ (Quantize-Dequantize) nodes into the optimized graph using static calibration. This reduces model size and speeds up inference on hardware with integer execution units, which includes Snapdragon NPUs and Intel NPUs.
uv run winml quantize -m convnext_optim.onnx -o convnext_int8.onnx --precision int8 --samples 32\n The quantizer generates 32 random calibration samples, runs them through the model to collect activation statistics, and uses those statistics (with the default minmax method) to set the quantization scale and zero-point for each tensor. Thirty-two samples is sufficient for a vision model with fixed-size inputs like ConvNeXt. For models with variable-length inputs or complex activation distributions, increase --samples to 64 or 128.
What we just did
--precision int8 sets both weights and activations to 8-bit integers, which is the precision most NPU compilers expect. The output model still contains standard QuantizeLinear and DequantizeLinear ONNX nodes, so it is portable and can run on any ONNX Runtime backend \u2014 you do not need special tooling to inspect it. See Concepts \u2192 Quantization and QDQ for a detailed explanation of the QDQ node pattern, calibration methods, and how to choose between per-tensor and per-channel quantization.
Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time. This is the step that produces a device-locked artifact tied to the selected EP.
The examples below use the default compiler backend (--compiler ort), which uses ONNX Runtime's built-in EP context compiler:
uv run winml compile -m convnext_int8.onnx --device npu --ep qnn\n uv run winml compile -m convnext_int8.onnx --device npu --ep openvino\n uv run winml compile -m convnext_int8.onnx --device npu --ep vitisai\n uv run winml compile -m convnext_int8.onnx --device cpu\n The compiled output file appears in the same directory as the input model. The file name follows the pattern convnext_int8_npu_ctx.onnx (using the resolved device string npu, not the EP name) and an accompanying .bin context binary is written alongside it (unless --embed is passed, which embeds the binary inside the ONNX file). CPU builds do not produce a new artifact \u2014 the compile step validates EP compatibility but writes no output file; use convnext_int8.onnx directly for CPU inference.
What we just did
Compilation embeds EP context \u2014 the compiled binary \u2014 inside or alongside the ONNX file using the EPContext node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph, eliminating the 15\u201360 second JIT penalty on first load. The default --compiler ort backend bundles compilation within ONNX Runtime itself. See Concepts \u2192 Compile and EPContext for the full picture of what gets embedded and how the context is consumed at runtime.
Measure inference latency and throughput with the --monitor flag to see live NPU utilization alongside the timing numbers.
uv run winml perf -m convnext_int8_npu_ctx.onnx --device npu --iterations 50 --monitor\n uv run winml perf -m convnext_int8_npu_ctx.onnx --device npu --ep openvino --iterations 50 --monitor\n uv run winml perf -m convnext_int8.onnx --device cpu --iterations 50\n A representative run on a Snapdragon X Elite NPU produces output like the following:
Device: npu\nTask: image-classification\nIterations: 50 (+ 10 warmup)\nBatch Size: 1\n\nLatency (ms)\n Avg P50 P90 P95 P99 Min Max Std\n 2.14 2.11 2.31 2.38 2.59 1.98 2.71 0.14\n\nThroughput: 467.29 samples/sec\n\nHardware (during benchmark)\n NPU: 72.4% avg, 89.1% peak | CPU: 3.2% avg\n Sys Mem: 1842 MB | Device Mem: 48/12 MB (local/shared)\n The CPU fallback (same model, --device cpu) will typically show latencies 8\u201315x higher and near-zero NPU utilization. The contrast between those two runs is the best proof that your NPU path is actually being used.
What we just did
winml perf generates random inputs matching the model's I/O spec, runs the configured number of warmup iterations (excluded from statistics), then the benchmark iterations, and reports full latency percentiles alongside throughput. The --monitor flag activates live hardware utilization polling at 200 ms intervals, displaying an in-terminal chart and attaching the hardware metrics to the JSON report saved alongside the console output. See Concepts \u2192 Perf and monitoring for how to interpret the utilization numbers and what hw_monitor fields look like in the JSON report.
After quantization it is good practice to verify that INT8 accuracy is close to the FP32 baseline. The winml eval command runs the model against a held-out dataset slice and reports task-relevant metrics.
uv run winml eval -m convnext_int8.onnx --model-id facebook/convnext-tiny-224 --dataset imagenet-1k --split validation --samples 100 --device npu\n The --model-id flag is required when passing an ONNX file, because the evaluator needs it to locate the preprocessor and label mappings. The command downloads 100 shuffled validation samples, runs inference, and reports top-1 and top-5 accuracy. A well-quantized ConvNeXt-tiny should lose less than 0.5 percentage points of top-1 accuracy compared to the floating-point checkpoint.
What we just did
Accuracy evaluation gives you a principled stopping criterion for quantization decisions. If the accuracy drop is larger than acceptable, return to Step 6 and try --precision int16 or per-channel quantization (--per-channel) instead of the default per-tensor int8. See Concepts \u2192 Eval and datasets for the full list of supported datasets, tasks, and column mapping options.
winml build","text":"Once you understand what each primitive stage does (which you now do), you can collapse the entire pipeline into a single command. winml build orchestrates export, optimize, quantize, and compile in sequence.
uv run winml build -m facebook/convnext-tiny-224 -o convnext_out/ --device npu --precision int8\n Config file is optional
The -c config.json flag is optional. Without it, winml build auto-generates an internal config from the flags you pass (like --device and --precision). If you need a reusable config, generate one with winml config.
The command downloads the pretrained weights, runs all four pipeline stages, and writes every intermediate and final artifact into convnext_out/. The stage timing is printed as each stage completes, and the final line tells you the path of the compiled model.
You can selectively skip stages using the override flags:
--no-optimize \u2014 skip graph optimization (rarely needed; useful if you have a pre-optimized ONNX)--no-quant \u2014 skip quantization (produces a floating-point compiled model)--no-compile \u2014 skip compilation (produces a quantized but not device-locked ONNX)For example, to produce an optimized and quantized model without the compile step:
uv run winml build -m facebook/convnext-tiny-224 -o convnext_out/ --device npu --precision int8 --no-compile\n What we just did
winml build is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary.
Once the build completes, benchmark the final artifact from convnext_out/:
uv run winml perf -m convnext_out/model.onnx --device npu --iterations 50 --monitor\n The result should match what you saw in Step 8, confirming that the winml build pipeline produces bit-identical output to the manual primitive chain.