feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872
Open
DingmaomaoBJTU wants to merge 15 commits into
Open
feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872DingmaomaoBJTU wants to merge 15 commits into
DingmaomaoBJTU wants to merge 15 commits into
Conversation
8f5a1d2 to
9e7d8fd
Compare
9e7d8fd to
7d7a0ae
Compare
7d7a0ae to
328b5ab
Compare
timenick
reviewed
Jun 11, 2026
328b5ab to
b859627
Compare
837330d to
fede96c
Compare
Add FP16 precision conversion support across all model pipeline commands: - Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16) - optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list - build: --precision fp16 stage between optimize and quantize - export: --precision fp16 as post-export conversion - Add shared precision_option() CLI decorator in utils/cli.py Design: FP16 is a precision transformation (not a graph optimization), so it lives as a command-layer utility rather than an optimizer pipe. All three commands share the same convert_to_fp16() function. Fixes #867
- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list, and RTN fields to WinMLQuantizationConfig - quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ) and FP16 post-processing after QDQ (fp16=True, fp16_only=False) - resolve_quant_compile_config returns fp16_only quant config for precision=fp16 - Remove _run_fp16_stage and skip-quantize hack from build.py pipelines - Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile where Quantize Stage handles both QDQ and FP16 conversion - Update tests to reflect new behavior (fp16 produces quant config, not None)
- Remove --precision flag and FP16 conversion from export command - Remove --precision, --fp16-keep-io-types, --fp16-op-block-list from optimize command and all FP16 conversion logic - Add --precision fp16 support to quantize command (creates fp16_only config, uses quantize_onnx FP16 fast path) - FP16 precision is now only available through: - winml quantize --precision fp16 (standalone) - winml build --precision fp16 (E2E pipeline) - winml perf/eval --precision fp16 (E2E commands)
Expand build's --precision from fp32/fp16 only to the full precision
range: auto, fp32, fp16, int8, int16, and w{x}a{y} format (e.g., w8a8,
w8a16). This unifies the build and quantize CLI experience.
Changes:
- Update precision_option() to accept free-form string instead of
click.Choice restricted to fp32/fp16
- Pass precision to generate_build_config() for proper quant config
resolution at config generation time
- Pass precision to resolve_quant_compile_config() in _patch_device
for config-file builds with --precision override
- Propagate fp16/fp16_only fields when patching existing quant config
- Add early validation using _is_valid_precision() for clear error
messages
- Add precision examples to build command help text
Replace 'import onnx' + 'from onnx import ...' dual-import pattern with consistent 'from onnx import ...' style to satisfy CodeQL's 'Module is imported with import and import from' check.
- Remove duplicate old precision_option (main already has expanded version) - Update test_precision_fp16_clears_quant to expect fp16_only quant config instead of quant=None (matches our FP16-in-quantize design) - Remove duplicate --precision fp16 build example (main already has one)
82c92cb to
75be8d3
Compare
When --precision fp16 is used, calibration-related flags (--samples, --method, --weight-type, --activation-type) have no effect. Add explicit warnings in both the CLI layer (quantize command) and the API layer (quantize_onnx) so users are not silently surprised.
FP16-only quantization configs do not perform calibration, so they do not need task or model_name fields. The validation now treats fp16_only the same as ONNX builds and submodule builds.
Only static QDQ quantization requires calibration data (and thus task/model_name). RTN (weight-only) and dynamic quantization do not need calibration, so they should not require these fields.
- Add int4 to named precisions, support w4a{8,16} as weight-only RTN
- Add is_weight_only_precision() and extract_weight_bits() helpers
- resolve_quant_compile_config creates RTN config for weight-only
- quantize command: add RTN fast path between FP16 and QDQ paths
- quantize_onnx: implement RTN path using ORT MatMulNBitsQuantizer
- Update tests for new valid precision values (int4, w4a16)
…tion - _patch_device now propagates algorithm/rtn_bits to existing quant config - _run_quantize_stage: add RTN path with proper StageLive output - quantizer: extract .model (ModelProto) from ONNXModel wrapper
- Add type annotation to fp16.py convert result (no-any-return) - Add assert for precision not None in quantize.py (union-attr) - Remove duplicate imports in build.py _run_quantize_stage
- Add RTN branch in generate_hf_build_config (int4/w4a16 was silently skipped) - Pass use_external_data to save_onnx in FP16 and RTN paths (quantizer.py) - Extract _warn_ignored_calibration_options helper to remove duplication
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds precision-driven quantization to
winml quantizeandwinml build. The--precisionflag auto-selects the appropriate quantization algorithm: FP16 conversion, RTN weight-only, or static QDQ — no need to manually specify--algorithmor--rtn-bits.Resolves #867
Supported Commands
--precisionsupportwinml buildwinml quantizewinml configwinml perfwinml evalwinml exportwinml optimizePrecision → Algorithm Auto-Resolution
fp16int4w4a16w4a8int8int16w8a16w8a8autoKey rule: 4-bit weight → RTN (no QDQ support for 4-bit), 8/16-bit → static QDQ.
Usage
Design
Algorithm Selection Logic
RTN Configuration
--precision int4automatically sets:algorithm = "rtn"rtn_bits = 4(derived from precision)rtn_block_size = 128(default, tunable via future CLI flag)rtn_symmetric = True(default)Advanced users can tune RTN params without needing
--rtn-bits— bit-width is always inferred from precision.Validation
task/model_namevalidation (no calibration needed)algorithm="static"requires calibration → requires task/model_name for HF buildsbanana,w4a4) produce clear error messagesE2E Verified (ConvNeXt-Tiny-224)
winml build --precision fp16winml build --precision int4winml quantize --precision fp16winml quantize --precision int4winml quantize --precision int8winml quantize --precision w4a16winml quantize --precision fp16 --samples 50winml quantize --precision int4 --samples 50winml quantize --precision bananaFiles Changed
Core
config/precision.py— Addedint4preset,is_weight_only_precision(),extract_weight_bits(), expanded_VALID_WEIGHT_BITSto include 4config/build.py—resolve_quant_compile_configcreates RTN config for weight-only; validation skips calibration requirements for RTN/FP16/dynamicquant/quantizer.py— Three execution paths: FP16 fast path → RTN (MatMulNBitsQuantizer) → static QDQquant/config.py—algorithmfield (static/dynamic/rtn), RTN params, FP16 paramscommands/quantize.py— CLI routing: FP16 → RTN → QDQ, with calibration-ignored warningscommands/build.py—_patch_devicepropagates algorithm/RTN fields;_run_quantize_stagehas dedicated RTN path with StageLive outputoptim/fp16.py—convert_to_fp16()with keep_io_types, already-FP16 skip, topo sort fixRemoved
--precisionfromexportandoptimizecommands (quantize stage handles all precision work)TODO (follow-up PRs)
--precision int8 --fp16to run QDQ quantization first, then convert remaining FP32 ops to FP16. The config infrastructure supports this (fp16=True, fp16_only=False) but CLI flags and build pipeline routing are not yet wired.--rtn-block-size,--rtn-symmetric,--rtn-accuracy-levelonwinml quantizeandwinml buildfor advanced users.--algorithm dynamicthrough quantize command (config already supportsalgorithm="dynamic"but no CLI flag yet).