New PTQ Framework for W4A4 Quantization of Wan2.2-I2V Video Diffusion Transformers
A new framework for post-training quantization aimed at W4A4 quantization of large video diffusion Transformers has been introduced, tackling issues related to activation outliers and timestep-dependent distributions. This approach integrates SVDQuant for low-rank outlier compensation, GPTQ for reconstruction-aware residual weight quantization, and independently assesses timestep-bin-wise per-layer activation clipping ratios for each expert. It focuses on the Mixture-of-Experts DiT architecture of Wan2.2-I2V, where the quantization sensitivities differ between high-noise and low-noise experts. According to results from the OpenS2V-Eval benchmark, this method achieves a 59.3% reduction in peak GPU memory compared to the BF16 baseline, with only a 0.9% decrease in the VBench average score. The research is available on arXiv under ID 2605.27003.
Key facts
- Proposed framework combines SVDQuant, GPTQ, and timestep-bin-wise clipping-ratio search.
- Addresses activation outliers and timestep-dependent distributions in Wan2.2-I2V.
- Targets two-expert Mixture-of-Experts DiT design with distinct quantization sensitivities.
- Achieves 59.3% peak GPU memory reduction on OpenS2V-Eval benchmark.
- Only 0.9% drop in VBench average score compared to BF16 baseline.
- Published on arXiv with ID 2605.27003.
- Method is post-training quantization (PTQ).
- W4A4 quantization enables substantial memory savings for video diffusion Transformers.
Entities
Institutions
- arXiv