Quantized Matrix Multiplication with Covariance for LLM Quantization
This paper, the second part of a study on quantized matrix multiplication (MatMul), addresses the setting where the covariance matrix Σ_X of the second factor's columns is available, as in weight-only post-training quantization of LLMs. It connects weight-only quantization to weighted mean squared error (WMSE) source coding, whose classical waterfilling solution dictates optimal rate distribution across coordinates. The authors show how waterfilling can improve practical LLM quantization algorithms like GPTQ, which currently allocate rate equally. They analyze a recent scheme, WaterSIC, that uses scalar INT quantizers, and prove its high-rate performance is basis-free, characterized by the determinant of Σ_X.
Key facts
- Second part of work on quantized matrix multiplication
- Considers setting with covariance matrix Σ_X available
- Applies to weight-only post-training quantization of LLMs
- Connects to weighted mean squared error (WMSE) source coding
- Waterfilling solution dictates optimal rate distribution
- Shows waterfilling can improve GPTQ algorithm
- Analyzes WaterSIC scheme using scalar INT quantizers
- High-rate performance is basis-free, characterized by determinant of Σ_X
Entities
—