Quantized Matrix Multiplication with Covariance for LLM Quantization

other · 2026-05-14

This paper, the second part of a study on quantized matrix multiplication (MatMul), addresses the setting where the covariance matrix Σ_X of the second factor's columns is available, as in weight-only post-training quantization of LLMs. It connects weight-only quantization to weighted mean squared error (WMSE) source coding, whose classical waterfilling solution dictates optimal rate distribution across coordinates. The authors show how waterfilling can improve practical LLM quantization algorithms like GPTQ, which currently allocate rate equally. They analyze a recent scheme, WaterSIC, that uses scalar INT quantizers, and prove its high-rate performance is basis-free, characterized by the determinant of Σ_X.

Key facts

Second part of work on quantized matrix multiplication
Considers setting with covariance matrix Σ_X available
Applies to weight-only post-training quantization of LLMs
Connects to weighted mean squared error (WMSE) source coding
Waterfilling solution dictates optimal rate distribution
Shows waterfilling can improve GPTQ algorithm
Analyzes WaterSIC scheme using scalar INT quantizers
High-rate performance is basis-free, characterized by determinant of Σ_X

Entities

—

Sources

arXiv cs.AI — 2026-05-14