FASQ: Calibration-Free LLM Compression via Product Quantization
FASQ (Flexible Accelerated Subspace Quantization) is a calibration-free framework for compressing large language models (LLMs) using product quantization on weight matrices. By adjusting sub-vector size and codebook cardinality, it achieves continuous compression ratios of 27-49% of original FP16 size, filling gaps left by fixed-bit methods. On Meta-Llama-3-8B, FASQ outperforms 4-bit GPTQ and AWQ with 67.1-67.7% accuracy at 37-42% model size, with similar results on Qwen3-8B and Qwen3.5-9B-Base. Custom CUDA kernels enable efficient inference via LUT-free direct-compute GEMV and output-stationary designs.
Key facts
- FASQ applies product quantization to LLM weight matrices.
- It requires no calibration data.
- Two parameters control compression: sub-vector size and codebook cardinality.
- Compression range spans 27-49% of original FP16 model size.
- On Meta-Llama-3-8B, accuracy reaches 67.1-67.7% at 37-42% model size.
- Surpasses 4-bit GPTQ and AWQ in accuracy.
- Consistent results on Qwen3-8B and Qwen3.5-9B-Base.
- Custom CUDA kernels include LUT-free direct-compute GEMV and output-stationary design.
Entities
Institutions
- arXiv