FASQ: Calibration-Free LLM Compression via Product Quantization

ai-technology · 2026-05-07

FASQ (Flexible Accelerated Subspace Quantization) is a calibration-free framework for compressing large language models (LLMs) using product quantization on weight matrices. By adjusting sub-vector size and codebook cardinality, it achieves continuous compression ratios of 27-49% of original FP16 size, filling gaps left by fixed-bit methods. On Meta-Llama-3-8B, FASQ outperforms 4-bit GPTQ and AWQ with 67.1-67.7% accuracy at 37-42% model size, with similar results on Qwen3-8B and Qwen3.5-9B-Base. Custom CUDA kernels enable efficient inference via LUT-free direct-compute GEMV and output-stationary designs.

Key facts

FASQ applies product quantization to LLM weight matrices.
It requires no calibration data.
Two parameters control compression: sub-vector size and codebook cardinality.
Compression range spans 27-49% of original FP16 model size.
On Meta-Llama-3-8B, accuracy reaches 67.1-67.7% at 37-42% model size.
Surpasses 4-bit GPTQ and AWQ in accuracy.
Consistent results on Qwen3-8B and Qwen3.5-9B-Base.
Custom CUDA kernels include LUT-free direct-compute GEMV and output-stationary design.

FASQ: Calibration-Free LLM Compression via Product Quantization

Key facts

Entities

Institutions

Sources