ARTFEED — Contemporary Art Intelligence

FASQ: Calibration-Free LLM Compression via Product Quantization

ai-technology · 2026-05-07

FASQ (Flexible Accelerated Subspace Quantization) is a calibration-free framework for compressing large language models (LLMs) using product quantization on weight matrices. By adjusting sub-vector size and codebook cardinality, it achieves continuous compression ratios of 27-49% of original FP16 size, filling gaps left by fixed-bit methods. On Meta-Llama-3-8B, FASQ outperforms 4-bit GPTQ and AWQ with 67.1-67.7% accuracy at 37-42% model size, with similar results on Qwen3-8B and Qwen3.5-9B-Base. Custom CUDA kernels enable efficient inference via LUT-free direct-compute GEMV and output-stationary designs.

Key facts

  • FASQ applies product quantization to LLM weight matrices.
  • It requires no calibration data.
  • Two parameters control compression: sub-vector size and codebook cardinality.
  • Compression range spans 27-49% of original FP16 model size.
  • On Meta-Llama-3-8B, accuracy reaches 67.1-67.7% at 37-42% model size.
  • Surpasses 4-bit GPTQ and AWQ in accuracy.
  • Consistent results on Qwen3-8B and Qwen3.5-9B-Base.
  • Custom CUDA kernels include LUT-free direct-compute GEMV and output-stationary design.

Entities

Institutions

  • arXiv

Sources