Quantization Undoes Alignment: Bias Emergence in Compressed LLMs
A recent investigation published on arXiv indicates that applying post-training quantization to large language models (LLMs) can lead to the resurgence of stereotypical biases, even in models that were initially well-aligned. The study evaluated three instruction-tuned models—Qwen2.5-7B, Mistral-7B, and Phi-3.5-mini—across five precision levels, ranging from BF16 to 3-bit. Researchers utilized 12,148 items from the BBQ bias benchmark, conducting a total of 911,100 inference records across five random seeds. Findings reveal a distinct dose-response relationship: 3-bit quantization resulted in 6-21% of previously unbiased items exhibiting new stereotypical behaviors, as confirmed by logistic regression. This underscores the necessity for bias-aware compression methods to ensure the safe deployment of quantized LLMs in cloud and edge settings.
Key facts
- Study conducted on arXiv:2605.15208
- Models: Qwen2.5-7B, Mistral-7B, Phi-3.5-mini
- Precision levels: BF16 through 3-bit
- Benchmark: 12,148 BBQ bias items
- Random seeds: 5
- Total inference records: 911,100
- 3-bit quantization causes 6-21% of unbiased items to develop stereotypical behaviors
- Dose-response pattern confirmed via logistic regression
Entities
Institutions
- arXiv