Quantization Undoes Alignment: Bias Emergence in Compressed LLMs

ai-technology · 2026-05-18

A recent investigation published on arXiv indicates that applying post-training quantization to large language models (LLMs) can lead to the resurgence of stereotypical biases, even in models that were initially well-aligned. The study evaluated three instruction-tuned models—Qwen2.5-7B, Mistral-7B, and Phi-3.5-mini—across five precision levels, ranging from BF16 to 3-bit. Researchers utilized 12,148 items from the BBQ bias benchmark, conducting a total of 911,100 inference records across five random seeds. Findings reveal a distinct dose-response relationship: 3-bit quantization resulted in 6-21% of previously unbiased items exhibiting new stereotypical behaviors, as confirmed by logistic regression. This underscores the necessity for bias-aware compression methods to ensure the safe deployment of quantized LLMs in cloud and edge settings.

Key facts

Study conducted on arXiv:2605.15208
Models: Qwen2.5-7B, Mistral-7B, Phi-3.5-mini
Precision levels: BF16 through 3-bit
Benchmark: 12,148 BBQ bias items
Random seeds: 5
Total inference records: 911,100
3-bit quantization causes 6-21% of unbiased items to develop stereotypical behaviors
Dose-response pattern confirmed via logistic regression

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs

Key facts

Entities

Institutions

Sources