Quantization Trap Breaks Neural Scaling Laws in Multi-Hop Reasoning
A new study on arXiv reveals that neural scaling laws, which typically promise better efficiency when numerical precision is reduced, don't apply to multi-hop reasoning tasks. The authors highlight a phenomenon they call a 'quantization trap,' where lowering precision from 16 bits to 8 or 4 bits actually leads to higher energy consumption and lower reasoning accuracy. They attribute this to hardware casting overhead, hidden delays in dequantization processes, and issues with energy savings over time. Additionally, the study presents a concept called Critical Model Scale N*, which predicts when this trap might either lessen or worsen, depending on factors like model size, batch size, and hardware concurrency. This finding challenges the notion that reducing precision always improves efficiency, especially in sequential reasoning.
Key facts
- Neural scaling laws predict linear efficiency gains from reduced numerical precision.
- Reducing precision from 16-bit to 8/4-bit increases net energy consumption in multi-hop reasoning.
- The 'quantization trap' degrades reasoning accuracy while increasing energy use.
- Hardware casting overhead and dequantization kernel latency are primary causes.
- Sequential energy amortization failure contributes to the trap.
- A Critical Model Scale N* predicts when the trap dissolves or deepens.
- The trap depends on model size, batch size, and hardware concurrency.
- Scaling law breaking is unavoidable in practice for multi-hop reasoning.
Entities
Institutions
- arXiv