FP16 Precision Causes Deterministic Token Divergence in KV-Cached Transformer Inference
A study released on arXiv indicates that KV caching, a widely used optimization method for autoregressive transformer inference, does not match cache-free computation in standard FP16 precision numerically. This discrepancy arises because the execution paths for cache-ON and cache-OFF utilize different orders of floating-point accumulation, and the non-associative nature of FP16 results in consistent variations in decoded token sequences. Testing on three open-weight models—LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B—using the GSM8K dataset revealed a 100% token divergence rate across all sampling methods, including greedy decoding, which rules out randomness. Cache-ON produced better accuracy in 8 of 9 scenarios, suggesting that the divergence is systematic. Controlled FP32 falsification significantly reduces divergence and eliminates token flips. The paper, arXiv:2604.15409v1, questions the previously accepted notion of numerical equivalence in KV caching, impacting the reliability of transformer models in tasks needing numerical precision.
Key facts
- KV caching optimization in autoregressive transformer inference is not numerically equivalent to cache-free computation under FP16 precision
- Cache-ON and cache-OFF paths use different floating-point accumulation orderings, causing deterministic divergence due to FP16 non-associativity
- Experiments on LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B models show 100% token divergence rate on GSM8K dataset
- Divergence occurs across all sampling strategies, including greedy decoding, ruling out sampling randomness
- Cache-ON yielded higher accuracy in 8 out of 9 conditions, indicating systematic divergence direction
- Controlled FP32 falsification reduces divergence by eight orders of magnitude and eliminates token flips
- The paper is arXiv:2604.15409v1 with announcement type cross
- The finding challenges long-presumed numerical equivalence in KV caching
Entities
Institutions
- arXiv