FP16 Precision Causes Deterministic Token Divergence in KV-Cached Transformer Inference

ai-technology · 2026-04-20

A study released on arXiv indicates that KV caching, a widely used optimization method for autoregressive transformer inference, does not match cache-free computation in standard FP16 precision numerically. This discrepancy arises because the execution paths for cache-ON and cache-OFF utilize different orders of floating-point accumulation, and the non-associative nature of FP16 results in consistent variations in decoded token sequences. Testing on three open-weight models—LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B—using the GSM8K dataset revealed a 100% token divergence rate across all sampling methods, including greedy decoding, which rules out randomness. Cache-ON produced better accuracy in 8 of 9 scenarios, suggesting that the divergence is systematic. Controlled FP32 falsification significantly reduces divergence and eliminates token flips. The paper, arXiv:2604.15409v1, questions the previously accepted notion of numerical equivalence in KV caching, impacting the reliability of transformer models in tasks needing numerical precision.

Key facts

KV caching optimization in autoregressive transformer inference is not numerically equivalent to cache-free computation under FP16 precision
Cache-ON and cache-OFF paths use different floating-point accumulation orderings, causing deterministic divergence due to FP16 non-associativity
Experiments on LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B models show 100% token divergence rate on GSM8K dataset
Divergence occurs across all sampling strategies, including greedy decoding, ruling out sampling randomness
Cache-ON yielded higher accuracy in 8 out of 9 conditions, indicating systematic divergence direction
Controlled FP32 falsification reduces divergence by eight orders of magnitude and eliminates token flips
The paper is arXiv:2604.15409v1 with announcement type cross
The finding challenges long-presumed numerical equivalence in KV caching

FP16 Precision Causes Deterministic Token Divergence in KV-Cached Transformer Inference

Key facts

Entities

Institutions

Sources