ARTFEED — Contemporary Art Intelligence

FP16 Precision Causes Deterministic Token Divergence in KV-Cached Transformer Inference

ai-technology · 2026-04-20

A study released on arXiv indicates that KV caching, a widely used optimization method for autoregressive transformer inference, does not match cache-free computation in standard FP16 precision numerically. This discrepancy arises because the execution paths for cache-ON and cache-OFF utilize different orders of floating-point accumulation, and the non-associative nature of FP16 results in consistent variations in decoded token sequences. Testing on three open-weight models—LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B—using the GSM8K dataset revealed a 100% token divergence rate across all sampling methods, including greedy decoding, which rules out randomness. Cache-ON produced better accuracy in 8 of 9 scenarios, suggesting that the divergence is systematic. Controlled FP32 falsification significantly reduces divergence and eliminates token flips. The paper, arXiv:2604.15409v1, questions the previously accepted notion of numerical equivalence in KV caching, impacting the reliability of transformer models in tasks needing numerical precision.

Key facts

  • KV caching optimization in autoregressive transformer inference is not numerically equivalent to cache-free computation under FP16 precision
  • Cache-ON and cache-OFF paths use different floating-point accumulation orderings, causing deterministic divergence due to FP16 non-associativity
  • Experiments on LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B models show 100% token divergence rate on GSM8K dataset
  • Divergence occurs across all sampling strategies, including greedy decoding, ruling out sampling randomness
  • Cache-ON yielded higher accuracy in 8 out of 9 conditions, indicating systematic divergence direction
  • Controlled FP32 falsification reduces divergence by eight orders of magnitude and eliminates token flips
  • The paper is arXiv:2604.15409v1 with announcement type cross
  • The finding challenges long-presumed numerical equivalence in KV caching

Entities

Institutions

  • arXiv

Sources