ARTFEED — Contemporary Art Intelligence

Runtime-Certified Bounded-Error Quantized Attention for LLMs

ai-technology · 2026-05-22

A newly developed tiered KV cache architecture facilitates runtime-certified attention in large language models, ensuring limited error due to quantization. The architecture utilizes INT8 keys and INT4 values stored in GPU memory, while the FP16 originals are kept in system RAM for reliable fallback. By employing a two-term error decomposition, it calculates bounds on attention distribution distortion and value reconstruction error for each head and step. This drives an adaptive precision selection and a multi-stage fallback system that guarantees recovery to precise dense attention output when necessary. Evaluated on LLaMA 3.1-8B with contexts reaching 128K across PG-19, NIAH, and RULER benchmarks, the system achieves performance comparable to dense full-precision attention while lowering memory costs. This advancement addresses the absence of runtime error detection in current KV cache quantization systems, which depend solely on average-case robustness.

Key facts

  • Tiered KV cache architecture with INT8 keys and INT4 values in GPU memory, FP16 originals in system RAM
  • Two-term error decomposition yields per-head, per-step bounds on attention distortion and value error
  • Adaptive precision selection and multi-stage fallback ladder guarantee recovery to exact dense attention
  • Tested on LLaMA 3.1-8B with contexts up to 128K
  • Benchmarks: PG-19, NIAH, RULER
  • Matches dense full-precision attention while reducing memory cost
  • First system to provide runtime-certified bounded error for quantized attention
  • Published on arXiv: 2605.20868

Entities

Sources