Runtime-Certified Bounded-Error Quantized Attention for LLMs

ai-technology · 2026-05-22

A newly developed tiered KV cache architecture facilitates runtime-certified attention in large language models, ensuring limited error due to quantization. The architecture utilizes INT8 keys and INT4 values stored in GPU memory, while the FP16 originals are kept in system RAM for reliable fallback. By employing a two-term error decomposition, it calculates bounds on attention distribution distortion and value reconstruction error for each head and step. This drives an adaptive precision selection and a multi-stage fallback system that guarantees recovery to precise dense attention output when necessary. Evaluated on LLaMA 3.1-8B with contexts reaching 128K across PG-19, NIAH, and RULER benchmarks, the system achieves performance comparable to dense full-precision attention while lowering memory costs. This advancement addresses the absence of runtime error detection in current KV cache quantization systems, which depend solely on average-case robustness.

Key facts

Tiered KV cache architecture with INT8 keys and INT4 values in GPU memory, FP16 originals in system RAM
Two-term error decomposition yields per-head, per-step bounds on attention distortion and value error
Adaptive precision selection and multi-stage fallback ladder guarantee recovery to exact dense attention
Tested on LLaMA 3.1-8B with contexts up to 128K
Benchmarks: PG-19, NIAH, RULER
Matches dense full-precision attention while reducing memory cost
First system to provide runtime-certified bounded error for quantized attention
Published on arXiv: 2605.20868

Entities

—

Sources

arXiv cs.AI — 2026-05-21