Hardware Accelerator for Long-Context LLM Attention Decoding
A new hardware accelerator addresses the computational challenges of long-context attention decoding in large language models. The decoding phase's memory and bandwidth demands grow linearly with sequence length, degrading performance on existing accelerators designed for short contexts. The proposed solution uses hardware-software co-design: on the software side, dual-compression dynamic sparse attention combines ultra-low-precision quantization with feature sparsity to reduce overhead, and a hardware-friendly approximate Top-K selection lowers filter complexity from O(n log k) to O(n). On the hardware side, the accelerator is deeply optimized for these sparse computations. The work is published on arXiv (2604.24820).
Key facts
- Long contexts increase compute and memory footprints linearly with sequence length.
- Decoding phase continuously accesses massive KV cache, increasing bandwidth and computing pressure.
- Existing accelerators suffer performance degradation on long contexts.
- Proposed accelerator uses hardware-software co-design.
- Software: dual-compression dynamic sparse attention with ultra-low-precision quantization and feature sparsity.
- Hardware-friendly approximate Top-K selection reduces filter complexity from O(n log k) to O(n).
- Hardware is deeply optimized for sparse computations.
- Paper available on arXiv with ID 2604.24820.
Entities
Institutions
- arXiv