GQLA: Hardware-Adaptive Attention for LLM Decoding

ai-technology · 2026-05-18

A new method called Group-Query Latent Attention (GQLA) has been introduced by researchers as an enhancement to the Multi-head Latent Attention (MLA) from DeepSeek-V2/V3. While MLA compresses keys and values into a low-rank latent representation and achieves nearly optimal roofline performance on H100 GPUs, it is constrained by compute-bandwidth ratios specific to H100-class systems. This limitation results in the loss of tensor parallelism along the head axis and does not provide Multi-Token Prediction (MTP) improvements on standard GPUs such as the H20. GQLA minimally adjusts MLA's trained weights, revealing two equivalent decoding paths: the MQA-absorb path, which mirrors MLA, and a GQA path featuring an expanded cache per group. This runtime selection allows the optimal path for the hardware without the need for retraining or specialized kernels, enabling GQLA weights to deliver roofline performance on both H100 (using MQA-absorb, s_q=1) and H20 (via the GQA path). This strategy enhances hardware adaptability for large language model inference amid export limitations.

Key facts

GQLA modifies DeepSeek-V2/V3's Multi-head Latent Attention (MLA).
MLA compresses keys and values into a low-rank latent.
MLA achieves near-perfect roofline on H100 GPUs.
MLA is tied to H100-class compute-bandwidth ratios.
MLA forfeits tensor parallelism along the head axis.
MLA yields no Multi-Token Prediction (MTP) gain on H20 GPUs.
GQLA exposes two decoding paths: MQA-absorb and GQA.
GQLA requires no retraining or custom kernels.
GQLA targets H100 and H20 GPUs.
The approach addresses hardware adaptability under export restrictions.

Entities

—

Sources

arXiv cs.AI — 2026-05-18