Power Capping Illusion in LLM Decode: Phase-Aware Energy Study Across Attention Architectures
A study from arXiv (2605.11999v1) reveals that power capping, a standard GPU energy lever in LLM serving, is ineffective during autoregressive decode, the dominant phase in production serving. Across four attention paradigms—GQA, MLA, Gated DeltaNet, and Mamba2—on NVIDIA H200 GPUs, decode draws only 137–300 W on a 700 W GPU, meaning no cap ever triggers because memory-bound decode saturates HBM bandwidth rather than compute, leaving power headroom untouched. Firmware-initiated clock throttling compounds the illusion by corrupting throughput measurements. SM clock locking resolves both issues, Pareto-dominating power capping and recovering up to 32% of decode energy with minimal throughput loss. The study identifies three architecture-dependent factors influencing energy efficiency.
Key facts
- Power capping is ineffective during autoregressive decode in LLM serving.
- Study tested GQA, MLA, Gated DeltaNet, and Mamba2 on NVIDIA H200.
- Decode draws only 137–300 W on a 700 W GPU.
- Memory-bound decode saturates HBM bandwidth, not compute.
- Firmware clock throttling corrupts throughput measurements.
- SM clock locking recovers up to 32% decode energy.
- Clock locking Pareto-dominates power capping.
- Three architecture-dependent factors identified.
Entities
Institutions
- arXiv
- NVIDIA