Post-Training N:M Activation Pruning for Efficient LLM Inference
A new study from arXiv (2509.22166) investigates post-training N:M activation sparsification for large language models (LLMs), finding that activation pruning preserves generative capabilities better than weight pruning at equivalent sparsity levels. The work evaluates lightweight error mitigation techniques and pruning criteria, establishing hardware-friendly baselines requiring minimal calibration. It also explores sparsity patterns beyond NVIDIA's standard 2:4, showing the 16:32 pattern achieves performance nearly on par with unstructured pruning. The research addresses the underexplored area of dynamic, input-adaptive activation compression to reduce I/O overhead in LLM inference.
Key facts
- Study focuses on post-training N:M activation pruning in LLMs
- Activation pruning preserves generative capabilities better than weight pruning at equivalent sparsity
- Evaluates lightweight, plug-and-play error mitigation techniques
- Establishes hardware-friendly baselines requiring minimal calibration
- Explores sparsity patterns beyond NVIDIA's standard 2:4
- 16:32 pattern achieves performance nearly on par with unstructured pruning
- Addresses dynamic, input-adaptive compression and I/O overhead reduction
- Published on arXiv with ID 2509.22166
Entities
Institutions
- arXiv
- NVIDIA