GRASPrune Framework Enables Efficient Structured Pruning of Large Language Models
A novel structured pruning technique known as GRASPrune has been introduced to lower the computational expenses associated with large language models. This method simultaneously prunes channels in feed-forward networks and groups of key-value heads while adhering to a unified global budget constraint. In contrast to methods that apply budgets solely post-training, GRASPrune employs a projected straight-through estimator to derive lightweight gate scores, enforcing a hard mask that meets the budget at each training step while keeping the backbone weights unchanged. Once the mask is established, scaling factors are adjusted for the retained units to address scale mismatches from pruning. When implemented on LLaMA-2-7B, the framework successfully eliminates 50% of parameters and achieves a perplexity of 12.18 on WikiText-. The findings were published as arXiv:2604.19398v1, offering a post-pretraining pruning strategy that preserves model efficiency without sacrificing performance.
Key facts
- GRASPrune is a structured pruning framework for large language models
- It jointly prunes FFN channels and KV head groups under a single global budget
- Uses projected straight-through estimator to learn gate scores with hard mask constraints
- Keeps backbone weights frozen during pruning process
- Calibrates scaling factors on retained units to mitigate pruning-induced scale mismatch
- Folds scaling factors into pruned weights to create smaller dense checkpoint
- On LLaMA-2-7B, removes 50% of parameters while achieving 12.18 perplexity on WikiText-
- Addresses memory and latency costs from parameters, attention computation, and KV caches
Entities
—