GRASPrune Framework Enables Efficient Structured Pruning of Large Language Models

ai-technology · 2026-04-22

A novel structured pruning technique known as GRASPrune has been introduced to lower the computational expenses associated with large language models. This method simultaneously prunes channels in feed-forward networks and groups of key-value heads while adhering to a unified global budget constraint. In contrast to methods that apply budgets solely post-training, GRASPrune employs a projected straight-through estimator to derive lightweight gate scores, enforcing a hard mask that meets the budget at each training step while keeping the backbone weights unchanged. Once the mask is established, scaling factors are adjusted for the retained units to address scale mismatches from pruning. When implemented on LLaMA-2-7B, the framework successfully eliminates 50% of parameters and achieves a perplexity of 12.18 on WikiText-. The findings were published as arXiv:2604.19398v1, offering a post-pretraining pruning strategy that preserves model efficiency without sacrificing performance.

Key facts

GRASPrune is a structured pruning framework for large language models
It jointly prunes FFN channels and KV head groups under a single global budget
Uses projected straight-through estimator to learn gate scores with hard mask constraints
Keeps backbone weights frozen during pruning process
Calibrates scaling factors on retained units to mitigate pruning-induced scale mismatch
Folds scaling factors into pruned weights to create smaller dense checkpoint
On LLaMA-2-7B, removes 50% of parameters while achieving 12.18 perplexity on WikiText-
Addresses memory and latency costs from parameters, attention computation, and KV caches

Entities

—

Sources

arXiv cs.AI — 2026-04-22