Budgeted Attention Allocation Boosts Transformer Efficiency

other · 2026-05-09

A novel technique named Budgeted Attention Allocation enables transformers to function across various cost-quality levels using a single trained model. This method incorporates a monotone head-gating system based on the specified attention budget. The importance of dense warm-starting for stability was highlighted. In a synthetic sequence task, the budgeted model recorded an impressive 99.7% accuracy at an estimated attention cost of 0.303 and achieved 100.0% at a cost of 0.504. For AG News, a custom word-level transformer with hard-gate adaptation attained 82.1% accuracy and a 1.28x speedup at a budget of 0.50. Additionally, budgeted structural pruning on a pretrained BERT-Mini yielded 87.6% accuracy with a 1.20x speedup at the same budget, surpassing a zero-shot dense post-hoc baseline (86.1%) and closely approaching a per-budget specialist after one recovery epoch (87.9%). The approach was also evaluated on DBpedia14 using BERT-Mini.

Key facts

Budgeted Attention Allocation is a monotone head-gating mechanism conditioned on a requested attention budget.
Dense warm-starting is important for stability.
On a synthetic sequence task, the model reached 99.7% accuracy at 0.303 cost and 100.0% at 0.504 cost.
On AG News with a custom word-level transformer, hard-gate adaptation achieved 82.1% accuracy with 1.28x speedup at budget 0.50.
On pretrained BERT-Mini AG News, budgeted structural pruning reached 87.6% accuracy with 1.20x speedup at budget 0.50.
A zero-shot dense post-hoc structural baseline reached 86.1% accuracy.
One recovery epoch raised the per-budget specialist to 87.9% accuracy.
The method was also tested on DBpedia14 with BERT-Mini.

Budgeted Attention Allocation Boosts Transformer Efficiency

Key facts

Entities

Institutions

Sources