Focus: Efficient Attention with Learnable Centroids for Pretrained Models
A novel approach known as Focus has been developed, utilizing learnable centroids—requiring as few as 148K parameters per layer—which serve as gates to facilitate efficient attention by limiting long-range interactions to token pairs within the same centroid group. This adaptable method can be integrated into any pretrained model by only training the centroids while keeping the original weights intact. Tests conducted on models ranging from 124M to 70B parameters across five attention architectures revealed no decline in performance on downstream benchmarks. Specifically, at the 124M scale, sparse Focus attention achieved a perplexity of 30.3, outperforming full attention's 31.4, and demonstrated comparable performance at larger scales.
Key facts
- Focus adds learnable centroids (as few as 148K parameters per layer) to gate token pair attention.
- Only centroids are trained; original pretrained weights remain frozen.
- Experiments cover model sizes from 124M to 70B parameters and five attention architectures.
- Sparse Focus attention achieves 30.3 perplexity vs. 31.4 for full attention at 124M scale.
- Focus matches full attention performance on downstream benchmarks with zero degradation.
- Standard attention scales quadratically with sequence length; Focus reduces this cost.
- Focus is composable and can be retrofitted into any pretrained model.
- The method learns which token pairs matter for efficient attention.
Entities
—