Gated Subspace Inference Accelerates Transformer Models Up to 10.5x
There's this innovative method called Gated Subspace Inference that speeds up how transformer language models process information. It leverages the low effective rank found in token activation manifolds. Essentially, it splits each activation vector into two parts: a subspace component and a residual, using a low-rank weight image for the subspace to cut down on memory bandwidth. Each token has a gate that decides if the residual correction should be calculated, keeping the output distribution within a certain range. When tested on models like GPT-2 124M, GPT-J 6B, and OPT 6.7B using the AMD MI300X, it showed speed boosts of 3.0x to 10.5x, with impressive perplexity ratios under 1.00 and over 98% top-1 token agreement. No need for retraining or architectural tweaks, operating at k=256.
Key facts
- Method exploits low effective rank of token activation manifolds.
- Decomposes activation vectors into subspace and residual components.
- Caches low-rank weight image for subspace to reduce memory bandwidth.
- Per-token gate controls residual correction computation.
- Validated on GPT-2 124M, GPT-J 6B, OPT 6.7B models.
- Tested on AMD MI300X hardware.
- Achieves 3.0x to 10.5x speedups on linear-layer weight reads.
- Perplexity ratios below 1.00 and top-1 token agreement above 98%.
- No retraining, architectural modification, or attention approximation required.
- Operating point uses k=256.
Entities
Institutions
- arXiv
- AMD