LEAP: New Training Method Boosts Transformer Efficiency by 1.61x
A significant incompatibility has been discovered by researchers between layer-aligned distillation and early exit mechanisms based on convergence in transformer models. The distillation goals that synchronize intermediate layers of students with teacher representations hinder the representational convergence necessary for early exits, making them ineffective. To tackle this issue, the team presents LEAP (Layer-wise Exit-Aware Pretraining), which is an auxiliary training objective that does not require changes to the architecture. LEAP enhances standard distillation by imposing a constraint that ensures intermediate layers closely resemble final-layer representations. When implemented in MiniLM, LEAP results in a 1.61x increase in wall-clock speedup at batch=1 on an NVIDIA L4 GPU with a threshold θ=0.95, with 91.9% of samples exiting by layer 7. The paper can be found on arXiv with the identifier 2605.01058.
Key facts
- LEAP reconciles incompatibility between distillation and early exit.
- No architectural modifications required.
- LEAP-MiniLM achieves 1.61x speedup on NVIDIA L4.
- 91.9% of samples exit by layer 7 at θ=0.95.
- Paper available on arXiv: 2605.01058.
Entities
Institutions
- arXiv
- NVIDIA