New Two-Dimensional Early Exit Strategy Accelerates LLM Inference for Classification Tasks
A new two-dimensional early exit technique for large language models integrates layer-wise and sentence-wise exits to achieve significant computational efficiency. This strategy processes input incrementally, analyzing each sentence while activating deeper layers progressively, surpassing optimizations that concentrate on only one dimension. Testing on four leading LLMs—Llama 3.1, Llama 3.2, Gemma, and Qwen, which have parameters between 3B and 8B—showed speed improvements ranging from 1.4 to 2.3 times compared to optimal layer-wise early exits across three sentiment classification datasets. The method exhibits graceful performance decline on intricate multi-class tasks, and while fine-tuning diminishes its benefits, it does not completely negate them. It is model-agnostic, needing only simple classification adapters, and complements other efficiency techniques like quantization. This strategy was detailed in a cross-announcement abstract on arXiv, identified as 2604.18592v1.
Key facts
- A two-dimensional early exit strategy coordinates layer-wise and sentence-wise exiting for LLM classification tasks.
- The method processes input incrementally sentence-by-sentence while progressively activating deeper layers.
- It achieves multiplicative computational savings exceeding those from optimizing either dimension independently.
- Experimental evaluation involved four state-of-the-art LLMs: Llama 3.1, Llama 3.2, Gemma, and Qwen.
- Models ranged from 3B to 8B parameters.
- Testing was conducted on three sentiment classification datasets.
- Speed-ups of 1.4 to 2.3 times were observed over optimal layer-wise early exit for simpler tasks.
- The approach is model-agnostic and requires only lightweight classification adapters.
Entities
Institutions
- arXiv