AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices
A recent study presents AHASD, an innovative task-level asynchronous mobile NPU-PIM heterogeneous architecture aimed at enhancing speculative decoding for large language models (LLMs) on mobile platforms. In this process, a small draft language model (DLM) typically generates initial drafts, which are subsequently validated in batches by a larger target language model (TLM). However, mobile single-NPU-PIM systems face challenges with idle overhead during synchronous execution and inefficient computation during asynchronous execution, primarily due to varying draft lengths. AHASD resolves these challenges by facilitating parallel drafting on the PIM while verification occurs on a single NPU through decoupling DLM and TLM tasks. It employs Entropy-History-Aware Drafting Control and Time-Aware Pre-Verification Control to optimize the execution of adaptive drafting algorithms and manage pre-verification timing, reducing invalid drafts based on low-confidence predictions. The paper can be found on arXiv under ID 2604.25326.
Key facts
- AHASD is a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding.
- Speculative decoding uses a small DLM to generate drafts and a large TLM to verify them in batches.
- Adaptive drafting on mobile single-NPU-PIM systems faces idle overhead in synchronous execution.
- Asynchronous execution suffers from wasted computation due to fluctuations in draft length.
- AHASD achieves parallel drafting on PIM and verification on a single NPU via task-level DLM-TLM decoupling.
- Entropy-History-Aware Drafting Control dynamically manages adaptive drafting algorithm execution.
- Time-Aware Pre-Verification Control manages pre-verification timing.
- The paper is published on arXiv with ID 2604.25326.
Entities
Institutions
- arXiv