MobileLLM-Flash: Latency-Guided On-Device LLM Design

other · 2026-04-30

A novel approach has been unveiled for creating on-device large language models (OD-LLMs) tailored for real-time AI applications on hardware with limited resources. This technique incorporates hardware-in-the-loop architecture search while adhering to mobile latency requirements, allowing for the deployment of models without the need for custom kernels and ensuring compatibility with standard mobile runtimes such as Executorch. Instead of using specialized attention mechanisms, it utilizes attention skipping to enhance long-context processing speed. The method simultaneously refines model architecture (including layers and dimensions) and attention patterns, considering each candidate as a streamlined version of a pretrained backbone with shared weights for effective evaluation. This framework aims for industry-scale implementation, broadening user access through extensive hardware compatibility and near-instantaneous responses.

Key facts

Methodology uses hardware-in-the-loop architecture search under mobile latency constraints.
Models are deployable without custom kernels and compatible with Executorch.
Avoids specialized attention mechanisms; uses attention skipping for long-context acceleration.
Jointly optimizes model architecture (layers, dimensions) and attention pattern.
Each candidate is treated as a pruned version of a pretrained backbone with inherited weights.
Designed for industry-scale deployment on resource-constrained hardware.
Aims to maximize user reach through broad hardware compatibility.
Focuses on near-real-time responses for real-time AI experiences.

Entities

—

Sources

arXiv cs.AI — 2026-04-29