Samsung Galaxy devices to run multilingual LLMs with dynamic LoRA switching and multi-stream decoding
An innovative framework designed for hardware optimization facilitates effective on-device inference of a multilingual foundation model based on LLaMA for Samsung Galaxy S24 and S25 smartphones, which are powered by SM8650 and SM8750 Qualcomm chipsets, respectively. This system employs application-specific LoRAs as runtime inputs to a singular frozen inference graph, enabling seamless task switching without the need for recompilation or additional memory usage. A multi-stream decoding approach generates various stylistic responses—such as formal, polite, or cheerful—simultaneously in one forward pass, achieving latency reductions of up to six times. To enhance token generation speed, Dynamic Self-Speculative Decoding (DS2D), a tree-based method for predicting future tokens, is utilized. The challenges of deploying large language models on smartphones stem from strict limitations on memory, latency, and runtime adaptability. This research is detailed in arXiv:2604.18655v1.
Key facts
- A hardware-aware framework enables on-device LLM inference on Samsung Galaxy S24 and S25
- The framework supports a LLaMA-based multilingual foundation model
- Application-specific LoRAs are integrated as runtime inputs to a single frozen inference graph
- Dynamic task switching is enabled without recompilation or memory overhead
- Multi-stream decoding generates stylistic variations concurrently in a single forward pass
- Latency is reduced by up to 6x
- Dynamic Self-Speculative Decoding (DS2D) accelerates token generation
- The work addresses challenges of memory, latency, and runtime flexibility on smartphones
Entities
Institutions
- Samsung
- Qualcomm