Samsung Galaxy devices to run multilingual LLMs with dynamic LoRA switching and multi-stream decoding

ai-technology · 2026-04-22

An innovative framework designed for hardware optimization facilitates effective on-device inference of a multilingual foundation model based on LLaMA for Samsung Galaxy S24 and S25 smartphones, which are powered by SM8650 and SM8750 Qualcomm chipsets, respectively. This system employs application-specific LoRAs as runtime inputs to a singular frozen inference graph, enabling seamless task switching without the need for recompilation or additional memory usage. A multi-stream decoding approach generates various stylistic responses—such as formal, polite, or cheerful—simultaneously in one forward pass, achieving latency reductions of up to six times. To enhance token generation speed, Dynamic Self-Speculative Decoding (DS2D), a tree-based method for predicting future tokens, is utilized. The challenges of deploying large language models on smartphones stem from strict limitations on memory, latency, and runtime adaptability. This research is detailed in arXiv:2604.18655v1.

Key facts

A hardware-aware framework enables on-device LLM inference on Samsung Galaxy S24 and S25
The framework supports a LLaMA-based multilingual foundation model
Application-specific LoRAs are integrated as runtime inputs to a single frozen inference graph
Dynamic task switching is enabled without recompilation or memory overhead
Multi-stream decoding generates stylistic variations concurrently in a single forward pass
Latency is reduced by up to 6x
Dynamic Self-Speculative Decoding (DS2D) accelerates token generation
The work addresses challenges of memory, latency, and runtime flexibility on smartphones

Samsung Galaxy devices to run multilingual LLMs with dynamic LoRA switching and multi-stream decoding

Key facts

Entities

Institutions

Sources