SOMA: Efficient Multi-turn LLM Serving via Small Language Model

ai-technology · 2026-05-13

A new framework called SOMA optimizes multi-turn LLM serving by using a small surrogate model for later dialogue turns. It learns soft prompts to maximize semantic divergence between large and small models, applies anti-degeneration control, and distills knowledge to maintain response quality while reducing latency, memory, and API costs. The approach is detailed in arXiv:2605.11317.