SOMA: Efficient Multi-turn LLM Serving via Small Language Model
A new framework called SOMA optimizes multi-turn LLM serving by using a small surrogate model for later dialogue turns. It learns soft prompts to maximize semantic divergence between large and small models, applies anti-degeneration control, and distills knowledge to maintain response quality while reducing latency, memory, and API costs. The approach is detailed in arXiv:2605.11317.
Key facts
- arXiv:2605.11317
- SOMA framework
- multi-turn LLM serving
- small surrogate model
- soft prompts
- semantic divergence
- anti-degeneration control
- distillation
Entities
Institutions
- arXiv