ARTFEED — Contemporary Art Intelligence

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

ai-technology · 2026-05-13

A new framework called SOMA optimizes multi-turn LLM serving by using a small surrogate model for later dialogue turns. It learns soft prompts to maximize semantic divergence between large and small models, applies anti-degeneration control, and distills knowledge to maintain response quality while reducing latency, memory, and API costs. The approach is detailed in arXiv:2605.11317.

Key facts

  • arXiv:2605.11317
  • SOMA framework
  • multi-turn LLM serving
  • small surrogate model
  • soft prompts
  • semantic divergence
  • anti-degeneration control
  • distillation

Entities

Institutions

  • arXiv

Sources