TAPER: Regulating Branch Parallelism in LLM Serving

ai-technology · 2026-05-11

A recent study published on arXiv (2605.06914) presents TAPER, a per-step admission controller designed for LLM serving systems that manages branch parallelism. Current systems either allow all independent decoding branches to enter eagerly, which increases shared decode step latency and negatively impacts co-batched requests, or implement rigid limits that miss out on potential throughput improvements. TAPER views additional branches as opportunistic tasks, allowing their admission only when the predicted branch externality aligns with the batch's existing slack budget. This method is effective as it separates compute processes from admission choices at the branch level.

Key facts

Paper on arXiv: 2605.06914
Announce type: cross
TAPER is a per-step admission controller
Addresses branch externality in LLM serving
Eager admission inflates shared decode step latency
Fixed caps forgo throughput
Safe width depends on batch composition, context lengths, accumulated slack
Branch-level scheduling decouples compute from admission

TAPER: Regulating Branch Parallelism in LLM Serving

Key facts

Entities

Institutions

Sources