TAPER: Regulating Branch Parallelism in LLM Serving
A recent study published on arXiv (2605.06914) presents TAPER, a per-step admission controller designed for LLM serving systems that manages branch parallelism. Current systems either allow all independent decoding branches to enter eagerly, which increases shared decode step latency and negatively impacts co-batched requests, or implement rigid limits that miss out on potential throughput improvements. TAPER views additional branches as opportunistic tasks, allowing their admission only when the predicted branch externality aligns with the batch's existing slack budget. This method is effective as it separates compute processes from admission choices at the branch level.
Key facts
- Paper on arXiv: 2605.06914
- Announce type: cross
- TAPER is a per-step admission controller
- Addresses branch externality in LLM serving
- Eager admission inflates shared decode step latency
- Fixed caps forgo throughput
- Safe width depends on batch composition, context lengths, accumulated slack
- Branch-level scheduling decouples compute from admission
Entities
Institutions
- arXiv