SPECTRE Framework Boosts LLM Inference Efficiency via Hybrid Parallel Speculative Decoding
SPECTRE is a framework aimed at enhancing resource efficiency within multi-model cloud LLM systems. It tackles the issue of long-tailed demand, where a small number of large models receive the majority of requests, leaving smaller tail models underused. By utilizing these less-frequented tail-model services as remote drafters for the heavily burdened large models through speculative decoding, SPECTRE optimizes performance. The framework allows for simultaneous draft generation and target-side verification, achieved through three main techniques: a hybrid ordinary-parallel speculative decoding approach based on a throughput-derived threshold, speculative priority scheduling to ensure draft-target overlap amidst multi-tenant traffic, and draft-side prompt compression to minimize latency. SPECTRE is executed in SGLang.
Key facts
- SPECTRE stands for Parallel SPECulative Decoding with a Multi-Tenant REmote Drafter.
- It targets multi-model cloud LLM serving platforms with long-tailed request distributions.
- It reuses underutilized tail-model services as remote drafters for large models.
- It enables parallel draft generation and target-side verification.
- Three techniques: hybrid ordinary-parallel speculative decoding, speculative priority scheduling, draft-side prompt compression.
- Implementation is in SGLang.
- The paper is arXiv:2605.08151v1.
- The approach is designed to improve resource efficiency.
Entities
Institutions
- arXiv
- SGLang