Qrita: Efficient Top-k and Top-p Algorithm for LLM Sampling
Researchers propose Qrita, a novel algorithm for efficient Top-k and Top-p sampling in large language models. Existing methods rely on sorting, causing high GPU overhead, or stochastic approaches that alter outputs. Qrita uses pivot-based truncation and selection with two key techniques: Gaussian-based sigma-truncation to reduce vocabulary search space, and quaternary pivot search with duplication handling to halve iterations and ensure deterministic output. Implemented in Triton, Qrita outperforms kernels from SGLang and FlashInfer in high-performance LLM execution engines. The work addresses a significant challenge in model sampling for large vocabularies.
Key facts
- Qrita is a Top-k and Top-p algorithm based on pivot-based truncation and selection.
- It uses Gaussian-based sigma-truncation to reduce vocabulary search space.
- Quaternary pivot search with duplication handling halves pivot search iterations.
- Qrita guarantees deterministic output.
- Implementation uses Triton.
- Evaluated against SGLang and FlashInfer kernels.
- Improves performance over existing approaches.
- Addresses GPU computation and memory overhead of sorting-based methods.
Entities
Institutions
- SGLang
- FlashInfer
- Triton