Hybrid JIT-CUDA Graph Optimization Reduces LLM Inference Latency by 66%
Researchers propose a hybrid runtime framework combining Just-In-Time (JIT) compilation with CUDA Graph execution to reduce inference latency in large language models (LLMs). The approach partitions transformer inference into static components (replayed via CUDA Graph) and dynamic components (handled by JIT-compiled kernels), enabling asynchronous graph capture and reuse across autoregressive decoding steps. Evaluated on LLaMA-2 7B with single-GPU, batch-size-one inference for prompt lengths of 10 to 500 tokens, the method achieves up to 66.0% reduction in Time-to-First-Token (TTFT). The work addresses kernel launch overhead in interactive, short-sequence settings, improving practical deployment of LLMs.
Key facts
- Hybrid runtime combines JIT compilation with CUDA Graph execution
- Partitions transformer inference into static and dynamic components
- Static components executed via CUDA Graph replay
- Dynamic components handled through JIT-compiled kernels
- Enables asynchronous graph capture and reuse across decoding steps
- Evaluated on LLaMA-2 7B with single-GPU, batch-size-one inference
- Prompt lengths from 10 to 500 tokens
- Reduces Time-to-First-Token (TTFT) by up to 66.0%
Entities
—