Hybrid JIT-CUDA Graph Optimization Reduces LLM Inference Latency by 66%

ai-technology · 2026-04-29

Researchers propose a hybrid runtime framework combining Just-In-Time (JIT) compilation with CUDA Graph execution to reduce inference latency in large language models (LLMs). The approach partitions transformer inference into static components (replayed via CUDA Graph) and dynamic components (handled by JIT-compiled kernels), enabling asynchronous graph capture and reuse across autoregressive decoding steps. Evaluated on LLaMA-2 7B with single-GPU, batch-size-one inference for prompt lengths of 10 to 500 tokens, the method achieves up to 66.0% reduction in Time-to-First-Token (TTFT). The work addresses kernel launch overhead in interactive, short-sequence settings, improving practical deployment of LLMs.

Key facts

Hybrid runtime combines JIT compilation with CUDA Graph execution
Partitions transformer inference into static and dynamic components
Static components executed via CUDA Graph replay
Dynamic components handled through JIT-compiled kernels
Enables asynchronous graph capture and reuse across decoding steps
Evaluated on LLaMA-2 7B with single-GPU, batch-size-one inference
Prompt lengths from 10 to 500 tokens
Reduces Time-to-First-Token (TTFT) by up to 66.0%

Entities

—

Sources

arXiv cs.AI — 2026-04-28