FastKernels: A Production-Grade GPU Kernel Benchmark for LLM Agents

other · 2026-05-25

FastKernels, a newly introduced benchmark, aims to resolve the discrepancies between current GPU kernel benchmarks and actual production inference frameworks. Existing benchmarks assess kernels using synthetic inputs on individual GPUs, overlook compilation stacks, and favor the replication of established optimizations. This results in the creation of kernels that perform well in controlled environments but lead to compatibility issues, conflicts in compilation stacks, and unnoticed correctness declines in real-world applications. FastKernels encompasses 46 diverse architectures across 8 categories, covering 96.2% (409/425) of HuggingFace Transformers architectures. Additionally, it serves as a streamlined, production-ready benchmark intended to deliver precise reward signals for agents generating GPU kernels based on LLMs.

Key facts

Existing GPU kernel benchmarks are poorly aligned with production inference frameworks.
Benchmarks evaluate kernels on single GPUs with synthetic inputs.
Current benchmarks ignore the surrounding compilation stack.
Existing benchmarks reward replicating known optimizations rather than discovering new ones.
Agents learn to generate kernels that score well in sandboxes but fail in real systems.
FastKernels is a new benchmark based on 46 representative architectures.
The 46 architectures span 8 categories.
FastKernels covers 96.2% (409/425) of HuggingFace Transformers architectures.

FastKernels: A Production-Grade GPU Kernel Benchmark for LLM Agents

Key facts

Entities

Institutions

Sources