AgentKernelArena Benchmark Tests AI Coding Agents on GPU Kernel Optimization
AgentKernelArena serves as an open-source benchmark designed to assess AI coding agents in the realm of GPU kernel optimization. It encompasses 196 distinct tasks, which include HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation. This benchmark scrutinizes entire agent workflows within isolated environments, employing gated compilation, correctness and performance evaluations, centralized scoring, and a generalization protocol for unseen configurations to determine if optimizations are applicable in new contexts. Unlike existing kernel benchmarks that focus solely on individual LLM calls, AgentKernelArena uniquely integrates both kernel-to-kernel optimization and unseen-configuration generalization testing. As GPU kernel optimization becomes vital for efficient deep learning, the demand for high-performance kernels necessitates considerable low-level knowledge. Recent AI coding agents have the capability to iteratively analyze code, utilize compilers and profilers, and enhance implementations.
Key facts
- AgentKernelArena is an open-source benchmark for AI coding agents on GPU kernel optimization.
- The benchmark contains 196 tasks.
- Tasks span HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation.
- It evaluates complete agent workflows in isolated workspaces.
- Uses gated compilation, correctness, and performance checks.
- Includes centralized scoring and an unseen-configuration generalization protocol.
- Existing kernel benchmarks evaluate single LLM calls, not full agent workflows.
- GPU kernel optimization is critical for efficient deep learning systems.
Entities
Institutions
- arXiv