ARTFEED — Contemporary Art Intelligence

AgentKernelArena Benchmark Tests AI Coding Agents on GPU Kernel Optimization

ai-technology · 2026-05-20

AgentKernelArena serves as an open-source benchmark designed to assess AI coding agents in the realm of GPU kernel optimization. It encompasses 196 distinct tasks, which include HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation. This benchmark scrutinizes entire agent workflows within isolated environments, employing gated compilation, correctness and performance evaluations, centralized scoring, and a generalization protocol for unseen configurations to determine if optimizations are applicable in new contexts. Unlike existing kernel benchmarks that focus solely on individual LLM calls, AgentKernelArena uniquely integrates both kernel-to-kernel optimization and unseen-configuration generalization testing. As GPU kernel optimization becomes vital for efficient deep learning, the demand for high-performance kernels necessitates considerable low-level knowledge. Recent AI coding agents have the capability to iteratively analyze code, utilize compilers and profilers, and enhance implementations.

Key facts

  • AgentKernelArena is an open-source benchmark for AI coding agents on GPU kernel optimization.
  • The benchmark contains 196 tasks.
  • Tasks span HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation.
  • It evaluates complete agent workflows in isolated workspaces.
  • Uses gated compilation, correctness, and performance checks.
  • Includes centralized scoring and an unseen-configuration generalization protocol.
  • Existing kernel benchmarks evaluate single LLM calls, not full agent workflows.
  • GPU kernel optimization is critical for efficient deep learning systems.

Entities

Institutions

  • arXiv

Sources