ProcBench: New Benchmark Evaluates LLM Coding Agent Execution Process

ai-technology · 2026-05-22

ProcBench is an innovative benchmark designed to assess LLM coding agents by emphasizing the quality of the execution process instead of just the end results. It classifies recurring execution errors into an ontology that includes 11 defect types across four distinct categories and analyzes agent trajectories using standardized process evidence. This benchmark converts raw logs into a cohesive trajectory format for comparing different agents. Additionally, it introduces a new metric called control preservation, which measures the extent to which execution is interpretable, interruptible, correctable, reversible, and capable of returning authority. This research is detailed in arXiv paper 2605.20251.

Key facts

ProcBench evaluates execution-process defects in LLM coding agents
Covers 11 defect types in 4 categories
Standardizes raw logs into unified trajectory representation
Introduces control preservation as a quality metric
Published on arXiv with ID 2605.20251
Focuses on process evidence rather than final outcomes
Supports comparison across heterogeneous agents
Reports calibrated scorecards over process-level findings

ProcBench: New Benchmark Evaluates LLM Coding Agent Execution Process

Key facts

Entities

Institutions

Sources