ARTFEED — Contemporary Art Intelligence

ProcBench: New Benchmark Evaluates LLM Coding Agent Execution Process

ai-technology · 2026-05-22

ProcBench is an innovative benchmark designed to assess LLM coding agents by emphasizing the quality of the execution process instead of just the end results. It classifies recurring execution errors into an ontology that includes 11 defect types across four distinct categories and analyzes agent trajectories using standardized process evidence. This benchmark converts raw logs into a cohesive trajectory format for comparing different agents. Additionally, it introduces a new metric called control preservation, which measures the extent to which execution is interpretable, interruptible, correctable, reversible, and capable of returning authority. This research is detailed in arXiv paper 2605.20251.

Key facts

  • ProcBench evaluates execution-process defects in LLM coding agents
  • Covers 11 defect types in 4 categories
  • Standardizes raw logs into unified trajectory representation
  • Introduces control preservation as a quality metric
  • Published on arXiv with ID 2605.20251
  • Focuses on process evidence rather than final outcomes
  • Supports comparison across heterogeneous agents
  • Reports calibrated scorecards over process-level findings

Entities

Institutions

  • arXiv

Sources