Harness-Bench: Benchmarking Execution-Layer Effects in LLM Agent Workflows
Harness-Bench, a newly established diagnostic standard, examines the influence of harness configurations—responsible for managing context, tools, state, constraints, permissions, tracing, and recovery—on the performance of LLM agents in realistic workflows. Traditional benchmarks often overlook execution or maintain a fixed harness, complicating the analysis of variations in the execution layer. Harness-Bench evaluates a variety of representative harness configurations across different model backends within consistent task environments, budgets, and evaluation protocols, while maintaining the inherent execution characteristics of each harness. This benchmark includes 106 sandboxed offline tasks inspired by real-world agent usage patterns. The findings are available on arXiv (paper 2605.27922).
Key facts
- Harness-Bench is a diagnostic benchmark for evaluating harness effects in LLM agent workflows.
- It evaluates harness configurations across multiple model backends under shared conditions.
- The benchmark contains 106 sandboxed offline tasks from practical agent-use patterns.
- Existing benchmarks abstract away execution or hold the harness fixed.
- The harness manages context, tools, state, constraints, permissions, tracing, and recovery.
- The paper is available on arXiv with ID 2605.27922.
Entities
Institutions
- arXiv