AuditRepairBench: A Corpus for Evaluator-Channel Ranking Instability in Agent Repair

other · 2026-05-07

A new study has introduced AuditRepairBench, which is a dataset designed to explore inconsistencies in ranking for agent-repair leaderboards caused by changes in evaluators. This dataset consists of 576,000 registered cells, with 96,000 already executed, and it tackles the issue of ranking instability linked to evaluator-channel-blocking within a defined observability range. The research highlights that a large amount of the reordering stems from methods that rely on signals from evaluators during the selection of repair candidates. To address this, a modular screening framework employs four flexible methods, including a learned influence proxy and a human-audit proxy, creating a comprehensive screening process that informs various scoring and ranking mechanisms.

Key facts

AuditRepairBench is a paired-execution trace corpus.
The corpus has 576,000 registered cells, 96,000 executed.
It addresses ranking instability in agent-repair leaderboards.
Instability arises from evaluator reconfiguration.
Methods consulting evaluator-derived signal cause reordering.
Four screening implementations are used: learned influence proxy, rule-based channel-exposure ratio, counterfactual sensitivity proxy, sparse human-audit proxy.
The architecture produces a screening posterior, cell-level flip functional, set-valued label, stratified system score, and set-valued leaderboard.
The paper is published on arXiv with ID 2605.04624.

AuditRepairBench: A Corpus for Evaluator-Channel Ranking Instability in Agent Repair

Key facts

Entities

Institutions

Sources