AuditRepairBench: A Corpus for Evaluator-Channel Ranking Instability in Agent Repair
A new study has introduced AuditRepairBench, which is a dataset designed to explore inconsistencies in ranking for agent-repair leaderboards caused by changes in evaluators. This dataset consists of 576,000 registered cells, with 96,000 already executed, and it tackles the issue of ranking instability linked to evaluator-channel-blocking within a defined observability range. The research highlights that a large amount of the reordering stems from methods that rely on signals from evaluators during the selection of repair candidates. To address this, a modular screening framework employs four flexible methods, including a learned influence proxy and a human-audit proxy, creating a comprehensive screening process that informs various scoring and ranking mechanisms.
Key facts
- AuditRepairBench is a paired-execution trace corpus.
- The corpus has 576,000 registered cells, 96,000 executed.
- It addresses ranking instability in agent-repair leaderboards.
- Instability arises from evaluator reconfiguration.
- Methods consulting evaluator-derived signal cause reordering.
- Four screening implementations are used: learned influence proxy, rule-based channel-exposure ratio, counterfactual sensitivity proxy, sparse human-audit proxy.
- The architecture produces a screening posterior, cell-level flip functional, set-valued label, stratified system score, and set-valued leaderboard.
- The paper is published on arXiv with ID 2605.04624.
Entities
Institutions
- arXiv