RepoMirage Tests Code Agents' Repository Context Reasoning with Perturbations
Researchers have unveiled RepoMirage, a two-phase evaluation framework based on SWE-Bench Verified, aimed at assessing whether code agents genuinely grasp repository context in end-to-end tasks such as resolving issues. This framework employs perturbations as diagnostic instruments to heighten the necessity for contextual reasoning by modifying the repository's presentation. In its initial phase, RepoMirage-Perturb implements three forms of semantics-preserving repository-level perturbations, leading to a noticeable decline in performance when broader context access is essential for accurate solutions. The subsequent phase, RepoMirage-Extend, transforms these structural bottlenecks into explicit tasks beyond just issue resolution, resulting in an even greater drop in average performance. The findings suggest that success on benchmarks may not accurately indicate true reasoning regarding multi-file relationships.
Key facts
- RepoMirage is built on SWE-Bench Verified.
- The suite has two stages: RepoMirage-Perturb and RepoMirage-Extend.
- Three types of semantics-preserving perturbations are applied.
- Performance drops when broader context is needed.
- Structural bottlenecks are turned into explicit tasks.
- The study questions whether benchmark success reflects true reasoning.
- Code agents currently perform well on repository-level benchmarks.
- Perturbations are used as diagnostic tools.
Entities
—