AgentLens Reveals Lucky Pass Problem in SWE Agent Evaluation
A recent publication on arXiv (2605.12925) presents AgentLens, a framework designed for assessing software engineering (SWE) agent trajectories at the process level. The study analyzes 2,614 OpenHands trajectories across eight model backends, focusing on 60 SWE-bench Verified tasks. Out of 1,815 successful trajectories within a selection of 47 tasks, 10.7% demonstrate a 'Lucky Pass,' indicating that agents succeed through random trial-and-error instead of systematic methods. Additionally, the authors have introduced AgentLens-Bench, a dataset comprising 1,815 trajectories that include annotations for quality scores, waste signals, and divergence points.
Key facts
- arXiv paper 2605.12925 introduces AgentLens framework
- Evaluates 2,614 OpenHands trajectories from eight model backends
- Uses 60 SWE-bench Verified tasks
- 47 tasks have enough passing trajectories for process references
- 1,815-trajectory evaluation subset
- 10.7% of passing trajectories are Lucky Passes
- Lucky Pass includes regression cycles, blind retries, missing verification, temporally disordered exploration
- AgentLens-Bench dataset released with quality scores, waste signals, divergence points
Entities
Institutions
- arXiv
- OpenHands
- SWE-bench