AgentLens Reveals Lucky Pass Problem in SWE Agent Evaluation

other · 2026-05-14

A recent publication on arXiv (2605.12925) presents AgentLens, a framework designed for assessing software engineering (SWE) agent trajectories at the process level. The study analyzes 2,614 OpenHands trajectories across eight model backends, focusing on 60 SWE-bench Verified tasks. Out of 1,815 successful trajectories within a selection of 47 tasks, 10.7% demonstrate a 'Lucky Pass,' indicating that agents succeed through random trial-and-error instead of systematic methods. Additionally, the authors have introduced AgentLens-Bench, a dataset comprising 1,815 trajectories that include annotations for quality scores, waste signals, and divergence points.

Key facts

arXiv paper 2605.12925 introduces AgentLens framework
Evaluates 2,614 OpenHands trajectories from eight model backends
Uses 60 SWE-bench Verified tasks
47 tasks have enough passing trajectories for process references
1,815-trajectory evaluation subset
10.7% of passing trajectories are Lucky Passes
Lucky Pass includes regression cycles, blind retries, missing verification, temporally disordered exploration
AgentLens-Bench dataset released with quality scores, waste signals, divergence points

AgentLens Reveals Lucky Pass Problem in SWE Agent Evaluation

Key facts

Entities

Institutions

Sources