ARTFEED — Contemporary Art Intelligence

AgentLens Reveals Lucky Pass Problem in SWE Agent Evaluation

other · 2026-05-14

A recent publication on arXiv (2605.12925) presents AgentLens, a framework designed for assessing software engineering (SWE) agent trajectories at the process level. The study analyzes 2,614 OpenHands trajectories across eight model backends, focusing on 60 SWE-bench Verified tasks. Out of 1,815 successful trajectories within a selection of 47 tasks, 10.7% demonstrate a 'Lucky Pass,' indicating that agents succeed through random trial-and-error instead of systematic methods. Additionally, the authors have introduced AgentLens-Bench, a dataset comprising 1,815 trajectories that include annotations for quality scores, waste signals, and divergence points.

Key facts

  • arXiv paper 2605.12925 introduces AgentLens framework
  • Evaluates 2,614 OpenHands trajectories from eight model backends
  • Uses 60 SWE-bench Verified tasks
  • 47 tasks have enough passing trajectories for process references
  • 1,815-trajectory evaluation subset
  • 10.7% of passing trajectories are Lucky Passes
  • Lucky Pass includes regression cycles, blind retries, missing verification, temporally disordered exploration
  • AgentLens-Bench dataset released with quality scores, waste signals, divergence points

Entities

Institutions

  • arXiv
  • OpenHands
  • SWE-bench

Sources