ARTFEED — Contemporary Art Intelligence

PhoneSafety: Benchmark Reveals Agent Safety vs. Inability

other · 2026-05-11

A new benchmark called PhoneSafety evaluates whether phone-use agents avoid harm due to safety awareness or simple incapability. The benchmark comprises 700 safety-critical moments from real phone interactions across over 130 apps, isolating the next decision at risky moments. Eight representative agents were tested, revealing that stronger general phone-use ability does not guarantee safer behavior; agents often fail to act or take unsafe actions. The study highlights the need for evaluations that distinguish between safe choices and failures to act.

Key facts

  • PhoneSafety benchmark includes 700 safety-critical moments.
  • Moments are drawn from real phone interactions across more than 130 apps.
  • Eight representative phone-use agents were evaluated.
  • Stronger general phone-use ability does not correlate with safer behavior.
  • Current benchmarks often conflate safe actions with inability to act.
  • Each instance asks if the model takes safe, unsafe, or no useful action.
  • The study distinguishes between risk recognition and execution failure.
  • Different causes of harm avoidance require different fixes.

Entities

Sources