PhoneSafety: Benchmark Reveals Agent Safety vs. Inability

other · 2026-05-11

A new benchmark called PhoneSafety evaluates whether phone-use agents avoid harm due to safety awareness or simple incapability. The benchmark comprises 700 safety-critical moments from real phone interactions across over 130 apps, isolating the next decision at risky moments. Eight representative agents were tested, revealing that stronger general phone-use ability does not guarantee safer behavior; agents often fail to act or take unsafe actions. The study highlights the need for evaluations that distinguish between safe choices and failures to act.

Key facts

PhoneSafety benchmark includes 700 safety-critical moments.
Moments are drawn from real phone interactions across more than 130 apps.
Eight representative phone-use agents were evaluated.
Stronger general phone-use ability does not correlate with safer behavior.
Current benchmarks often conflate safe actions with inability to act.
Each instance asks if the model takes safe, unsafe, or no useful action.
The study distinguishes between risk recognition and execution failure.
Different causes of harm avoidance require different fixes.

Entities

—

Sources

arXiv cs.AI — 2026-05-11