PhoneSafety: Benchmark Reveals Agent Safety vs. Inability
A new benchmark called PhoneSafety evaluates whether phone-use agents avoid harm due to safety awareness or simple incapability. The benchmark comprises 700 safety-critical moments from real phone interactions across over 130 apps, isolating the next decision at risky moments. Eight representative agents were tested, revealing that stronger general phone-use ability does not guarantee safer behavior; agents often fail to act or take unsafe actions. The study highlights the need for evaluations that distinguish between safe choices and failures to act.
Key facts
- PhoneSafety benchmark includes 700 safety-critical moments.
- Moments are drawn from real phone interactions across more than 130 apps.
- Eight representative phone-use agents were evaluated.
- Stronger general phone-use ability does not correlate with safer behavior.
- Current benchmarks often conflate safe actions with inability to act.
- Each instance asks if the model takes safe, unsafe, or no useful action.
- The study distinguishes between risk recognition and execution failure.
- Different causes of harm avoidance require different fixes.
Entities
—