Claude Code Auto Mode Fails Stress Test with 81% False Negative Rate
An independent assessment of Claude Code's auto mode, which is the inaugural permission system for AI coding agents, has uncovered an alarming end-to-end false negative rate of 81.0% (95% CI: 73.8%-87.4%) when faced with intentionally ambiguous authorization situations. This figure significantly exceeds the 17% false negative rate claimed by Anthropic for production traffic. The evaluation utilized AmPermBench, a benchmark comprising 128 prompts across four DevOps task categories and three controlled ambiguity dimensions, analyzing 253 state-altering actions against oracle ground truth. Anthropic's approach employs a two-stage transcript classifier to manage risky tool calls, reporting a mere 0.4% false positive rate on production traffic. These results reveal a critical disparity between reported performance and actual ambiguous scenarios under stress-testing conditions.
Key facts
- Claude Code's auto mode is the first deployed permission system for AI coding agents
- System uses a two-stage transcript classifier to gate dangerous tool calls
- Anthropic reports 0.4% false positive rate and 17% false negative rate on production traffic
- Independent evaluation uses AmPermBench, a 128-prompt benchmark
- Benchmark spans four DevOps task families and three controlled ambiguity axes
- 253 state-changing actions evaluated at individual action level against oracle ground truth
- End-to-end false negative rate is 81.0% (95% CI: 73.8%-87.4%)
- Study focuses on deliberately ambiguous authorization scenarios
Entities
Institutions
- Anthropic