Longer Contexts Degrade AI Monitor Performance on Dangerous Actions
A new study reveals that frontier AI models fail to detect dangerous actions more frequently in longer transcripts. Researchers found that Opus 4.6, GPT 5.4, and Gemini 3.1 miss subtly dangerous coding actions 2 to 30 times more often after 800K tokens of benign activity compared to short contexts. The paper, titled "Classifier Context Rot: Monitor Performance Degrades with Context Length," highlights a critical weakness in monitoring agents for dangerous behavior, as prior benchmarks rarely exceeded 100K tokens. Prompting techniques like periodic reminders partially mitigate the issue, but evaluations ignoring long-context degradation likely overestimate monitor performance.
Key facts
- Current frontier models fail to notice dangerous actions more often in longer transcripts.
- Opus 4.6, GPT 5.4, and Gemini 3.1 miss dangerous actions 2 to 30 times more often after 800K tokens of benign activity.
- Prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens.
- Weaknesses can be partially mitigated with prompting techniques such as periodic reminders.
- Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.
- The study focuses on classifying transcripts that often exceed 500K tokens.
- The paper is titled 'Classifier Context Rot: Monitor Performance Degrades with Context Length'.
- The research was published on arXiv.
Entities
Institutions
- arXiv