Monitoring-Control Gap in Retrieval-Augmented LLMs
A new study from arXiv reveals a critical flaw in retrieval-augmented large language models (LLMs): they can detect contradictory evidence but fail to resolve it safely in multi-turn interactions. The research, involving four model families ranging from 1.5B to 32B parameters and over 50,000 turn-level evaluations, shows that single-turn diagnostics overestimate RAG safety. The monitoring-control gap demonstrates that acknowledging contradiction does not correlate with safe resolution, a pattern confirmed by human validation. No universal prompt fix exists, and mechanism evidence from hidden-state probing and attention analysis supports the findings.
Key facts
- arXiv paper 2605.27157
- Four model families tested (1.5B-32B parameters)
- Over 50,000 turn-level evaluations
- Single-turn diagnostics overestimate RAG safety
- Contradiction acknowledgement uncorrelated with safe resolution
- No universal prompt fix exists
- Hidden-state probing and attention analysis used
- Human validation corroborated the pattern
Entities
Institutions
- arXiv