ARTFEED — Contemporary Art Intelligence

Monitoring-Control Gap in Retrieval-Augmented LLMs

ai-technology · 2026-05-27

A new study from arXiv reveals a critical flaw in retrieval-augmented large language models (LLMs): they can detect contradictory evidence but fail to resolve it safely in multi-turn interactions. The research, involving four model families ranging from 1.5B to 32B parameters and over 50,000 turn-level evaluations, shows that single-turn diagnostics overestimate RAG safety. The monitoring-control gap demonstrates that acknowledging contradiction does not correlate with safe resolution, a pattern confirmed by human validation. No universal prompt fix exists, and mechanism evidence from hidden-state probing and attention analysis supports the findings.

Key facts

  • arXiv paper 2605.27157
  • Four model families tested (1.5B-32B parameters)
  • Over 50,000 turn-level evaluations
  • Single-turn diagnostics overestimate RAG safety
  • Contradiction acknowledgement uncorrelated with safe resolution
  • No universal prompt fix exists
  • Hidden-state probing and attention analysis used
  • Human validation corroborated the pattern

Entities

Institutions

  • arXiv

Sources