Study Questions Reliability of LLMs for Assessing User States
A recent study published on arXiv questions the belief that large language models (LLMs) can accurately evaluate user states within conversational and adaptive systems. Titled "Can We Trust AI-Inferred User States," the research empirically investigates the psychometric reliability of AI metrics by conducting assessments across three bimodal LLMs: GPT-4o audio, Gemini 2.0 Flash, and Gemini 2.5 Flash. The authors focused on both individual score reliability and overall reliability, revealing that metric reliability is not inherently present in interpretive contexts. The instability observed at the individual score level indicates potential challenges in real-time adaptations based on these metrics.
Key facts
- Paper published on arXiv with ID 2605.15734
- Focuses on psychometric reliability of AI measures of user states
- Evaluates three bimodal LLMs: GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash
- Uses replication evaluation procedures
- Distinguishes individual score reliability from aggregated reliability
- Finds metric reliability not a default property in interpretive domains
- Lack of stability at individual score level
- Implications for real-time adaptation in conversational systems
Entities
Institutions
- arXiv