Study Questions Reliability of LLMs for Assessing User States

ai-technology · 2026-05-18

A recent study published on arXiv questions the belief that large language models (LLMs) can accurately evaluate user states within conversational and adaptive systems. Titled "Can We Trust AI-Inferred User States," the research empirically investigates the psychometric reliability of AI metrics by conducting assessments across three bimodal LLMs: GPT-4o audio, Gemini 2.0 Flash, and Gemini 2.5 Flash. The authors focused on both individual score reliability and overall reliability, revealing that metric reliability is not inherently present in interpretive contexts. The instability observed at the individual score level indicates potential challenges in real-time adaptations based on these metrics.

Key facts

Paper published on arXiv with ID 2605.15734
Focuses on psychometric reliability of AI measures of user states
Evaluates three bimodal LLMs: GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash
Uses replication evaluation procedures
Distinguishes individual score reliability from aggregated reliability
Finds metric reliability not a default property in interpretive domains
Lack of stability at individual score level
Implications for real-time adaptation in conversational systems

Study Questions Reliability of LLMs for Assessing User States

Key facts

Entities

Institutions

Sources