Human Psychometric Questionnaires Fail to Characterize LLM Behavior
A recent investigation published on arXiv (2509.10078) indicates that psychometric surveys designed for humans do not effectively capture the behavior of large language models (LLMs). The study examined eight open-source LLMs, contrasting value and personality assessments derived from Likert self-reports (PVQ-40/21, BFI-44/10) with the likelihood of generating responses to value-oriented user inquiries. The findings showed a significant divergence between the two profiles. The expected consistency within constructs, often used to support claims of stable LLM traits, was absent in generation probabilities. This discrepancy arises from the explicit lexical cues present in survey questions, which enable models to identify and respond in socially acceptable manners, unlike typical user queries. Furthermore, demographic persona prompts alter model reactions to human surveys.
Key facts
- Study examines reliability of human psychometric questionnaires for LLM behavior characterization
- Eight open-source LLMs analyzed
- Comparison of Likert self-reports (PVQ-40/21, BFI-44/10) and generation probabilities
- Two profiles diverge substantially
- Within-construct item consistency disappears in generation probabilities
- Explicit lexical cues in questionnaires allow socially desirable responses
- Realistic user queries provide no such cues
- Demographic persona prompts shift responses to human questionnaires
Entities
Institutions
- arXiv