Human Psychometric Questionnaires Fail to Characterize LLM Behavior

ai-technology · 2026-06-01

A recent investigation published on arXiv (2509.10078) indicates that psychometric surveys designed for humans do not effectively capture the behavior of large language models (LLMs). The study examined eight open-source LLMs, contrasting value and personality assessments derived from Likert self-reports (PVQ-40/21, BFI-44/10) with the likelihood of generating responses to value-oriented user inquiries. The findings showed a significant divergence between the two profiles. The expected consistency within constructs, often used to support claims of stable LLM traits, was absent in generation probabilities. This discrepancy arises from the explicit lexical cues present in survey questions, which enable models to identify and respond in socially acceptable manners, unlike typical user queries. Furthermore, demographic persona prompts alter model reactions to human surveys.

Key facts

Study examines reliability of human psychometric questionnaires for LLM behavior characterization
Eight open-source LLMs analyzed
Comparison of Likert self-reports (PVQ-40/21, BFI-44/10) and generation probabilities
Two profiles diverge substantially
Within-construct item consistency disappears in generation probabilities
Explicit lexical cues in questionnaires allow socially desirable responses
Realistic user queries provide no such cues
Demographic persona prompts shift responses to human questionnaires

Human Psychometric Questionnaires Fail to Characterize LLM Behavior

Key facts

Entities

Institutions

Sources