AMEL: How Prior Chat History Biases LLM Judgments
A recent investigation published on arXiv (2605.22714) indicates that large language models (LLMs) function as automated evaluators with a notable bias influenced by the tone of previous conversations. This effect, identified as the accumulated message effect on LLM judgments (AMEL), was analyzed through 75,898 API calls involving 11 models from OpenAI, Anthropic, Google, and four open-source alternatives. The same test items were evaluated either alone or following dialogues filled with predominantly positive or negative feedback. Findings demonstrate that models align with the dominant sentiment of the conversation (d = -0.17, p < 10^-46). This bias is most pronounced in instances where the model's initial certainty is low (d = -0.34 for high-entropy items compared to d = -0.15 for deterministic ones). Interestingly, the degree of bias remains consistent regardless of the context length: five prior exchanges yield the same shift as fifty (Spearman |r| < 0.01; OLS slope p = 0). These results carry significant implications for LLM applications in code review, content moderation, and assessment.
Key facts
- Study on accumulated message effect on LLM judgments (AMEL)
- 75,898 API calls to 11 models from 4 providers
- Models: OpenAI, Anthropic, Google, and four open-source models
- Identical test items presented in isolation or after biased histories
- Models shift toward prevailing polarity (d = -0.17, p < 10^-46)
- Effect concentrates on uncertain items (d = -0.34 for high-entropy)
- Bias does not grow with context length (5 vs 50 turns same shift)
- Implications for automated evaluation in code review, moderation, scoring
Entities
Institutions
- OpenAI
- Anthropic