AI Agents as Jurors: Multi-Agent LLM Deliberation Tested on '12 Angry Men'

ai-technology · 2026-05-06

A recent paper on arXiv (2605.01986) employs the storyline from Sidney Lumet's 1957 movie '12 Angry Men' as a standard for assessing multi-agent LLM discussions. In this study, twelve AI agents, each embodying a character from the film, engage in a debate over a murder trial using a multi-agent setup. The research evaluates two models: GPT-4o (closed-source, heavily aligned) and Llama-4-Scout (open-weight, less aligned), under three different scenarios (baseline, open-minded prompt, no initial vote), with three replications for each condition (totaling 18 runs). Findings reveal that 17 out of 18 trials result in a hung jury, highlighting that the expected shift from minority to majority opinion rarely happens, suggesting that anchoring is a significant issue in LLM discussions.

Key facts

Paper uses '12 Angry Men' as a multi-agent benchmark for LLM deliberation.
Twelve AI agents are conditioned on film-faithful personas.
Models tested: GPT-4o and Llama-4-Scout.
Three conditions: baseline, open-minded prompt, no initial vote.
18 runs total (N=3 per cell).
17 of 18 runs end in a hung jury.
Minority-to-majority persuasion almost never occurs.
Anchoring identified as dominant failure mode.

Entities

Artists

Sidney Lumet

Institutions

arXiv

Sources

arXiv cs.AI — 2026-05-05