AI Peer Review Systems Are Easily Gameable, Study Finds
A recent position paper available on arXiv (2605.03202) asserts that large language models should not be utilized for peer reviews without thorough evaluation. The researchers conducted an empirical analysis comparing reviews generated by humans and AI for ICLR 2026, also examining the impact of automated rewriting on AI reviewers. They identified two major concerns: AI reviewers tend to show a hivemind effect, leading to excessive agreement across different papers, which diminishes diversity of perspectives; additionally, AI-generated review scores can be easily manipulated through paper laundering—rewriting a paper with an LLM significantly boosts AI reviewer scores, indicating that stylistic alterations can overshadow scientific merit. The authors argue that while non-gameability and review diversity are essential, they are not sufficient for automation, urging caution in the deployment of AI in peer review.
Key facts
- arXiv:2605.03202 is a position paper arguing against using AI for peer review without rigorous evaluation.
- Empirical comparison of human- vs AI-generated ICLR 2026 reviews was conducted.
- AI reviewers show a hivemind effect of excessive agreement within and across papers.
- AI review scores are gameable through paper laundering: rewriting a paper with an LLM increases scores.
- Stylistic changes, not scientific results, can manipulate AI reviewers.
- Non-gameability and review diversity are necessary but not sufficient for automation.
- The paper was announced as a new submission on arXiv.
- The study evaluates the effect of automated paper rewriting on different AI reviewers.
Entities
Institutions
- arXiv
- ICLR