ResearchArena Benchmarks AI Research Paper Quality

ai-technology · 2026-05-20

A recent study has unveiled ResearchArena, a streamlined framework that allows off-the-shelf AI agents to independently navigate the entire research process—covering ideation, experimentation, writing, and self-improvement—with minimal oversight. The platform underwent testing with 13 computer science seed topics, yielding 117 papers from three trials per agent-domain combination. The agents assessed were Claude Code with Opus 4.6, Codex with GPT-5.4, and Kimi Code with K2.5. Evaluations were conducted through three perspectives: a manuscript-only review (SAR), an artifact-aware peer review (PR), and a human-led meta-review. Claude Code excelled under SAR, surpassing Analemma's FARS and equaling human-written works. This research underscores the variability in quality among automated research outputs and the need for systematic evaluation. The findings set a foundation for future advancements in automated research.

Key facts

ResearchArena is a minimal scaffold for autonomous AI research.
Three agents tested: Claude Code (Opus 4.6), Codex (GPT-5.4), Kimi Code (K2.5).
117 agent-generated papers produced across 13 seeds and 3 trials.
Evaluation includes SAR, artifact-aware peer review, and human meta-review.
Claude Code scored highest under SAR, outperforming Analemma's FARS.
Auto-research systems can produce complete papers but quality varies.
The field lacks systematic study of agent-generated paper quality.
ResearchArena provides a benchmark for future automated research systems.

Entities

Institutions

arXiv
ResearchArena
Claude Code
Opus 4.6
Codex
GPT-5.4
Kimi Code
K2.5
Analemma
FARS

Sources

arXiv cs.AI — 2026-05-20