ResearchArena Benchmarks AI Research Paper Quality
A recent study has unveiled ResearchArena, a streamlined framework that allows off-the-shelf AI agents to independently navigate the entire research process—covering ideation, experimentation, writing, and self-improvement—with minimal oversight. The platform underwent testing with 13 computer science seed topics, yielding 117 papers from three trials per agent-domain combination. The agents assessed were Claude Code with Opus 4.6, Codex with GPT-5.4, and Kimi Code with K2.5. Evaluations were conducted through three perspectives: a manuscript-only review (SAR), an artifact-aware peer review (PR), and a human-led meta-review. Claude Code excelled under SAR, surpassing Analemma's FARS and equaling human-written works. This research underscores the variability in quality among automated research outputs and the need for systematic evaluation. The findings set a foundation for future advancements in automated research.
Key facts
- ResearchArena is a minimal scaffold for autonomous AI research.
- Three agents tested: Claude Code (Opus 4.6), Codex (GPT-5.4), Kimi Code (K2.5).
- 117 agent-generated papers produced across 13 seeds and 3 trials.
- Evaluation includes SAR, artifact-aware peer review, and human meta-review.
- Claude Code scored highest under SAR, outperforming Analemma's FARS.
- Auto-research systems can produce complete papers but quality varies.
- The field lacks systematic study of agent-generated paper quality.
- ResearchArena provides a benchmark for future automated research systems.
Entities
Institutions
- arXiv
- ResearchArena
- Claude Code
- Opus 4.6
- Codex
- GPT-5.4
- Kimi Code
- K2.5
- Analemma
- FARS