ARTFEED — Contemporary Art Intelligence

ResearchArena Benchmarks AI Research Paper Quality

ai-technology · 2026-05-20

A recent study has unveiled ResearchArena, a streamlined framework that allows off-the-shelf AI agents to independently navigate the entire research process—covering ideation, experimentation, writing, and self-improvement—with minimal oversight. The platform underwent testing with 13 computer science seed topics, yielding 117 papers from three trials per agent-domain combination. The agents assessed were Claude Code with Opus 4.6, Codex with GPT-5.4, and Kimi Code with K2.5. Evaluations were conducted through three perspectives: a manuscript-only review (SAR), an artifact-aware peer review (PR), and a human-led meta-review. Claude Code excelled under SAR, surpassing Analemma's FARS and equaling human-written works. This research underscores the variability in quality among automated research outputs and the need for systematic evaluation. The findings set a foundation for future advancements in automated research.

Key facts

  • ResearchArena is a minimal scaffold for autonomous AI research.
  • Three agents tested: Claude Code (Opus 4.6), Codex (GPT-5.4), Kimi Code (K2.5).
  • 117 agent-generated papers produced across 13 seeds and 3 trials.
  • Evaluation includes SAR, artifact-aware peer review, and human meta-review.
  • Claude Code scored highest under SAR, outperforming Analemma's FARS.
  • Auto-research systems can produce complete papers but quality varies.
  • The field lacks systematic study of agent-generated paper quality.
  • ResearchArena provides a benchmark for future automated research systems.

Entities

Institutions

  • arXiv
  • ResearchArena
  • Claude Code
  • Opus 4.6
  • Codex
  • GPT-5.4
  • Kimi Code
  • K2.5
  • Analemma
  • FARS

Sources