ARTFEED — Contemporary Art Intelligence

CoCoReviewBench: Benchmark for Evaluating AI Reviewers

ai-technology · 2026-05-11

Researchers have introduced CoCoReviewBench, a benchmark designed to evaluate AI reviewers with a focus on completeness and correctness. The benchmark addresses the unreliability of human reviews as gold references by building category-specific subsets and skipping evaluation when human reviews are missing. It leverages reviewer-author-meta-review discussions as expert annotations and filters unreliable reviews. CoCoReviewBench curates 3,900 papers from ICLR and NeurIPS, enabling fine-grained evaluation. Analysis reveals AI reviewers are limited in correctness and prone to hallucinations, with reasoning models proving more effective.

Key facts

  • CoCoReviewBench is a benchmark for AI reviewers.
  • It focuses on completeness and correctness.
  • Human reviews are unreliable as gold references.
  • Category-specific subsets are used.
  • Evaluation is skipped when human reviews are missing.
  • Reviewer-author-meta-review discussions serve as expert annotations.
  • 3,900 papers from ICLR and NeurIPS are curated.
  • AI reviewers are prone to hallucinations.

Entities

Institutions

  • ICLR
  • NeurIPS

Sources