CoCoReviewBench: Benchmark for Evaluating AI Reviewers

ai-technology · 2026-05-11

Researchers have introduced CoCoReviewBench, a benchmark designed to evaluate AI reviewers with a focus on completeness and correctness. The benchmark addresses the unreliability of human reviews as gold references by building category-specific subsets and skipping evaluation when human reviews are missing. It leverages reviewer-author-meta-review discussions as expert annotations and filters unreliable reviews. CoCoReviewBench curates 3,900 papers from ICLR and NeurIPS, enabling fine-grained evaluation. Analysis reveals AI reviewers are limited in correctness and prone to hallucinations, with reasoning models proving more effective.

Key facts

CoCoReviewBench is a benchmark for AI reviewers.
It focuses on completeness and correctness.
Human reviews are unreliable as gold references.
Category-specific subsets are used.
Evaluation is skipped when human reviews are missing.
Reviewer-author-meta-review discussions serve as expert annotations.
3,900 papers from ICLR and NeurIPS are curated.
AI reviewers are prone to hallucinations.

CoCoReviewBench: Benchmark for Evaluating AI Reviewers

Key facts

Entities

Institutions

Sources