CoCoReviewBench: Benchmark for Evaluating AI Reviewers
Researchers have introduced CoCoReviewBench, a benchmark designed to evaluate AI reviewers with a focus on completeness and correctness. The benchmark addresses the unreliability of human reviews as gold references by building category-specific subsets and skipping evaluation when human reviews are missing. It leverages reviewer-author-meta-review discussions as expert annotations and filters unreliable reviews. CoCoReviewBench curates 3,900 papers from ICLR and NeurIPS, enabling fine-grained evaluation. Analysis reveals AI reviewers are limited in correctness and prone to hallucinations, with reasoning models proving more effective.
Key facts
- CoCoReviewBench is a benchmark for AI reviewers.
- It focuses on completeness and correctness.
- Human reviews are unreliable as gold references.
- Category-specific subsets are used.
- Evaluation is skipped when human reviews are missing.
- Reviewer-author-meta-review discussions serve as expert annotations.
- 3,900 papers from ICLR and NeurIPS are curated.
- AI reviewers are prone to hallucinations.
Entities
Institutions
- ICLR
- NeurIPS