RPC-Bench: Benchmarking Research Paper Comprehension in AI
Researchers have unveiled RPC-Bench, a comprehensive benchmark for question-answering aimed at assessing the understanding of foundation models regarding research articles. Derived from high-quality computer science paper review-rebuttal dialogues, it features 15,000 QA pairs verified by humans. The benchmark employs a detailed taxonomy that aligns with the scientific research process to evaluate models on questions of why, what, and how. A framework for annotating LLM-human interactions facilitates extensive labeling and ensures quality control. The evaluation utilizes the LLM-as-a-Judge approach, measuring models based on correctness, completeness, and conciseness, demonstrating a strong correlation with human assessments. Experiments indicate that even the top-performing models face challenges with this task.
Key facts
- RPC-Bench is a benchmark for research paper comprehension
- Built from review-rebuttal exchanges of computer science papers
- Contains 15,000 human-verified QA pairs
- Uses a fine-grained taxonomy aligned with scientific research flow
- Assesses why, what, and how questions
- Employs LLM-human interaction annotation framework
- Evaluates on correctness-completeness and conciseness
- Even strong models perform poorly on this benchmark
Entities
Institutions
- arXiv