AI Models Learn to Predict Research Success via Comparative Evaluation
Researchers from arXiv preprint 2605.21491 investigate whether language models can forecast the empirical success of research ideas without prior experimentation. They introduce comparative empirical forecasting: given a benchmark goal and two candidate ideas, predict which yields better performance. A dataset of 11,488 idea pairs was constructed from PapersWithCode outcomes. Off-the-shelf 8B-parameter models achieved only 30% accuracy, but supervised fine-tuning (SFT) boosted performance to 77.1%, surpassing GPT-5's 61.1%. Using reinforcement learning with verifiable rewards (RLVR), models reached 71.35% accuracy with interpretable justifications. The study addresses a bottleneck in AI-driven research: evaluating numerous generated ideas efficiently.
Key facts
- Study focuses on comparative empirical forecasting of research ideas.
- Dataset includes 11,488 idea pairs from PapersWithCode.
- Off-the-shelf 8B-parameter models achieve 30% accuracy.
- SFT improves accuracy to 77.1%.
- GPT-5 achieves 61.1% accuracy.
- RLVR yields 71.35% accuracy with interpretable justifications.
- Research addresses bottleneck in evaluating AI-generated ideas.
- Preprint is arXiv:2605.21491.
Entities
Institutions
- arXiv
- PapersWithCode