AI Models Learn to Predict Research Success via Comparative Evaluation

ai-technology · 2026-05-23

Researchers from arXiv preprint 2605.21491 investigate whether language models can forecast the empirical success of research ideas without prior experimentation. They introduce comparative empirical forecasting: given a benchmark goal and two candidate ideas, predict which yields better performance. A dataset of 11,488 idea pairs was constructed from PapersWithCode outcomes. Off-the-shelf 8B-parameter models achieved only 30% accuracy, but supervised fine-tuning (SFT) boosted performance to 77.1%, surpassing GPT-5's 61.1%. Using reinforcement learning with verifiable rewards (RLVR), models reached 71.35% accuracy with interpretable justifications. The study addresses a bottleneck in AI-driven research: evaluating numerous generated ideas efficiently.

Key facts

Study focuses on comparative empirical forecasting of research ideas.
Dataset includes 11,488 idea pairs from PapersWithCode.
Off-the-shelf 8B-parameter models achieve 30% accuracy.
SFT improves accuracy to 77.1%.
GPT-5 achieves 61.1% accuracy.
RLVR yields 71.35% accuracy with interpretable justifications.
Research addresses bottleneck in evaluating AI-generated ideas.
Preprint is arXiv:2605.21491.

AI Models Learn to Predict Research Success via Comparative Evaluation

Key facts

Entities

Institutions

Sources