SciCoQA Dataset Reveals LLMs Fail to Detect Scientific Paper-Code Discrepancies
A new dataset called SciCoQA, comprising 635 paper-code discrepancies (92 real, 543 synthetic), has been introduced to evaluate whether large language models (LLMs) can reliably detect mismatches between scientific papers and their accompanying code. The study, published on arXiv (2601.12910), tested 22 models and found that even the best performers—Gemini 3.1 Pro and GPT-5 Mini—detected only 46.7% of real-world discrepancies, highlighting a critical gap in automated scientific quality assurance. The dataset was constructed from GitHub issues and reproducibility papers, with a synthetic generation pipeline extending coverage beyond AI to fields like Physics and Quantitative Biology. A taxonomy of discrepancy types and categories was also developed to characterize mismatches.
Key facts
- SciCoQA dataset contains 635 paper-code discrepancies.
- 92 discrepancies are real-world, 543 are synthetic.
- 22 LLMs were evaluated on the dataset.
- Best models: Gemini 3.1 Pro and GPT-5 Mini.
- Best models detect only 46.7% of real discrepancies.
- Dataset built from GitHub issues and reproducibility papers.
- Synthetic pipeline extends to Physics, Quantitative Biology.
- Taxonomy of discrepancy types and categories introduced.
Entities
Institutions
- arXiv