SciMDR Dataset Boosts Multimodal Scientific Document Reasoning

ai-technology · 2026-04-30

A new training dataset, SciMDR, has been launched by researchers to enhance cross-modal understanding in scientific literature, featuring 300K question-and-answer pairs with explicit reasoning chains derived from 20K scientific papers. This dataset was developed through an innovative synthesize-and-reground approach that balances scale, fidelity, and realism. The process consists of two stages: Claim-Centric QA Synthesis, which produces accurate, isolated QA pairs and reasoning for specific segments, and Document-Scale Regrounding, which programmatically integrates these pairs into comprehensive document tasks to reflect realistic complexity. Furthermore, the team created SciMDR-Eval, a benchmark with expert annotations for assessing multimodal comprehension in complete scientific workflows. Experiments indicate that models refined using SciMDR show notable advancements in scientific multimodal document reasoning.

Key facts

SciMDR is a large-scale training dataset for cross-modal comprehension.
It contains 300K QA pairs with explicit reasoning chains.
The dataset spans 20K scientific papers.
Constructed using a synthesize-and-reground framework.
The framework includes Claim-Centric QA Synthesis and Document-Scale Regrounding.
SciMDR-Eval is an expert-annotated benchmark for evaluation.
Models fine-tuned on SciMDR show significant improvements.
The research is published on arXiv with ID 2603.12249.

SciMDR Dataset Boosts Multimodal Scientific Document Reasoning

Key facts

Entities

Institutions

Sources