SciMDR Dataset Boosts Multimodal Scientific Document Reasoning
A new training dataset, SciMDR, has been launched by researchers to enhance cross-modal understanding in scientific literature, featuring 300K question-and-answer pairs with explicit reasoning chains derived from 20K scientific papers. This dataset was developed through an innovative synthesize-and-reground approach that balances scale, fidelity, and realism. The process consists of two stages: Claim-Centric QA Synthesis, which produces accurate, isolated QA pairs and reasoning for specific segments, and Document-Scale Regrounding, which programmatically integrates these pairs into comprehensive document tasks to reflect realistic complexity. Furthermore, the team created SciMDR-Eval, a benchmark with expert annotations for assessing multimodal comprehension in complete scientific workflows. Experiments indicate that models refined using SciMDR show notable advancements in scientific multimodal document reasoning.
Key facts
- SciMDR is a large-scale training dataset for cross-modal comprehension.
- It contains 300K QA pairs with explicit reasoning chains.
- The dataset spans 20K scientific papers.
- Constructed using a synthesize-and-reground framework.
- The framework includes Claim-Centric QA Synthesis and Document-Scale Regrounding.
- SciMDR-Eval is an expert-annotated benchmark for evaluation.
- Models fine-tuned on SciMDR show significant improvements.
- The research is published on arXiv with ID 2603.12249.
Entities
Institutions
- arXiv