DiagramBank Dataset Enables AI-Generated Scientific Diagrams
Researchers have introduced DiagramBank, a large-scale dataset of 89,422 schematic diagrams sourced from top-tier scientific publications. Designed to address a bottleneck in autonomous "AI scientist" systems, the dataset enables multimodal retrieval and exemplar-driven generation of publication-grade scientific figures, such as teaser images. Unlike derivative data plots, these diagrams require conceptual synthesis to translate complex logic into compelling visuals. The dataset is intended to support retrieval-augmented generation for scientific figure creation, filling a gap where existing AI systems often omit or produce inferior alternatives. The work is detailed in arXiv preprint 2604.20857.
Key facts
- DiagramBank contains 89,422 schematic diagrams from top-tier scientific publications.
- The dataset is designed for multimodal retrieval and exemplar-driven scientific figure generation.
- It addresses a bottleneck in autonomous AI scientist systems for producing publication-grade diagrams.
- Teaser figures serve as strategic visual interfaces requiring conceptual synthesis.
- Existing AI systems often omit or produce inferior alternatives to scientific diagrams.
- The dataset supports retrieval-augmented generation for scientific figure creation.
- The research is published on arXiv with ID 2604.20857.
- The dataset targets schematic diagrams, not derivative data plots.
Entities
Institutions
- arXiv