MolRecBench-Wild: New Benchmark for Chemical Structure Recognition
MolRecBench-Wild has been launched by researchers as a benchmark comprising 5,029 molecular structures sourced from 820 contemporary chemistry publications. This benchmark aims to assess Optical Chemical Structure Recognition (OCSR) systems using authentic images. It utilizes MOSAIC, a framework that incorporates dual-dimensional difficulty levels and features 37 detailed labels addressing visual interference and chemical semantics. To facilitate accurate evaluations, the team has also introduced CARBON, a representation language adept at conveying valence changes, icon-based categories, and various unconventional chemical semantics. Furthermore, a dual-track evaluation protocol is established to accommodate both CARBON and SMILES outputs, ensuring extensive compatibility.
Key facts
- MolRecBench-Wild contains 5,029 structures from 820 recent chemistry papers.
- MOSAIC is a dual-dimensional difficulty framework with 37 fine-grained labels.
- CARBON is a new representation language for non-standard chemical semantics.
- The benchmark covers the full difficulty spectrum of real publications.
- A dual-track evaluation protocol supports both CARBON and SMILES outputs.
- OCSR aims to translate molecular diagrams into machine-readable formats.
- Current OCSR systems remain unreliable on real-world images.
- The work is published on arXiv with ID 2605.05832.
Entities
Institutions
- arXiv