MechVQA Dataset Benchmarks Multimodal LLMs on Mechanical Drawings
A groundbreaking dataset named MechVQA has been launched by researchers to assess multimodal large language models (MLLMs) in the context of mechanical engineering drawings. This dataset features 3,300 high-density images and includes 21,000 question-answer pairs across ten distinct tasks, categorized into three levels of capability: Recognition, Reasoning, and Judging. Developed through a semi-automated process with quality control, it aims to tackle the vulnerabilities of MLLMs when interpreting such drawings, which often suffer from high annotation density, limited domain expertise, and unreliable reasoning regarding spatial relations under strict geometric constraints. Additionally, the MechVL model, derived from MechVQA, is introduced to improve comprehension of real-world mechanical drawings.
Key facts
- MechVQA is the first comprehensive mechanical drawing understanding dataset.
- It contains 3,300 high-density pictures with 21,000 question-answer pairs.
- The dataset spans 10 fine-grained tasks across Recognition, Reasoning, and Judging levels.
- Created via a semi-automated construction and quality-control pipeline.
- MLLMs currently perform poorly on mechanical engineering drawings.
- The MechVL model is developed on top of MechVQA.
- The research addresses annotation density, domain knowledge, and spatial reasoning issues.
- The dataset serves as a testbed for MLLM understanding of real-world mechanical drawings.
Entities
—