UML Class Diagram Benchmark for Vision Language Models
A new benchmark for visual question answering utilizing UML class diagrams has been established by researchers, filling a void in VLM studies that primarily concentrated on photographs and more straightforward graphics such as bar charts. They developed an extensive training dataset comprising 16,000 triples of images, questions, and answers. In this context, a LoRA-based fine-tuning method surpassed the performance of Qwen 3.5 27B, a recently introduced top-performing VLM.
Key facts
- Vision Language Models (VLMs) struggle with diagram understanding compared to photos.
- Prior research focused on bar charts and line charts, not computer science diagrams like UML class diagrams.
- A new benchmark for visual question answering on UML class diagrams was introduced.
- A training dataset of 16,000 image-question-answer triples was constructed.
- LoRA-based fine-tuning outperformed Qwen 3.5 27B on the UML benchmark.
- The work is published on arXiv under computer vision and pattern recognition.
- The dataset and code are available via Semantic Scholar and other tools.
- The research is part of the arXivLabs framework.
Entities
Institutions
- arXiv
- Semantic Scholar
- arXivLabs