VLMs Struggle with 3D-Printed Objects in Robotic Scene Understanding
A recent investigation published on arXiv (2506.19579) examines the performance of Vision-Language Models (VLMs) in adapting to domain shifts within single-view robotic scene comprehension. The researchers established a controlled physical domain shift by contrasting real tools with 3D-printed versions that were geometrically similar but varied in texture, color, and material. They assessed advanced, locally implementable VLMs for object captioning in tabletop scenes captured by a robotic manipulator. Findings indicate that while VLMs effectively describe typical real-world objects, their performance significantly declines with 3D-printed items, even when the structures are similar. Additionally, the study highlights significant flaws in conventional evaluation metrics, which may overlook domain shifts or favor fluent yet inaccurate captions.
Key facts
- Study evaluates VLM robustness to domain shift in single-view robotic scene understanding
- Domain shift contrasts real tools with 3D-printed counterparts differing in texture, color, material
- Benchmarks state-of-the-art locally deployable VLMs on object captioning
- Performance degrades on 3D-printed items despite similar shapes
- Standard evaluation metrics fail to detect domain shifts or reward incorrect captions
- Research conducted on tabletop scenes captured by robotic manipulator
- Published on arXiv with ID 2506.19579
- Focus on semantic alignment and factual grounding
Entities
Institutions
- arXiv