VLMs Struggle with 3D-Printed Objects in Robotic Scene Understanding

ai-technology · 2026-04-25

A recent investigation published on arXiv (2506.19579) examines the performance of Vision-Language Models (VLMs) in adapting to domain shifts within single-view robotic scene comprehension. The researchers established a controlled physical domain shift by contrasting real tools with 3D-printed versions that were geometrically similar but varied in texture, color, and material. They assessed advanced, locally implementable VLMs for object captioning in tabletop scenes captured by a robotic manipulator. Findings indicate that while VLMs effectively describe typical real-world objects, their performance significantly declines with 3D-printed items, even when the structures are similar. Additionally, the study highlights significant flaws in conventional evaluation metrics, which may overlook domain shifts or favor fluent yet inaccurate captions.

Key facts

Study evaluates VLM robustness to domain shift in single-view robotic scene understanding
Domain shift contrasts real tools with 3D-printed counterparts differing in texture, color, material
Benchmarks state-of-the-art locally deployable VLMs on object captioning
Performance degrades on 3D-printed items despite similar shapes
Standard evaluation metrics fail to detect domain shifts or reward incorrect captions
Research conducted on tabletop scenes captured by robotic manipulator
Published on arXiv with ID 2506.19579
Focus on semantic alignment and factual grounding

VLMs Struggle with 3D-Printed Objects in Robotic Scene Understanding

Key facts

Entities

Institutions

Sources