VLMs Fail at Basic Visual Path Following Due to Local Distractor Competition
A recent investigation published on arXiv (2605.15672) indicates that advanced vision-language models (VLMs) have difficulty with line tracing, a basic visual task that involves following a designated path through successive local continuations. Researchers created controlled tracing assignments to minimize semantic and topological confusion; however, even the best-performing VLMs often deviate from the intended path, opting for nearby alternatives that appear similar. Internal analyses and behavioral interventions suggest that these errors stem from local competition with nearby distractors. While increasing model size provides marginal improvements, reasoning helps somewhat but does not fully resolve the issue. These results call into question the presumed robustness of VLMs in fundamental visual operations.
Key facts
- Vision-language models (VLMs) fail at line tracing tasks.
- Failures occur when nearby distractors look similar to the target path.
- Standard scaling of model size provides only limited gains.
- Reasoning partially compensates for the tracing bottleneck.
- The study was published on arXiv with ID 2605.15672.
- Controlled tasks reduced semantic and topological ambiguity.
- Internal analyses confirm local competition as the failure cause.
- State-of-the-art models were tested and found lacking.
Entities
Institutions
- arXiv