SceneFunRI Benchmark Tests VLMs on Occluded Object Reasoning
A new benchmark named SceneFunRI has been unveiled to assess the performance of vision-language models (VLMs) in determining the positions of hidden functional items. Built upon the SceneFun3D dataset, it includes 855 scenarios that challenge models to deduce object locations based on instructions and innate reasoning. The top-performing model, Gemini 3 Flash, recorded a CAcc@75 of 15.20, an mIoU score of 0.74, and a Dist of 28.65. The research explored various prompting strategies, revealing significant limitations in VLMs' spatial reasoning capabilities when faced with unfamiliar environments.
Key facts
- SceneFunRI is a benchmark for reasoning about invisible functional objects.
- Based on the SceneFun3D dataset.
- Comprises 855 instances.
- Requires models to infer locations of occluded objects from task instructions and commonsense reasoning.
- Gemini 3 Flash achieved CAcc@75 of 15.20, mIoU of 0.74, and Dist of 28.65.
- Prompting analysis includes Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE).
- Addresses a major challenge for vision-language models (VLMs).
Entities
—