Flat-Pack Bench: New Benchmark Tests LVLMs on Furniture Assembly
Researchers have introduced Flat-Pack Bench, a benchmark designed to evaluate Large Vision-Language Models (LVLMs) on fine-grained spatio-temporal understanding through furniture assembly tasks. Existing video understanding benchmarks focus on coarse-grained tasks like action segmentation, classification, captioning, and retrieval, often relying on easily identifiable entities such as household objects, animals, and human subjects. This limits their applicability to complex, in-the-wild video scenarios. Flat-Pack Bench addresses this gap by requiring step-by-step understanding of assembly actions, including temporal ordering, temporal localization of assembly states, and part matching. The benchmark aims to push LVLMs toward more nuanced video comprehension needed for applications like furniture assembly and cooking.
Key facts
- Flat-Pack Bench is a new benchmark for evaluating LVLMs on furniture assembly tasks.
- It focuses on fine-grained spatio-temporal understanding.
- Existing benchmarks are limited to coarse-grained tasks and simple entities.
- The benchmark includes temporal ordering, localization, and part matching.
- It targets applications like furniture assembly and cooking.
- The work is published on arXiv with ID 2605.21625.
- The announcement type is cross.
- The benchmark aims to address gaps in current video understanding evaluations.
Entities
Institutions
- arXiv