TouchSafeBench: Benchmarking Collision Grounding in VLMs for Human-Robot Safety
TouchSafeBench, a new evaluation standard, assesses vision-language models (VLMs) in terms of collision grounding—determining whether a robot is in a safe position, currently colliding, or on the verge of collision with a person or object. Developed using Habitat 3.0, this benchmark features 2,940 simulated indoor co-presence scenarios spanning social navigation and social rearrangement tasks. It offers synchronized multi-view RGB-D data, top-down trajectory maps, calibrated camera information, and contact labels derived from the simulator. The research emphasizes two key tasks for deployment: identifying the current safety state and issuing warnings about potential collisions. The findings underscore that effective human-robot collaboration requires more than just visual descriptions; it necessitates integrating visual data with robot geometry, camera perspective, scene arrangement, human proximity, and motion over time. The paper can be found on arXiv with ID 2605.31196.
Key facts
- TouchSafeBench is a physics-grounded benchmark for collision grounding in VLMs.
- Built in Habitat 3.0.
- Contains 2,940 simulated indoor co-presence episodes.
- Covers social navigation and social rearrangement tasks.
- Provides synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels.
- Two deployment-facing tasks: classifying current safety state and warning about imminent collision.
- Collision grounding requires binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion.
- Paper ID: arXiv:2605.31196.
Entities
Institutions
- arXiv