TouchSafeBench: Benchmarking Collision Grounding in VLMs for Human-Robot Safety

ai-technology · 2026-06-01

TouchSafeBench, a new evaluation standard, assesses vision-language models (VLMs) in terms of collision grounding—determining whether a robot is in a safe position, currently colliding, or on the verge of collision with a person or object. Developed using Habitat 3.0, this benchmark features 2,940 simulated indoor co-presence scenarios spanning social navigation and social rearrangement tasks. It offers synchronized multi-view RGB-D data, top-down trajectory maps, calibrated camera information, and contact labels derived from the simulator. The research emphasizes two key tasks for deployment: identifying the current safety state and issuing warnings about potential collisions. The findings underscore that effective human-robot collaboration requires more than just visual descriptions; it necessitates integrating visual data with robot geometry, camera perspective, scene arrangement, human proximity, and motion over time. The paper can be found on arXiv with ID 2605.31196.

Key facts

TouchSafeBench is a physics-grounded benchmark for collision grounding in VLMs.
Built in Habitat 3.0.
Contains 2,940 simulated indoor co-presence episodes.
Covers social navigation and social rearrangement tasks.
Provides synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels.
Two deployment-facing tasks: classifying current safety state and warning about imminent collision.
Collision grounding requires binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion.
Paper ID: arXiv:2605.31196.

TouchSafeBench: Benchmarking Collision Grounding in VLMs for Human-Robot Safety

Key facts

Entities

Institutions

Sources