BilliardPhys-Bench Tests Physical Reasoning in Multimodal LLMs
BilliardPhys-Bench has been developed by researchers as a benchmark to assess physical reasoning capabilities in multimodal large language models (MLLMs) through synthetic billiards environments. This benchmark features a procedural engine that creates randomized scenarios, incorporating elements like friction and elastic collisions, to evaluate three main skills: predicting interactions between balls, reasoning about bounces off walls, and estimating the final positions of balls after they come to a stop. Recent evaluations of MLLMs, including those from the GPT, Claude, Gemini, and Qwen families, show a decline in performance with longer simulation times and more complex scene geometries. A notable issue is "stasis bias," where models often predict no interaction when the correct outcome is less obvious, underscoring limitations in MLLMs' intuitive physical reasoning amidst strong static image recognition.
Key facts
- BilliardPhys-Bench is a benchmark for physical reasoning in synthetic billiards environments.
- It tests three abilities: predicting ball-to-ball collisions, wall bounces, and final ball positions.
- The procedural engine generates randomized scenarios with friction and elastic collisions.
- Evaluated MLLMs include GPT, Claude, Gemini, and Qwen families.
- Performance drops with longer simulation time and more complex scene geometry.
- A failure mode called 'stasis bias' is observed: models predict no interaction when outcomes are hard to infer.
- Current MLLMs handle static images well but struggle with intuitive physical reasoning.
- The benchmark is introduced in arXiv paper 2605.30900.
Entities
Institutions
- arXiv