VLMs Fail at Relative Camera Pose Estimation

ai-technology · 2026-05-01

A recent study indicates that vision-language models (VLMs) face challenges in estimating relative camera poses (RCPE) from pairs of images, a task that assesses multi-view spatial reasoning. The researchers approached RCPE as a discrete verbal classification challenge and developed two benchmarks: VRRPI-Bench, derived from real RGB-D frames with object-centric camera movements, and VRRPI-Diag, which focuses on individual motion degrees of freedom. While humans achieved an accuracy of 0.91 and specialized geometric methods like LoFTR reached 0.99, the highest score for VLMs was only 0.66, with many performing close to random. This disparity remains despite strong VLMs nearing peak performance on single-image tests. They showed instability during source-target reversals (with a maximum consistency of 59.7%) and struggled in simplified single-degree-of-freedom scenarios, particularly with optical-axis motions like roll and depth translation. These results underscore a significant limitation in the spatial reasoning capabilities of current VLMs across different views.

Key facts

VLMs struggle with relative camera pose estimation from image pairs.
Humans achieve 0.91 accuracy on the task.
Specialized geometric pipeline LoFTR achieves 0.99 accuracy.
Best VLM reaches only 0.66 accuracy.
Most VLMs perform near random on the task.
VLMs are unstable under source-target reversal (best 59.7% consistency).
Weakness persists in simplified single-degree-of-freedom settings.
Optical-axis motions like roll and depth translation are particularly challenging.

Entities

—

Sources

arXiv cs.AI — 2026-05-01