ERGeoBench: Benchmarking Embodied Geo-Localization in Multimodal LLMs

ai-technology · 2026-06-01

Researchers have introduced a new diagnostic tool called ERGeoBench to evaluate how well multimodal large language models (MLLMs) perform in vision-driven geo-localization. It includes 2,207 street-view images from various locations worldwide and tests models across three scenarios: single-view, panorama-view, and embodied-view, which allows agents to change their perspective through yaw, pitch, and zoom adjustments. The benchmark focuses on four main abilities: basic perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Initial tests on various MLLMs show that while these models understand broad geographic ideas, they struggle with specific perceptual tasks, precise localization, and keeping spatial consistency across different views. This tool addresses a gap in assessing embodied geo-localization, an area that hasn't been deeply explored yet.

Key facts

ERGeoBench is a diagnostic benchmark for vision-driven embodied geo-localization.
It contains 2,207 globally distributed street-view panoramas.
Evaluates models under single-view, panorama-view, and embodied-view settings.
Measures foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning.
Current MLLMs can infer high-level geographic semantics but struggle with fine-grained operations.
Models have difficulty with metric localization and spatial consistency across views.
The benchmark fills a gap in fine-grained evaluation for embodied geo-localization.
The study was published on arXiv with ID 2605.31251.

ERGeoBench: Benchmarking Embodied Geo-Localization in Multimodal LLMs

Key facts

Entities

Institutions

Sources