OMIBench: New Benchmark Tests Olympiad-Level Multi-Image Reasoning in LVLMs

other · 2026-04-24

Researchers have introduced OMIBench, a benchmark designed to evaluate large vision-language models (LVLMs) on Olympiad-level reasoning tasks that require integrating information across multiple images. The benchmark includes problems from biology, chemistry, mathematics, and physics Olympiads, accompanied by manually annotated rationales and evaluation protocols for exact and semantic answer matching. Experiments reveal significant performance gaps among current models, with the strongest LVLM, Gemini-3-Pro, achieving only about 50% accuracy. OMIBench aims to address the limitation of existing Olympiad-level multimodal benchmarks that focus on single-image analysis, providing a resource for studying and improving multi-image reasoning in LVLMs.

Key facts

OMIBench evaluates Olympiad-level reasoning across multiple images.
It covers biology, chemistry, mathematics, and physics Olympiads.
Includes manually annotated rationales and evaluation protocols.
Gemini-3-Pro achieves only about 50% accuracy on OMIBench.
Existing models show meaningful performance gaps.
Current Olympiad benchmarks emphasize single-image analysis.
OMIBench is designed to exploit contextual information across images.
The benchmark is a focused resource for multi-image reasoning.

Entities

—

Sources

arXiv cs.AI — 2026-04-23