GPT-4o and Other Multimodal Models Benchmarked on Standard Vision Tasks

ai-technology · 2026-05-04

A recent study published on arXiv evaluates several multimodal foundation models (MFMs), such as GPT-4o, o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, and Llama 3.2, focusing on typical computer vision tasks including semantic segmentation, object detection, image classification, and predicting depth and surface normals. Utilizing well-known datasets like COCO and ImageNet, the research highlights significant challenges, particularly that many models primarily generate text and struggle to represent visual elements like segments or 3D structures. Additionally, numerous models are proprietary and accessible only through APIs. To tackle these issues, the authors propose converting vision tasks into text-promptable formats using prompt chaining, establishing a consistent evaluation framework.

Key facts

Paper benchmarks GPT-4o, o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2.
Tasks include semantic segmentation, object detection, image classification, depth and surface normal prediction.
Uses COCO and ImageNet datasets.
Models are text-output only and cannot natively express segments or 3D geometry.
Many models are proprietary with API-only access.
Prompt chaining used to translate vision tasks into text-promptable formats.
Published on arXiv with ID 2507.01955.
Study evaluates detailed visual understanding beyond question answering.

GPT-4o and Other Multimodal Models Benchmarked on Standard Vision Tasks

Key facts

Entities

Institutions

Sources