LLMs Struggle with Multimodal Physics Problems

ai-technology · 2026-05-07

A study on arXiv evaluates three large language models (Claude, Gemini, ChatGPT) on multimodal physics problems from the OpenStax database. While all models achieved 96% accuracy on text-only problems, performance dropped substantially on multimodal tasks. The research develops an empirical error taxonomy and tests a structured dialogue intervention to address multimodal processing limitations.

Key facts

Study evaluates LLMs on multimodal physics problems
Models tested: Claude, Gemini, ChatGPT
Problems from OpenStax database
96% accuracy on text-only problems
Performance declined on multimodal problems
Empirical error taxonomy developed
Structured multimodal dialogue intervention tested
ArXiv paper ID: 2605.04131

LLMs Struggle with Multimodal Physics Problems

Key facts

Entities

Institutions

Sources