LLMs Struggle with Multimodal Physics Problems
A study on arXiv evaluates three large language models (Claude, Gemini, ChatGPT) on multimodal physics problems from the OpenStax database. While all models achieved 96% accuracy on text-only problems, performance dropped substantially on multimodal tasks. The research develops an empirical error taxonomy and tests a structured dialogue intervention to address multimodal processing limitations.
Key facts
- Study evaluates LLMs on multimodal physics problems
- Models tested: Claude, Gemini, ChatGPT
- Problems from OpenStax database
- 96% accuracy on text-only problems
- Performance declined on multimodal problems
- Empirical error taxonomy developed
- Structured multimodal dialogue intervention tested
- ArXiv paper ID: 2605.04131
Entities
Institutions
- OpenStax
- arXiv