Higher Observation Fidelity Hurts Embodied LLM Problem Solving

ai-technology · 2026-05-20

A new study on arXiv (2605.20072) reveals that large language models (LLMs) don’t perform as well in robotic systems when they have perfect data compared to using raw RGB images. The researchers tested LLM agents with the Lockbox, which is a complex puzzle with hidden connections, and looked at RGB, RGB-D, and perfect symbolic data in a real-world robotic setting. Interestingly, the agents using just raw RGB images outperformed those with perfect data. Furthermore, simulations showed that tweaking perceived outcomes randomly actually boosted performance, hitting an optimal flip probability of 40%, which led to a 2.85-fold increase in success. This raises doubts about the idea that better observation quality always helps in these tasks.

Key facts

Study published on arXiv with ID 2605.20072
LLMs used as cognitive components for robotic systems
Lockbox puzzle used for evaluation
RGB, RGB-D, and ground-truth symbolic observations tested
Raw RGB input yielded best performance
Perfect ground-truth observations yielded worst performance
Moderate noise (40% flip probability) improved success rate 2.85-fold
Controlled simulation used to probe behavior

Higher Observation Fidelity Hurts Embodied LLM Problem Solving

Key facts

Entities

Institutions

Sources