Higher Observation Fidelity Hurts Embodied LLM Problem Solving
A new study on arXiv (2605.20072) reveals that large language models (LLMs) don’t perform as well in robotic systems when they have perfect data compared to using raw RGB images. The researchers tested LLM agents with the Lockbox, which is a complex puzzle with hidden connections, and looked at RGB, RGB-D, and perfect symbolic data in a real-world robotic setting. Interestingly, the agents using just raw RGB images outperformed those with perfect data. Furthermore, simulations showed that tweaking perceived outcomes randomly actually boosted performance, hitting an optimal flip probability of 40%, which led to a 2.85-fold increase in success. This raises doubts about the idea that better observation quality always helps in these tasks.
Key facts
- Study published on arXiv with ID 2605.20072
- LLMs used as cognitive components for robotic systems
- Lockbox puzzle used for evaluation
- RGB, RGB-D, and ground-truth symbolic observations tested
- Raw RGB input yielded best performance
- Perfect ground-truth observations yielded worst performance
- Moderate noise (40% flip probability) improved success rate 2.85-fold
- Controlled simulation used to probe behavior
Entities
Institutions
- arXiv