Study Finds AI Scientific Agents Ignore Evidence in 68% of Research Traces
A new study published on arXiv (2604.18805v1) reveals that large language model-based systems deployed for autonomous scientific research frequently violate core epistemic norms. Through analysis of over 25,000 agent runs across eight scientific domains, researchers found that evidence was disregarded in 68% of reasoning traces. The study employed two complementary methodologies: a systematic performance analysis separating contributions of base models from agent scaffolds, and a behavioral analysis examining the epistemological structure of agent reasoning. Results showed the base model accounted for 41.4% of explained variance in both performance and behavior, compared to just 1.5% for the scaffold. Refutation-driven belief revision occurred in only 26% of cases, while convergent multi-test evidence remained rare. The research questions whether LLM-based scientific agents adhere to the self-correcting principles essential to scientific inquiry, particularly in workflow execution and hypothesis-driven investigation contexts.
Key facts
- Study published as arXiv:2604.18805v1
- Analyzed over 25,000 LLM-based agent runs
- Evidence ignored in 68% of reasoning traces
- Base model accounted for 41.4% of explained variance
- Agent scaffold accounted for 1.5% of explained variance
- Refutation-driven belief revision occurred in 26% of cases
- Convergent multi-test evidence was rare
- Examined eight scientific domains
Entities
Institutions
- arXiv