CMBAgent Fails Silently in Astrophysical Tasks
A recent study assessing CMBAgent across 18 astrophysical tasks indicates that the AI excels in well-defined problems but often generates plausible yet incorrect outputs without the ability to self-correct. In the One-Shot context, incorporating domain-specific information enhances performance by approximately six times (0.85 compared to ~0 without context). However, the main issue arises from silent incorrect computations, where syntactically correct code produces erroneous results. In the Deep Research context, the system experiences silent failures during stress testing, leading to physically inconsistent outcomes. These results highlight that the most alarming failure in agentic scientific processes is not blatant errors but rather seemingly valid but inaccurate conclusions.
Key facts
- CMBAgent was evaluated across two workflow paradigms and 18 astrophysical tasks.
- In the One-Shot setting, domain-specific context yields ~6x performance improvement (0.85 vs. ~0 without context).
- Primary failure mode is silent incorrect computation: syntactically valid code producing plausible but inaccurate results.
- In the Deep Research setting, the system frequently exhibits silent failures across stress tests.
- The system produces physically inconsistent posteriors without self-diagnosis.
- Performance degrades on problems designed to probe reasoning limits, often without visible error signals.
- The most concerning failure mode is not overt errors but plausible yet wrong results.
- The study is published on arXiv with ID 2604.25345.
Entities
Institutions
- arXiv