XDomainBench Diagnoses Reasoning Collapse in LLMs for Science
A new benchmark, XDomainBench, reveals that large language models (LLMs) suffer from systematic reasoning collapse when composing knowledge across scientific disciplines. The benchmark, introduced in a paper on arXiv (2605.14754), simulates interactive interdisciplinary scientific workflows with 8,598 sessions across 20 domains and 4 task categories. It formalizes composition order and mixture structure to stress-test models from single-discipline to interdisciplinary reasoning. Evaluations show that as composition order increases, LLMs exhibit a collapse in reasoning, attributed to two root causes. The study aims to address the gap in existing benchmarks that focus on single-turn scenarios, failing to capture the complexities of real-world AI4S (AI for Science) applications.
Key facts
- XDomainBench is a diagnostic benchmark for interactive interdisciplinary scientific reasoning.
- It comprises 8,598 interactive sessions across 20 domains and 4 task categories.
- The benchmark includes 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics.
- Large-scale evaluation of LLMs reveals systematic reasoning collapse as composition order increases.
- The collapse stems from two root causes (not specified in the abstract).
- Existing benchmarks primarily focus on single-turn restricted scenarios.
- The benchmark simulates real-world AI4S scenarios.
- The paper is available on arXiv with ID 2605.14754.
Entities
Institutions
- arXiv