XDomainBench Diagnoses Reasoning Collapse in LLMs for Science

ai-technology · 2026-05-16

A new benchmark, XDomainBench, reveals that large language models (LLMs) suffer from systematic reasoning collapse when composing knowledge across scientific disciplines. The benchmark, introduced in a paper on arXiv (2605.14754), simulates interactive interdisciplinary scientific workflows with 8,598 sessions across 20 domains and 4 task categories. It formalizes composition order and mixture structure to stress-test models from single-discipline to interdisciplinary reasoning. Evaluations show that as composition order increases, LLMs exhibit a collapse in reasoning, attributed to two root causes. The study aims to address the gap in existing benchmarks that focus on single-turn scenarios, failing to capture the complexities of real-world AI4S (AI for Science) applications.

Key facts

XDomainBench is a diagnostic benchmark for interactive interdisciplinary scientific reasoning.
It comprises 8,598 interactive sessions across 20 domains and 4 task categories.
The benchmark includes 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics.
Large-scale evaluation of LLMs reveals systematic reasoning collapse as composition order increases.
The collapse stems from two root causes (not specified in the abstract).
Existing benchmarks primarily focus on single-turn restricted scenarios.
The benchmark simulates real-world AI4S scenarios.
The paper is available on arXiv with ID 2605.14754.

XDomainBench Diagnoses Reasoning Collapse in LLMs for Science

Key facts

Entities

Institutions

Sources