NoisyCausal Benchmark Tests LLM Causal Reasoning Under Noise

ai-technology · 2026-05-07

A new benchmark called NoisyCausal has been developed by researchers to assess causal reasoning capabilities in large language models (LLMs) amidst structured noise. Each benchmark instance is derived from a true causal graph and set within a natural language context, incorporating controllable noise types such as irrelevant distractors, value changes, confounding factors, and partial observability. Additionally, a modular reasoning framework is proposed, integrating LLMs with defined causal structures to tackle these issues. The benchmark's objective is to evaluate LLMs' ability to differentiate between correlation and causation when faced with inaccurate observations or irrelevant data. This research is outlined in a paper available on arXiv, identified by 2605.04313.

Key facts

NoisyCausal is a new benchmark for evaluating causal reasoning under structured noise.
Each instance is generated from a ground-truth causal graph.
Noise types include irrelevant distractors, value perturbations, confounding, and partial observability.
A modular reasoning framework combining LLMs with explicit causal structure is proposed.
LLMs struggle to disentangle correlation from causation under noisy conditions.
The benchmark is designed to test causal reasoning in natural language scenarios.
The paper is available on arXiv with ID 2605.04313.
The research focuses on evaluating LLMs' causal reasoning abilities.

NoisyCausal Benchmark Tests LLM Causal Reasoning Under Noise

Key facts

Entities

Institutions

Sources