LongDS-Bench: Benchmarking Long-Horizon Data Analysis Agents

ai-technology · 2026-06-01

A new benchmark called LongDS-Bench has been developed by researchers to assess AI agents on tasks involving long-horizon, multi-turn data analysis. This benchmark features 68 tasks sourced from actual Kaggle notebooks, amounting to 2,225 turns across six different fields, including Business, Geoscience, and Education. Agents are required to manage, update, restore, and create evolving analytical states, with an average dependency span of 11.3 turns. When testing five advanced models, the highest accuracy reached was only 48.45%, with a nearly 47-point decline in performance from the initial to the final turns. Long-horizon errors were responsible for 52%–69% of the failures, underscoring the difficulties in maintaining analytical context during prolonged interactions.

Key facts

LongDS-Bench evaluates AI agents on long-horizon, multi-turn data analysis.
The benchmark includes 68 tasks from real-world Kaggle notebooks.
Tasks span 2,225 turns across six domains: Geoscience, Business, and Education.
Average dependency span is 11.3 turns.
Best model achieved 48.45% average accuracy.
Performance dropped nearly 47 points from early to late turns.
Long-horizon errors account for 52%–69% of failures.
Tasks involve state-evolution patterns like counterfactual perturbation and rollback.

LongDS-Bench: Benchmarking Long-Horizon Data Analysis Agents

Key facts

Entities

Institutions

Sources