AI Models Evaluated on System Dynamics Tasks: Cloud vs Local Performance on Causal Loop Diagrams

ai-technology · 2026-04-22

A new research paper systematically evaluates large language models for System Dynamics AI assistance, comparing proprietary cloud APIs with locally-hosted open-source models. The study introduces two benchmarks: the CLD Leaderboard with 53 tests for structured causal loop diagram extraction, and the Discussion Leaderboard for interactive model discussion, feedback explanation, and model building coaching. On CLD extraction tasks, cloud models achieved overall pass rates between 77% and 89%. The best-performing local model, Kimi K2.5 GGUF Q3 using a zero-shot engine, reached 77% accuracy, matching mid-tier cloud model performance. For Discussion tasks, local models showed varied results: they achieved 50-100% on model building steps and 47-75% on feedback explanation, but only 0-50% on error fixing. The paper identifies that error fixing performance suffers due to long-context prompts that expose memory limitations in local deployments. A key contribution is the systematic analysis of model type effects on performance across different task categories. The research was published on arXiv with identifier 2604.18566v2.

Key facts

Systematic evaluation of large language models for System Dynamics AI assistance
Comparison of proprietary cloud APIs and locally-hosted open-source models
Two benchmarks: CLD Leaderboard (53 tests) and Discussion Leaderboard
Cloud models achieved 77-89% pass rates on CLD extraction
Best local model (Kimi K2.5 GGUF Q3) reached 77% on CLD extraction
Local models performed 50-100% on model building steps in Discussion tasks
Local models performed 0-50% on error fixing due to memory limitations
Research published on arXiv with identifier 2604.18566v2

AI Models Evaluated on System Dynamics Tasks: Cloud vs Local Performance on Causal Loop Diagrams

Key facts

Entities

Institutions

Sources