LLMs tested on agent-based model replication from ODD specifications

ai-technology · 2026-05-01

A research investigation analyzed 17 extensive language models (LLMs) to determine their proficiency in executing agent-based models based on standardized ODD specifications, utilizing the PPHPC predator-prey model as a benchmark. The Python code generated was evaluated through checks for executability, statistical comparisons with a NetLogo baseline, and assessments of runtime efficiency and maintainability. Findings suggest that while behaviorally accurate implementations can be achieved, they are not assured, and mere executability does not suffice for scientific applications. GPT-4.1 consistently delivered statistically valid and efficient implementations, with Claude 3.7 Sonnet also showing commendable performance.

Key facts

17 contemporary LLMs were evaluated on ODD-to-code translation
PPHPC predator-prey model used as fully specified reference
Generated Python implementations compared against validated NetLogo baseline
GPT-4.1 consistently produced statistically valid and efficient implementations
Claude 3.7 Sonnet also performed well
Executability alone insufficient for scientific use
Behaviorally faithful implementations achievable but not guaranteed
Study published on arXiv (2602.10140)

LLMs tested on agent-based model replication from ODD specifications

Key facts

Entities

Institutions

Sources