LLMs tested on agent-based model replication from ODD specifications
A research investigation analyzed 17 extensive language models (LLMs) to determine their proficiency in executing agent-based models based on standardized ODD specifications, utilizing the PPHPC predator-prey model as a benchmark. The Python code generated was evaluated through checks for executability, statistical comparisons with a NetLogo baseline, and assessments of runtime efficiency and maintainability. Findings suggest that while behaviorally accurate implementations can be achieved, they are not assured, and mere executability does not suffice for scientific applications. GPT-4.1 consistently delivered statistically valid and efficient implementations, with Claude 3.7 Sonnet also showing commendable performance.
Key facts
- 17 contemporary LLMs were evaluated on ODD-to-code translation
- PPHPC predator-prey model used as fully specified reference
- Generated Python implementations compared against validated NetLogo baseline
- GPT-4.1 consistently produced statistically valid and efficient implementations
- Claude 3.7 Sonnet also performed well
- Executability alone insufficient for scientific use
- Behaviorally faithful implementations achievable but not guaranteed
- Study published on arXiv (2602.10140)
Entities
Institutions
- arXiv