Mage: Multi-Axis Evaluation of LLM-Generated Game Scenes Beyond Compile-Pass Rate
A newly introduced evaluation framework named Mage highlights the limitations of the compile-pass rate as a measure for code generated by LLMs in complex domains. The research, which involved 858 attempts at synthesizing executable game scenes using four open-weight LLMs (ranging from 7B to 30B) and 26 manually crafted Unity goal patterns, shows that generating C# directly from natural language results in a mean runtime-pass rate of 43%, but produces scenes lacking structural integrity (mechanism F1 ≈ 0.12). On the other hand, structural IR conditioning improves runtime success by 50% and restores domain-appropriate structure (F1 reaching 1.00). Notably, within IR conditioning, both behavior-only and full-scene granularity display no significant differences (McNemar p = 1.0). This study is available on arXiv (2605.07342).
Key facts
- Mage is a four-axis evaluation protocol: compile success, runtime success, structural fidelity, mechanism adherence.
- 858 generation attempts across four open-weight LLMs (7B-30B).
- 26 hand-crafted Unity goal pattern playable concepts used.
- Two automatically extracted IR granularity levels tested.
- Direct NL-to-C# generation achieves 43% mean runtime-pass rate.
- Direct generation yields mechanism F1 ≈ 0.12 (structurally vacuous).
- Structural IR conditioning halves runtime rate but recovers domain-faithful structure (F1 up to 1.00).
- Behavior-only and full-scene granularity are statistically indistinguishable (McNemar p = 1.0).
Entities
—