Mage: Multi-Axis Evaluation of LLM-Generated Game Scenes Beyond Compile-Pass Rate

ai-technology · 2026-05-11

A newly introduced evaluation framework named Mage highlights the limitations of the compile-pass rate as a measure for code generated by LLMs in complex domains. The research, which involved 858 attempts at synthesizing executable game scenes using four open-weight LLMs (ranging from 7B to 30B) and 26 manually crafted Unity goal patterns, shows that generating C# directly from natural language results in a mean runtime-pass rate of 43%, but produces scenes lacking structural integrity (mechanism F1 ≈ 0.12). On the other hand, structural IR conditioning improves runtime success by 50% and restores domain-appropriate structure (F1 reaching 1.00). Notably, within IR conditioning, both behavior-only and full-scene granularity display no significant differences (McNemar p = 1.0). This study is available on arXiv (2605.07342).

Key facts

Mage is a four-axis evaluation protocol: compile success, runtime success, structural fidelity, mechanism adherence.
858 generation attempts across four open-weight LLMs (7B-30B).
26 hand-crafted Unity goal pattern playable concepts used.
Two automatically extracted IR granularity levels tested.
Direct NL-to-C# generation achieves 43% mean runtime-pass rate.
Direct generation yields mechanism F1 ≈ 0.12 (structurally vacuous).
Structural IR conditioning halves runtime rate but recovers domain-faithful structure (F1 up to 1.00).
Behavior-only and full-scene granularity are statistically indistinguishable (McNemar p = 1.0).

Entities

—

Sources

arXiv cs.AI — 2026-05-11