ARTFEED — Contemporary Art Intelligence

DexBench: New Benchmark Tests LLMs' Dual Reasoning on Program Execution

ai-technology · 2026-04-25

A research paper on arXiv (2604.20917) introduces DexBench, a benchmark designed to evaluate large language models' understanding of program execution through two complementary reasoning tasks: predicting observed behavior for a given input and inferring how input must be mutated to achieve a specific behavioral objective. The authors argue that existing benchmarks focus narrowly on predicting program properties tied to specific inputs, providing a limited view of dynamic code reasoning and being prone to data contamination. DexBench comprises 445 paired instances and was used to evaluate 13 LLMs. The results demonstrate that dual reasoning is essential for assessing causal understanding of execution flow.

Key facts

  • arXiv paper 2604.20917 introduces DexBench
  • DexBench evaluates LLMs on two reasoning tasks: predicting behavior and inferring input mutations
  • Existing benchmarks focus on predicting program properties for specific inputs
  • DexBench has 445 paired instances
  • 13 LLMs were evaluated on DexBench
  • The paper argues for evaluating inherent duality in program execution understanding
  • Dual reasoning tasks probe causal understanding of execution flow
  • The research highlights limitations of current benchmarks regarding data contamination

Entities

Institutions

  • arXiv

Sources