DexBench: New Benchmark Tests LLMs' Dual Reasoning on Program Execution

ai-technology · 2026-04-25

A research paper on arXiv (2604.20917) introduces DexBench, a benchmark designed to evaluate large language models' understanding of program execution through two complementary reasoning tasks: predicting observed behavior for a given input and inferring how input must be mutated to achieve a specific behavioral objective. The authors argue that existing benchmarks focus narrowly on predicting program properties tied to specific inputs, providing a limited view of dynamic code reasoning and being prone to data contamination. DexBench comprises 445 paired instances and was used to evaluate 13 LLMs. The results demonstrate that dual reasoning is essential for assessing causal understanding of execution flow.

Key facts

arXiv paper 2604.20917 introduces DexBench
DexBench evaluates LLMs on two reasoning tasks: predicting behavior and inferring input mutations
Existing benchmarks focus on predicting program properties for specific inputs
DexBench has 445 paired instances
13 LLMs were evaluated on DexBench
The paper argues for evaluating inherent duality in program execution understanding
Dual reasoning tasks probe causal understanding of execution flow
The research highlights limitations of current benchmarks regarding data contamination

DexBench: New Benchmark Tests LLMs' Dual Reasoning on Program Execution

Key facts

Entities

Institutions

Sources