TraceEval: First Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning

ai-technology · 2026-05-13

The introduction of TraceEval marks a significant advancement in assessing whether large language models (LLMs) can extract execution-relevant program structures from source code, moving beyond merely generating code that meets testing criteria. Current benchmarks, such as HumanEval, MBPP, LiveCodeBench, and SWE-Bench, primarily emphasize outputs that pass tests, providing limited insights into program semantics. TraceEval stands out as the first multi-language benchmark that is execution-verified for code semantic reasoning, specifically targeting the recovery of a program's runtime call structure. Unlike earlier benchmarks that depend on static-tool outputs or manually annotated data, every positive edge in TraceEval is confirmed through validation execution, thereby removing issues related to annotator disagreement and label noise. The benchmark includes 10,583 real-world programs.

Key facts

TraceEval is the first execution-verified, multi-language benchmark for code semantic reasoning.
It evaluates LLMs on recovering runtime call structure from source code.
Existing benchmarks like HumanEval, MBPP, LiveCodeBench, and SWE-Bench focus on test-passing outputs.
TraceEval eliminates annotator disagreement and label noise through mechanical validation execution.
The benchmark includes 10,583 real-world programs.
Every positive edge in TraceEval is witnessed by validation execution.
Prior call-graph benchmarks rely on static-tool output or hand-annotated ground truth.
TraceEval targets execution-relevant program structure recovery.

TraceEval: First Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning

Key facts

Entities

Institutions

Sources