LLMs Show Divergent Robustness in Code Understanding Under Perturbations

ai-technology · 2026-04-22

A study examining the robustness of large language models in understanding code execution semantics reveals significant behavioral differences. While frontier model GPT-5.2 achieves near-perfect 99% accuracy on the unperturbed CRUXEval benchmark, its performance becomes brittle under code transformations and input perturbations, with accuracy dropping 20-24%. Open-source reasoning models from the DeepSeek-R1 family demonstrate more stable behavior, maintaining accuracies between 38% and 67% despite similar perturbations. The research, published as arXiv:2604.16320v1, investigates whether LLMs utilize internal world models or rely on sophisticated pattern matching. Many models perform particularly poorly when predicting behavior on perturbed inputs that raise exceptions, with performance varying depending on the exception type. The study explores potential remedies to address these deficiencies in model robustness. This analysis uses a standard program-output prediction task to evaluate code understanding capabilities.

Key facts

GPT-5.2 achieves 99% accuracy on unperturbed CRUXEval benchmark
GPT-5.2 accuracy drops 20-24% under code transformations and input perturbations
DeepSeek-R1 family models maintain 38-67% accuracy under perturbations
Study examines LLM robustness in code execution semantics understanding
Research published as arXiv:2604.16320v1
Many models perform worse on perturbed inputs that raise exceptions
Prediction performance depends on exception type
Study explores remedies for robustness deficiencies

Entities

—

Sources

arXiv cs.AI — 2026-04-21