ARTFEED — Contemporary Art Intelligence

LLMs Show Divergent Robustness in Code Understanding Under Perturbations

ai-technology · 2026-04-22

A study examining the robustness of large language models in understanding code execution semantics reveals significant behavioral differences. While frontier model GPT-5.2 achieves near-perfect 99% accuracy on the unperturbed CRUXEval benchmark, its performance becomes brittle under code transformations and input perturbations, with accuracy dropping 20-24%. Open-source reasoning models from the DeepSeek-R1 family demonstrate more stable behavior, maintaining accuracies between 38% and 67% despite similar perturbations. The research, published as arXiv:2604.16320v1, investigates whether LLMs utilize internal world models or rely on sophisticated pattern matching. Many models perform particularly poorly when predicting behavior on perturbed inputs that raise exceptions, with performance varying depending on the exception type. The study explores potential remedies to address these deficiencies in model robustness. This analysis uses a standard program-output prediction task to evaluate code understanding capabilities.

Key facts

  • GPT-5.2 achieves 99% accuracy on unperturbed CRUXEval benchmark
  • GPT-5.2 accuracy drops 20-24% under code transformations and input perturbations
  • DeepSeek-R1 family models maintain 38-67% accuracy under perturbations
  • Study examines LLM robustness in code execution semantics understanding
  • Research published as arXiv:2604.16320v1
  • Many models perform worse on perturbed inputs that raise exceptions
  • Prediction performance depends on exception type
  • Study explores remedies for robustness deficiencies

Entities

Sources