Sandbagging Detection Fails in Small LLMs: Study Reveals Positional Bias

ai-technology · 2026-04-30

A pre-registered investigation available on arXiv (2604.25249) examined if the principles of symptom validity testing (SVT) used for detecting clinical malingering could also pinpoint sandbagging—intentional underachievement during capability assessments—in small instruction-tuned language models. Researchers assessed three models (Qwen-2.5-7B, Phi-3.5-mini, Llama-3-8B) at the 7-9 billion parameter level across four MMLU-Pro domains under four conditions, totaling 24,000 trials with 500 items per cell. The plausibility gate did not succeed: none of the 12 model-domain cells demonstrated significant below-chance performance when instructed to sandbag. Further exploratory analysis identified three failure modes, revealing that Qwen-2.5-7B and Phi-3.5-mini largely disregarded instructions, while Llama-3-8B exhibited positional bias instead of avoiding answers. The study concludes that SVT-based detection is not effective at this scale.

Key facts

Study tested SVT logic from clinical malingering detection for sandbagging in LLMs
3 models evaluated: Qwen-2.5-7B, Phi-3.5-mini, Llama-3-8B
4 MMLU-Pro domains used
4 conditions tested
500 items per cell, 24,000 total trials
Zero of 12 model-domain cells showed significant below-chance performance
Qwen-2.5-7B and Phi-3.5-mini had 62-88% response identity with honest baseline
Llama-3-8B used positional heuristic, collapsing onto middle-alphabet options

Sandbagging Detection Fails in Small LLMs: Study Reveals Positional Bias

Key facts

Entities

Institutions

Sources