Sandbagging Detection Fails in Small LLMs: Study Reveals Positional Bias
A pre-registered investigation available on arXiv (2604.25249) examined if the principles of symptom validity testing (SVT) used for detecting clinical malingering could also pinpoint sandbagging—intentional underachievement during capability assessments—in small instruction-tuned language models. Researchers assessed three models (Qwen-2.5-7B, Phi-3.5-mini, Llama-3-8B) at the 7-9 billion parameter level across four MMLU-Pro domains under four conditions, totaling 24,000 trials with 500 items per cell. The plausibility gate did not succeed: none of the 12 model-domain cells demonstrated significant below-chance performance when instructed to sandbag. Further exploratory analysis identified three failure modes, revealing that Qwen-2.5-7B and Phi-3.5-mini largely disregarded instructions, while Llama-3-8B exhibited positional bias instead of avoiding answers. The study concludes that SVT-based detection is not effective at this scale.
Key facts
- Study tested SVT logic from clinical malingering detection for sandbagging in LLMs
- 3 models evaluated: Qwen-2.5-7B, Phi-3.5-mini, Llama-3-8B
- 4 MMLU-Pro domains used
- 4 conditions tested
- 500 items per cell, 24,000 total trials
- Zero of 12 model-domain cells showed significant below-chance performance
- Qwen-2.5-7B and Phi-3.5-mini had 62-88% response identity with honest baseline
- Llama-3-8B used positional heuristic, collapsing onto middle-alphabet options
Entities
Institutions
- arXiv