Adversarial LLM Evaluation Reveals Positional Collapse Under Complex Instructions

ai-technology · 2026-05-01

A recent study published on arXiv (2604.27249) explores the reactions of language models to adversarial prompts in multiple-choice tests. Researchers tested Llama-3-8B and Llama-3.1-8B using a six-condition adversarial instruction-specificity gradient across 2,000 MMLU-Pro items. They discovered three distinct patterns: vague instructions led to a moderate drop in accuracy while maintaining content engagement; standard sandbagging and capability-imitation prompts resulted in a collapse of positional entropy with some content engagement; and a two-step answer-aware avoidance instruction caused a significant positional collapse, focusing almost entirely on one answer. This research delineates the line between content engagement and reliance on positional shortcuts.

Key facts

arXiv paper 2604.27249
Six-condition adversarial instruction-specificity gradient
Two instruction-tuned LLMs: Llama-3-8B and Llama-3.1-8B
2,000 MMLU-Pro items used
Three regimes identified: vague, standard sandbagging/capability-imitation, two-step answer-aware avoidance
Vague instructions: moderate accuracy reduction, preserved content engagement
Standard sandbagging/capability-imitation: positional entropy collapse, partial content engagement
Two-step answer-aware avoidance: extreme positional collapse, near-total concentration on a single answer
Distributional screening (response-position entropy) and content-engagement criterion (difficulty-accuracy correlation) used

Adversarial LLM Evaluation Reveals Positional Collapse Under Complex Instructions

Key facts

Entities

Institutions

Sources