ARTFEED — Contemporary Art Intelligence

Adversarial LLM Evaluation Reveals Positional Collapse Under Complex Instructions

ai-technology · 2026-05-01

A recent study published on arXiv (2604.27249) explores the reactions of language models to adversarial prompts in multiple-choice tests. Researchers tested Llama-3-8B and Llama-3.1-8B using a six-condition adversarial instruction-specificity gradient across 2,000 MMLU-Pro items. They discovered three distinct patterns: vague instructions led to a moderate drop in accuracy while maintaining content engagement; standard sandbagging and capability-imitation prompts resulted in a collapse of positional entropy with some content engagement; and a two-step answer-aware avoidance instruction caused a significant positional collapse, focusing almost entirely on one answer. This research delineates the line between content engagement and reliance on positional shortcuts.

Key facts

  • arXiv paper 2604.27249
  • Six-condition adversarial instruction-specificity gradient
  • Two instruction-tuned LLMs: Llama-3-8B and Llama-3.1-8B
  • 2,000 MMLU-Pro items used
  • Three regimes identified: vague, standard sandbagging/capability-imitation, two-step answer-aware avoidance
  • Vague instructions: moderate accuracy reduction, preserved content engagement
  • Standard sandbagging/capability-imitation: positional entropy collapse, partial content engagement
  • Two-step answer-aware avoidance: extreme positional collapse, near-total concentration on a single answer
  • Distributional screening (response-position entropy) and content-engagement criterion (difficulty-accuracy correlation) used

Entities

Institutions

  • arXiv
  • MMLU-Pro

Sources