New Adversarial Framework Exposes Vulnerabilities in Black-Box NLP Pipelines

ai-technology · 2026-04-29

Researchers have developed a novel adversarial attack framework that exposes vulnerabilities in black-box natural language processing (NLP) pipelines, specifically those used for misinformation detection. The framework, called Agentic Adversarial Rewriting, operates under a strict threat model with binary-only feedback, no gradient access, and a limited query budget of 10 queries. It consists of two agents: an Attacker Agent that generates meaning-preserving rewrites and a Prompt Optimization Agent that refines the attack strategy using only binary decision feedback. Evaluated against four evidence-based misinformation detection pipelines, the framework achieved evasion rates of 19.95% to 40.34% on modern large language model (LLM) based systems. In contrast, token-level perturbation baselines that rely on surrogate models achieved at most 3.90% evasion because they cannot operate under the same threat model. The research highlights significant architectural vulnerabilities in current NLP pipelines, particularly those relying on LLMs, and underscores the need for more robust defenses. The paper is available on arXiv under the identifier 2604.23483.

Key facts

The framework operates under a strict black-box threat model with binary-only feedback, no gradient access, and a 10-query budget.
It consists of an Attacker Agent and a Prompt Optimization Agent.
Evaluated against four evidence-based misinformation detection pipelines.
Evasion rates of 19.95% to 40.34% on modern LLM-based systems.
Token-level perturbation baselines achieved at most 3.90% evasion.
The research was published on arXiv with identifier 2604.23483.

New Adversarial Framework Exposes Vulnerabilities in Black-Box NLP Pipelines

Key facts

Entities

Institutions

Sources