Adversarial Explanation Attacks Manipulate Human Trust in AI

ai-technology · 2026-05-18

A new form of attack known as adversarial explanation attacks (AEAs) has been introduced by researchers, allowing attackers to alter explanations produced by LLMs to influence human trust in erroneous results. This tactic takes advantage of a novel vulnerability at the cognitive level, specifically the interaction between AI systems and users. The metric for trust miscalibration gap measures the disparity in human trust levels between non-threatening and adversarial explanations. This research underscores potential behavioral dangers, revealing how strategically framed explanations can erode confidence in AI-supported decision-making.

Key facts

Adversarial explanation attacks (AEAs) manipulate LLM-generated explanations to modulate human trust in incorrect outputs.
Attack surface is the communication channel between AI and its users.
Trust miscalibration gap metric captures difference in human trust between benign and adversarial explanations.
Study highlights behavioral risks of persuasive explanation framing.

Entities

—

Sources

arXiv cs.AI — 2026-05-18