REALISTA: New Attack Method Induces LLM Hallucinations

ai-technology · 2026-05-14

A group of researchers has come up with REALISTA, a groundbreaking framework aimed at generating realistic adversarial prompts that can trigger hallucinations in large language models (LLMs). They view the process of inducing these hallucinations as a constrained optimization problem, focusing on creating prompts that are semantically similar to harmless user inputs. Existing techniques fall short: while discrete prompt attacks keep the meaning intact, they're not very varied, and continuous latent-space attacks can result in nonsensical rephrasing. REALISTA establishes a tailored dictionary of valid editing paths linked to semantically consistent rewordings and fine-tunes continuous latent vectors to spark hallucinations. This research is detailed in the arXiv preprint 2605.12813.

Key facts

REALISTA is a realistic latent-space attack framework.
It elicits hallucinations in large language models.
Hallucination elicitation is framed as a constrained optimization problem.
The goal is to find semantically coherent adversarial prompts equivalent to benign prompts.
Discrete prompt-based attacks search over a limited set of prompt variations.
Continuous latent-space attacks often decode into invalid rephrasings.
REALISTA uses an input-dependent dictionary of valid editing directions.
The preprint is available on arXiv under ID 2605.12813.

REALISTA: New Attack Method Induces LLM Hallucinations

Key facts

Entities

Institutions

Sources