REALISTA: New Attack Method Induces LLM Hallucinations
A group of researchers has come up with REALISTA, a groundbreaking framework aimed at generating realistic adversarial prompts that can trigger hallucinations in large language models (LLMs). They view the process of inducing these hallucinations as a constrained optimization problem, focusing on creating prompts that are semantically similar to harmless user inputs. Existing techniques fall short: while discrete prompt attacks keep the meaning intact, they're not very varied, and continuous latent-space attacks can result in nonsensical rephrasing. REALISTA establishes a tailored dictionary of valid editing paths linked to semantically consistent rewordings and fine-tunes continuous latent vectors to spark hallucinations. This research is detailed in the arXiv preprint 2605.12813.
Key facts
- REALISTA is a realistic latent-space attack framework.
- It elicits hallucinations in large language models.
- Hallucination elicitation is framed as a constrained optimization problem.
- The goal is to find semantically coherent adversarial prompts equivalent to benign prompts.
- Discrete prompt-based attacks search over a limited set of prompt variations.
- Continuous latent-space attacks often decode into invalid rephrasings.
- REALISTA uses an input-dependent dictionary of valid editing directions.
- The preprint is available on arXiv under ID 2605.12813.
Entities
Institutions
- arXiv