Adaptive Unlearning Suppresses LLM Hallucinations in Code Generation
A new framework called Adaptive Unlearning (AU) surgically suppresses hallucinations in deployed large language models (LLMs) without costly retraining. Hallucinations—plausible but factually incorrect outputs—pose a critical supply-chain vulnerability in code generation, where models recommend non-existent software packages. Attackers can register these fictional packages on public registries with malicious payloads, a class of attack known as slopsquatting. Existing mitigation methods either degrade model utility or require a pre-specified forget-set, which is impractical for the unbounded space of hallucinations. AU operates post-deployment, targeting specific failure modes while preserving overall performance. The paper is published on arXiv (2605.01047) and addresses a key challenge in AI safety for autonomous code agents.
Key facts
- Adaptive Unlearning (AU) is a post-deployment framework for suppressing LLM hallucinations.
- Hallucinations in code generation create supply-chain vulnerabilities via slopsquatting attacks.
- Existing approaches cause severe degradation of model utility or rely on a pre-specified forget-set.
- AU does not require full retraining and targets specific failure modes.
- The paper is available on arXiv with identifier 2605.01047.
- Hallucinations are defined as outputs that sound plausible but are factually incorrect.
- Slopsquatting involves registering fictional packages on public registries with malicious payloads.
- The framework addresses the unbounded space of hallucinations.
Entities
Institutions
- arXiv