DareU: LLM Unlearning via Data Attribution Rewards

ai-technology · 2026-06-01

A new framework for unlearning in large language models (LLMs), named DareU, has been introduced by researchers. This innovative approach shifts the optimization goal from maximizing loss on a forget set to eliminating data attribution. By employing reinforcement learning, DareU updates the LLM, effectively lowering the attribution score for responses linked to the data owners that need to be forgotten, a process referred to as de-attributing. Empirical tests using an LLM classifier as a reliable attribution approximation demonstrate that DareU surpasses current benchmarks, achieving successful unlearning while reducing over-forgetting and preserving model performance. The research can be found on arXiv with the ID 2605.30919.

Key facts

DareU is the first LLM unlearning framework based on data attribution rewards.
It uses reinforcement learning to de-attribute responses to forget data owners.
The approach addresses over-forgetting and poor model utility.
Empirical evaluation uses an LLM classifier for efficient attribution approximation.
DareU outperforms existing baselines.
The paper is available on arXiv (ID 2605.30919).
The work frames unlearning as zeroing out data attribution.
The method reduces attribution scores of generated responses.

DareU: LLM Unlearning via Data Attribution Rewards

Key facts

Entities

Institutions

Sources