New Framework Harmonizes Multi-Objective Unlearning for Large Language Models
A novel framework for Large Language Model (LLM) unlearning addresses multiple critical objectives simultaneously, moving beyond the limited focus of existing methods. The approach harmonizes the removal of undesirable knowledge with the preservation of general utility, while also preventing over-refusal of related concepts and ensuring robustness against adversarial probing attacks. Existing unlearning techniques typically concentrate only on efficacy and utility preservation, often neglecting robustness and boundary behaviors. The proposed method employs a co-design of data and optimization to achieve this multi-objective balance. It standardizes training corpora into a unified data representation to minimize domain gaps. This research, documented in arXiv:2604.15482v1, is crucial for safely removing hazardous or privacy-sensitive information from LLMs. The work highlights that naively extending current single-objective methods can lead to interference between different unlearning tasks.
Key facts
- The paper proposes a novel multi-objective unlearning framework for Large Language Models (LLMs).
- The framework aims to remove undesirable or privacy-leaking knowledge while preserving general model utility.
- It specifically addresses the challenge of avoiding over-refusal of concepts neighboring the target unlearning data.
- A key objective is ensuring robustness against adversarial probing attacks post-unlearning.
- Existing unlearning methods are criticized for focusing primarily on efficacy and utility, overlooking robustness and boundary behavior.
- The method uses a co-design of data and optimization to harmonize these multiple objectives.
- Training corpora are standardized into a unified data representation to reduce domain gaps.
- The research is documented in the preprint arXiv:2604.15482v1, announced as a cross submission.
Entities
—