CAP Framework Uses Reinforcement Learning for LLM Unlearning

ai-technology · 2026-04-25

A novel approach known as Controllable Alignment Prompting for Unlearning (CAP) tackles the issue of selectively erasing knowledge in large language models (LLMs). Current methods that modify parameters face challenges such as high computational demands, unpredictable forgetting limits, and reliance on access to model weights, rendering them ineffective for closed-source models. CAP separates the unlearning process into an optimization of prompts that can be learned through reinforcement learning. A prompt generator works alongside the LLM to eliminate specific knowledge while maintaining overall functionality. This method is comprehensive and driven by prompts, providing a non-invasive solution that avoids altering model weights. The research is published on arXiv with the identifier 2604.21251.

Key facts

CAP stands for Controllable Alignment Prompting for Unlearning.
The framework uses reinforcement learning for prompt optimization.
It targets selective knowledge unlearning in LLMs.
Existing methods are parameter-modifying and have high costs.
CAP is non-invasive and does not require model weight access.
A prompt generator collaborates with the LLM.
The approach suppresses target knowledge while preserving general capabilities.
The paper is on arXiv with ID 2604.21251.

CAP Framework Uses Reinforcement Learning for LLM Unlearning

Key facts

Entities

Institutions

Sources