Optimus Framework Defends LLMs Against Toxicity During Fine-Tuning
A new defense framework named Optimus has been developed by researchers to reduce the risk of harmful behaviors during the fine-tuning of Large Language Models (LLMs) on unreliable datasets. Unlike current methods that rely on accurate toxicity detection or strict filtering, Optimus effectively tackles the issue of ensuring strong mitigation even when toxicity classifiers are flawed or biased. This framework utilizes a training-free toxicity classification approach that leverages the safety alignment of standard LLMs and implements a dual-strategy alignment method that combines synthetic 'healing data' with Direct Preference Optimization (DPO) to guide models towards safer outputs. Comprehensive assessments reveal that Optimus can reduce toxicity, even when using highly biased classifiers with up to 85% Recall degradation. This research is available on arXiv with the identifier 2507.05660.
Key facts
- Optimus is a defense framework for fine-tuning LLMs on untrusted datasets.
- It mitigates toxic behaviors without relying on precise toxicity detection.
- The framework uses a training-free toxicity classification scheme.
- It repurposes safety alignment of commodity LLMs.
- Optimus employs synthetic 'healing data' and Direct Preference Optimization (DPO).
- It performs well even with biased classifiers having 85% Recall degradation.
- The research is published on arXiv (2507.05660).
- The framework preserves conversational utility while ensuring safety.
Entities
Institutions
- arXiv