Cross-Lingual Safety Transfer for LLMs via Self-Distillation
Large language models (LLMs) face significant issues with multilingual safety alignment, showing robust protections in high-resource languages but being particularly susceptible to jailbreak attacks in low-resource languages. The existing methods for safety alignment necessitate high-quality response data for each specific language, which is both costly and challenging to produce. To address this, researchers have introduced a framework called Multilingual Self-Distillation (MSD), which allows for the transfer of an LLM's safety features from high-resource languages like English to low-resource languages such as Javanese, without requiring response data. This adaptable framework can work with various self-distillation techniques. Two specific approaches, on-policy MSD and off-policy MSD, facilitate effective safety transfer across languages using only multilingual queries. The paper is accessible on arXiv under ID 2605.02971.
Key facts
- LLMs have severe multilingual safety misalignment.
- High-resource languages have strong safeguards; low-resource languages are vulnerable.
- Current safety alignment methods require expensive response data per language.
- MSD framework transfers safety from high-resource to low-resource languages.
- MSD eliminates need for response data in any language.
- Two methods: on-policy MSD and off-policy MSD.
- Both methods use only multilingual queries.
- Paper available on arXiv: 2605.02971.
Entities
Institutions
- arXiv