Cross-Lingual Safety Transfer for LLMs via Self-Distillation

ai-technology · 2026-05-07

Large language models (LLMs) face significant issues with multilingual safety alignment, showing robust protections in high-resource languages but being particularly susceptible to jailbreak attacks in low-resource languages. The existing methods for safety alignment necessitate high-quality response data for each specific language, which is both costly and challenging to produce. To address this, researchers have introduced a framework called Multilingual Self-Distillation (MSD), which allows for the transfer of an LLM's safety features from high-resource languages like English to low-resource languages such as Javanese, without requiring response data. This adaptable framework can work with various self-distillation techniques. Two specific approaches, on-policy MSD and off-policy MSD, facilitate effective safety transfer across languages using only multilingual queries. The paper is accessible on arXiv under ID 2605.02971.

Key facts

LLMs have severe multilingual safety misalignment.
High-resource languages have strong safeguards; low-resource languages are vulnerable.
Current safety alignment methods require expensive response data per language.
MSD framework transfers safety from high-resource to low-resource languages.
MSD eliminates need for response data in any language.
Two methods: on-policy MSD and off-policy MSD.
Both methods use only multilingual queries.
Paper available on arXiv: 2605.02971.

Cross-Lingual Safety Transfer for LLMs via Self-Distillation

Key facts

Entities

Institutions

Sources