TLPO: Token-Level Policy Optimization to Fix Language Confusion in LLMs

ai-technology · 2026-04-30

A new fine-tuning framework called Token-Level Policy Optimization (TLPO) has been introduced by researchers to address language confusion in large language models (LLMs). In contrast to earlier sequence-level techniques such as DPO, ORPO, and GRPO, which modify entire responses and can impair overall performance, TLPO focuses on localized updates at the token level. It pinpoints positions prone to errors, assesses alternative token candidates, and refines the policy with a specific goal to minimize outputs that lead to confusion. This targeted strategy successfully alleviates language confusion while maintaining the model's general capabilities. Further details can be found in arXiv:2604.26553v1.

Key facts

TLPO is a token-level fine-tuning framework for mitigating language confusion in LLMs.
Prior methods like DPO, ORPO, and GRPO operate at sequence level and can degrade general capabilities.
TLPO identifies error-prone positions and explores alternative candidate tokens.
The policy is updated using a tailored objective to suppress error-inducing outputs.
Selective intervention enables effective mitigation without compromising general abilities.
The paper is available on arXiv with ID 2604.26553v1.
Language confusion refers to LLMs failing to consistently generate responses in the intended language.
TLPO provides a more fine-grained alternative to sequence-level fine-tuning.

TLPO: Token-Level Policy Optimization to Fix Language Confusion in LLMs

Key facts

Entities

Institutions

Sources