GRPO Boosts Encoder-Decoder MT Models Without Reference Data

ai-technology · 2026-05-18

A recent investigation introduces Group Relative Policy Optimization (GRPO) to encoder-decoder machine translation systems, focusing on NLLB-200 with 600M and 1.3B parameters. This method utilizes a hybrid reference-free reward that integrates LaBSE and COMET-Kiwi, eliminating the need for parallel data during fine-tuning. The study shows consistent advancements across 13 diverse languages, achieving improvements of up to +5.03 chrF++ for Traditional Chinese. Remarkably, it rivals 3-epoch supervised fine-tuning in morphologically intricate languages without requiring target-language data. The research highlights that the greatest gains occur when baseline performance is lowest and reward discriminability is highest, indicating its effectiveness in low-resource settings. This work fills a gap in reinforcement learning fine-tuning for machine translation, which has primarily centered on decoder-only LLMs exceeding 7B parameters, while practical applications depend on encoder-decoder Seq2Seq models.

Key facts

GRPO applied to NLLB-200 (600M and 1.3B) encoder-decoder models
Hybrid reference-free reward uses LaBSE and COMET-Kiwi
No parallel data required at fine-tuning time
Evaluated across 13 typologically diverse languages
Up to +5.03 chrF++ improvement for Traditional Chinese
Competes with 3-epoch supervised fine-tuning on morphologically complex languages
Gains largest where baseline performance is weakest
Addresses gap in RL fine-tuning for Seq2Seq MT models

Entities

—

Sources

arXiv cs.AI — 2026-05-18