GRPO Boosts Encoder-Decoder MT Models Without Reference Data
A recent investigation introduces Group Relative Policy Optimization (GRPO) to encoder-decoder machine translation systems, focusing on NLLB-200 with 600M and 1.3B parameters. This method utilizes a hybrid reference-free reward that integrates LaBSE and COMET-Kiwi, eliminating the need for parallel data during fine-tuning. The study shows consistent advancements across 13 diverse languages, achieving improvements of up to +5.03 chrF++ for Traditional Chinese. Remarkably, it rivals 3-epoch supervised fine-tuning in morphologically intricate languages without requiring target-language data. The research highlights that the greatest gains occur when baseline performance is lowest and reward discriminability is highest, indicating its effectiveness in low-resource settings. This work fills a gap in reinforcement learning fine-tuning for machine translation, which has primarily centered on decoder-only LLMs exceeding 7B parameters, while practical applications depend on encoder-decoder Seq2Seq models.
Key facts
- GRPO applied to NLLB-200 (600M and 1.3B) encoder-decoder models
- Hybrid reference-free reward uses LaBSE and COMET-Kiwi
- No parallel data required at fine-tuning time
- Evaluated across 13 typologically diverse languages
- Up to +5.03 chrF++ improvement for Traditional Chinese
- Competes with 3-epoch supervised fine-tuning on morphologically complex languages
- Gains largest where baseline performance is weakest
- Addresses gap in RL fine-tuning for Seq2Seq MT models
Entities
—