BalCapRL Framework Balances RL-Based MLLM Image Captioning
A new reinforcement learning framework called BalCapRL addresses trade-offs in image captioning for multimodal large language models (MLLMs). Existing RL methods often optimize narrow metrics, leading to either noisy, hallucinated captions that boost downstream tasks but harm fluency, or fluent but generic descriptions with limited usefulness. BalCapRL jointly optimizes utility-aware correctness, reference coverage, and linguistic quality to produce more balanced captions. The framework is detailed in a paper on arXiv (2605.07394).
Key facts
- BalCapRL is a balanced RL framework for MLLM image captioning
- Existing RL methods create trade-offs between utility and fluency
- BalCapRL optimizes correctness, coverage, and linguistic quality
- Paper available on arXiv with ID 2605.07394
- Image captioning is a fundamental computer vision task
- MLLMs have drawn attention for open-ended captioning
- Utility objectives can cause hallucinations and overlong captions
- Arena-style objectives favor fluent but generic descriptions
Entities
Institutions
- arXiv