EBM-RL: Decoupled Reinforcement Learning for Video Role-Playing
A novel framework known as EBM-RL (Eye-Brain-Mouth Reinforcement Learning) has been developed to enhance text-based role-playing models for applications such as VR games and interactive storytelling. This GRPO-based framework distinctly separates the processes of observation ([perception]), reasoning ([think]), and utterance ([answer]), thereby fostering a human-like sensory grounding. It requires the model to prioritize visual cues, subsequently form internal interpretations, and ultimately produce contextually relevant dialogue. EBM-RL incorporates four synergistic rewards: CLIP-based scene-text alignment for mood and emotion, a Perceptual-Cognitive reward that bolsters [perception] and [think] processes to improve reference response likelihood, and accuracy of answers. This research tackles the shortcomings of existing models in capturing scene atmosphere and evolving tension, crucial for immersive experiences. The paper can be found on arXiv with the identifier 2605.04733.
Key facts
- EBM-RL is a decoupled GRPO-based framework for video-grounded role-playing dialogue.
- The framework separates observation, reasoning, and utterance into distinct processes.
- It uses four complementary rewards including CLIP-based scene-text alignment.
- The research aims to improve scene atmosphere and evolving tension in VR games and interactive narratives.
- The paper is published on arXiv with identifier 2605.04733.
Entities
Institutions
- arXiv