SeePhys Pro Benchmark Reveals Modality Transfer Gaps in Multimodal RLVR
SeePhys Pro is a benchmark designed for fine-grained modality transfer, aimed at examining whether AI models maintain their reasoning abilities when essential information transitions from text to images. Unlike typical vision-centric benchmarks that assess a single input type, SeePhys Pro includes four semantically aligned versions for each problem, featuring progressively more visual elements. Evaluations reveal that leading models struggle with representation-invariance: performance typically declines as information shifts from language to diagrams, with visual variable grounding being the primary obstacle. To address this inference-time vulnerability, researchers created extensive training datasets for multimodal RLVR and employed blind training as a diagnostic tool. They discovered that reinforcement learning, even with all training images obscured, can enhance performance on unmasked validation sets. This research is available on arXiv, identifier 2605.09266.
Key facts
- SeePhys Pro is a modality transfer benchmark for AI reasoning.
- It tests performance as information moves from text to image.
- Four semantically aligned variants per problem with increasing visual elements.
- Current frontier models are not representation-invariant reasoners.
- Performance degrades when information shifts from language to diagrams.
- Visual variable grounding is the most critical bottleneck.
- Large training corpora for multimodal RLVR were developed.
- Blind training with masked images still improves performance on unmasked sets.
Entities
Institutions
- arXiv