SeePhys Pro Benchmark Reveals Modality Transfer Gaps in Multimodal RLVR

ai-technology · 2026-05-12

SeePhys Pro is a benchmark designed for fine-grained modality transfer, aimed at examining whether AI models maintain their reasoning abilities when essential information transitions from text to images. Unlike typical vision-centric benchmarks that assess a single input type, SeePhys Pro includes four semantically aligned versions for each problem, featuring progressively more visual elements. Evaluations reveal that leading models struggle with representation-invariance: performance typically declines as information shifts from language to diagrams, with visual variable grounding being the primary obstacle. To address this inference-time vulnerability, researchers created extensive training datasets for multimodal RLVR and employed blind training as a diagnostic tool. They discovered that reinforcement learning, even with all training images obscured, can enhance performance on unmasked validation sets. This research is available on arXiv, identifier 2605.09266.

Key facts

SeePhys Pro is a modality transfer benchmark for AI reasoning.
It tests performance as information moves from text to image.
Four semantically aligned variants per problem with increasing visual elements.
Current frontier models are not representation-invariant reasoners.
Performance degrades when information shifts from language to diagrams.
Visual variable grounding is the most critical bottleneck.
Large training corpora for multimodal RLVR were developed.
Blind training with masked images still improves performance on unmasked sets.

SeePhys Pro Benchmark Reveals Modality Transfer Gaps in Multimodal RLVR

Key facts

Entities

Institutions

Sources