ARTFEED — Contemporary Art Intelligence

SeePhys Pro Benchmark Reveals Modality Transfer Gaps in Multimodal RLVR

ai-technology · 2026-05-12

SeePhys Pro is a benchmark designed for fine-grained modality transfer, aimed at examining whether AI models maintain their reasoning abilities when essential information transitions from text to images. Unlike typical vision-centric benchmarks that assess a single input type, SeePhys Pro includes four semantically aligned versions for each problem, featuring progressively more visual elements. Evaluations reveal that leading models struggle with representation-invariance: performance typically declines as information shifts from language to diagrams, with visual variable grounding being the primary obstacle. To address this inference-time vulnerability, researchers created extensive training datasets for multimodal RLVR and employed blind training as a diagnostic tool. They discovered that reinforcement learning, even with all training images obscured, can enhance performance on unmasked validation sets. This research is available on arXiv, identifier 2605.09266.

Key facts

  • SeePhys Pro is a modality transfer benchmark for AI reasoning.
  • It tests performance as information moves from text to image.
  • Four semantically aligned variants per problem with increasing visual elements.
  • Current frontier models are not representation-invariant reasoners.
  • Performance degrades when information shifts from language to diagrams.
  • Visual variable grounding is the most critical bottleneck.
  • Large training corpora for multimodal RLVR were developed.
  • Blind training with masked images still improves performance on unmasked sets.

Entities

Institutions

  • arXiv

Sources