ARTFEED — Contemporary Art Intelligence

PRISM-VL: Grounding Vision-Language Models in RAW Sensor Data

ai-technology · 2026-05-13

There's this cool new method called PRISM-VL that's changing how vision-language learning works. Instead of relying on processed RGB images, it uses the original camera measurements to get clearer sensor data. This approach combines RAW-derived Meas.-XYZ inputs, adapts to different cameras, and employs Exposure-Bracketed Supervision Aggregation, focusing on measurements rather than RGB imagery. It trained on a dataset of 150K instructions and was tested in tricky conditions like low light and HDR. The PRISM-VL-8B model scored a BLEU of 0.6120, a ROUGE-L of 0.4571, and 82.66% accuracy, outperforming the RGB-based Qwen3-VL-8B by notable margins. This shows how measurement-focused methods can boost performance in challenging visual settings.

Key facts

  • PRISM-VL uses RAW-derived Meas.-XYZ inputs instead of post-ISP RGB images.
  • The method incorporates camera-conditioned grounding and Exposure-Bracketed Supervision Aggregation.
  • Training used a quality-controlled 150K instruction-tuning set.
  • Benchmark targeted low-light, HDR, visibility-sensitive, and hallucination-sensitive cases.
  • PRISM-VL-8B achieved BLEU 0.6120, ROUGE-L 0.4571, and LLM-Judge accuracy 82.66%.
  • Improvement over Qwen3-VL-8B baseline: +0.1074 BLEU, +0.1071 ROUGE-L, +4.46 percentage points.
  • The approach aims to reduce information loss from RGB rendering.
  • Published on arXiv with ID 2605.11727.

Entities

Institutions

  • arXiv

Sources