PRISM-VL: Grounding Vision-Language Models in RAW Sensor Data
There's this cool new method called PRISM-VL that's changing how vision-language learning works. Instead of relying on processed RGB images, it uses the original camera measurements to get clearer sensor data. This approach combines RAW-derived Meas.-XYZ inputs, adapts to different cameras, and employs Exposure-Bracketed Supervision Aggregation, focusing on measurements rather than RGB imagery. It trained on a dataset of 150K instructions and was tested in tricky conditions like low light and HDR. The PRISM-VL-8B model scored a BLEU of 0.6120, a ROUGE-L of 0.4571, and 82.66% accuracy, outperforming the RGB-based Qwen3-VL-8B by notable margins. This shows how measurement-focused methods can boost performance in challenging visual settings.
Key facts
- PRISM-VL uses RAW-derived Meas.-XYZ inputs instead of post-ISP RGB images.
- The method incorporates camera-conditioned grounding and Exposure-Bracketed Supervision Aggregation.
- Training used a quality-controlled 150K instruction-tuning set.
- Benchmark targeted low-light, HDR, visibility-sensitive, and hallucination-sensitive cases.
- PRISM-VL-8B achieved BLEU 0.6120, ROUGE-L 0.4571, and LLM-Judge accuracy 82.66%.
- Improvement over Qwen3-VL-8B baseline: +0.1074 BLEU, +0.1071 ROUGE-L, +4.46 percentage points.
- The approach aims to reduce information loss from RGB rendering.
- Published on arXiv with ID 2605.11727.
Entities
Institutions
- arXiv