VPSG Method Corrects Coordinate Prediction Bias in MLLMs

ai-technology · 2026-04-30

A study has revealed that high-resolution inputs negatively impact visual positional encodings (VPEs) in multimodal large language models (MLLMs), resulting in predictable biases in coordinate predictions instead of random variations. To tackle this issue, the researchers propose Vision-PE Shuffle Guidance (VPSG), which is a correction technique that operates during inference without requiring training. VPSG works by shuffling VPEs to isolate position-unconditioned tendencies and utilizes this negative evidence to enhance digit decoding through a simple finite-state machine. When tested on the ScreenSpot-Pro benchmark, VPSG successfully corrects coordinate drift and demonstrates significant improvements in localization accuracy across different model sizes.

Key facts

Multimodal Large Language Models (MLLMs) show degraded visual positional encodings (VPEs) with high-resolution inputs.
Encoding failures trigger predictable, directional biases, not random noise.
Models default to internal spatial priors when grounding signals are weak.
Vision-PE Shuffle Guidance (VPSG) is a training-free, inference-time correction method.
VPSG shuffles VPEs to isolate position-unconditioned tendencies.
A lightweight finite-state machine steers digit decoding using negative evidence.
Evaluation on ScreenSpot-Pro benchmark shows consistent improvements in localization accuracy.
The method works across various model scales.

Entities

—

Sources

arXiv cs.AI — 2026-04-29