PGT: Procedurally Generated Tasks Boost Visual Grounding in MLLMs

ai-technology · 2026-05-25

Researchers propose Procedurally Generated Tasks (PGT), a data-driven framework to improve fine-grained visual understanding in Multimodal Large Language Models (MLLMs). PGT overlays geometric primitives on images to generate dense supervision, disentangling visual grounding from semantic priors. Experiments show gains of up to +20% on What'sUp benchmark and +13.3% on CV-Bench-2D when augmenting LLaVA-v1.5-Instruct with PGT data, while maintaining general perception capabilities. The framework also serves as a low-cost diagnostic tool for identifying perception failures. The paper is available on arXiv under ID 2605.23883.

Key facts

PGT stands for Procedurally Generated Tasks
PGT improves fine-grained visual understanding in MLLMs
PGT overlays geometric primitives on images
PGT disentangles visual grounding from semantic priors
Up to +20% improvement on What'sUp benchmark
+13.3% improvement on CV-Bench-2D
PGT acts as a low-cost diagnostic tool
Paper available on arXiv: 2605.23883

PGT: Procedurally Generated Tasks Boost Visual Grounding in MLLMs

Key facts

Entities

Institutions

Sources