ARTFEED — Contemporary Art Intelligence

PGT: Procedurally Generated Tasks Boost Visual Grounding in MLLMs

ai-technology · 2026-05-25

Researchers propose Procedurally Generated Tasks (PGT), a data-driven framework to improve fine-grained visual understanding in Multimodal Large Language Models (MLLMs). PGT overlays geometric primitives on images to generate dense supervision, disentangling visual grounding from semantic priors. Experiments show gains of up to +20% on What'sUp benchmark and +13.3% on CV-Bench-2D when augmenting LLaVA-v1.5-Instruct with PGT data, while maintaining general perception capabilities. The framework also serves as a low-cost diagnostic tool for identifying perception failures. The paper is available on arXiv under ID 2605.23883.

Key facts

  • PGT stands for Procedurally Generated Tasks
  • PGT improves fine-grained visual understanding in MLLMs
  • PGT overlays geometric primitives on images
  • PGT disentangles visual grounding from semantic priors
  • Up to +20% improvement on What'sUp benchmark
  • +13.3% improvement on CV-Bench-2D
  • PGT acts as a low-cost diagnostic tool
  • Paper available on arXiv: 2605.23883

Entities

Institutions

  • arXiv

Sources