Ghost-100 Benchmark Tests Vision-Language Models Under Coercive Prompting

ai-technology · 2026-04-22

A new research paper introduces Ghost-100, a benchmark designed to evaluate how Vision-Language Models respond to coercive prompt phrasing. The study addresses a gap in existing hallucination benchmarks, which typically use neutral prompts and binary detection methods. Ghost-100 contains 800 synthetically generated images across eight categories within three task families: text-illegibility, time-reading, and object-absence. Each task is constructed under a negative-ground-truth principle, ensuring the queried target is absent, illegible, or indeterminate. Every image is paired with five prompts from a structured 5-Level Prompt Intensity Framework, allowing researchers to measure both the incidence and intensity of fabrication under graded linguistic pressure. The research, published as arXiv:2604.18803v1, examines VLMs in settings where reliable visual grounding has operational consequences. The benchmark spans structurally distinct task types to provide comprehensive evaluation of model behavior. This work helps characterize how VLMs perform when faced with progressively coercive language.

Key facts

Ghost-100 is a benchmark for evaluating Vision-Language Models under coercive prompting
Contains 800 synthetically generated images across eight categories
Images span three task families: text-illegibility, time-reading, and object-absence
Each task follows a negative-ground-truth principle
Every image paired with five prompts from a 5-Level Prompt Intensity Framework
Research published as arXiv:2604.18803v1
Addresses gap in existing hallucination benchmarks that use neutral prompts
Measures both incidence and intensity of fabrication under linguistic pressure

Entities

—

Sources

arXiv cs.AI — 2026-04-22