Ghost-100 Benchmark Tests Vision-Language Models Under Coercive Prompting
A new research paper introduces Ghost-100, a benchmark designed to evaluate how Vision-Language Models respond to coercive prompt phrasing. The study addresses a gap in existing hallucination benchmarks, which typically use neutral prompts and binary detection methods. Ghost-100 contains 800 synthetically generated images across eight categories within three task families: text-illegibility, time-reading, and object-absence. Each task is constructed under a negative-ground-truth principle, ensuring the queried target is absent, illegible, or indeterminate. Every image is paired with five prompts from a structured 5-Level Prompt Intensity Framework, allowing researchers to measure both the incidence and intensity of fabrication under graded linguistic pressure. The research, published as arXiv:2604.18803v1, examines VLMs in settings where reliable visual grounding has operational consequences. The benchmark spans structurally distinct task types to provide comprehensive evaluation of model behavior. This work helps characterize how VLMs perform when faced with progressively coercive language.
Key facts
- Ghost-100 is a benchmark for evaluating Vision-Language Models under coercive prompting
- Contains 800 synthetically generated images across eight categories
- Images span three task families: text-illegibility, time-reading, and object-absence
- Each task follows a negative-ground-truth principle
- Every image paired with five prompts from a 5-Level Prompt Intensity Framework
- Research published as arXiv:2604.18803v1
- Addresses gap in existing hallucination benchmarks that use neutral prompts
- Measures both incidence and intensity of fabrication under linguistic pressure
Entities
—