FBHM Benchmark Exposes VLMs' Inability to Detect Hateful Memes
Researchers have introduced FBHM (Functionality Based Hateful Memes), a new benchmark designed to test vision-language models (VLMs) on hateful meme detection. Unlike existing datasets that confound rhetorical hate mechanisms with target community features, FBHM systematically separates 25 distinct rhetorical functionalities across 10 target communities, comprising 5,000 memes. Evaluation of state-of-the-art VLMs reveals a severe generalization gap: models that perform highly on standard datasets drop to near-random accuracy on FBHM, indicating reliance on dataset-specific heuristics rather than robust multimodal reasoning. To address this, the team proposes LSV (learnable steering vectors), an ultra-low data regime strategy using causal intervention on as few as 500 steering samples to efficiently close the performance gap. The work is detailed in arXiv paper 2605.31349.
Key facts
- FBHM benchmark introduced for hateful meme detection
- 25 rhetorical functionalities and 10 target communities
- 5,000 memes in the benchmark
- State-of-the-art VLMs show near-random performance on FBHM
- Models rely on dataset-specific heuristics
- LSV (learnable steering vectors) proposed as solution
- LSV uses as few as 500 steering samples
- Paper available on arXiv: 2605.31349
Entities
Institutions
- arXiv