FBHM Benchmark Exposes VLMs' Inability to Detect Hateful Memes

other · 2026-06-01

Researchers have introduced FBHM (Functionality Based Hateful Memes), a new benchmark designed to test vision-language models (VLMs) on hateful meme detection. Unlike existing datasets that confound rhetorical hate mechanisms with target community features, FBHM systematically separates 25 distinct rhetorical functionalities across 10 target communities, comprising 5,000 memes. Evaluation of state-of-the-art VLMs reveals a severe generalization gap: models that perform highly on standard datasets drop to near-random accuracy on FBHM, indicating reliance on dataset-specific heuristics rather than robust multimodal reasoning. To address this, the team proposes LSV (learnable steering vectors), an ultra-low data regime strategy using causal intervention on as few as 500 steering samples to efficiently close the performance gap. The work is detailed in arXiv paper 2605.31349.

Key facts

FBHM benchmark introduced for hateful meme detection
25 rhetorical functionalities and 10 target communities
5,000 memes in the benchmark
State-of-the-art VLMs show near-random performance on FBHM
Models rely on dataset-specific heuristics
LSV (learnable steering vectors) proposed as solution
LSV uses as few as 500 steering samples
Paper available on arXiv: 2605.31349

FBHM Benchmark Exposes VLMs' Inability to Detect Hateful Memes

Key facts

Entities

Institutions

Sources