RuleSafe-VL Benchmark Tests Vision-Language Models on Content Moderation Rules

ai-technology · 2026-05-11

A new benchmark called RuleSafe-VL has been developed by researchers to assess rule-based decision-making in vision-language content moderation. This benchmark is based on publicly accessible platform moderation guidelines and includes 93 atomic rules and 92 types of rule relationships, resulting in 2,166 context-sensitive image-text pairs. It aims to overcome the shortcomings of existing multimodal safety benchmarks, which often simplify moderation to merely matching predetermined labels, without evaluating if models accurately apply policy rules or depend on superficial indicators. RuleSafe-VL evaluates how effectively models manage explicit policy rules and context-sensitive conditions to determine if user content should be permitted, restricted, or eliminated. This research is available on arXiv with the identifier 2605.07760.

Key facts

RuleSafe-VL is a benchmark for rule-conditioned decision reasoning in vision-language content moderation.
It is derived from publicly available platform moderation policies.
The benchmark formalizes 93 atomic rules and 92 typed rule relations.
It includes 2,166 context-sensitive image-text pairs.
Current multimodal safety benchmarks reduce moderation to matching predefined final labels.
RuleSafe-VL tests whether models apply policy rules correctly or rely on superficial cues.
The research is published on arXiv with identifier 2605.07760.
The benchmark evaluates how models handle explicit policy rules and context-dependent conditions.

RuleSafe-VL Benchmark Tests Vision-Language Models on Content Moderation Rules

Key facts

Entities

Institutions

Sources