Bot-Mod: A Framework for Detecting Malicious Intent in Multi-Agent Systems

ai-technology · 2026-05-14

A new moderation framework called Bot-Mod (Bot-Moderation) addresses the challenge of detecting malicious intent in multi-agent systems. Unlike traditional content-based moderation, Bot-Mod identifies underlying agent intent through multi-turn dialogue exchanges guided by Gibbs-based sampling over candidate intent hypotheses. This approach progressively narrows the space of plausible agent objectives to uncover hidden malicious behavior. The framework is evaluated using a dataset derived from Moltbook, which encompasses diverse scenarios. The research is published on arXiv under the identifier 2605.12856.

Key facts

Bot-Mod is a moderation framework for multi-agent systems.
It detects malicious intent rather than relying on content-level signals.
The framework uses multi-turn dialogue and Gibbs-based sampling.
It progressively narrows candidate intent hypotheses.
Evaluation uses a dataset derived from Moltbook.
The paper is available on arXiv with ID 2605.12856.
The approach addresses novel moderation challenges beyond content filtering.
Malicious agents may produce benign-looking content to evade detection.

Bot-Mod: A Framework for Detecting Malicious Intent in Multi-Agent Systems

Key facts

Entities

Institutions

Sources