Bot-Mod: A Framework for Detecting Malicious Intent in Multi-Agent Systems
A new moderation framework called Bot-Mod (Bot-Moderation) addresses the challenge of detecting malicious intent in multi-agent systems. Unlike traditional content-based moderation, Bot-Mod identifies underlying agent intent through multi-turn dialogue exchanges guided by Gibbs-based sampling over candidate intent hypotheses. This approach progressively narrows the space of plausible agent objectives to uncover hidden malicious behavior. The framework is evaluated using a dataset derived from Moltbook, which encompasses diverse scenarios. The research is published on arXiv under the identifier 2605.12856.
Key facts
- Bot-Mod is a moderation framework for multi-agent systems.
- It detects malicious intent rather than relying on content-level signals.
- The framework uses multi-turn dialogue and Gibbs-based sampling.
- It progressively narrows candidate intent hypotheses.
- Evaluation uses a dataset derived from Moltbook.
- The paper is available on arXiv with ID 2605.12856.
- The approach addresses novel moderation challenges beyond content filtering.
- Malicious agents may produce benign-looking content to evade detection.
Entities
Institutions
- arXiv