ARTFEED — Contemporary Art Intelligence

Bot-Mod: A Framework for Detecting Malicious Intent in Multi-Agent Systems

ai-technology · 2026-05-14

A new moderation framework called Bot-Mod (Bot-Moderation) addresses the challenge of detecting malicious intent in multi-agent systems. Unlike traditional content-based moderation, Bot-Mod identifies underlying agent intent through multi-turn dialogue exchanges guided by Gibbs-based sampling over candidate intent hypotheses. This approach progressively narrows the space of plausible agent objectives to uncover hidden malicious behavior. The framework is evaluated using a dataset derived from Moltbook, which encompasses diverse scenarios. The research is published on arXiv under the identifier 2605.12856.

Key facts

  • Bot-Mod is a moderation framework for multi-agent systems.
  • It detects malicious intent rather than relying on content-level signals.
  • The framework uses multi-turn dialogue and Gibbs-based sampling.
  • It progressively narrows candidate intent hypotheses.
  • Evaluation uses a dataset derived from Moltbook.
  • The paper is available on arXiv with ID 2605.12856.
  • The approach addresses novel moderation challenges beyond content filtering.
  • Malicious agents may produce benign-looking content to evade detection.

Entities

Institutions

  • arXiv

Sources