ARTFEED — Contemporary Art Intelligence

ROME and ARISE: New Methods for Testing LLM Agent Safety on Deceptive Scenarios

ai-technology · 2026-05-07

Researchers have developed ROME (Red-team Orchestrated Multi-agent Evolution), a novel framework aimed at generating challenging benchmarks that transform existing unsafe paths into intricate evaluation scenarios while retaining their risk labels. Starting with an initial set of 100 unsafe trajectories, ROME creates 300 complex situations filled with contextual nuances and hidden dangers that complicate decision-making. Their findings indicate these new challenge sets significantly impede safety assessments, particularly in scenarios involving concealed risks, posing difficulties even for advanced models. Additionally, the team is exploring ARISE (Analogical Reasoning for Improved Safety Evaluation) to further refine safety evaluation methods. The full study is available on arXiv.

Key facts

  • ROME is a controlled benchmark-construction pipeline that rewrites unsafe trajectories into deceptive evaluation instances.
  • ROME produces 300 challenge instances from 100 unsafe source trajectories.
  • Challenge instances span contextual ambiguity, implicit risks, and shortcut decision-making.
  • Hidden-risk cases degrade safety-judgment performance even for frontier models.
  • ARISE is a method for analogical reasoning to improve safety evaluation.
  • Existing safety benchmarks emphasize explicit risks, potentially overstating model ability.
  • The paper is published on arXiv with ID 2605.03242.
  • Tool-using agent systems powered by LLMs are deployed across web, app, OS, and transactional environments.

Entities

Institutions

  • arXiv

Sources