ARTFEED — Contemporary Art Intelligence

SIREN AI Model Detects Harmful Content Using Internal LLM Representations

ai-technology · 2026-04-22

A novel AI safety model named SIREN has been introduced to identify harmful content within large language models by examining internal representations instead of merely focusing on final outputs. This efficient guard model detects safety-related neurons across various internal layers through linear probing and integrates them using an adaptive layer-weighted method. SIREN functions without altering the original LLM architecture, ensuring a non-invasive safety approach. Extensive evaluations reveal that SIREN significantly surpasses existing open-source guard models across numerous benchmarks while utilizing 250 times fewer trainable parameters. The model shows enhanced generalization to unseen benchmarks and facilitates real-time streaming detection. This research, detailed in arXiv:2604.18519v1, addresses the shortcomings of current guard models that depend solely on terminal-layer representations.

Key facts

  • SIREN is a lightweight guard model for detecting harmful content in LLMs
  • It analyzes internal representations across multiple layers rather than just terminal outputs
  • The model identifies safety neurons via linear probing
  • It uses an adaptive layer-weighted strategy to combine safety features
  • SIREN outperforms state-of-the-art open-source guard models across multiple benchmarks
  • It uses 250 times fewer trainable parameters than current models
  • The model enables real-time streaming detection of harmful content
  • SIREN improves inference efficiency compared to generative guard models

Entities

Institutions

  • arXiv

Sources