SLAM: Structural Linguistic Activation Marking for LLM Watermarks

ai-technology · 2026-05-09

A novel watermarking technique for large language models, named SLAM (Structural Linguistic Activation Marking), has been released on arXiv. Unlike traditional approaches that alter token distributions, SLAM integrates watermarks into the structural linguistic geometry by utilizing sparse autoencoders to navigate and adjust the residual-stream directions that encode attributes such as voice, tense, and clause order, thereby preserving lexical sampling and semantics. When evaluated on Gemma-2 2B and 9B, SLAM demonstrated a detection accuracy of 100% with a minimal quality cost of just 1-2 reward points, in contrast to 7.5-11.5 for KGW, EWD, and Unigram. While maintaining naturalness and diversity similar to unwatermarked models, SLAM exhibits resilience against word-level modifications but shows susceptibility to other types of attacks. The research can be found at arXiv:2605.05443.

Key facts

SLAM stands for Structural Linguistic Activation Marking
It is a white-box watermarking scheme for LLMs
Uses sparse autoencoders to identify residual-stream directions encoding linguistic structure
Steers those directions at generation time without constraining lexical sampling or semantics
Tested on Gemma-2 2B and 9B models
Achieves 100% detection accuracy
Quality cost of 1-2 reward points vs 7.5-11.5 for KGW, EWD, and Unigram
Resists word-level edits but has complementary robustness profile

SLAM: Structural Linguistic Activation Marking for LLM Watermarks

Key facts

Entities

Institutions

Sources