BarrierSteer: LLM Safety via Learned Barrier Functions

ai-technology · 2026-05-25

BarrierSteer introduces an innovative framework aimed at enhancing the safety of large language models (LLMs) during inference by integrating learned nonlinear safety constraints into the model's latent representation space. By utilizing hidden-state safety classifiers as Control Barrier Functions (CBFs), it allows for the steering of unsafe latent trajectories through constraint guidance during the generation process. The framework efficiently combines various safety constraints without altering the underlying parameters of the LLM, thereby maintaining its utility. Theoretical findings validate its efficacy, tackling the critical challenge of adversarial attacks and the generation of unsafe content in high-stakes LLM applications.

Key facts

BarrierSteer is an inference-time safety framework for LLMs.
It embeds learned nonlinear safety constraints into latent representation space.
Hidden-state safety classifiers are treated as Control Barrier Functions (CBFs).
Constraint-guided steering of unsafe latent trajectories occurs during generation.
Multiple safety constraints are composed via efficient merging.
No modification of underlying LLM parameters is required.
Model utility is preserved.
The framework targets adversarial attacks and unsafe content generation.

Entities

—

Sources

arXiv cs.AI — 2026-05-25