Input Embeddings Optimized for Safety in Aligned LLMs

ai-technology · 2026-04-30

Researchers demonstrate that input word embeddings can be optimized to reduce semantic harmfulness in aligned language models, which typically produce a bimodal refuse-or-comply distribution. Using zeroth-order gradient estimation of a black-box text-moderation API, they apply gradient descent on input embeddings at a sub-lexical level. This extends prior work on steering pretrained text-completion models via embeddings, which was limited to reducing surface-level profanity. The study, published on arXiv (2604.26167), explores safety alignment as a natural next step.

Key facts

arXiv paper 2604.26167
Input word embeddings serve as control variables for steering model behavior
Prior work only demonstrated on pretrained text-completion models for reducing profanity
Aligned models produce bimodal refuse-or-comply output distribution
Approach uses zeroth-order gradient estimation of black-box text-moderation API
Gradient descent applied on input embeddings at sub-lexical level
Objective is to minimize semantic harmfulness of aligned model responses
Published on arXiv

Input Embeddings Optimized for Safety in Aligned LLMs

Key facts

Entities

Institutions

Sources