Input Embeddings Optimized for Safety in Aligned LLMs
Researchers demonstrate that input word embeddings can be optimized to reduce semantic harmfulness in aligned language models, which typically produce a bimodal refuse-or-comply distribution. Using zeroth-order gradient estimation of a black-box text-moderation API, they apply gradient descent on input embeddings at a sub-lexical level. This extends prior work on steering pretrained text-completion models via embeddings, which was limited to reducing surface-level profanity. The study, published on arXiv (2604.26167), explores safety alignment as a natural next step.
Key facts
- arXiv paper 2604.26167
- Input word embeddings serve as control variables for steering model behavior
- Prior work only demonstrated on pretrained text-completion models for reducing profanity
- Aligned models produce bimodal refuse-or-comply output distribution
- Approach uses zeroth-order gradient estimation of black-box text-moderation API
- Gradient descent applied on input embeddings at sub-lexical level
- Objective is to minimize semantic harmfulness of aligned model responses
- Published on arXiv
Entities
Institutions
- arXiv