ARTFEED — Contemporary Art Intelligence

Input Embeddings Optimized for Safety in Aligned LLMs

ai-technology · 2026-04-30

Researchers demonstrate that input word embeddings can be optimized to reduce semantic harmfulness in aligned language models, which typically produce a bimodal refuse-or-comply distribution. Using zeroth-order gradient estimation of a black-box text-moderation API, they apply gradient descent on input embeddings at a sub-lexical level. This extends prior work on steering pretrained text-completion models via embeddings, which was limited to reducing surface-level profanity. The study, published on arXiv (2604.26167), explores safety alignment as a natural next step.

Key facts

  • arXiv paper 2604.26167
  • Input word embeddings serve as control variables for steering model behavior
  • Prior work only demonstrated on pretrained text-completion models for reducing profanity
  • Aligned models produce bimodal refuse-or-comply output distribution
  • Approach uses zeroth-order gradient estimation of black-box text-moderation API
  • Gradient descent applied on input embeddings at sub-lexical level
  • Objective is to minimize semantic harmfulness of aligned model responses
  • Published on arXiv

Entities

Institutions

  • arXiv

Sources