ARTFEED — Contemporary Art Intelligence

Geometry-Lite: Interpreting LLM Safety Probe Geometry

ai-technology · 2026-05-22

A novel technique known as Geometry-Lite evaluates how large language models (LLMs) distinguish between safe and unsafe prompts at various layers. This method translates the final prompt-token representation of each layer into signed margins through centroid, local-neighborhood, and supervised linear-boundary readouts. It then organizes margin profiles based on boundary position, changes between layers, and overall shape. Geometry-Lite was tested on nine instruction-tuned models ranging from 1.2B to 70B parameters and seven safety benchmarks, surpassing single-layer probes while offering clear geometric insights. The research explores the disconnect between strong average detection performance and separation geometry, the formation of safety evidence across layers, and the persistence of certain geometric biases amid benchmark changes. The paper can be found on arXiv.

Key facts

  • Geometry-Lite is a prompt-level safety probe for LLMs.
  • It maps each layer's final prompt-token representation to signed margins.
  • Readouts include centroid, local-neighborhood, and supervised linear-boundary.
  • Margin profiles summarize boundary position, layer-to-layer change, and coarse shape.
  • Tested on nine instruction-tuned backbones from 1.2B to 70B parameters.
  • Evaluated on seven safety benchmarks.
  • Outperforms single-layer probes.
  • Provides interpretable geometric insights into safety separation.

Entities

Institutions

  • arXiv

Sources