Mapping Bias Fingerprints in LLMs: Locating Stereotypes in Neural Networks

ai-technology · 2026-04-24

A new study investigates where stereotypes reside inside large language models, specifically GPT-2 Small and Llama 3.2. Researchers aim to locate contrastive neuron activations and attention heads that encode biased outputs, calling these 'bias fingerprints.' The work provides initial insights for mitigating harmful societal biases in AI systems.

Key facts

Study focuses on GPT-2 Small and Llama 3.2
Investigates internal mechanisms for stereotype-related activations
Two approaches: contrastive neuron activations and attention heads
Term 'bias fingerprints' used to describe biased neural patterns
Aims to provide initial insights for mitigating stereotypes
Published on arXiv under Computer Science > Computation and Language
Submission history not specified beyond preprint date

Mapping Bias Fingerprints in LLMs: Locating Stereotypes in Neural Networks

Key facts

Entities

Institutions

Sources