Mapping Bias Fingerprints in LLMs: Locating Stereotypes in Neural Networks
A new study investigates where stereotypes reside inside large language models, specifically GPT-2 Small and Llama 3.2. Researchers aim to locate contrastive neuron activations and attention heads that encode biased outputs, calling these 'bias fingerprints.' The work provides initial insights for mitigating harmful societal biases in AI systems.
Key facts
- Study focuses on GPT-2 Small and Llama 3.2
- Investigates internal mechanisms for stereotype-related activations
- Two approaches: contrastive neuron activations and attention heads
- Term 'bias fingerprints' used to describe biased neural patterns
- Aims to provide initial insights for mitigating stereotypes
- Published on arXiv under Computer Science > Computation and Language
- Submission history not specified beyond preprint date
Entities
Institutions
- arXiv