Protein Language Model ESM2-8M Relies on Retrieval, Not Biological Evidence, for Methionine Prediction
A recent study on arXiv indicates that the ESM2-8M protein language model fails to identify methionine at masked locations based on biological data. Instead, it extracts a methionine-preferential signal from a reference representation linked to the beginning-of-sequence token. This prediction arises from competition with context-dependent circuits, emphasizing the difference between reliable predictions and authentic biological recognition. The study also presents a norm-direction decomposition of attention scores across rotary frequency bands to elucidate how positional information is conveyed to the readout, demonstrating that positional encoding functions through interconnected alterations.
Key facts
- Study examines ESM2-8M's prediction that proteins begin with methionine
- Model retrieves signal from beginning-of-sequence token, not from masked position
- Final output emerges through competition with context-dependent circuits
- Researchers introduce norm-direction decomposition of attention scores
- Positional encoding operates through coupled changes in attention scores
Entities
Institutions
- arXiv