ARTFEED — Contemporary Art Intelligence

Protein Language Model ESM2-8M Relies on Retrieval, Not Biological Evidence, for Methionine Prediction

other · 2026-05-20

A recent study on arXiv indicates that the ESM2-8M protein language model fails to identify methionine at masked locations based on biological data. Instead, it extracts a methionine-preferential signal from a reference representation linked to the beginning-of-sequence token. This prediction arises from competition with context-dependent circuits, emphasizing the difference between reliable predictions and authentic biological recognition. The study also presents a norm-direction decomposition of attention scores across rotary frequency bands to elucidate how positional information is conveyed to the readout, demonstrating that positional encoding functions through interconnected alterations.

Key facts

  • Study examines ESM2-8M's prediction that proteins begin with methionine
  • Model retrieves signal from beginning-of-sequence token, not from masked position
  • Final output emerges through competition with context-dependent circuits
  • Researchers introduce norm-direction decomposition of attention scores
  • Positional encoding operates through coupled changes in attention scores

Entities

Institutions

  • arXiv

Sources