AMix-2: A Protein-Text Foundation Model for Unified Biological Understanding and Design

ai-technology · 2026-06-01

A new protein-text foundation model named AMix-2 has been developed by researchers, which incorporates protein sequences as an inherent part of large language models (LLMs). This model merges protein comprehension and sequence creation into one cohesive framework, removing the necessity for distinct models tailored to specific tasks. AMix-2 is founded on two principal innovations: a unified formulation that integrates natural language and protein sequences into a common token space for biological reasoning and conditional design, and a block-wise diffusion language modeling backbone that enhances causal generation across blocks while allowing for bidirectional context and iterative refinement, aligning more closely with the inherent characteristics of proteins. Additionally, the team has introduced ProteinArena, a thorough benchmark designed to assess protein foundation models in realistic generalization scenarios. This research is available in a preprint on arXiv (ID: 2605.30963).

Key facts

AMix-2 is a protein-text foundation model.
It establishes protein as a native modality in LLMs.
The model unifies protein understanding and sequence design.
It uses a unified protein-text formulation with shared token space.
A block-wise diffusion language modeling backbone is employed.
ProteinArena is a new benchmark for protein foundation models.
ProteinArena includes time-aware and homology-aware protocols.
The preprint is available on arXiv (ID: 2605.30963).

AMix-2: A Protein-Text Foundation Model for Unified Biological Understanding and Design

Key facts

Entities

Institutions

Sources