Ensembits: First Tokenizer for Protein Conformational Ensembles
Researchers introduced Ensembits, the first tokenizer designed for protein conformational ensembles, addressing the limitations of existing protein structure tokenizers (PSTs) that only capture static local geometry. Ensembits handles correlated motions and alternative states from molecular dynamics data. It uses a Residual VQ-VAE with a frame distillation objective trained on a large corpus. The method outperforms related approaches in RMSF prediction and matches or exceeds static tokenizers on motion amplitude analysis.
Key facts
- Ensembits is the first tokenizer of protein conformational ensembles.
- Existing PSTs only capture local geometry of static structures.
- Ensembits addresses challenges: deriving geometric descriptors across conformations, permutation-invariance encoding, and conquering sparsity.
- Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus.
- Outperforms all related methods on RMSF prediction.
- Strongest standalone structural tokenizer on token-conditioned ANOVA test for per-residue motion amplitude.
- Matches or exceeds static tokenizers on motion amplitude analysis.
- Published on arXiv with ID 2605.13789.
Entities
Institutions
- arXiv