๐ฎ
๐ฎ
The Ethereal
Viral Proteins Reveal Geometry of Protein Language Models
June 10, 2026 ยท Grace Period ยท ๐ ICML 2026 GenBio Workshop and FM4LS Workshop
Authors
Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang
arXiv ID
2606.12609
Category
cs.LG: Machine Learning
Cross-listed
q-bio.QM
Citations
0
Venue
ICML 2026 GenBio Workshop and FM4LS Workshop
Abstract
Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Machine Learning
๐ฎ
๐ฎ
The Ethereal
Continuous control with deep reinforcement learning
๐
๐
Old Age
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
๐
๐
Old Age
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
๐
๐
Old Age
SGDR: Stochastic Gradient Descent with Warm Restarts
๐ฎ
๐ฎ
The Ethereal