ProCyon: A multimodal foundation model for protein phenotypes.
| Authors | |
| Abstract | Characterizing human proteins remains a major challenge: approximately 29% of human proteins lack experimentally validated functions and even well-annotated proteins often lack context-specific phenotypic insights. To enable universal modeling of protein phenotypes, we present ProCyon, a multimodal foundation model that utilizes protein sequence, structure, and natural language for generating and predicting protein phenotypes across diverse knowledge domains. ProCyon is trained on our novel dataset, ProCyon-Instruct, with 33 million protein phenotype instructions. On dozens of benchmarking tasks, ProCyon performs competitively against single-modal and multimodal models. Further, ProCyon conditionally retrieves proteins via mechanisms of action of small molecule drugs and disease contexts, and it generates candidate phenotypic descriptions for poorly characterized proteins, including those implicated in Parkinson's disease that were identified after ProCyon's knowledge cutoff date. We experimentally confirm ProCyon's predictions in multiple sclerosis using post-mortem brain RNA-seq, identifying novel MS genes and elucidating associated pathway mechanisms consistent with cortical pathology. ProCyon paves the way toward a general approach to generate functional insights into the human proteome. |
| Year of Publication | 2025
|
| Journal | bioRxiv : the preprint server for biology
|
| Date Published | 11/2025
|
| ISSN | 2692-8205
|
| DOI | 10.1101/2024.12.10.627665
|
| PubMed ID | 41279541
|
| Links |