PMCID
PMC12632626

ProCyon: A multimodal foundation model for protein phenotypes.

bioRxiv : the preprint server for biology
Authors
Abstract

Characterizing human proteins remains a major challenge: approximately 29% of human proteins lack experimentally validated functions and even well-annotated proteins often lack context-specific phenotypic insights. To enable universal modeling of protein phenotypes, we present ProCyon, a multimodal foundation model that utilizes protein sequence, structure, and natural language for generating and predicting protein phenotypes across diverse knowledge domains. ProCyon is trained on our novel dataset, ProCyon-Instruct, with 33 million protein phenotype instructions. On dozens of benchmarking tasks, ProCyon performs competitively against single-modal and multimodal models. Further, ProCyon conditionally retrieves proteins via mechanisms of action of small molecule drugs and disease contexts, and it generates candidate phenotypic descriptions for poorly characterized proteins, including those implicated in Parkinson's disease that were identified after ProCyon's knowledge cutoff date. We experimentally confirm ProCyon's predictions in multiple sclerosis using post-mortem brain RNA-seq, identifying novel MS genes and elucidating associated pathway mechanisms consistent with cortical pathology. ProCyon paves the way toward a general approach to generate functional insights into the human proteome.

Year of Publication
2025
Journal
bioRxiv : the preprint server for biology
Date Published
11/2025
ISSN
2692-8205
DOI
10.1101/2024.12.10.627665
PubMed ID
41279541
Links