Protein language models uncover carbohydrate-active enzyme function in metagenomics.

BMC bioinformatics
Authors
Keywords
Abstract

BACKGROUND: The functional annotation of uncharacterized microbial enzymes from metagenomic data remains a significant challenge, limiting our understanding of microbial metabolic dynamics. Traditional annotation methods often rely on sequence homology, which can fail to identify remote homologs or enzymes with structural rather than sequence conservation. To address this gap, we developed CAZyLingua, the first annotation tool to use protein language models (pLMs) for the accurate classification of carbohydrate-active enzyme (CAZyme) families and subfamilies.RESULTS: CAZyLingua demonstrated high performance, maintaining precision and recall comparable to state-of-the-art hidden Markov model-based methods while outperforming purely sequence-based approaches. When applied to a metagenomic gene catalog from mother/infant pairs, CAZyLingua identified over 27,000 putative CAZymes missed by other tools, including horizontally-transferred enzymes implicated in infant microbiome development. In datasets from patients with Crohn's disease and IgG4-related disease, CAZyLinuga uncovered disease-associated CAZymes, highlighting an expansion of carbohydrate esterases (CEs) in IgG4-related disease. A CE17 enzyme predicted to be overabundant in Crohn's disease was functionally validated, confirming its catalytic activity on acetylated manno-oligosaccharides.CONCLUSIONS: CAZyLingua is a powerful tool that effectively augments existing functional annotation pipelines for CAZymes. By leveraging the deep contextual information captured by pLMs, our method can uncover novel CAZyme diversity and reveal enzymatic functions relevant to health and disease, contributing to a further understanding of biological processes related to host health and nutrition.

Year of Publication
2025
Journal
BMC bioinformatics
Volume
26
Issue
1
Pages
285
Date Published
11/2025
ISSN
1471-2105
DOI
10.1186/s12859-025-06286-y
PubMed ID
41299229
Links