Natural language processing of gene descriptions for overrepresentation analysis with GeneTEA.

Genome biology

Authors	Isabella Boyle Nayeem Aquib Mustafa Kocak Randy Creasi Philip Montgomery Catarina Campbell Joshua Dempster
Keywords	Gene sets Natural language processing Overrepresentation analysis
Abstract	Overrepresentation analysis is used to identify biological enrichment in a list of genes. Here, we introduce GeneTEA, a model that ingests free-text gene descriptions and incorporates natural language processing methods to learn a sparse gene-by-term embedding, which can be treated as a de novo gene set database. In benchmarks against existing overrepresentation analysis tools, only GeneTEA properly controls false discovery while consistently surfacing the most relevant biology, doing so with less redundancy. We show that the same approach can be applied to other organisms' genomes or compounds. Furthermore, we provide an interactive app and API for the trained GeneTEA model.
Year of Publication	2025
Journal	Genome biology
Volume	26
Issue	1
Pages	376
Date Published	10/2025
ISSN	1474-760X
DOI	10.1186/s13059-025-03844-8
PubMed ID	41168869
Links

Recent ӳ��ý Publications