Natural language processing of gene descriptions for overrepresentation analysis with GeneTEA.

Genome biology
Authors
Keywords
Abstract

Overrepresentation analysis is used to identify biological enrichment in a list of genes. Here, we introduce GeneTEA, a model that ingests free-text gene descriptions and incorporates natural language processing methods to learn a sparse gene-by-term embedding, which can be treated as a de novo gene set database. In benchmarks against existing overrepresentation analysis tools, only GeneTEA properly controls false discovery while consistently surfacing the most relevant biology, doing so with less redundancy. We show that the same approach can be applied to other organisms' genomes or compounds. Furthermore, we provide an interactive app and API for the trained GeneTEA model.

Year of Publication
2025
Journal
Genome biology
Volume
26
Issue
1
Pages
376
Date Published
10/2025
ISSN
1474-760X
DOI
10.1186/s13059-025-03844-8
PubMed ID
41168869
Links