Natural language processing of gene descriptions for overrepresentation analysis with GeneTEA.
Genome biology
| Authors | |
| Keywords | |
| Abstract | Overrepresentation analysis is used to identify biological enrichment in a list of genes. Here, we introduce GeneTEA, a model that ingests free-text gene descriptions and incorporates natural language processing methods to learn a sparse gene-by-term embedding, which can be treated as a de novo gene set database. In benchmarks against existing overrepresentation analysis tools, only GeneTEA properly controls false discovery while consistently surfacing the most relevant biology, doing so with less redundancy. We show that the same approach can be applied to other organisms' genomes or compounds. Furthermore, we provide an interactive app and API for the trained GeneTEA model. |
| Year of Publication | 2025
|
| Journal | Genome biology
|
| Volume | 26
|
| Issue | 1
|
| Pages | 376
|
| Date Published | 10/2025
|
| ISSN | 1474-760X
|
| DOI | 10.1186/s13059-025-03844-8
|
| PubMed ID | 41168869
|
| Links |