Multiple-instance learning of somatic mutations for the classification of tumour type and the prediction of microsatellite status.

Nature biomedical engineering
Authors
Abstract

Large-scale genomic data are well suited to analysis by deep learning algorithms. However, for many genomic datasets, labels are at the level of the sample rather than for individual genomic measures. Machine learning models leveraging these datasets generate predictions by using statically encoded measures that are then aggregated at the sample level. Here we show that a single weakly supervised end-to-end multiple-instance-learning model with multi-headed attention can be trained to encode and aggregate the local sequence context or genomic position of somatic mutations, hence allowing for the modelling of the importance of individual measures for sample-level classification and thus providing enhanced explainability. The model solves synthetic tasks that conventional models fail at, and achieves best-in-class performance for the classification of tumour type and for predicting microsatellite status. By improving the performance of tasks that require aggregate information from genomic datasets, multiple-instance deep learning may generate biological insight.

Year of Publication
2023
Journal
Nature biomedical engineering
Date Published
11/2023
ISSN
2157-846X
DOI
10.1038/s41551-023-01120-3
PubMed ID
37919367
Links