Meaningful choice/curation of pre-training data in alignment with a downstream task
Stanley Hua
University of Toronto
Motivation: Lack of labeled data is a common problem when applying computer vision to biomedical imaging data. A common approach is "transfer learning", which pre-trains on a separate (and sometimes unrelated) dataset to set the model to a good initialization so its features can be generalized to the new task. Most pre-training approaches in biomedical computer vision rely upon ImageNet or other natural image datasets. However, given the large shift in the domain, we hypothesized that transfer from natural image datasets may not be the most effective approach for transfer learning on biomedical images. Here, we sought to understand if curation and pre-training on microscopy images can yield better performance on tasks relating to the analysis of microscopy images. We present CytoImageNet, a large-scale dataset of openly-sourced and weakly-labeled microscopy images (890K images, 894 classes). Intriguingly, while pre-trained models on CytoImage Net do not surpass the performance of ImageNet, we show that fusing their features improves performance on unseen microscopy images across the board, suggesting that CytoImageNet features capture information not available in ImageNet-trained features. Our work highlights the potential of meaningfully curating domain-relevant datasets to learn domain-relevant features. The CytoImageNet dataset is made available at .