Gone Fishing: Unsupervised methods for discovery from public data/Primer: Integrating biomedical knowledge to predict new uses for existing drugs


U Penn School of Medicine
Gone Fishing: Unsupervised methods for discovery from public data

Abstract:  Public gene expression data are abundant. Anybody with an internet connection can download more than 2 million genome-wide assays of gene expression. Learning from these data remains challenging. For example, public data often lack the annotations that enable traditional meta-analysis. If we could surmount these barriers, however, we'd have a valuable resource at our fingertips. Our lab uses machine learning methods to integrate these heterogeneous, noisy, and often poorly or incorrectly annotated data. We focus specifically on algorithms that are unsupervised and robust to noise in order to tackle unannotated data. We've shown that these algorithms can robustly reveal biological features in data from cancer biopsies to microbial systems. And we share these algorithms by building user-friendly software and web servers. Our aim is to make the reproducible analysis of big public data as routine in life sciences labs as wet-bench techniques like PCR.


Greene Lab
Primer: Integrating biomedical knowledge to predict new uses for existing drugs

Abstract:  How do you teach a computer biology? Our goal was to predict new uses for existing drugs. But we're data scientists, not pharmacologists. So we set out to encode the knowledge from millions of biomedical studies from the last half century. Using a heterogeneous network (hetnet) as our data structure, we were able to condense a large portion of biomedical knowledge into a network with 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. The network is named Hetionet v1.0 and lives at .

Hetionet enables queries that span many types of information. While such queries were possible before Hetionet, they often took months of data integration, preprocessing, and specialized query scripts. Now complex queries can be written in minutes using the Cypher query language for hetnets. Accordingly, we were able to perform ~47 million queries to assess the connectivity between 136 diseases and 1,538 compounds. Next, we compiled a catalog of 755 disease-modifying treatments and learned which types of network paths could predict whether a compound treats a disease. In total, we predicted probabilities of treatment for 209,168 compound-disease pairs (). Our method also allows you to compare which types of information were valuable for predicting drug efficacy. Project Rephetio, the codename for this project, was performed openly online in realtime (). In total, 40 community members provided feedback across 86 project discussions.

Attend the primer to learn more about Project Rephetio & Hetionet as well as hetnets for data integration and the Neo4j graph database. Research continuous as a set of open source GitHub repositories, allowing anyone interested to get involved.