ML4H builds AI that works the way medicine does: across modalities, over time, and grounded in real patient data. Our projects span the full arc from foundational models and biological discovery to clinical risk prediction and disease monitoring, always with an eye toward what will matter in the clinic.
Our research is built on large-scale, carefully curated clinical data resources that enable NLP-driven phenotyping, disease prediction, and biological discovery across the full ML4H portfolio.
The Community Care Cohort Project (C3PO)
Longitudinal, high-resolution clinical data for over 500,000 individuals, with ~80 billion tokens of clinical text. Used across ML4H projects for NLP-driven phenotyping and disease discovery.
Publication(s):
PADME: The Pregnancy Cohort
A multi-institutional EHR cohort of over 56,000 pregnancies with longitudinal cardiovascular outcomes, built to study cardiometabolic risk and cardiovascular complications during pregnancy.
Publication(s):
ML4H develops new machine learning methods, not just applications, that advance the state of the art in learning from complex biomedical data.
Patient Contrastive Learning Representations (PCLR)
A transfer learning approach for ECGs that outperforms supervised models trained from scratch in low-data settings, making high-quality ECG modeling broadly accessible.
Publication(s):
ECG-MRI Cross-Modal Autoencoder (DropFuse)
A framework integrating ECGs and cardiac MRIs into a unified cardiovascular representation, improving phenotype prediction and enabling data imputation when one modality is missing.
Publication(s):
Registered Autoencoders for GWAS and PheWAS
Autoencoders built across multiple imaging modalities and cohorts using anatomical atlas registration, greatly increasing the number of genetic and phenotypic associations in the learned latent space.
Publication(s): ;
Representation Fusion for Multimodal Diagnostics
A framework for decoupling diagnostic information across modalities in multimodal learning, enabling more robust representations from combined clinical data.
Publication(s):
xMADD: Diffusion Models for Medical Image and Waveform Synthesis
A unified diffusion framework for conditioned synthesis of medical images and ECG waveforms, supporting data augmentation and privacy-preserving data generation.
Publication(s):
The Latentverse: Benchmarking Latent Representations
An open-source toolkit for evaluating the quality and clinical utility of latent representations learned from biomedical data across tasks and modalities.
Publication(s):
ML4H uses deep learning on cardiac MRI, echocardiography, and abdominal imaging to extract phenotypes at a scale impossible with manual methods, linking them to clinical outcomes and genetic architecture.
Echocardiogram Phenotyping
A segmentation-free deep learning model extracts measures of cardiac structure and function from echocardiogram videos, strongly correlated with future clinical outcomes.
Publication(s):
Left Atrial Structure and Atrial Fibrillation Risk
Deep learning characterizes left atrial structure and function across the UK Biobank, revealing strong links to atrial fibrillation risk.
Publication(s):
Left Ventricular Mass: Imaging and ECG
Deep learning models measure left ventricular mass from cardiac MRI and predict it directly from 12-lead ECGs, providing non-invasive access to a key marker of cardiac health.
Publication(s): ; ;
Right Heart Structure and Genetics
Analysis of cardiac MRI in over 40,000 individuals identifies the genetic determinants of right heart structure and heritable drivers of right-sided cardiac disease.
Publication(s):
Mitral Valve Prolapse Detection from Echocardiogram
A deep learning model identifies mitral valve prolapse from echocardiographic images, offering a scalable approach to a commonly missed condition.
Publication(s):
Phenotyping Cardiac Fibrosis
An AI model quantifies myocardial fibrosis from T1-map cardiac MRI, with scores correlated to disease risk and genetic loci that reveal potential therapeutic pathways.
Publication(s):
Papillary Muscle Segmentation and Fibrosis
Deep learning segments papillary muscles from cardiac MRI T1 maps and quantifies fibrosis, identifying associations with cardiovascular disease and distinct GWAS loci.
Multi-Organ Fibrosis: Shared and Organ-Specific Pathways
Using abdominal MRI T1 maps, ML4H quantifies fibrosis across liver, kidney, and pancreas, identifying shared genetic pathways and linking multi-organ fibrosis to all-cause mortality.
Publication(s):
MRI-Based Fat Distribution and Cardiometabolic Risk
Deep learning measures visceral, subcutaneous, and gluteofemoral fat from MRI, showing that fat distribution — more than BMI — drives diabetes and coronary artery disease risk.
Publication(s):
Body Fat Distribution from Silhouette Images
A model estimates body fat distribution from standard silhouette images, enabling low-cost cardiometabolic phenotyping at scale.
Publication(s):
Aortic Dimensions, Genetics, and Disease Risk
Analysis of 2M+ cardiac MRI images identifies ~80 GWAS loci for aortic diameter and builds polygenic risk scores for aortic aneurysm, stenosis, and dissection.
Publication(s): ;
Ascending Aortic Diameter Prediction
A population-based model predicts ascending aortic diameter in asymptomatic individuals, enabling earlier identification of those at risk for aortic disease.
Publication(s):
ML4H has built a comprehensive portfolio of AI models that extract far more from the ECG than conventional interpretation: predicting disease onset, stratifying risk, and uncovering biological signals.
ECG-AI: Predicting Time to Incident Atrial Fibrillation
A deep learning model using 12-lead ECGs predicts time to incident AF, with improved utility when combined with a clinical risk model.
Publication(s):
Risk-Guided AF Screening in the VITAL-AF Trial
AI-enabled ECG models guide AF screening by identifying patients most likely to benefit from targeted monitoring, validated in a randomized trial.
Publication(s):
Single-Lead ECG AI for Incident AF
An AI model applied to handheld single-lead ECGs — as used in consumer devices — predicts incident AF, validated in the VITAL-AF trial.
Publication(s):
Predicting AF Recurrence After Cardioversion
An ECG-based model predicts AF recurrence following cardioversion for newly diagnosed AF, supporting personalized post-procedure management.
Publication(s):
Stressor-Associated AF: Predicting Recurrence and Outcomes
A deep learning model predicts recurrence and outcomes in patients who develop AF following a physiological stressor — a distinct and understudied AF subtype.
Publication(s):
Genetic Susceptibility to AF from ECG Deep Learning
Deep learning representations of the 12-lead ECG capture heritable variation in AF risk, linking ECG features to genetic susceptibility loci.
Publication(s):
ECG2Hypertension: A Digital Biomarker for Hypertension
A deep learning model detects hypertension from a single 12-lead ECG — a stronger predictor of mortality, stroke, heart failure, and MI than systolic blood pressure.
Publication(s):
AI-ECG for Incident Heart Failure
A deep learning model applied to the 12-lead ECG predicts incident heart failure, extending routine ECG data into long-term risk stratification.
Publication(s):
ECG-Based Identification of Coronary Artery Disease
A deep learning model identifies coronary artery disease from the standard 12-lead ECG, offering a non-invasive screening signal for a leading cause of death.
Publication(s):
ECG AI to Discriminate Cardioembolic Stroke and Post-Stroke AF Risk
An AI model discriminates cardioembolic from non-cardioembolic stroke and stratifies post-stroke AF risk, supporting targeted secondary prevention.
Publication(s):
ECG Deep Learning to Predict VO2 Max
A resting 12-lead ECG model predicts peak exercise oxygen uptake (VO2 max), a powerful fitness marker, without requiring an exercise test.
Publication(s):
Identifying Impaired Heart Rate Recovery from the ECG
A resting ECG model identifies impaired heart rate recovery — a marker of autonomic dysfunction and elevated cardiovascular risk.
Publication(s):
Pulmonary Capillary Wedge Pressure from the ECG
A deep learning model infers elevated pulmonary capillary wedge pressure — a key indicator of heart failure severity — from the standard 12-lead ECG.
Publication(s):
ECG Age and Clonal Hematopoiesis
Deep learning-derived ECG age is associated with clonal hematopoiesis of indeterminate potential, linking a non-invasive cardiac signal to an emerging cardiovascular risk factor.
Publication(s):
ML4H develops NLP and large language model (LLM) methods to unlock unstructured clinical text for phenotyping, prediction, and discovery.
NLP for Heart Failure Adjudication
An NLP model identifies heart failure events from unstructured discharge summaries, enabling scalable adjudication in clinical trials and real-world cohorts.
Publication(s): ;
LLM-Based Dementia Risk Prediction from Longitudinal Notes (CLIN-SUMM)
LLMs extract early dementia markers — falls and hearing loss mentions — from EHR notes and apply temporal summarization to build predictive models for early risk identification.
Publication(s):
Cardiac MRI Report Measurement Extraction
A data-efficient NLP approach extracts structured measurements from cardiac MRI radiology reports, reducing manual effort in research and clinical workflows.
Publication(s):
C3PO NLP Phenotyping at Scale
Using ~80 billion tokens in the C3PO corpus, ML4H scales phenotype ascertainment across hundreds of thousands of patients via NLP — enabling analyses impossible with manual chart review.
Publication(s):
ML4H connects AI-derived phenotypes to genetic and molecular data, identifying biological drivers of disease and revealing new targets for therapy.
Liver Fat Genetics
Machine learning on large-scale imaging data reveals new genetic contributions to liver fat accumulation, a key driver of metabolic liver disease.
Publication(s):
Dilated Cardiomyopathy: Genetic Insights from Cardiac MRI
Analysis of cardiac MRI in 36,000 individuals identifies genetic loci associated with dilated cardiomyopathy, demonstrating the power of AI-enabled imaging phenotyping for cardiac genetics.
Publication(s):
Multimodal Feature Selection for Coronary Artery Disease
Machine learning selects 51 predictive features from 13,782 multimodal candidates, improving CAD prediction and identifying the most informative signals across clinical, imaging, and genetic data.
Publication(s):
Circulating Proteins Associated with Myocardial Fibrosis
Proteomics analysis identifies circulating proteins linked to myocardial interstitial fibrosis, providing potential biomarkers and therapeutic targets for fibrotic heart disease.
Publication(s):
Pulse Waveform and Genetic Associations
Machine learning identifies genetic and clinical factors associated with the dicrotic notch of the pulse waveform, linking a non-invasive vascular signal to heritable cardiovascular biology.
Publication(s):
Wearable devices generate continuous, real-world data that complements episodic clinical measurements. ML4H builds methods to extract health insights from this modality at population scale.
Physical Activity Patterns and Cardiometabolic Disease
"Weekend Warrior" activity — concentrating exercise in one or two days — reduces cardiometabolic disease risk comparably to more distributed activity patterns.
Publication(s): ;
Wearable Monitoring at Population Scale
Analysis of 20M+ days of wearable data characterizes population-level physical activity and sedentary behavior patterns and links them to cardiovascular outcomes.
Publication(s):