Celeste: A cloud-based genomics infrastructure with variant-calling pipeline suited for population-scale sequencing projects.

medRxiv : the preprint server for health sciences
Authors
Keywords
Abstract

BACKGROUND: The Research Program () is one of the world's largest sequencing efforts that will generate genetic data for over one million individuals from diverse backgrounds. This historic megaproject will create novel research platforms that integrate an unprecedented amount of genetic data with longitudinal health information. Here, we describe the design of , a resilient, open-source cloud architecture for implementing genomics workflows that has successfully analyzed petabytes of participant genomic information for - thereby enabling other large-scale sequencing efforts with a comprehensive set of tools to power analysis. The infrastructure is tremendously scalable and has routinely processed fluctuating workloads of up to 9,000 whole-genome sequencing (WGS) samples for , monthly. It also lends itself to multiple projects. Serverless technology and container orchestration form the basis of 's system for managing this volume of data.

RESULTS: In 12 months of production (within a single Amazon Web Services (AWS) Region), around 200 million serverless functions and over 20 million messages coordinated the analysis of 1.8 million bioinformatics, quality control, and clinical reporting jobs. Adapting WGS analysis to clinical projects requires adaptation of variant-calling methods to enrich the reliable detection of variants with known clinical importance. Thus, we also share the process by which we tuned the variant-calling pipeline in use by the multiple genome centers supporting to maximize precision and accuracy for low fraction variant calls with clinical significance.

CONCLUSIONS: When combined with hardware-accelerated implementations for genomic analysis, Celeste had far-reaching, positive implications for turn-around time, dynamic scalability, security, and storage of analysis for one hundred-thousand whole-genome samples and counting. Other groups may align their sequencing workflows to this harmonized pipeline standard, included within the framework, to meet clinical requisites for population-scale sequencing efforts. is available as an Amazon Web Services (AWS) deployment in GitHub, and includes command-line parameters and software containers.

Year of Publication
2025
Journal
medRxiv : the preprint server for health sciences
Date Published
04/2025
DOI
10.1101/2025.04.29.25326690
PubMed ID
40343041
Links