A data model for population descriptors in genomic research.

American journal of human genetics
Authors
Keywords
Abstract

Population descriptors used in genetic studies have broad social and translational implications. There are no globally agreed-upon definitions or usages of common population descriptors (e.g., race, ethnicity, nationality, and tribe), many of which are applied ad hoc and/or derived from political or bureaucratic conventions. Recent recommendations have encouraged the retention of as much granularity in population descriptors as possible during data preparation, analysis, and interpretation of research results. However, genomic research infrastructures (i.e., current practices, resources, and workflows in genomic research) often lack systematic and flexible organization, structure, and harmonization of multifaceted and detailed population descriptor data. This can lead to loss of information, barriers to international collaboration, and potential issues in clinical translation. Here, we describe a data model, developed by the NIH-funded Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium, that organizes and retains detailed population descriptor data for future research use. The model supports a versatile, traceable, and reproducible harmonization system that offers multiple benefits over existing data structures. This data model affords researchers the flexibility to thoughtfully choose and scientifically justify their choice of population descriptors. It avoids the conflation of social identities with biological categories and guards against harmful typological inferences. Genomic research tools of this kind will be crucial for producing scientifically robust findings that minimize potential harms of descriptor misuse while maximizing benefits for diverse communities.

Year of Publication
2025
Journal
American journal of human genetics
Volume
112
Issue
7
Pages
1504-1514
Date Published
07/2025
ISSN
1537-6605
DOI
10.1016/j.ajhg.2025.05.011
PubMed ID
40513563
Links