Public RNA-seq data are not representative of global human diversity

Abstract

The field of human genetics has reached a consensus that it is important to work with diverse and globally representative participant groups. This diverse sampling is required to build a robust understanding of the genomic basis of complex traits and diseases as well as human evolution, and to ensure that all people benefit from downstream scientific discoveries. While previous work has characterized compositional biases and disparities for public genome-wide association (GWAS), microbiome, and epigenomic studies, we currently lack a comprehensive understanding of the degree of bias for transcriptomic studies. To address this gap, we analyzed the metadata for RNA-seq studies from two public databases—the Sequence Read Archive (SRA), representing 795,071 samples from 21,209 studies, and the Database of Genotypes and Phenotypes (dbGaP), representing 167,389 samples from 649 studies. We also randomly selected 620 studies from SRA for detailed, manual evaluation. We found that 3% of samples in SRA and 21% of individuals described in the literature had population descriptors (race, ethnicity, or ancestry); 28% of samples in dbGaP had paired genotype data that was used to empirically infer ancestry. In SRA, dbGaP, and the literature, race, ethnicity, and ancestry terms were frequently conflated and difficult to disambiguate. After standardizing population descriptors, we observed many clear biases: for example, among samples in SRA that were coded using US Census terms, 69.0% came from white donors, corresponding to an 1.2x overrepresentation of this group relative to the US population. Among samples in SRA coded using continental ancestry labels, 55.6% came from European ancestry donors—an 4.1x overrepresentation of this group relative to the global population. These biases were generally similar across datasets (SRA, dbGaP, literature review), and were comparable to previous reports for other ‘omics data types. However, we note that, relative to other ‘omics data subsets like GWAS, there is considerably less information, of arguably worse quality, about who is participating in RNA-seq studies. Together, these results demonstrate a critical need to improve our thoughtfulness, consistency, and effort around reporting population descriptors in RNA-seq studies, and to more generally strive for greater diversity in this important data type.

Publication
bioRxiv

Related