Access to publicly available human single-cell gene expression datasets, or scRNA-seq datasets, has significantly enhanced researchers' understanding of both complex biological systems and the etymology of various diseases. However, the increase in accessibility raises a greater concern about the privacy of the individuals who donated the cells and the likelihood of their private health details being shared without consent.
Previous studies on these privacy breaches have focused on bulk gene expression data sharing, where the average expression levels of genes are measured across a large population of cells from a tissue or sample rather than an individual cell. Because single-cell datasets can contain a lot of variation or "noise", researchers did not consider them at high risk for information leaks. Now, researchers at the New York Genome Center, Columbia University, and Brown University have challenged this assumption.
A new study, published on October 2 in Cell, describes the novel discovery that individuals in single-cell gene expression datasets are vulnerable to "linking attacks". In such attacks, hackers can uncover private genetic and physical trait information of research participants.
"Recently released population scale single-cell datasets allowed us to approach the topic of privacy leakage and address the question of whether a hacker can work through the noise of single-cell data using publicly available information only to gain insight on a patient's genetic makeup and phenotypic traits and diseases," said corresponding author Gamze Gürsoy, PhD, Core Faculty Member at the New York Genome Center (NYGC), and Herbert Irving Assistant Professor of Biomedical Informatics at Columbia University.
Dr. Gürsoy and the study authors first gathered data from a Lupus study and the OneK1K cohort, linking individuals to their genetic and phenotypic data by comparing it to publicly available bulk expression quantitative trait loci (eQTLs). They then demonstrated that this linking could be performed even more accurately using cell-type specific eQTLs. Finally, they showed that linking individuals to their genetic and phenotypic profiles is still feasible in cases where eQTL data is unavailable, by leveraging genetic and single-cell data from a smaller number of individuals to train a predictive model.
"We all know that gene expression patterns are influenced by genetic mutations, combinations of which are unique to each individual," adds Conor Walker, a former post-doc in Dr. Gürsoy's labs at NYGC and Columbia. "We showed that by using genetic variants and single-cell RNA-Seq data from one cohort, we can identify positions that can be predicted in other studies, relying solely on the single-cell expression data from those studies. This approach allows the retrieval of genetic information that participants in unrelated studies never consented to sharing."
Since the data does not need to originate from the same group or population, healthy datasets can be used to predict information about a diseased dataset. There are enough underlying commonalities within the gene expressions of healthy and diseased individuals that disease does not greatly impact the gene expression signals even in single cells.
"The ability to leverage data generated in a different lab and even processed with a different method, to then use it to link individuals in a completely different anonymous dataset, is rather striking and highlights a real privacy issue for single-cell data," added Dr. Gürsoy. "We aim for this study to help quantify risks before data release and shape the design of future studies to ensure greater privacy for patients."
The hope is this discovery will assist in developing clear and detailed consent policies highlighting the privacy risk for donors of single-cell data, and to shape laws and legislation preventing attackers from using this information for harm.
All authors (from the New York Genome Center and Columbia University unless noted): Conor R. Walker, Xiaoting Li, Manav Cakravarthy (Brown University), William Lounsbery-Scaife, Yoolim A. Choi, Rithambara Singh (Brown University), and Gamze Gürsoy.
This work has been supported by the following grants: R35GM147004 and R00HG010909