Greater Availability of De-Identified Patient Health Data Could Enable Better Treatments and Diagnostics
BOSTON – Artificial intelligence has come a long way in recent years. Scientists have made great strides in developing algorithms that can analyze patient data to diagnose disease or predict which treatments work best for different patients. The success of those algorithms depends on access to patient health data which has been stripped of personal information that could be used to identify individuals from the dataset. However, the possibility that individuals could be identified through other means has raised concerns among privacy advocates.
In a recent study, researchers at Beth Israel Deaconess Medical Center (BIDMC) have quantified the potential risk of this kind of patient re-identification and found that it is currently extremely low relative to the risk of data breach. In fact, between 2016 and 2021, the period examined in the study, there were no reports of patient re-identification through publicly available health data. The findings, published in PLOS Digital Health, suggest that the potential risk to patient privacy is greatly outweighed by the gains for patients, who benefit from better diagnosis and treatment that large datasets can yield.
"We agree that there is some risk to patient privacy, but there is also a risk of not sharing data," said senior author Leo Anthony Celi, MD, MPH, MSc, a physician-researcher in the Division of Pulmonary, Critical Care and Sleep Medicine at BIDMC. "There is harm when data is not shared, and that needs to be factored into the equation. It is my hope that, in the near future, these datasets will become more widely available and include a more diverse group of patients."
Large health record databases created by hospitals and other institutions contain a wealth of information on diseases such as heart disease, cancer, macular degeneration, and COVID-19, which researchers mine to discover new ways to diagnose and treat disease. Celi and colleagues at Massachusetts Institute of Technology's (MIT) Laboratory for Computational Physiology have created several publicly available databases including the Medical Information Mart for Intensive Care (MIMIC), which they recently leveraged to develop algorithms that can help doctors make better medical decisions. Many other research groups have also used the data, and others have created similar databases in countries around the world.
Typically, when patient information is entered into this kind of database, certain identifying facts are removed, such as patients' names, addresses, and phone numbers. However, concerns about privacy have slowed the development of more publicly available databases. In the new study, the team set out to determine the actual risk of patient re-identification. First, they searched PubMed, a database of scientific papers, for any reports of patient re-identification from publicly available health data, but found none. To expand the search, the researchers then examined media reports from September 2016 to September 2021, using Media Cloud, an open-source global news database and analysis tool.
"In a search of more than 10,000 U.S. media publications, we did not find a single instance of patient re-identification from publicly available health data," said lead author Kenneth Seastedt, MD, a thoracic surgery fellow at BIDMC. "During the same time period, we actually found health records of nearly 100 million people were stolen through data breaches of information that was supposed to be securely stored."
More widespread sharing of de-identified health data is necessary, the scientists say, to improve the inclusion of minority groups in the United States, who have traditionally been underrepresented in medical studies. They are also working to encourage the development of more such databases in low- and middle-income countries.
"We cannot move forward with AI unless we address the biases that lurk in our datasets," said Celi. "When we have this debate over privacy, no one hears the voice of the people who are not represented. People are deciding for them that their data need to be protected and should not be shared. But they are the ones whose health is at stake; they're the ones who would most likely benefit from data-sharing."
"Our study team includes researchers from across the world," said Seastedt. "With the lack of access and representation of low-income countries in data-sharing work, it is important to ensure voices from the countries that could benefit from these efforts are included."
The team is enhancing existing safeguards to protect such datasets like MIMIC by sharing the data in a way that it can't be downloaded, and all queries run on it can be monitored by the administrators of the database. This allows them to flag any user inquiry that seems like it might not be for legitimate research purposes.
"What we are advocating for is performing data analysis in a very secure environment so that we weed out any nefarious players trying to use the data for other reasons apart from improving population health," said Celi who is also a principal research scientist at MIT. "We're not saying that we should disregard patient privacy. What we're saying is that we have to also balance that with the value of data sharing."
Co-authors included Patrick Schwab of GlaxoSmithKline; Zach O'Brien of Monash University; Edith Wakida of Mbarara University of Science and Technology; Karen Herrera of Hospital Militar, Managua; Portia Grace F. Marcelo and Alvin Marcelo of University of the Philippines; Louis Agha-Mir-Salim of Institute of Medical Informatics, Charité - Universitätsmedizin; Xavier Borrat Frigola of Hospital Clinic de Barcelona and Harvard-MIT Division of Health Sciences & Technology; and Emily Boardman Ndulue of Northeastern University.
The research was funded by the National Institutes of Health (grant NIBIB R01 EB017205).
Schwab is an employee and shareholder of GlaxoSmithKline plc. The funders had no role in study design, decision to publish, or preparation of the manuscript.
This press release was repurposed from content developed by Anne Trafton of the MIT News Office.