At the intersection of data privacy, open science and public health, the University of Waterloo has initiated a collaborative effort with several Canadian government agencies to explore the use of synthetic data for research and analysis.
Dr. Helen Chen, a professor in the Faculty of Health and director of the Professional Practice Centre in Health Systems is leading the initiative.
"This collaboration primarily aims to utilize synthetic data's capabilities in addressing the sensitive information held by government agencies," Chen says. "It is very important we preserve privacy while enabling data sharing and analysis - synthetic data stands to be a critical tool in achieving this balance."
Synthetic data is artificially generated information that closely mimics the characteristics and patterns of real data without containing any sensitive or personally identifiable information. It is developed through advanced machine learning algorithms and statistical techniques to replicate the properties and structures of real-world datasets.
Chen and her team have worked closely with The Ottawa Hospital on a case study using synthetic data to study a specific drug's use patterns and to train a machine learning model to predict the drug's efficacy on individual patients.
"The Ottawa Hospital uses MDClone, a synthetic data analytics platform, to generate high quality synthetic data that closely mirrors the characteristics of real data, ensuring no patient information is involved," Chen says. "Our model based on synthetic data is then provided to the team at the hospital for validation with real data - enabling us to evaluate the effectiveness of synthetic data when compared to the real data."
This collaboration caught the attention of multiple government agencies at a conference in Ottawa organized by Waterloo's Cybersecurity and Privacy Institute. The Public Health Agency of Canada (PHAC) and the Canada Border Services Agency both expressed interest in the use of synthetic data for research due to the sensitive nature of the data they harbor.
These agencies acknowledge the significance of employing synthetic data for various purposes, such as enhancing pandemic response applications, assessing border crossing patterns and evaluating health policy impacts on vulnerable communities.
By harnessing synthetic data, agencies can conduct essential testing, training and policy evaluations without compromising individual privacy or data security. This approach holds immense promise for enhancing decision-making processes, particularly in cases where real data must remain protected.
Seeing the usefulness of synthetic data from the Ottawa Hospital case study, the collaboration with the Waterloo research team and the federal agencies initiated a feasibility study. The study was conducted by a dedicated team comprising students and faculty members from the University and data scientists within PHAC.
This interdisciplinary team, including students from systems design engineering, computer science, public health sciences and urban planning, closely collaborated with PHAC's Corporate Data and Surveillance Division to create and assess synthetic data solutions. The students involved are gaining invaluable experiential learning opportunities and are contributing to the research development in this emerging field.
"One of the key members of the project, Dr. Shu-feng Tsao, has started her System Impact Post-Doctorial Fellowship at the Canadian Institute for Health Information, investigating data governance, ethics and quality measurements of generating and sharing synthetic health data," Chen says. "Some students have joined the government agencies as part-time employees, engaging in scientific inquiries and further research within this domain."
Bing Hu is a graduate student interested in medical applications of artificial intelligence (AI) and data and is on the research team.
"Open data and science are essential for ensuring fairness and equity, especially when real data can't be openly shared," Hu says. "Through this project, I've actively advanced the cause of open data and science and enjoyed meeting with and sharing my passion and knowledge of AI and synthetic data with critical stakeholders at PHAC."
The collaboration has already hosted discussions regarding ethical considerations in synthetic data generation and is planning workshops to focus on data generation techniques and evaluation practices.
"The agencies recognize the importance of promoting beneficial data use while ensuring privacy protection," highlights Chen on the government agencies' commitment to advancing open data and open science initiatives. "As part of our collaboration, we are developing frameworks to evaluate synthetic data's utility, fidelity and privacy - with the aim of setting national standards."
Looking forward, the team plans to explore synthetic data's potential across various government agencies and datasets. Their objective is to facilitate knowledge dissemination and establish the de facto standard for synthetic data sharing in Canada.
Waterloo's collaboration with The Ottawa Hospital and Canadian government agencies on synthetic data represents a significant stride towards advancing open science, safeguarding privacy and building and maintaining trust while accelerating research and analysis. As the project unfolds, it promises to reshape the handling and utilization of sensitive data for the Canadian public.