Researchers from the National Institute of Health Data Science at Peking University and the Department of Clinical Epidemiology and Biostatistics at Peking University People's Hospital have conducted a comprehensive systematic review evaluating strategies for addressing missing data in electronic health records (EHRs). Published in Health Data Science, the study highlights the growing importance of machine learning methods over traditional statistical approaches in managing missing data scenarios effectively.
Electronic health records have become a cornerstone in modern healthcare research, enabling analysis across clinical trials, treatment effectiveness studies, and genetic association research. However, missing data remains a persistent challenge, potentially introducing bias and undermining the reliability of findings. This study reviewed 46 research papers published between 2010 and 2024, systematically comparing the performance of traditional statistical methods, such as Multiple Imputation by Chained Equations (MICE), with modern machine learning approaches like Generative Adversarial Networks (GANs) and k-Nearest Neighbors (KNN).
The findings reveal that machine learning techniques, particularly GAN-based methods and context-aware time-series imputation (CATSI), consistently outperformed traditional statistical approaches in handling both longitudinal and cross-sectional datasets. For longitudinal data, Med.KNN and CATSI showed superior performance, while probabilistic principal component analysis (PCA) and MICE were more effective for cross-sectional datasets.
"Machine learning methods show significant promise for addressing missing data in EHRs," said Dr. Huixin Liu, Associate Professor at Peking University People's Hospital. "However, no single approach offers a universally applicable solution, underscoring the need for standardized benchmarking analyses across diverse datasets and missingness scenarios".
The study also identifies key challenges, including the heterogeneity of EHR datasets, the opacity of machine learning models, and the lack of universal benchmarks for assessing methodology performance. Future research aims to establish a standardized protocol for handling missing EHR data and develop benchmarking datasets for comprehensive evaluation.
"Our ultimate goal is to create a universally accepted protocol for handling missing data in electronic health records, ensuring more reliable and reproducible findings across medical research," added Dr. Shenda Hong, Assistant Professor at the National Institute of Health Data Science at Peking University.
This research marks a significant step toward addressing one of the most pressing challenges in digital healthcare research, offering insights that can help bridge the gap between data scarcity and robust analysis.