EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

December 18, 2020 · Declared Dead · 🏛 Machine Learning in Health Care

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Siddharth Biswal, Soumya Ghosh, Jon Duke, Bradley Malin, Walter Stewart, Jimeng Sun arXiv ID 2012.10020 Category cs.LG: Machine Learning Cross-listed cs.AI Citations 47 Venue Machine Learning in Health Care Last Checked 3 months ago

Abstract

Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall.