作者Yale, Andrew Jonathan
ProQuest Information and Learning Co
Rensselaer Polytechnic Institute. Computer Science
書名Privacy Preserving Synthetic Health Data Generation and Evaluation
出版項2020
說明1 online resource (99 pages)
文字text
無媒介computer
成冊online resource
附註Source: Dissertations Abstracts International, Volume: 82-03, Section: B
Advisor: Bennett, Kristin P
Thesis (Ph.D.)--Rensselaer Polytechnic Institute, 2020
Includes bibliographical references
We consider the problem of health data availability in education and research. Efforts to teach health informatics courses are severely limited without access to health data. Re- search projects are restricted to those with resources to attain health data. Even for those fortunate groups with access, progress is still impeded by privacy regulations. Research and educational efforts are biased towards a few publicly available datasets of limited scope. Deidentifying datasets by mapping and removing variables is costly and time con- suming, and reidentification is always a risk. Studies involving patient data are not repro- ducible by other researchers, which limits innovation. The ability to generate synthetic data suitable for a wide variety of studies involving healthcare enables much richer health- care data to be used in educational programs of all types. High quality libraries of synthetic datasets can advance health research by making data publicly available to all researchers without compromising on features and while still adhering to privacy regulations.The goal of this thesis is to define an end-to-end process for taking a defined med- ical dataset in a secure environment and creating a high quality synthetic version of the dataset that can be used in a public environment. We use this approach to create synthetic versions of two studies on autism spectrum disorder in the OptumLabs® Data Warehouse secure environment. The OptumLabs Data Warehouse is a longitudinal, real-world data asset with de-identified administrative claims and electronic health record data, contain- ing information on over 200 million patient lives. The OptumLabs Data Policy prevents exporting patient-level data in order to minimize re-identification risk for the millions of patient lives represented in the database. Thus our strategy is to train a novel generative model in the secure environment, assess the quality and privacy of that model, and the data it generates, export the model from the secure environment, generate the data exter- nally, and deploy the data in educational settings. This approach offers a new paradigm for health research. Researchers can develop algorithms using the synthetic data first, and then import these methods into a secure environment to conduct a final study on the real data to ensure results and conclusions are valid. The challenge is to create a process that OptumLabs has full confidence does not violate privacy. In order to address this chal- lenge, we define novel measures to empirically assess privacy. To accomplish this goal, we pursue a series of steps described in the chapters of this work.In Chapter 1, we define characteristics of high quality synthetic medical data using the categories of privacy, resemblance, utility, and efficiency. Privacy is a measure that verifies that no private information about an individual is contained in the synthetic data. Resemblance is split into three types: distribution, coverage, and fidelity resemblance. Distribution resemblance examines the statistical similarities between the distributions of the real and synthetic data. Coverage resemblance characterizes how well the supports of the synthetic and real data match. Fidelity resemblance examines how well the synthetic data preserves the qualitative features of the real data. Utility is defined in terms of targeted or universal utility. Targeted utility means the synthetic data is useful for a specific analysis task. Universal utility means the synthetic data can be used for any analysis task for which the real data would be used. Efficiency specifies the constraints on time, size, and data retention for the model. Efficiency ensures another layer of privacy protection within the model itself.Chapter 2 examines approaches for synthetic data generation. First, we examine ex- isting published approaches for synthetic data generation. For each approach, we discuss strengths and weakness with respect to privacy, resemblance, utility, and efficiency. Then, we select five baseline synthetic data generation methods, including one novel method, that excel in one or more of privacy, resemblance, utility, and efficiency, but have downsides in other categories. These methods are multivariate Gaussian, Parzen windows, differential privacy obfuscation, copying the real data, and our novel approach of the additive noise model. Second, we discuss existing methods included as alternatives to the baseline approaches. These methods are similar to the baseline methods, but have enough drawbacks that they are not considered for final baseline or candidate methods. Finally, we explore the candidate generative adversarial network (GAN) methods. The strengths of the base- line methods are compared to the GAN methods to examine how GAN models perform equivalently to the best parts of each baseline model.In Chapter 3, we create a dataset for testing the baseline methods and candidate GAN methods. The dataset is based on MIMIC-III ICU data and includes diagnosis, demographic, vital sign, and health outcome data. We evaluate the medGAN using metrics from the original paper and find it to be under performing. We create a novel method, the HealthGAN, as the final candidate method. We define a suite of metrics and visualizations for assessing the privacy, resemblance, and utility of synthetic data, demonstrate their use, and apply them in a benchmark study. These metrics are developed to be used on any type of synthetic data generation method. Theoretical results for privacy, resemblance, and utility do not exist for most methods. Our work provides empirical results, which are more valuable to rely on than theoretical results, and existing work on assessment of synthetic data is haphazard. Thus a primary contribution of this thesis is a novel suite of methods to empirically assess privacy, resemblance, and utility. To demonstrate this suite of methods, we create a MIMIC-III testbed and demonstrate that the novel HealthGAN method yields strong results.We propose targeted utility as the primary goal of synthetic data i.e., the ability of the synthetic data to work for an intended usage. How to assess the performance of tar- geted utility necessarily depends on the targeted tasks. In Chapter 4, we examine targeted utility for two different types of task, all based on synthetic datasets generated from the MIMIC-III data. The first test examines the use of the data in an education setting by using the synthetic data as an online challenge in a class. The second task examines the effective- ness of synthetic data for classroom usage and research studies by reproducing published studies using the real and synthetic MIMIC data. In all cases, we use HealthGAN to generate the data. On six problems, HealthGAN synthesized data that proved valuable for instructional purposes, and the synthetic data did a reasonable job of reproducing research results.Chapter 5 demonstrates the full end-to-end process. To evaluate this process, we replicate two different studies on autism spectrum disorder. We train our generative models on real de-identified patient data inside the secure OptumLabs environment, replicating as close as possible the exact cohorts used in previously published work. Through the process of repeating these studies, we identify several areas for improvements to the HealthGAN method that result in higher quality data. The synthetic data is evaluated for privacy, resemblance, and utility using our defined metrics. Finally, we export the synthetic data models from the OptumLabs environment.Chapter 6 concludes with a summary of contributions and discussions of future work
Electronic reproduction. Ann Arbor, Mich. : ProQuest, 2021
Mode of access: World Wide Web
主題Computer science
Public health
Information science
Health data availability
End-to-end process
Defined medical dataset
Electronic books.
0984
0723
0573
ISBN/ISSN9798662575981
QRCode
相關連結: click for full text (PQDT) (網址狀態查詢中....)
館藏地 索書號 條碼 處理狀態  

Go to Top