MARC 主機 00000nam a2200493Ki 4500 
001    AAI3728239 
005    20181030085013.5 
006    m     o  u         
007    cr mn||||a|a|| 
008    181030s2015    xx      sbm   000 0 eng d 
020    9781339136318 
035    (MiAaPQ)AAI3728239 
035    (MiAaPQ)umn:16425 
040    MiAaPQ|beng|cMiAaPQ|dNTU 
100 1  Sarkar, Chandrima 
245 10 Improving Predictive Modeling in High Dimensional, 
       Heterogeneous and Sparse Health Care Data 
264  0 |c2015 
300    1 online resource (140 pages) 
336    text|btxt|2rdacontent 
337    computer|bc|2rdamedia 
338    online resource|bcr|2rdacarrier 
500    Source: Dissertation Abstracts International, Volume: 77-
       03(E), Section: B 
500    Advisers: Jaideep Srivastava; Sarah Cooley 
502    Thesis (Ph.D.)--University of Minnesota, 2015 
504    Includes bibliographical references 
520    In the past few decades predictive modeling has emerged as
       an important tool for exploratory data analysis and 
       decision making in health care. Predictive modeling is a 
       commonly used statistical and data mining technique that 
       works by analyzing historical and current data and 
       generating a model to help predict future outcomes. It 
       gives us the power to discover hidden relationships in 
       volumes of data and use those insights to confidently 
       predict the outcome of future events and interactions. In 
       health care, complex models can be created to combine 
       patient information like demographic and clinical 
       information from care providers, in order to predict and 
       improve model accuracy. Predictive modeling in health care
       seeks out subtle data patterns to enhance decision making 
       such as care providers can recommend prescription drugs 
       and services based on patient profile 
520    Although all predictive techniques have different 
       strengths and weaknesses, model accuracy is mostly 
       dependent on the raw input data with various features used
       to train a predictive model. Model building often requires
       data pre-processing in order to reduce the impact of the 
       skewed property of the data or outliers. This helps by 
       significantly improving performance. From hundreds of 
       available raw data fields, a subset is selected and fields
       are pre-processed before being presented to a predictive 
       modeling technique. For example, there can be thousands of
       variables consisting of genetic, clinical and demographic 
       information for different groups of patients. Therefore 
       detecting significant variables for a particular group of 
       patient can enhance model accuracy. Hence, the secret 
       behind a good predictive model often times depends on good
       pre-processing and more so than the technique used to 
       train the model 
520    While the above responsibilities of an effective and 
       efficient data pre-processing mechanism and its usage with
       predictive modeling in health care data are better 
       understood, three key challenges were identified that 
       faces this data pre-processing task. These include, 1) 
       High dimensionality: The challenge of high-dimensionality 
       arises in diverse fields, ranging from health care and 
       computational biology to financial engineering and risk 
       management. This work identifies that there is no single 
       feature selection strategy that is robust towards 
       different families of classification or prediction 
       algorithm. The existing feature selection techniques 
       produce different results with different predictive 
       models. This can be a problem when deciding about the best
       predictive model to use while working with real high 
       dimensional health care data and especially without domain
       experts 
520    2) Heterogeneity in the data and data redundancy: Most of 
       the real world data is heterogeneous in nature, i.e. the 
       population consists of overlapping homogeneous groups. In 
       health care, Electronic Health Records (EHR) data consists
       of diverse groups of patients with a wide range of diverse
       health conditions. This thesis identifies that predictive 
       modeling with a single learning model over heterogeneous 
       data can result in inconclusive results and ineffective 
       explanation of an outcome. Therefore, it has been proposed
       in this thesis that, there is a need for data segmentation
       / co-clustering technique that extracts groups from data 
       while removing insignificant features and extraneous rows,
       giving result to an improved predictive modeling with a 
       learning model 
520    3) Data sparseness: When a row is created, storage is 
       allocated for every column, irrespective of whether a 
       value exists for a given field. This gives rise to sparse 
       data which has a relatively high percentage of the 
       variable's cells, missing the actual data. In health care,
       not all patients undergo every possible medical 
       diagnostics and lab results are equally sparse. Such 
       Sparse information or missing values causes predictive 
       models to produce inconclusive results. One primitive 
       technique is manual imputation of missing values by the 
       domain experts. Today, this scenario is almost impossible 
       as the data is huge and high dimensional in nature. A 
       variety of statistical and machine learning based missing 
       value estimation techniques exist which estimates missing 
       values by statistical analysis of the data set available. 
       However, most of these techniques do not consider the 
       importance of a domain expert's opinion in estimating 
       missing data. It has been proposed in this thesis that 
       techniques that use statistical information from the data 
       as well as opinion of the experts can estimate missing 
       values more effectively. This imputation procedure can 
       results in non-sparse data which is closer to the ground 
       truth and that improves predictive modeling 
520    In this thesis, the following computational approaches has
       been proposed for handling challenges described above for 
       an effective and improved predictive modeling -- 1) For 
       handling high-dimensional data a novel robust rank 
       aggregation-based feature selection technique has been 
       developed using exclusive rank aggregation strategies by 
       Borda (1781) and Kemeny (1959). The concept of robustness 
       of a feature selection algorithm has been introduced, 
       which can be defined as the property that characterizes 
       the stability of a ranked feature set toward achieving 
       similar classification accuracy across a wide range of 
       classifiers. This concept has been quantified with an 
       evaluation measure namely, the robustness index (RI). The 
       concept of inter-rater agreement for improving the quality
       of the rank aggregation approach for feature selection has
       also been proposed in this thesis 
520    2) The concept of a co-clustering has been proposed that 
       is dedicated towards improving predictive modeling. The 
       novel idea of Learning based Co-Clustering (LCC) has been 
       developed as an optimization problem for a more effective 
       and improved predictive analysis. An important property of
       this algorithm is that there is no need to specify the 
       number of co-clusters. A separate model testing framework 
       has also been proposed in this work, for reducing model 
       over-fitting and for a more accurate result. The 
       methodology has been evaluated on health care data as a 
       case study as well as several other publicly available 
       data sets 
520    3) A missing value imputation technique based on domain 
       expert's knowledge and statistical analysis of the 
       available data has been proposed in this thesis. The 
       medical domain of HSCT has been chosen for the case study 
       and the domain expert's knowledge is a group of stem cell 
       transplant physician's opinion. The machine learning 
       approach developed can be defined as -- rule mining with 
       expert knowledge and similarity scoring based missing 
       value imputation. This technique has been developed and 
       validated using real world medical data set. The results 
       demonstrate the effectiveness and utility of this 
       technique in practice 
533    Electronic reproduction.|bAnn Arbor, Mich. :|cProQuest,
       |d2018 
538    Mode of access: World Wide Web 
650  4 Computer science 
650  4 Health care management 
655  7 Electronic books.|2local 
690    0984 
690    0769 
710 2  ProQuest Information and Learning Co 
710 2  University of Minnesota.|bComputer Science 
773 0  |tDissertation Abstracts International|g77-03B(E) 
856 40 |uhttps://pqdd.sinica.edu.tw/twdaoapp/servlet/
       advanced?query=3728239|zclick for full text (PQDT) 
912    PQDT 
館藏地索書號條碼處理狀態 

Go to Top