Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data

被引:190
作者
Lasko, Thomas A. [1 ]
Denny, Joshua C. [1 ,2 ]
Levy, Mia A. [1 ,2 ,3 ]
机构
[1] Vanderbilt Univ, Sch Med, Dept Biomed Informat, Nashville, TN 37212 USA
[2] Vanderbilt Univ, Sch Med, Dept Med, Nashville, TN 37212 USA
[3] Vanderbilt Univ, Sch Med, Vanderbilt Ingram Canc Ctr, Nashville, TN 37212 USA
关键词
ELECTRONIC MEDICAL-RECORDS; RISK STRATIFICATION; HEART-FAILURE; MODELS; REGULARIZATION; TIME;
D O I
10.1371/journal.pone.0066341
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Inferring precise phenotypic patterns from population-scale clinical data is a core computational task in the development of precision, personalized medicine. The traditional approach uses supervised learning, in which an expert designates which patterns to look for (by specifying the learning task and the class labels), and where to look for them (by specifying the input variables). While appropriate for individual tasks, this approach scales poorly and misses the patterns that we don't think to look for. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that collectively form a compact and expressive representation of the source data, with no need for expert input or labeled examples. Its rising popularity is driven by new deep learning methods, which have produced high-profile successes on difficult standardized problems of object recognition in images. Here we introduce its use for phenotype discovery in clinical data. This use is challenging because the largest source of clinical data - Electronic Medical Records - typically contains noisy, sparse, and irregularly timed observations, rendering them poor substrates for deep learning methods. Our approach couples dirty clinical data to deep learning architecture via longitudinal probability densities inferred using Gaussian process regression. From episodic, longitudinal sequences of serum uric acid measurements in 4368 individuals we produced continuous phenotypic features that suggest multiple population subtypes, and that accurately distinguished (0.97 AUC) the uric-acid signatures of gout vs. acute leukemia despite not being optimized for the task. The unsupervised features were as accurate as gold-standard features engineered by an expert with complete knowledge of the domain, the classification task, and the class labels. Our findings demonstrate the potential for achieving computational phenotype discovery at population scale. We expect such data-driven phenotypes to expose unknown disease variants and subtypes and to provide rich targets for genetic association studies.
引用
收藏
页数:13
相关论文
共 63 条
[1]   Is there anything good in uric acid? [J].
Alvarez-Lario, B. ;
Macarron-Vicente, J. .
QJM-AN INTERNATIONAL JOURNAL OF MEDICINE, 2011, 104 (12) :1015-1024
[2]  
[Anonymous], 2012, P INT C MACH LEARN
[3]  
[Anonymous], 1991, ELEMENTS INFORM THEO, DOI [DOI 10.1002/0471200611, 10.1002/0471200611]
[4]  
[Anonymous], 2011, R LANG ENV STAT COMP
[5]  
[Anonymous], 2007, Large-scale kernel machines, DOI DOI 10.7551/MITPRESS/7496.003.0016
[6]   Deep Machine Learning-A New Frontier in Artificial Intelligence Research [J].
Arel, Itamar ;
Rose, Derek C. ;
Karnowski, Thomas P. .
IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, 2010, 5 (04) :13-18
[7]   A Pattern Mining Approach for Classifying Multivariate Temporal Data [J].
Batal, Iyad ;
Valizadegan, Hamed ;
Cooper, Gregory F. ;
Hauskrecht, Milos .
2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, :358-365
[8]  
Bengio Y., 2012, P MACH LEARN RES JUN, P17, DOI DOI 10.1109/IJCNN.2011.6033302
[9]   Learning Deep Architectures for AI [J].
Bengio, Yoshua .
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127
[10]  
Bishop C., 2006, PATTERN RECOGN, DOI DOI 10.1117/1.2819119