Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

被引:75
作者
Brock, Guy N. [2 ]
Shaffer, John R. [1 ]
Blakesley, Richard E. [3 ]
Lotz, Meredith J. [3 ]
Tseng, George C. [1 ,3 ,4 ]
机构
[1] Univ Pittsburgh, Grad Sch Publ Hlth, Dept Human Genet, Pittsburgh, PA 15261 USA
[2] Univ Louisville, Sch Publ Hlth & Informat Sci, Dept Bioinformat & Biostat, Louisville, KY 40292 USA
[3] Univ Pittsburgh, Grad Sch Publ Hlth, Dept Biostat, Pittsburgh, PA 15261 USA
[4] Univ Pittsburgh, Sch Med, Dept Computat Biol, Pittsburgh, PA 15213 USA
关键词
D O I
10.1186/1471-2105-9-12
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Gene expression data frequently contain missing values, however, most downstream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. Results: We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Conclusion: Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
引用
收藏
页数:12
相关论文
共 31 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]  
[Anonymous], R Project for Statistical Computing (Version 3.0.2)
[4]   Prediction by supervised principal components [J].
Bair, E ;
Hastie, T ;
Paul, D ;
Tibshirani, R .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2006, 101 (473) :119-137
[5]  
Baldwin DN, 2003, GENOME BIOL, V4
[6]   LSimpute: accurate estimation of missing values in microarray data with least squares methods [J].
Bo, TH ;
Dysvik, J ;
Jonassen, I .
NUCLEIC ACIDS RESEARCH, 2004, 32 (03) :e34
[7]   Remodeling of yeast genome expression in response to environmental changes [J].
Causton, HC ;
Ren, B ;
Koh, SS ;
Harbison, CT ;
Kanin, E ;
Jennings, EG ;
Lee, TI ;
True, HL ;
Lander, ES ;
Young, RA .
MOLECULAR BIOLOGY OF THE CELL, 2001, 12 (02) :323-337
[8]   Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering [J].
de Brevern, AG ;
Hazout, S ;
Malpertuy, A .
BMC BIOINFORMATICS, 2004, 5 (1)
[9]  
Feten G, 2005, STAT APPL GENET MO B, V4
[10]   Microarray missing data imputation based on a set theoretic framework and biological knowledge [J].
Gan, XC ;
Liew, AWC ;
Yan, H .
NUCLEIC ACIDS RESEARCH, 2006, 34 (05) :1608-1619