Score distributions for simultaneous matching to multiple motifs

被引:52
作者
Bailey, TL
Gribskov, M
机构
[1] San Diego Supercomputer Center, San Diego, CA 92186-9784
关键词
protein motifs; profiles; score p values; score normalization; extreme-value distributions; sum statistics;
D O I
10.1089/cmb.1997.4.45
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Several computer algorithms now exist for discovering multiple motifs (expressed as weight matrices) that characterize a family of protein sequences known to be homologous. This paper describes a method for performing similarity searches of protein sequence databases using such a group of moths, By simultaneously using all the moths that characterize a protein family, the sensitivity and specificity of the database search are increased, We define the p-value for a target sequence to be the probability of a random sequence of the same length scoring as well or better in comparison to all the moths that characterize the family, (The p-value of a database search can be determined from this value and the size of the database.) We show that estimating the distribution of single motif scores by a Gaussian extreme value distribution is insufficiently accurate to provide a useful estimate of the p-value, but that this deficiency can be corrected by reestimating the parameters of the underlying Gaussian distribution from observed scores for comparison of a given moth and sequence database, These parameters are used to calculate a ''reduced variate'' which has a Gumbel limiting distribution, Multiple motif scores are combined to give a single p-value by using the sum of the reduced variates for the motif scores as the test statistic, We give a computationally efficient approximation to the distribution of the sum of independent Gumbel random variables and verify experimentally that it closely approximates the distribution of the test statistic, Experiments on pseudorandom sequences show that the approximated p-values are conservative, so the significance of high scores in database searches will not be overstated, Experiments with real protein sequences and moths identified by the MEME algorithm show that determining an overall p-value based on the combination of multiple moths gives significantly better database search results than using p-values of single moths.
引用
收藏
页码:45 / 59
页数:15
相关论文
共 19 条
[1]  
Altschul SF, 1996, METHOD ENZYMOL, V266, P460
[2]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[3]  
[Anonymous], METHOD ENZYMOL
[4]  
[Anonymous], 1994, P INT C INT SYST MOL
[5]  
BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379
[6]  
BAIROCH A, 1994, NUCLEIC ACIDS RES, V22, P3578
[7]  
BIAROCH A, 1995, NUCLEIC ACIDS RES, V24, P189
[8]  
Goldstein L, 1994, J Comput Biol, V1, P93, DOI 10.1089/cmb.1994.1.93
[9]   Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching [J].
Gribskov, M ;
Robinson, NL .
COMPUTERS & CHEMISTRY, 1996, 20 (01) :25-33
[10]  
GRIBSKOV M, 1990, METHOD ENZYMOL, V183, P146