Comparative accuracy of methods for protein sequence similarity search

被引:21
作者
Agarwal, P [1 ]
States, DJ [1 ]
机构
[1] Washington Univ, Inst Biomed Comp, St Louis, MO 63110 USA
关键词
D O I
10.1093/bioinformatics/14.1.40
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Searching a protein sequence database for homologs is a powerful tool for discovering the structure and function of a sequence. Two new methods for searching sequence databases have recently been described. Probabilistic Smith-Waterman (PSW), which is based Hidden Markov models for a single sequence using a standard scoring matrix, and a new version of BLAST (WU-BLAST2), which uses Sum statistics for gapped alignments. Results: This paper compares and contrasts the effectiveness of these methods with three older methods (Smith-Waterman: SSEARCH, FASTA and BLASTP). The analysis indicates that the new. These tools are useful, and often offer improved accuracy. These tools are compared using a curated (by Bill Pearson) version of the annotated portion of PIR 39. Three different statistical criteria are utilized. equivalence number, mininum errors and the receiver operating characteristic. For complete-length protein query sequences from large families, PSW's accuracy is superior to that of the other methods, but its accuracy is poor when used with partial-length query sequences. False negatives are twice as common as false positives irrespective of the the search methods if a family-specific threshold score that minimizes the total number of errors (i.e, the most favorable threshold score possibly is used. Thus, sensitivity not selectivity, is the major problem. Among the analyzed methods using default parameters, the best accuracy was obtained from SSEARCH and PSW for complete-length proteins, and the two BLAST programs, plus SSEARCH, for partial-length proteins. Availability: The data and search tools are available from their original authors. Contact: agarwal@mh.us.sbphrd.com, states@ibc.wustl.edu.
引用
收藏
页码:40 / 47
页数:8
相关论文
共 20 条
[11]  
Eddy S R, 1995, J Comput Biol, V2, P9, DOI 10.1089/cmb.1995.2.9
[12]   Hidden Markov models [J].
Eddy, SR .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 1996, 6 (03) :361-365
[13]   Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching [J].
Gribskov, M ;
Robinson, NL .
COMPUTERS & CHEMISTRY, 1996, 20 (01) :25-33
[14]  
Hart P.E., 1973, Pattern recognition and scene analysis
[15]   HIDDEN MARKOV-MODELS IN COMPUTATIONAL BIOLOGY - APPLICATIONS TO PROTEIN MODELING [J].
KROGH, A ;
BROWN, M ;
MIAN, IS ;
SJOLANDER, K ;
HAUSSLER, D .
JOURNAL OF MOLECULAR BIOLOGY, 1994, 235 (05) :1501-1531
[16]   COMPARISON OF METHODS FOR SEARCHING PROTEIN-SEQUENCE DATABASES [J].
PEARSON, WR .
PROTEIN SCIENCE, 1995, 4 (06) :1145-1160
[17]   IMPROVED TOOLS FOR BIOLOGICAL SEQUENCE COMPARISON [J].
PEARSON, WR ;
LIPMAN, DJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1988, 85 (08) :2444-2448
[18]   A SIMPLE METHOD TO GENERATE NONTRIVIAL ALTERNATE ALIGNMENTS OF PROTEIN SEQUENCES [J].
SAQI, MAS ;
STERNBERG, MJE .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (04) :727-732
[19]   IDENTIFICATION OF COMMON MOLECULAR SUBSEQUENCES [J].
SMITH, TF ;
WATERMAN, MS .
JOURNAL OF MOLECULAR BIOLOGY, 1981, 147 (01) :195-197
[20]   SUBOPTIMAL SEQUENCE ALIGNMENT IN MOLECULAR-BIOLOGY - ALIGNMENT WITH ERROR ANALYSIS [J].
ZUKER, M .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 221 (02) :403-420