Variations on probabilistic suffix trees: statistical modeling and prediction of protein families

被引:107
作者
Bejerano, G
Yona, G
机构
[1] Hebrew Univ Jerusalem, Sch Engn & Comp Sci, IL-91904 Jerusalem, Israel
[2] Stanford Univ, Dept Biol Struct, Stanford, CA 94305 USA
关键词
D O I
10.1093/bioinformatics/17.1.23
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. Results: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.
引用
收藏
页码:23 / 43
页数:21
相关论文
共 47 条
[1]  
ABE N, 1992, MACH LEARN, V9, P205, DOI 10.1007/BF00992677
[2]   AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) :555-565
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]   Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space [J].
Apostolico, A ;
Bejerano, G .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (3-4) :381-393
[5]   PRINTS prepares for the new millennium [J].
Attwood, TK ;
Flower, DR ;
Lewis, AP ;
Mabey, JE ;
Morgan, SR ;
Scordis, P ;
Selley, JN ;
Wright, W .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :220-225
[6]  
Bailey T L, 1995, Proc Int Conf Intell Syst Mol Biol, V3, P21
[7]   The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :49-54
[8]   Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins [J].
Bateman, A ;
Birney, E ;
Durbin, R ;
Eddy, SR ;
Finn, RD ;
Sonnhammer, ELL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :260-262
[9]  
BATES PA, 1991, PROTEINS, V3, P47
[10]   Recent improvements of the ProDom database of protein domain families [J].
Corpet, F ;
Gouzy, J ;
Kahn, D .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :263-267