Exploiting the past and the future in protein secondary structure prediction

被引:331
作者
Baldi, P [1 ]
Brunak, S
Frasconi, P
Soda, G
Pollastri, G
机构
[1] Univ Calif Irvine, Coll Med, Dept Informat & Comp Sci, Irvine, CA 92697 USA
[2] Univ Calif Irvine, Coll Med, Dept Biol Chem, Irvine, CA 92697 USA
[3] Tech Univ Denmark, Ctr Biol Sequence Anal, DK-2800 Lyngby, Denmark
[4] Univ Florence, Dipartimento Sistemi & Informat, I-50139 Florence, Italy
关键词
D O I
10.1093/bioinformatics/15.11.937
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Predicting the secondary structure of a protein (alpha-helix, beta-sheet, coil) is an important step towards elucidating its three-dimensional structure, as well as its function. Presently, the best predictors are based on machine learning approaches, in particular neural network architectures with a fixed and relatively short, input window of amino acids, centered at the prediction site. Although a fixed small window avoids overfitting problems, it does not permit capturing variable long-rang information. Results: We introduce a family of novel architectures which can learn to make predictions based on variable ranges of dependencies. These architectures extend recurrent neural networks, introducing non-causal bidirectional dynamics to capture both upstream and downstream information. The prediction algorithm is completed by the use of mixtures of estimators that leverage evolutionary information, expressed in terms of multiple alignments, both at the input and output levels. While our system currently achieves an overall performance close to 76% correct prediction - at least comparable to the best existing systems - the main emphasis here is on the development of new algorithmic ideas. Availability: The executable program for predicting protein secondary structure is available from the authors free of charge. Contact: pfbaldi@ics.uci.edu, gpollast@ics.uci.edu, brunak@cbs.dtu.dk, paolo@dsi.unifi.it.
引用
收藏
页码:937 / 946
页数:10
相关论文
共 35 条
[1]  
ANGLUIN D, 1987, ENCY ARTIFICIAL INTE, P409
[2]  
[Anonymous], METHOD ENZYMOL
[3]   The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :49-54
[4]   Hybrid modeling, HMM/NN architectures, and protein applications [J].
Baldi, P ;
Chauvin, Y .
NEURAL COMPUTATION, 1996, 8 (07) :1541-1565
[5]  
BALDI P, 1999, IN PRESS SEQUENCE LE
[6]  
BALDI P, 1999, UNPUB ASSESSING ACCU
[7]  
Baldi P., 1998, Bioinformatics: The machine learning approach
[8]   LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].
BENGIO, Y ;
SIMARD, P ;
FRASCONI, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166
[9]   Input-output HMM's for sequence processing [J].
Bengio, Y ;
Frasconi, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1996, 7 (05) :1231-1249
[10]   PROTEIN DATA BANK - COMPUTER-BASED ARCHIVAL FILE FOR MACROMOLECULAR STRUCTURES [J].
BERNSTEIN, FC ;
KOETZLE, TF ;
WILLIAMS, GJB ;
MEYER, EF ;
BRICE, MD ;
RODGERS, JR ;
KENNARD, O ;
SHIMANOUCHI, T ;
TASUMI, M .
JOURNAL OF MOLECULAR BIOLOGY, 1977, 112 (03) :535-542