CONTRAfold: RNA secondary structure prediction without physics-based models

被引:385
作者
Do, Chuong B. [1 ]
Woods, Daniel A. [1 ]
Batzoglou, Serafim [1 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
关键词
D O I
10.1093/bioinformatics/btl246
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: For several decades, free energy minimization methods have been the dominant strategy for single sequence RNA secondary structure prediction. More recently, stochastic context-free grammars (SCFGs) have emerged as an alternative probabilistic methodology for modeling RNA structure. Unlike physics-based methods, which rely on thousands of experimentally-measured thermodynamic parameters, SCFGs use fully-automated statistical learning algorithms to derive model parameters. Despite this advantage, however, probabilistic methods have not replaced free energy minimization methods as the tool of choice for secondary structure prediction, as the accuracies of the best current SCFGs have yet to match those of the best physics-based models. Results: In this paper, we present CONTRAfold, a novel secondary structure prediction method based on conditional log-linear models (CLLMs), a flexible class of probabilistic models which generalize upon SCFGs by using discriminative training and feature-rich scoring. In a series of cross-validation experiments, we show that grammar-based secondary structure prediction methods formulated as CLLMs consistently outperform their SCFG analogs. Furthermore, CONTRAfold, a CLLM incorporating most of the features found in typical thermodynamic models, achieves the highest single sequence prediction accuracies to date, outperforming currently available probabilistic and physics-based techniques. Our result thus closes the gap between probabilistic and thermodynamic models, demonstrating that statistical learning procedures provide an effective alternative to empirical measurement of thermodynamic parameters for RNA secondary structure prediction.
引用
收藏
页码:E90 / E98
页数:9
相关论文
共 26 条
[1]   Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction [J].
Dowell, RD ;
Eddy, SR .
BMC BIOINFORMATICS, 2004, 5 (1)
[2]  
Durbin R., 1998, Biological sequence analysis: Probabilistic models of proteins and nucleic acids
[3]  
Flannery B.P., 1992, NUMERICAL RECIPES C
[4]   NMR spectroscopy of RNA [J].
Fürtig, B ;
Richter, C ;
Wöhnert, J ;
Schwalbe, H .
CHEMBIOCHEM, 2003, 4 (10) :936-962
[5]   A comprehensive comparison of comparative RNA structure prediction approaches [J].
Gardner, PP ;
Giegerich, R .
BMC BIOINFORMATICS, 2004, 5 (1)
[6]   Rfam: annotating non-coding RNAs in complete genomes [J].
Griffiths-Jones, S ;
Moxon, S ;
Marshall, M ;
Khanna, A ;
Eddy, SR ;
Bateman, A .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D121-D124
[7]   Rfam: an RNA family database [J].
Griffiths-Jones, S ;
Bateman, A ;
Marshall, M ;
Khanna, A ;
Eddy, SR .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :439-441
[8]   FAST FOLDING AND COMPARISON OF RNA SECONDARY STRUCTURES [J].
HOFACKER, IL ;
FONTANA, W ;
STADLER, PF ;
BONHOEFFER, LS ;
TACKER, M ;
SCHUSTER, P .
MONATSHEFTE FUR CHEMIE, 1994, 125 (02) :167-188
[9]   Pfold: RNA secondary structure prediction using stochastic context-free grammars [J].
Knudsen, B ;
Hein, J .
NUCLEIC ACIDS RESEARCH, 2003, 31 (13) :3423-3428
[10]   RNA secondary structure prediction using stochastic context-free grammars and evolutionary history [J].
Knudsen, B ;
Hein, J .
BIOINFORMATICS, 1999, 15 (06) :446-454