Approaches to the automatic discovery of patterns in biosequences

被引:143
作者
Brazma, A
Jonassen, I [1 ]
Eidhammer, I
Gilbert, D
机构
[1] Univ Bergen, Dept Informat, HIB, N-5020 Bergen, Norway
[2] European Bioinformat Inst, EMBL Outstn, Cambridge CB10 1SD, England
[3] City Univ London, Dept Comp Sci, London EC1V 0HB, England
关键词
automatic discovery; bioinformatics; biosequences; machine learning; patterns;
D O I
10.1089/cmb.1998.5.279
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
This paper surveys approaches to the discovery of patterns in biosequences and places these approaches within a formal framework that systematises the types of patterns and the discovery algorithms, Patterns with expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering the patterns that are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis is presented of the ways in which an assessment can be made of the significance of the discovered patterns, It is shown that the problem is related to problems studied in the field of machine learning, The major part of this paper comprises a review of a number of existing methods developed to solve the problem and how these relate to each other, focusing on the algorithms underlying the approaches, A comparison is given of the algorithms, and examples are given of patterns that have been discovered using the different methods.
引用
收藏
页码:279 / 305
页数:27
相关论文
共 77 条
[41]   SPACE-ECONOMICAL SUFFIX TREE CONSTRUCTION ALGORITHM [J].
MCCREIGHT, EM .
JOURNAL OF THE ACM, 1976, 23 (02) :262-272
[42]   A GENERAL METHOD APPLICABLE TO SEARCH FOR SIMILARITIES IN AMINO ACID SEQUENCE OF 2 PROTEINS [J].
NEEDLEMAN, SB ;
WUNSCH, CD .
JOURNAL OF MOLECULAR BIOLOGY, 1970, 48 (03) :443-+
[43]   DETECTING PATTERNS IN PROTEIN SEQUENCES [J].
NEUWALD, AF ;
GREEN, P .
JOURNAL OF MOLECULAR BIOLOGY, 1994, 239 (05) :698-712
[44]  
NIX RP, 1983, THESIS YALE U CALIFO
[45]   CONSTRUCTION OF A DICTIONARY OF SEQUENCE MOTIFS THAT CHARACTERIZE GROUPS OF RELATED PROTEINS [J].
OGIWARA, A ;
UCHIYAMA, I ;
SETO, Y ;
KANEHISA, M .
PROTEIN ENGINEERING, 1992, 5 (06) :479-488
[46]   PREDICTIVE MOTIFS DERIVED FROM CYTOSINE METHYLTRANSFERASES [J].
POSFAI, J ;
BHAGWAT, AS ;
POSFAI, G ;
ROBERTS, RJ .
NUCLEIC ACIDS RESEARCH, 1989, 17 (07) :2421-2435
[47]   IMPROVEMENTS TO A PROGRAM FOR DNA ANALYSIS - A PROCEDURE TO FIND HOMOLOGIES AMONG MANY SEQUENCES [J].
QUEEN, C ;
WEGMAN, MN ;
KORN, LJ .
NUCLEIC ACIDS RESEARCH, 1982, 10 (01) :449-456
[48]  
Quinlan J. R., 1986, Machine Learning, V1, P81, DOI 10.1023/A:1022643204877
[49]   MODELING BY SHORTEST DATA DESCRIPTION [J].
RISSANEN, J .
AUTOMATICA, 1978, 14 (05) :465-471
[50]  
ROYTBERG MA, 1992, COMPUT APPL BIOSCI, V8, P57