The global trace graph, a novel paradigm for searching protein sequence databases

被引:17
作者
Heger, Andreas
Mallick, Swapan
Wilton, Christopher
Holm, Liisa
机构
[1] Univ Helsinki, Inst Biotechnol, FI-00014 Helsinki, Finland
[2] Univ Helsinki, Dept Biol & Environm Sci, Div Genet, FI-00014 Helsinki, Finland
[3] Univ Oxford, Dept Physiol Anat & Genet, MRC, Funct Genet Unit, Oxford OX1 3QX, England
[4] Harvard Univ, Sch Med, Dept Genet, Boston, MA USA
[5] Babraham Inst, Cambridge, England
关键词
D O I
10.1093/bioinformatics/btm358
中图分类号
Q5 [生物化学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
Motivation: Propagating functional annotations to sequence- similar, presumably homologous proteins lies at the heart of the bioinformatics industry. Correct propagation is crucially dependent on the accurate identification of subtle sequence motifs that are conserved in evolution. The evolutionary signal can be difficult to detect because functional sites may consist of non-contiguous residues while segments in-between may be mutated without affecting fold or function. Results: Here, we report a novel graph clustering algorithm in which all known protein sequences simultaneously self-organize into hypothetical multiple sequence alignments. This eliminates noise so that non- contiguous sequence motifs can be tracked down between extremely distant homologues. The novel data structure enables fast sequence database searching methods which are superior to profile-profile comparison at recognizing distant homologues. This study will boost the leverage of structural and functional genomics and opens up new avenues for data mining a complete set of functional signature motifs.
引用
收藏
页码:2361 / 2367
页数:7
相关论文
共 36 条
[1]
AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) :555-565
[2]
SCOP database in 2004: refinements integrate structure and sequence family data [J].
Andreeva, A ;
Howorth, D ;
Brenner, SE ;
Hubbard, TJP ;
Chothia, C ;
Murzin, AG .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D226-D229
[3]
Bateman A, 2002, NUCLEIC ACIDS RES, V30, P276, DOI [10.1093/nar/gkr1065, 10.1093/nar/gkp985, 10.1093/nar/gkh121]
[4]
A machine learning information retrieval approach to protein fold recognition [J].
Cheng, Jianlin ;
Baldi, Pierre .
BIOINFORMATICS, 2006, 22 (12) :1456-1463
[5]
Identification of homology in protein structure classification [J].
Dietmann, S ;
Holm, L .
NATURE STRUCTURAL BIOLOGY, 2001, 8 (11) :953-957
[6]
ProbCons: Probabilistic consistency-based multiple sequence alignment [J].
Do, CB ;
Mahabhashyam, MSP ;
Brudno, M ;
Batzoglou, S .
GENOME RESEARCH, 2005, 15 (02) :330-340
[7]
Profile hidden Markov models [J].
Eddy, SR .
BIOINFORMATICS, 1998, 14 (09) :755-763
[8]
The use of structure information to increase alignment accuracy does not aid homologue detection with profile HMMs [J].
Griffiths-Jones, S ;
Bateman, A .
BIOINFORMATICS, 2002, 18 (09) :1243-1249
[9]
Towards a covering set of protein family profiles [J].
Heger, A ;
Holm, L .
PROGRESS IN BIOPHYSICS & MOLECULAR BIOLOGY, 2000, 73 (05) :321-337
[10]
Heger A, 2005, NUCLEIC ACIDS RES, V33, pD188