INTRINSIC AND EXTRINSIC APPROACHES FOR DETECTING GENES IN A BACTERIAL GENOME

被引:78
作者
BORODOVSKY, M [1 ]
RUDD, KE [1 ]
KOONIN, EV [1 ]
机构
[1] NATL LIB MED, NATL CTR BIOTECHNOL INFORMAT, BETHESDA, MD 20894 USA
关键词
D O I
10.1093/nar/22.22.4756
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E.coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by both GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins.
引用
收藏
页码:4756 / 4767
页数:12
相关论文
共 64 条
[1]   IDENTIFICATION, MOLECULAR-CLONING AND SEQUENCE-ANALYSIS OF A GENE-CLUSTER ENCODING THE CLASS-II FRUCTOSE-1,6-BISPHOSPHATE ALDOLASE, 3-PHOSPHOGLYCERATE KINASE AND A PUTATIVE 2ND GLYCERALDEHYDE-3-PHOSPHATE DEHYDROGENASE OF ESCHERICHIA-COLI [J].
ALEFOUNDER, PR ;
PERHAM, RN .
MOLECULAR MICROBIOLOGY, 1989, 3 (06) :723-732
[2]   ISSUES IN SEARCHING MOLECULAR SEQUENCE DATABASES [J].
ALTSCHUL, SF ;
BOGUSKI, MS ;
GISH, W ;
WOOTTON, JC .
NATURE GENETICS, 1994, 6 (02) :119-129
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]   THE PROSITE DICTIONARY OF SITES AND PATTERNS IN PROTEINS, ITS CURRENT STATUS [J].
BAIROCH, A .
NUCLEIC ACIDS RESEARCH, 1993, 21 (13) :3097-3103
[5]  
BLATTNER FR, 1993, NUCLEIC ACIDS RES, V21, P5408
[6]  
BORODOVSKII MY, 1986, MOL BIOL+, V20, P1144
[7]  
BORODOVSKII MY, 1986, MOL BIOL+, V20, P833
[8]   NEW GENES IN OLD SEQUENCE - A STRATEGY FOR FINDING GENES IN THE BACTERIAL GENOME [J].
BORODOVSKY, M ;
KOONIN, EV ;
RUDD, KE .
TRENDS IN BIOCHEMICAL SCIENCES, 1994, 19 (08) :309-313
[9]   GENMARK - PARALLEL GENE RECOGNITION FOR BOTH DNA STRANDS [J].
BORODOVSKY, M ;
MCININCH, J .
COMPUTERS & CHEMISTRY, 1993, 17 (02) :123-133
[10]  
Borodovsky Mark, 1993, P231