SYSTEMATIC ANALYSIS OF CODING AND NONCODING DNA-SEQUENCES USING METHODS OF STATISTICAL LINGUISTICS

被引:109
作者
MANTEGNA, RN
BULDYREV, SV
GOLDBERGER, AL
HAVLIN, S
PENG, CK
SIMONS, M
STANLEY, HE
机构
[1] BOSTON UNIV, DEPT PHYS, BOSTON, MA 02215 USA
[2] UNIV PALERMO, DIPARTIMENTO ENERGET & APPLICAZ FIS, I-90128 PALERMO, ITALY
[3] HARVARD UNIV, BETH ISRAEL HOSP, SCH MED, DIV CARDIOVASC, BOSTON, MA 02215 USA
[4] BOSTON UNIV, DEPT BIOMED ENGN, BOSTON, MA 02215 USA
[5] BAR ILAN UNIV, DEPT PHYS, RAMAT GAN, ISRAEL
关键词
D O I
10.1103/PhysRevE.52.2939
中图分类号
O35 [流体力学]; O53 [等离子体物理学];
学科分类号
070204 ; 080103 ; 080704 ;
摘要
We compare the statistical properties of coding and noncoding regions in eukaryotic and viral DNA sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. The data set comprises all 30 sequences of length above 50 000 base pairs in GenBank Release No. 81.0, as well as the recently published sequences of C. elegans chromosome III (2.2 Mbp) and yeast chromosome XI (661 Kbp). We find that for the three chromosomes we studied the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of the coding regions. In particular, (i) an n-tuple Zipf analysis of noncoding regions reveals a regime close to power-law behavior while the coding regions show logarithmic behavior over a wide interval, while (ii) an n-gram entropy measurement shows that the noncoding regions have a lower n-gram entropy (and hence a larger ''n-gram redundancy'') than the coding regions. In contrast to the three chromosomes, we find that for vertebrates such as primates and rodents and for viral DNA, the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive. After noting the intrinsic Limitations of the n-gram redundancy analysis, we also briefly discuss the failure of zeroth- and first-order Markovian models or simple nucleotide repeats to account fully for these ''linguistic'' features of DNA. Finally, we emphasize that our results by no means prove the existence of a ''language'' in noncoding DNA.
引用
收藏
页码:2939 / 2950
页数:12
相关论文
共 36 条
[1]   LANGUAGE AND CODIFICATION DEPENDENCE OF LONG-RANGE CORRELATIONS IN TEXTS [J].
Amit, M. ;
Shmerler, Y. ;
Eisenberg, E. ;
Abraham, M. ;
Shnerb, N. .
FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 1994, 2 (01) :7-13
[2]   CHARACTERIZING LONG-RANGE CORRELATIONS IN DNA-SEQUENCES FROM WAVELET ANALYSIS [J].
ARNEODO, A ;
BACRY, E ;
GRAVES, PV ;
MUZY, JF .
PHYSICAL REVIEW LETTERS, 1995, 74 (16) :3293-3296
[3]   A GENERAL RULE FOR RANGED SERIES OF CODON FREQUENCIES IN DIFFERENT GENOMES [J].
BORODOVSKY, MY ;
GUSEIN-ZADE, SM .
JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 1989, 6 (05) :1001-1012
[4]  
BRILLOUIN L, 1956, SCI INFORMATION THEO
[5]   LONG-RANGE CORRELATION-PROPERTIES OF CODING AND NONCODING DNA-SEQUENCES - GENBANK ANALYSIS [J].
BULDYREV, SV ;
GOLDBERGER, AL ;
HAVLIN, S ;
MANTEGNA, RN ;
MATSA, ME ;
PENG, CK ;
SIMONS, M ;
STANLEY, HE .
PHYSICAL REVIEW E, 1995, 51 (05) :5084-5091
[6]  
BULDYREV SV, UNPUB
[7]   Genome sequence of the nematode C-elegans:: A platform for investigating biology [J].
不详 .
SCIENCE, 1998, 282 (5396) :2012-2018
[8]   CORRELATIONS IN BINARY SEQUENCES AND A GENERALIZED ZIPF ANALYSIS [J].
CZIROK, A ;
MANTEGNA, RN ;
HAVLIN, S ;
STANLEY, HE .
PHYSICAL REVIEW E, 1995, 52 (01) :446-452
[9]   HIERARCHICAL APPROACH TO COMPLEXITY WITH APPLICATIONS TO DYNAMIC-SYSTEMS [J].
DALESSANDRO, G ;
POLITI, A .
PHYSICAL REVIEW LETTERS, 1990, 64 (14) :1609-1612
[10]   GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT [J].
DAMASHEK, M .
SCIENCE, 1995, 267 (5199) :843-848