Automatic clustering of orthologs and in-paralogs from pairwise species comparisons

被引:901
作者
Remm, M
Storm, CEV
Sonnhammer, ELL [1 ]
机构
[1] Karolinska Inst, Ctr Genom & Bioinformat, S-17177 Stockholm, Sweden
[2] Estonian Bioctr, EE-51010 Tartu, Estonia
关键词
orthologs; paralogs; automatic clustering; genome comparison;
D O I
10.1006/jmbi.2000.5197
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Orthologs are genes in different species that originate from a single gene in the last common ancestor of these species. Such genes have often retained identical biological roles in the present-day organisms. It is hence important to identify orthologs for transferring functional information between genes in different organisms with a high degree of reliability. For example, orthologs of human proteins are often functionally characterized in model organisms. Unfortunately, orthology analysis between human and e.g. invertebrates is often complex because of large numbers of paralogs within protein families. Paralogs that predate the species split, which we call out-paralogs, can easily be confused with true orthologs. Paralogs that arose after the species split, which we call in-paralogs, however, are bona fide orthologs by definition. Orthologs and in-paralogs are typically detected with phylogenetic methods, but these are slow and difficult to automate. Automatic clustering methods based on two-way best genome-wide matches on the other hand, have so far not separated in-paralogs from out-paralogs effectively. We present a fully automatic method for finding orthologs; and in-paralogs from two species. Ortholog clusters are seeded with a two-way best pairwise match, after which an algorithm for adding in-paralogs is applied. The method bypasses multiple alignments and phylogenetic trees, which can be slow and error-prone steps in classical ortholog detection. Still, it robustly detects complex orthologous relationships and assigns confidence values for both orthologs and in-paralogs. The program, called INPARANOID, was tested on all completely sequenced eukaryotic genomes. To assess the quality of INPARANOID results, ortholog clusters were generated from a dataset of worm and mammalian transmembrane proteins, and were compared to clusters derived by manual tree-based ortholog detection methods. This study led to the identification with a high degree of confidence of over a dozen novel worm-mammalian ortholog assignments that were previously undetected because of shortcomings of phylogenetic methods. A WWW server that allows searching for orthologs between human and several fully sequenced genomes, is installed at http://www.cgb.ki.se/inparanoid/. This is the first comprehensive resource with orthologs of all fully sequenced eukaryotic genomes. Programs and tables of orthology assignments are available from the same location. (C) 2001 Academic Press.
引用
收藏
页码:1041 / 1052
页数:12
相关论文
共 13 条
[1]   Comparison of the complete protein sets of worm and yeast: Orthology and divergence [J].
Chervitz, SA ;
Aravind, L ;
Sherlock, G ;
Ball, CA ;
Koonin, EV ;
Dwight, SS ;
Harris, MA ;
Dolinski, K ;
Mohr, S ;
Smith, T ;
Weng, S ;
Cherry, JM ;
Botstein, D .
SCIENCE, 1998, 282 (5396) :2022-2028
[2]   DISTINGUISHING HOMOLOGOUS FROM ANALOGOUS PROTEINS [J].
FITCH, WM .
SYSTEMATIC ZOOLOGY, 1970, 19 (02) :99-&
[3]   Evolutionary parameters of the transcribed mammalian genome: An analysis of 2,820 orthologous rodent and human sequences [J].
Makalowski, W ;
Boguski, MS .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (16) :9407-9412
[4]   Large-scale taxonomic profiling of eukaryotic model organisms: A comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes [J].
Mushegian, AR ;
Garey, JR ;
Martin, J ;
Liu, LX .
GENOME RESEARCH, 1998, 8 (06) :590-598
[5]   Classification of transmembrane protein families in the Caenorhabditis elegans genome and identification of human orthologs [J].
Remm, M ;
Sonnhammer, E .
GENOME RESEARCH, 2000, 10 (11) :1679-1689
[6]   Comparative genomics of the eukaryotes [J].
Rubin, GM ;
Yandell, MD ;
Wortman, JR ;
Miklos, GLG ;
Nelson, CR ;
Hariharan, IK ;
Fortini, ME ;
Li, PW ;
Apweiler, R ;
Fleischmann, W ;
Cherry, JM ;
Henikoff, S ;
Skupski, MP ;
Misra, S ;
Ashburner, M ;
Birney, E ;
Boguski, MS ;
Brody, T ;
Brokstein, P ;
Celniker, SE ;
Chervitz, SA ;
Coates, D ;
Cravchik, A ;
Gabrielian, A ;
Galle, RF ;
Gelbart, WM ;
George, RA ;
Goldstein, LSB ;
Gong, FC ;
Guan, P ;
Harris, NL ;
Hay, BA ;
Hoskins, RA ;
Li, JY ;
Li, ZY ;
Hynes, RO ;
Jones, SJM ;
Kuehl, PM ;
Lemaitre, B ;
Littleton, JT ;
Morrison, DK ;
Mungall, C ;
O'Farrell, PH ;
Pickeral, OK ;
Shue, C ;
Vosshall, LB ;
Zhang, J ;
Zhao, Q ;
Zheng, XQH ;
Zhong, F .
SCIENCE, 2000, 287 (5461) :2204-2215
[7]   A genomic perspective on protein families [J].
Tatusov, RL ;
Koonin, EV ;
Lipman, DJ .
SCIENCE, 1997, 278 (5338) :631-637
[8]   Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli [J].
Tatusov, RL ;
Mushegian, AR ;
Bork, P ;
Brown, NP ;
Hayes, WS ;
Borodovsky, M ;
Rudd, KE ;
Koonin, EV .
CURRENT BIOLOGY, 1996, 6 (03) :279-291
[9]   The COG database: new developments in phylogenetic classification of proteins from complete genomes [J].
Tatusov, RL ;
Natale, DA ;
Garkavtsev, IV ;
Tatusova, TA ;
Shankavaram, UT ;
Rao, BS ;
Kiryutin, B ;
Galperin, MY ;
Fedorova, ND ;
Koonin, EV .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :22-28
[10]   The COG database: a tool for genome-scale analysis of protein functions and evolution [J].
Tatusov, RL ;
Galperin, MY ;
Natale, DA ;
Koonin, EV .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :33-36