Development and evaluation of an automated annotation pipeline and cDNA annotation system

被引:23
作者
Kasukawa, T
Furuno, M
Nikaido, I
Bono, H
Hume, DA
Bult, C
Hill, DP
Baldarelli, R
Gough, J
Kanapin, A
Matsuda, H
Schriml, LM
Hayashizaki, Y
Okazaki, Y
Quackenbush, J [1 ]
机构
[1] Inst Genom Res, Rockville, MD 20850 USA
[2] RIKEN, Yokohama Inst, GSC, Lab Genome Explorat,Tsurumi Ku, Kanagawa 2300045, Japan
[3] NTT Software Corp, Adv Technol Dev Dept, Multimedia Dev Ctr, Kanagawa 2318554, Japan
[4] Univ Queensland, Inst Mol Biosci, ARC Special Res Ctr Funct & Appl Genom, Brisbane, Qld 4072, Australia
[5] Jackson Lab, Mouse Genome Informat Grp, Bar Harbor, ME 04609 USA
[6] MRC, Mol Biol Lab, Cambridge CB2 2QH, England
[7] European Bioinformat Inst, Cambridge CB10 1SD, England
[8] Osaka Univ, Grad Sch Informat Sci & Technol, Toyonaka, Osaka 5608531, Japan
[9] NIH, Natl Ctr Biotechnol Informat, Bethesda, MD 20892 USA
[10] RIKEN, Genome Sci Lab, Wako, Saitama 3510198, Japan
关键词
D O I
10.1101/gr.992803
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Manual curation has long been held to be the "gold standard" for functional annotation of DNA sequence. Our experience with the annotation of more than 20,000 full-length cDNA sequences revealed problems with this approach, including inaccurate and inconsistent assignment of gene names, as well as many good assignments that were difficult to reproduce using only computational methods. For the FANTOM2 annotation of more than 60,000 cDNA clones, we developed a number of methods and tools to circumvent some of these problems, including an automated annotation pipeline that provides high-quality preliminary annotation for each sequence by introducing an "uninformative filter" that eliminates uninformative annotations, controlled vocabularies to accurately reflect both the functional assignments and the evidence supporting them, and a highly refined, Web-based manual annotation tool that allows users to view a wide array of sequence analyses and to assign gene names and putative functions using a consistent nomenclature. The ultimate utility of our approach is reflected in the low rate of reassignment of automated assignments by manual curation. Based on these results, we propose a new standard for large-scale annotation, in which the initial automated annotations are manually investigated and then computational methods are iteratively modified and improved based on the results of manual curation.
引用
收藏
页码:1542 / 1551
页数:10
相关论文
共 20 条
[1]   The genome sequence of Drosophila melanogaster [J].
Adams, MD ;
Celniker, SE ;
Holt, RA ;
Evans, CA ;
Gocayne, JD ;
Amanatides, PG ;
Scherer, SE ;
Li, PW ;
Hoskins, RA ;
Galle, RF ;
George, RA ;
Lewis, SE ;
Richards, S ;
Ashburner, M ;
Henderson, SN ;
Sutton, GG ;
Wortman, JR ;
Yandell, MD ;
Zhang, Q ;
Chen, LX ;
Brandon, RC ;
Rogers, YHC ;
Blazej, RG ;
Champe, M ;
Pfeiffer, BD ;
Wan, KH ;
Doyle, C ;
Baxter, EG ;
Helt, G ;
Nelson, CR ;
Miklos, GLG ;
Abril, JF ;
Agbayani, A ;
An, HJ ;
Andrews-Pfannkoch, C ;
Baldwin, D ;
Ballew, RM ;
Basu, A ;
Baxendale, J ;
Bayraktaroglu, L ;
Beasley, EM ;
Beeson, KY ;
Benos, PV ;
Berman, BP ;
Bhandari, D ;
Bolshakov, S ;
Borkova, D ;
Botchan, MR ;
Bouck, J ;
Brokstein, P .
SCIENCE, 2000, 287 (5461) :2185-2195
[2]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[3]   The InterPro database, an integrated documentation resource for protein families, domains and functional sites [J].
Apweiler, R ;
Attwood, TK ;
Bairoch, A ;
Bateman, A ;
Birney, E ;
Biswas, M ;
Bucher, P ;
Cerutti, T ;
Corpet, F ;
Croning, MDR ;
Durbin, R ;
Falquet, L ;
Fleischmann, W ;
Gouzy, J ;
Hermjakob, H ;
Hulo, N ;
Jonassen, I ;
Kahn, D ;
Kanapin, A ;
Karavidopoulou, Y ;
Lopez, R ;
Marx, B ;
Mulder, NJ ;
Oinn, TM ;
Pagni, M ;
Servant, F ;
Sigrist, CJA ;
Zdobnov, EM .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :37-40
[4]  
Ashburner M, 2001, GENOME RES, V11, P1425
[5]   The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48
[6]   The Mouse Genome Database (MGD): the model organism database for the laboratory mouse [J].
Blake, JA ;
Richardson, JE ;
Bult, CJ ;
Kadin, JA ;
Eppig, JT .
NUCLEIC ACIDS RESEARCH, 2002, 30 (01) :113-115
[7]   ESTABLISHING A HUMAN TRANSCRIPT MAP [J].
BOGUSKI, MS ;
SCHULER, GD .
NATURE GENETICS, 1995, 10 (04) :369-371
[8]   DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS [J].
BOGUSKI, MS ;
LOWE, TMJ ;
TOLSTOSHEV, CM .
NATURE GENETICS, 1993, 4 (04) :332-333
[9]   READ: RIKEN expression array database [J].
Bono, H ;
Kasukawa, T ;
Hayashizaki, Y ;
Okazaki, Y .
NUCLEIC ACIDS RESEARCH, 2002, 30 (01) :211-213
[10]  
BONO H, 2003, GENOME RES