Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data

被引:323
作者
Fallin, D
Schork, NJ
机构
[1] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44109 USA
[2] Jackson Lab, Bar Harbor, ME 04609 USA
[3] Harvard Univ, Sch Publ Hlth, Dept Biostat, Boston, MA 02115 USA
[4] Harvard Univ, Sch Publ Hlth, Program Populat Genet, Boston, MA 02115 USA
关键词
D O I
10.1086/303069
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Haplotype analyses have become increasingly common in genetic studies of human disease because of their ability to identify unique chromosomal segments likely to harbor disease-predisposing genes. The study of haplotypes is also used to investigate many population processes, such as migration and immigration rates, linkage-disequilibrium strength, and the relatedness of populations. Unfortunately, many haplotype-analysis methods require phase information that can be difficult to obtain from samples of nonhaploid species. There are, however, strategies for estimating haplotype frequencies from unphased diploid genotype data collected on a sample of individuals that make use of the expectation-maximization (EM) algorithm to overcome the missing phase information. The accuracy of such strategies, compared with other phase-determination methods, must be assessed before their use can be advocated. In this study we consider and explore sources of error between EM-derived haplotype frequency estimates and their population parameters, noting that much of this error is due to sampling error, which is inherent in all studies, even when phase can be determined. In light of this, we focus on the additional error between haplotype frequencies within a sample data set and EM-derived haplotype frequency estimates incurred by the estimation procedure. We assess the accuracy of haplotype frequency estimation as a function of a number of factors, including sample size, number of loci studied, allele frequencies, and locus-specific alellic departures from Hardy-Weinberg and linkage equilibrium. We point out the relative impacts of sampling error and estimation error, calling attention to the pronounced accuracy of EM estimates once sampling error has been accounted for. We also suggest that many factors that may influence accuracy can be assessed empirically within a data set-a fact that can be used to Create "diagnostics" that a user can turn to for assessing potential inaccuracies in estimation.
引用
收藏
页码:947 / 959
页数:13
相关论文
共 10 条
[1]  
CLARK AG, 1990, MOL BIOL EVOL, V7, P111
[2]  
EXCOFFIER L, 1995, MOL BIOL EVOL, V12, P921
[3]   HAPLO - A PROGRAM USING THE EM ALGORITHM TO ESTIMATE THE FREQUENCIES OF MULTISITE HAPLOTYPES [J].
HAWLEY, ME ;
KIDD, KK .
JOURNAL OF HEREDITY, 1995, 86 (05) :409-411
[4]  
LEWONTIN RC, 1964, GENETICS, V49, P49
[5]  
LONG JC, 1995, AM J HUM GENET, V56, P799
[6]   Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range PCR [J].
MichalatosBeloin, S ;
Tishkoff, SA ;
Bentley, KL ;
Kidd, KK ;
Ruano, G .
NUCLEIC ACIDS RESEARCH, 1996, 24 (23) :4841-4843
[7]   Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus [J].
Nielsen, DM ;
Ehm, MG ;
Weir, BS .
AMERICAN JOURNAL OF HUMAN GENETICS, 1998, 63 (05) :1531-1540
[8]   Linkage disequilibrium at the ADH2 and ADH3 loci and risk of alcoholism [J].
Osier, M ;
Pakstis, AJ ;
Kidd, JR ;
Lee, JF ;
Yin, SJ ;
Ko, HC ;
Edenberg, HJ ;
Lu, RB ;
Kidd, KK .
AMERICAN JOURNAL OF HUMAN GENETICS, 1999, 64 (04) :1147-1157
[9]  
SCHORK N, 2000, UNPUB GENOME RES
[10]  
Weir B.S., 1996, GENETIC DATA ANAL