Detecting individual ancestry in the human genome
 Andreas Wollstein^{1, 2}Email author and
 Oscar Lao^{1, 3}Email author
DOI: 10.1186/s133230150019x
© Wollstein and Lao; licensee BioMed Central. 2015
Received: 13 October 2014
Accepted: 12 January 2015
Published: 1 May 2015
Abstract
Detecting and quantifying the population substructure present in a sample of individuals are of main interest in the fields of genetic epidemiology, population genetics, and forensics among others. To date, several algorithms have been proposed for estimating the amount of genetic ancestry within an individual. In the present review, we introduce the most widely used methods in population genetics for detecting individual genetic ancestry. We further show, by means of simulations, the performance of popular algorithms for detecting individual ancestry in various controlled demographic scenarios. Finally, we provide some hints on how to interpret the results from these algorithms.
Keywords
Population substructure Human genetic variability SNPs Global ancestry Individual ancestry ADMIXTURE fastSTRUCTURE MDS PCA sNMFReview
Introduction
The genetic variability among the human species is known to be relatively low compared to other primate species [1]. There are paradoxically more genetic differences between Western and Eastern chimpanzee individuals sampled in the African continent [2] than in any genome of two human individuals sampled in different continents [3]. Human genetic diversity also tends to be positively correlated with the geographic distance between the sampled individuals [46], which is mainly a result from isolation by distance [7]. Studies using classical partition of the human genetic variance based on analysis of molecular variance (AMOVA [8]), and its generalization GAMOVA [9], have consistently shown that a small proportion (approximately 10% to 15%) of the total genetic variability is explained by continent of origin, whereas the majority (approximately 80%) is explained by withinindividual variation. The remaining approximately 5% of the genetic variation is explained by the populations [10]. Interpreting these results in terms of human population substructure and individual prediction to a population cluster is still controversial [11]. Some argue that humans should be considered as one genetically homogeneous group [12]; others suggest that, although small, the geographic dependence of human genetic diversity (at least) supports the existence of continental groups [11,13].
Inferring population substructure in the human genome is cumbersome and is the main goal for the large number of genetic ancestry algorithms and approaches that have been proposed in the last decade. A basic assumption is that any current individual genome or population is a mixture of ancestries from past populations [14]. Therefore, genetic ancestry is defined at different scales of complexity: at populations, at individuals within a population, and at a locus within an individual. In the present review, we focus on current methods for inferring genetic ancestry in the genome of an individual. We analyze the performance of some of the most commonly used programs through simulated data and show the range of parameters in which each program provides reliable results in those settings.
Methods for identifying individual ancestry
Commonly applied algorithms to SNP data for quantifying individual population substructure in humans
Type  Method  Name of package  Web address  Reference 

Modelfree  Principal component analysis  EIGENSOFT^{a}  http://genetics.med.harvard.edu/reich/Reich_Lab/Software.html  [70] 
Principal components and Moran’sI  adegenet (R software)  [71]  
Multidimensional scaling  PLINK^{a}  [28]  
Principal coordinates  PCOMC  [72]  
Spectral graph theory  GemTools  http://wpicr.wpic.pitt.edu/WPICCompGen/GemTools/GemTools.htm  [43]  
Spectral graph theory  SpectralGem  http://wpicr.wpic.pitt.edu/WPICCompGen/SpectralGEM/GEM+.htm  [56]  
Laplacian eigenfunction  LAPSTRUCT  [57]  
Genetic algorithm coupled to AMOVA  GAGA  [73]  
Modelbased  Loglikelihood HWE  ADMIXTURE  [24]  
Loglikelihood HWE  FRAPPE  [31]  
Bayesian HWE  STRUCTURE  [22]  
Bayesian HWE  fastSTRUCTURE  [59]  
Nonnegative matrix factorization  sNMF  [25]  
Bayesian  BAPS  [74]  
Chromopainting and Bayesian classifier  fineSTRUCTURE  [60]  
Loglikelihood genotypic/haplotypic gradients  LOCOLD  [37]  
Loglikelihood allelic gradients  SPA  [36]  
ADMIXTURE and linear regression  GPS  [39]  
Bayesian clustering with spatial information  TESS  [38] 
Popular methods for estimating the allelic frequencies f in the ancestral populations for all the loci and the ancestry q proportions in each individual include Bayesian (for example STRUCTURE [22]) and maximum likelihood approaches (for example, FRAPPE [31] and ADMIXTURE [24]).
Recently, new types of global ancestry methods have been proposed. These methods take advantage of the spatial dependence of human population substructure [32] to estimate ancestral geographic coordinates of an individual (BAPS2 [33], GENELAND [34], sPCA [35], SPA [36], LOCOLD [37], TESS [38], or GPS [39] among others).
Since modelbased methods explore the space of possible solutions starting from an initial point, it is recommended to run the algorithm several times at different initial starting points for each proposed K and to check for reproducibility of results [44]. Different strategies have been proposed for combining the results from different runs. One possibility is to compute a consensus ancestry value by merging all the solutions [44]. Another is just to take the run that provides the best value of model performance [24].
Usually, investigators apply both modelfree (for example, PCA or MDS) and modelbased methods (for example, ADMIXTURE, FRAPPE, or STRUCTURE) to the same dataset [45,46]. Plots (and further interpretation) tend to include the solutions of the optimal/best supported number of clusters.
Further improvements on genotyping technology, with the description of millions of single nucleotide polymorphisms (SNPs) in the human genome [15], have allowed the third generation of ancestry methods by modeling the genetic ancestry of local fragments of the genome, such as HapMix or StepPCO scripts [14,47] among others.
Ups and downs of individual genetic ancestry estimation
Individual ancestry methods can depict a detailed picture of the genetic landscape of human populations [15]. Furthermore, these algorithms are routinely applied to any dataset before conducting a genomewide association study (GWAS), in order to correct for the putative presence of hidden population substructure [48]. Moreover, they have been used to test the hypothesis of the ancestry origin of the perpetrator at a crime scene in forensic cases [49].
In principle, averaging the fragments of local ancestry over the genome of one individual computes the global ancestry estimation in that individual; similarly, averaging all of the global individual ancestries in one population provides a migration/admixture estimation in that population. Moreover, the mean and variance in the length of the ancestry fragments and the global ancestry proportions can be used to estimate parameters such as the time or migration rate of the admixture event in particular demographic scenarios [50]. Nevertheless, populationbased methods are sometimes preferred over global or local ancestry methods [18,51]. The main reason is that the results of global and local ancestry methods can be particularly difficult to interpret [21,52]. For example, several demographic scenarios can produce the same observed admixture pattern in PCA [30,53,54]. In humans, multiple demographic events can be identified in the same geographic area [55]; therefore, it is likely to find an ad hoc plausible explanation for any estimated admixture pattern (for example, see [53]). The presence of unequal sample size of the (a priori unknown) populations can also bias the output of some algorithms, such as PCA [30,56]; the presence of highly genetically related individuals and genetic outliers can also bias the output from different algorithms (such as in the case of PCA, [57]). Furthermore, the outcome from the different algorithms can differ substantially even for the same dataset [58]. Ultimately, there is the question of what a proposed ‘ancestral population’ is. By definition, since new populations appear by splitting from previous ones, population ancestry (and hence genetic admixture) can be defined at different time scales, taking into account that all individuals from a species ultimately share a common ancestral origin. However, this population ‘birth and death’ process is not really modeled in the modelbased methods (and by default, neither is it in the modelfree methods); in contrast, it is one of the main goals of populationbased methods, conditioned to the proper definition of ‘what a current population is’.
We exemplify some of these caveats using unsupervised analyses from four ascertained globalbased algorithms on simulated and real data using the default parameter settings from each algorithm. In particular, we consider ADMIXTURE [24], sNMF [25], fastSTRUCTURE [59], PCA [27], and MDS in PLINK [28]. This selection is based on methodological, historical, and computational characteristics. For example, we did not consider fineSTRUCTURE [60], a recently developed algorithm with enhanced power for detecting population substructure [61], because of its computational burden when the number of SNPs and sampled individuals are large (see the manual of fineSTRUCTURE and chromoPainter for details). The first two methods represent modelbased algorithms. ADMIXTURE [24] is a maximum likelihood algorithm. It can be considered the gold standard of modelbased methods; it is relatively fast and allows for the use of a large number of SNPs and samples. fastSTRUCTURE is a new software that implements a Bayesian framework similar to STRUCTURE [22]. However, in contrast to STRUCTURE, fastSTRUCTURE allows the fast analysis of a large number of samples and SNPs. PCA, MDS, and sNFM are modelfree methods. PCA and MDS are based on eigenvalue decomposition. They produce almost identical results in real data [62,63]; therefore, we have used either one or the other indistinctly in the different simulations. sNMF [25] is a novel software which in principle produces very similar results to ADMIXTURE [24] but at a computationally faster speed.
Performance of globalbased algorithms to estimate genetic ancestry on two simulated populations
Two populations with a genetic barrier
Default parameter used in twopopulation models, with and without migration
Parameter  Abbreviation  Default value 

Sample size population 1  n1  100 
Sample size population 2  n2  100 
Number of independent SNPs  nsnps  5,000 
Mutation rate (length)^{a}  theta  2 
Effective population size^{b}  N1, N2  10,000 
Divergence time  T1  2,000 
Constant migration rate  4 Nm  0 
Results from the twopopulation model simulations
Variable  sNMF (R2)  Admixture (R2)  fastStructure (R2) 

Sampling depth, n1, n2  
8  99.92  100  39.56 
10  99.83  100  34.03 
20  99.87  100  100 
40  99.81  100  100 
100  99.74  100  100 
Uneven sampling, n1  
8  98.94  99.45  98.59 
10  99.43  99.78  99.32 
20  99.61  100  92.21 
40  99.67  100  100 
100  99.74  100  100 
Sequencing depth, nsnps  
10  3.13  0.65  18.51 
50  66.56  75.54  74.42 
100  85.33  92.95  91.89 
500  96.78  99.87  99.93 
1,000  98.62  99.99  100 
5,000  99.74  100  100 
Population size, theta  
1  99.73  100  100 
2  99.74  100  100 
5  99.74  100  100 
10  99.72  100  100 
Effective population size, N2  
100  99.98  100  100 
2,500  99.94  100  100 
7,500  99.82  100  100 
10,000  99.74  100  100 
Divergence time (F _{ st }), T/(4 N_{1})  
0.000075  0.54  0.38  0.01 
0.00025  0.24  0.03  0 
0.00125  6.19  0.03  0.24 
0.0025  69.36  95.28  0.53 
0.0125  98.36  100  100 
0.05  99.74  100  100 
Constant migration rate, 4 Nm  
0.1  99.77  100  100 
1  99.78  100  100 
5  99.56  100  100 
10  99.15  99.99  100 
50  93.95  99.98  33.3 
100  41.61  94.06  0.56 
Results from admixture simulation with changing parameter in the HI model from HapMap III data
Parameter  sNMF (R2)  ADMIXTURE (R2)  fastSTRUCTURE (R2) 

Sample size  
8  98.2  99  86.66 
10  99.5  99.52  98.69 
20  99.74  99.82  99.71 
40  99.85  99.9  99.86 
50  99.87  99.93  99.9 
100  99.91  99.95  99.95 
nsnps  
5  4.56  15.38  19.44 
10  15.92  47.37  46.2 
50  80.62  86.31  86.89 
100  89.67  93.04  93.33 
500  98.46  99.07  99.11 
1,000  99.19  99.54  99.56 
5,000  99.84  99.92  99.91 
10,000  99.91  99.95  99.95 
Nbreaks  
5  88.82  88.37  87.46 
10  94.38  94.86  94.43 
50  98.74  98.87  98.8 
100  99.33  99.41  99.38 
500  99.81  99.85  99.84 
1,000  99.86  99.91  99.9 
5,000  99.91  99.94  99.94 
10,000  99.91  99.95  99.95 
alpha  
0.01  99.94  99.99  99.99 
0.03  99.93  99.97  99.95 
0.07  99.93  99.97  99.92 
0.1  99.93  99.97  99.91 
0.3  99.91  99.96  99.95 
0.5  99.91  99.95  99.95 
Migration between the two descendent populations (continuous gene flow model)
In addition, we studied the parameter range where migration becomes detectable depending on the start time and rate of migration in the continuous gene flow (CGF) model (see Figure 1A for the model and Figure 2 for results). Keeping the migration rate fixed at high migration rate (4 Nm = 2,000), the populations become distinguishable if the migration starts before 100 generations backward in time (Figure 2B). Beyond that value, the effect of migration is so strong that the two populations appear to be panmictic. In contrast, when fixing the start time of migration at ten generations, we observe that all populations become recognizable by all programs for 4 Nm < 500. The estimated proportions of ancestry do not match the proportion of migrants over time. A possible reason is that there is a continuous gene flow from one population into the other so that recombination has not enough time to produce the homogeneous mosaic of ancestral fragments that is emerging from the HI model (see below). Therefore, the migration rate cannot be inferred from this analysis.
Performance of the algorithms on the hybrid admixture (HI) model
Simulated data
Analyses focused on the estimated individual ancestry proportions in the hybrid population using the HI model (Figure 1B). We compared them with the real proportions of genomic admixture in each individual; this measure was estimated for each simulation by tracing back the ancestry of the genomic fragments that compose the genome of each admixed individual to either of the two parental populations. Therefore, in contrast to other approaches, which produce admixed individuals in forward generations from sampled real populations (that is, African Americans have been modeled as a mixture of CEU and YRI individuals from HapMap III [66]; also see the next section), we avoid the artificial introduction of strong bottlenecks.
Real data from HapMap III data
Simulations from synthetically generated admixed populations from African (YRI) and European (CEU) as ancestral populations were produced (see Table 4 for results and clarification of the applied methodology). We use the number of breakpoints to mimic the time of admixture [14] and sampled SNPs with a minimum distance of 1 Mb to ensure linkage equilibrium. The results for sample size, number of SNPs, and admixture time, represented here as the number of breaks, are quite similar to the twopopulation simulations above. The power of sNMF and ADMIXTURE is quite comparable. fastSTRUCTURE loses power more rapidly with lower sample size and maintains a better power for low numbers of SNPs. All programs have an equally high power to estimate the ancestry components.
Conclusions
Identifying hidden population substructure in the genome of an individual is important for a number of scientific disciplines. So far, the proposed algorithms are invaluable tools for detecting and controlling for the presence of hidden population substructure. In the simplest demographic models, these methods can also be used to estimate demographic parameters. However, interpreting the output of each algorithm from an evolutionary point of view can be difficult. Different demographic scenarios can lead to the same ancestry estimates, and different estimates can be retrieved when applied to the same dataset. Extrapolating the results from our simple simulations to real data (that is, suggesting which is the best algorithm) can be misleading; except for cases such as the admixture of European and SubSaharan African populations in the US [67], admixture usually involves more than two parental populations (for example, Latin America, although see [68]). In addition, parental populations tend to show a nonnegligible gene flow [61] with admixed populations that can substantially differ in the effective population size compared to the parental populations (for example, see the European Romani [46]), while usually the parental populations are unknown.
The number of SNPs and sample size seem to be a limiting factor in all the algorithms that we have tested; therefore, it would be recommended to use as many markers (conditioned in the absence of LD when required by the algorithm) and samples as possible. However, in our simple model, we observe already good estimates for >10 samples and >1,000 markers. In case fewer markers are available, fastSTRUCTURE provides the best estimates followed by ADMIXTURE and sNMF. Furthermore, it is recommendable to run more than one algorithm on the same data at the same time given the observed diversity of results, different sensitivity to biased sample size of the different algorithms, and ancestry noise. In this sense, combining global ancestry and population ancestry methods (for example, [69]), or using the output from these algorithms as summary statistics [40], can improve the identification of population substructure. Finally, although they can be used to provide hypotheses about the origin and evolution of populations, it is recommended to test the evolutionary hypotheses by means of other methods [46], rather than providing an ad hoc interpretation; in particular, any demographic interpretation from these methods should be further validated by means of demographic simulations, showing that the proposed demographic model can produce the observed output of genetic ancestry.
Notes
Abbreviations
 AMOVA:

Analysis of molecular variance
 CGF:

Continuous gene flow model
 GAMOVA:

Generalized analysis of molecular variance
 GWAS:

Genomewide association study
 HWE:

HardyWeinberg equilibrium
 MDS:

Classical multidimensional scaling, also called principal coordinate analysis
 PCA:

Principal component analysis
 SNP:

Single nucleotide polymorphism
Declarations
Acknowledgements
This study was funded in part by the Erasmus University Medical Center Rotterdam. AW was additionally supported by Volkswagen Foundation (ref 80462). We would like to thank Susan Walsh and Wolfgang Stephan for helpful comments on the manuscript.
Authors’ Affiliations
References
 CavalliSforza LL. Human evolution and its relevance for genetic epidemiology. Annu Rev Genomics Hum Genet. 2007;8:1–15.View ArticlePubMedGoogle Scholar
 Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87.View ArticleGoogle Scholar
 Barbujani G, Colonna V. Human genome diversity: frequently asked questions. Trends Genet. 2010;26:285–95.View ArticlePubMedGoogle Scholar
 Handley LJL, Manica A, Goudet J, Balloux F. Going the distance: human population genetics in a clinal world. Trends Genet. 2007;23:432–9.View ArticlePubMedGoogle Scholar
 Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, CavalliSforza LL. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A. 2005;102:15942–7.View ArticlePubMed CentralPubMedGoogle Scholar
 Rosenberg NA. Algorithms for selecting informative marker panels for population assignment. J Comput Biol. 2005;12:1183–201.View ArticlePubMedGoogle Scholar
 Jay F, Sjödin P, Jakobsson M, Blum MGB. Anisotropic isolation by distance: the main orientations of human genetic differentiation. Mol Biol Evol. 2013;30:513–25.View ArticlePubMed CentralPubMedGoogle Scholar
 Excoffier L, Smouse PE, Quattro JM. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics. 1992;131:479–91.PubMed CentralPubMedGoogle Scholar
 Nievergelt CM, Libiger O, Schork NJ. Generalized analysis of molecular variance. PLoS Genet. 2007;3:e51.View ArticlePubMed CentralPubMedGoogle Scholar
 Lewontin RC. The apportionment of human diversity. In: Evolutionary biology. US: Springer; 1995. p. 381–98.Google Scholar
 Edwards AWF. Human genetic diversity: Lewontin’s fallacy. Bioessays. 2003;25:798–801.View ArticlePubMedGoogle Scholar
 Barbujani G. Human races: classifying people vs understanding diversity. Curr Genomics. 2005;6:215–26.View ArticleGoogle Scholar
 Risch N. Dissecting racial and ethnic differences. N Engl J Med. 2006;354:408–11.View ArticlePubMedGoogle Scholar
 Pugach I, Matveyev R, Wollstein A, Kayser M, Stoneking M. Dating the age of admixture via wavelet transform analysis of genomewide data. Genome Biol. 2011;12:R19.View ArticlePubMed CentralPubMedGoogle Scholar
 Novembre J, Ramachandran S. Perspectives on human population structure at the cusp of the sequencing era. Annu Rev Genomics Hum Genet. 2011;12:245–74.View ArticlePubMedGoogle Scholar
 Liu Y, Nyunoya T, Leng S, Belinsky SA, Tesfaigzi Y, Bruse S. Softwares and methods for estimating genetic ancestry in human populations. Hum Genomics. 2013;7:1.View ArticlePubMed CentralPubMedGoogle Scholar
 Bernstein F. Die geographische Verteilung der Blutgruppen und ihre anthropologische Bedeutung. 1932.Google Scholar
 Pickrell JK, Pritchard JK. Inference of population splits and mixtures from genomewide allele frequency data. PLoS Genet. 2012;8:e1002967.View ArticlePubMed CentralPubMedGoogle Scholar
 Bertorelle G, Excoffier L. Inferring admixture proportions from molecular data. Mol Biol Evol. 1998;15:1298–311.View ArticlePubMedGoogle Scholar
 Long JC. The genetic structure of admixed populations. Genetics. 1991;127:417–28.PubMed CentralPubMedGoogle Scholar
 Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, et al. Ancient admixture in human history. Genetics. 2012;192:1065–93.View ArticlePubMed CentralPubMedGoogle Scholar
 Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–59.PubMed CentralPubMedGoogle Scholar
 Yang BZ, Zhao H, Kranzler HR, Gelernter J. Characterization of a likelihood based method and effects of markers informativeness in evaluation of admixture and population group assignment. BMC Genet. 2005;6:50.View ArticlePubMed CentralPubMedGoogle Scholar
 Alexander DH, Novembre J, Lange K. Fast modelbased estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–64.View ArticlePubMed CentralPubMedGoogle Scholar
 Frichot E, Mathieu F, Trouillon T, Bouchard G, François O. Fast and efficient estimation of individual ancestry coefficients. Genetics. 2014;196:973–83.View ArticlePubMed CentralPubMedGoogle Scholar
 Jombart T, Pontier D, Dufour AB. Genetic markers in the playground of multivariate analysis. Heredity. 2009;102:330–41.View ArticlePubMedGoogle Scholar
 Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190.View ArticlePubMed CentralPubMedGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007;81:559–75.View ArticlePubMed CentralPubMedGoogle Scholar
 Cox TF, Cox M. Multidimensional scaling. 2nd ed. New York: Chapman & Hall/CRC; 2010.Google Scholar
 McVean G. A genealogical interpretation of principal components analysis. PLoS Genet. 2009;5:e1000686.View ArticlePubMed CentralPubMedGoogle Scholar
 Tang H, Peng J, Wang P, Risch NJ. Estimation of individual admixture: analytical and study design considerations. Genet Epidemiol. 2005;28:289–301.View ArticlePubMedGoogle Scholar
 Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101.View ArticlePubMed CentralPubMedGoogle Scholar
 Cheng L, Connor TR, Sirén J, Aanensen DM, Corander J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol Biol Evol. 2013;30:1224–8.View ArticlePubMed CentralPubMedGoogle Scholar
 Guillot G, Mortier F, Estoup A. Geneland: a computer package for landscape genetics. Mol Ecol Notes. 2005;5:712–5.View ArticleGoogle Scholar
 Jombart T, Devillard S, Dufour AB, Pontier D. Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity. 2008;101:92–103.View ArticlePubMedGoogle Scholar
 Yang WY, Novembre J, Eskin E, Halperin E. A modelbased approach for analysis of spatial structure in genetic data. Nat Genet. 2012;44:725–31.View ArticlePubMed CentralPubMedGoogle Scholar
 Baran Y, Quintela I, Carracedo A, Pasaniuc B, Halperin E. Enhanced localization of genetic samples through linkagedisequilibrium correction. Am J Hum Genet. 2013;92:882–94.View ArticlePubMed CentralPubMedGoogle Scholar
 Chen C, Durand E, Forbes F, François O. Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study. Mol Ecol Notes. 2007;7:747–56.View ArticleGoogle Scholar
 Elhaik E, Tatarinova T, Chebotarev D, Piras IS, Maria Calò C, De Montis A, et al. Geographic population structure analysis of worldwide human populations infers their biogeographical origins. Nat Commun. 2014;5:3513.View ArticlePubMed CentralPubMedGoogle Scholar
 Evanno G, Regnaut S, Goudet J. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol. 2005;14:2611–20.View ArticlePubMedGoogle Scholar
 Fraley C, Raftery AE. Modelbased clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–31.View ArticleGoogle Scholar
 Lawson DJ, Falush D. Population identification using genetic data. Annu Rev Genomics Hum Genet. 2012;13:337–61.View ArticlePubMedGoogle Scholar
 Klei L, Kent BP, Melhem N, Devlin B, Roeder K. GemTools: a fast and efficient approach to estimating genetic ancestry. http://arxiv.org/abs/1104.1162. Accessed 10 June 2014.
 Jakobsson M, Rosenberg NA. CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics. 2007;23:1801–6.View ArticlePubMedGoogle Scholar
 Wollstein A, Lao O, Becker C, Brauer S, Trent RJ, Nürnberg P, et al. Demographic history of Oceania inferred from genomewide data. Curr Biol. 2010;20:1983–92.View ArticlePubMedGoogle Scholar
 Mendizabal I, Lao O, Marigorta UM, Wollstein A, Gusmão L, Ferak V, et al. Reconstructing the population history of European Romani from genomewide data. Curr Biol. 2012;22:2342–9.View ArticlePubMedGoogle Scholar
 Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5:e1000519.View ArticlePubMed CentralPubMedGoogle Scholar
 Tian C, Gregersen PK, Seldin MF. Accounting for ancestry: population substructure and genomewide association studies. Hum Mol Genet. 2008;17:R143–50.View ArticlePubMed CentralPubMedGoogle Scholar
 Kayser M, de Knijff P. Improving human forensics through advances in genetics, genomics and molecular biology. Nat Rev Genet. 2011;12:179–92.View ArticlePubMedGoogle Scholar
 Verdu P, Rosenberg NA. A general mechanistic model for admixture histories of hybrid populations. Genetics. 2011;189:1413–26.View ArticlePubMed CentralPubMedGoogle Scholar
 Pickrell JK, Patterson N, Barbieri C, Berthold F, Gerlach L, Güldemann T, et al. The genetic prehistory of southern Africa. Nat Commun. 2012;3:1143.View ArticlePubMed CentralPubMedGoogle Scholar
 Kalinowski ST. The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure. Heredity. 2011;106:625–32.View ArticlePubMed CentralPubMedGoogle Scholar
 Novembre J, Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nat Genet. 2008;40:646–9.View ArticlePubMed CentralPubMedGoogle Scholar
 François O, Currat M, Ray N, Han E, Excoffier L, Novembre J. Principal component analysis under population genetic models of range expansion and admixture. Mol Biol Evol. 2010;27:1257–68.View ArticlePubMedGoogle Scholar
 Sokal RR, Oden NL, Walker J, Di Giovanni D, Thomson BA. Historical population movements in Europe influence genetic relationships in modern samples. Hum Biol. 1996;68:873–98.PubMedGoogle Scholar
 Lee AB, Luca D, Klei L, Devlin B, Roeder K. Discovering genetic ancestry using spectral graph theory. Genet Epidemiol. 2010;34:51–9.PubMedGoogle Scholar
 Zhang J, Niyogi P, McPeek MS. Laplacian eigenfunctions learn population structure. PLoS One. 2009;4:e7928.View ArticlePubMed CentralPubMedGoogle Scholar
 Barbujani G, Belle EMS. Genomic boundaries between human populations. Hum Hered. 2006;61:15–21.View ArticlePubMedGoogle Scholar
 Raj A, Stephens M, Pritchard JK. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics. 2014;197:573–89.View ArticlePubMed CentralPubMedGoogle Scholar
 Lawson DJ, Hellenthal G, Myers S, Falush D. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8:e1002453.View ArticlePubMed CentralPubMedGoogle Scholar
 Hellenthal G, Busby GBJ, Band G, Wilson JF, Capelli C, Falush D, et al. A genetic atlas of human admixture history. Science. 2014;343:747–51.View ArticlePubMed CentralPubMedGoogle Scholar
 Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, et al. Worldwide human relationships inferred from genomewide patterns of variation. Science. 2008;319:1100–4.View ArticlePubMedGoogle Scholar
 Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, et al. Genotype, haplotype and copynumber variation in worldwide human populations. Nature. 2008;451:998–1003.View ArticlePubMedGoogle Scholar
 Nielsen R, Mountain JL, Huelsenbeck JP, Slatkin M. Maximumlikelihood estimation of population divergence times and population phylogeny in models without mutation. Evolution. 1998;52:669–77.View ArticleGoogle Scholar
 Weir BS, Cockerham CC. Estimating Fstatistics for the analysis of population structure. Evolution. 1984;38:1358–70.View ArticleGoogle Scholar
 International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–8.View ArticleGoogle Scholar
 Yaeger R, AvilaBront A, Abdul K, Nolan PC, Grann VR, Birchette MG, et al. Comparing genetic ancestry and selfdescribed race in African Americans born in the United States and in Africa. Cancer Epidemiol Biomarkers Prev. 2008;17:1329–38.View ArticlePubMed CentralPubMedGoogle Scholar
 Bryc K, Auton A, Nelson MR, Oksenberg JR, Hauser SL, Williams S, et al. Genomewide patterns of population structure and admixture in West Africans and African Americans. Proc Natl Acad Sci U S A. 2010;107:786–91.View ArticlePubMed CentralPubMedGoogle Scholar
 Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, Ray N, et al. Reconstructing Native American population history. Nature. 2012;488:370–4.View ArticlePubMed CentralPubMedGoogle Scholar
 Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genomewide association studies. Nat Genet. 2006;38:904–9.View ArticlePubMedGoogle Scholar
 Jombart T. adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics. 2008;24:1403–5.View ArticlePubMedGoogle Scholar
 Reeves PA, Richards CM. Accurate inference of subtle population structure (and other genetic discontinuities) using principal coordinates. PLoS One. 2009;4:e4269.View ArticlePubMed CentralPubMedGoogle Scholar
 Lao O, Liu F, Wollstein A, Kayser M. GAGA: a new algorithm for genomic inference of geographic ancestry reveals fine level population substructure in Europeans. PLoS Comput Biol. 2014;10:e1003480.View ArticlePubMed CentralPubMedGoogle Scholar
 Corander J, Waldmann P, Marttinen P, Sillanpää MJ. BAPS 2: enhanced possibilities for the analysis of genetic population structure. Bioinformatics. 2004;20:2363–9.View ArticlePubMedGoogle Scholar
 Hudson RR. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–8.View ArticlePubMedGoogle Scholar
 de Gruijter JM, Lao O, Vermeulen M, Xue Y, Woodwark C, Gillson CJ, et al. Contrasting signals of positive selection in genes involved in human skincolor variation from tests based on SNP scans and resequencing. Investig Genet. 2011;2:24.View ArticlePubMed CentralPubMedGoogle Scholar
 Gravel S. Population genetics models of local ancestry. Genetics. 2012;191:607–19.View ArticlePubMed CentralPubMedGoogle Scholar
 Nachman MW, Crowell SL. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000;156:297–304.PubMed CentralPubMedGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.