Inference of human continental origin and admixture proportions using a highly discriminative ancestry informative 41-SNP panel

Background Accurate determination of genetic ancestry is of high interest for many areas such as biomedical research, personal genomics and forensics. It remains an important topic in genetic association studies, as it has been shown that population stratification, if not appropriately considered, can lead to false-positive and -negative results. While large association studies typically extract ancestry information from available genome-wide SNP genotypes, many important clinical data sets on rare phenotypes and historical collections assembled before the GWAS area are in need of a feasible method (i.e., ease of genotyping, small number of markers) to infer the geographic origin and potential admixture of the study subjects. Here we report on the development, application and limitations of a small, multiplexable ancestry informative marker (AIM) panel of SNPs (or AISNP) developed specifically for this purpose. Results Based on worldwide populations from the HGDP, a 41-AIM AISNP panel for multiplex application with the ABI SNPlex and a subset with 31 AIMs for the Sequenome iPLEX system were selected and found to be highly informative for inferring ancestry among the seven continental regions Africa, the Middle East, Europe, Central/South Asia, East Asia, the Americas and Oceania. The panel was found to be least informative for Eurasian populations, and additional AIMs for a higher resolution are suggested. A large reference set including over 4,000 subjects collected from 120 global populations was assembled to facilitate accurate ancestry determination. We show practical applications of this AIM panel, discuss its limitations for admixed individuals and suggest ways to incorporate ancestry information into genetic association studies. Conclusion We demonstrated the utility of a small AISNP panel specifically developed to discern global ancestry. We believe that it will find wide application because of its feasibility and potential for a wide range of applications.


Background
Characterization of human ancestry has been of interest for decades as information about population structure can provide novel insight into the human past and remains an important topic in the rapidly evolving biomedical field. For example, because genetic variants conferring risk to a particular disease may be geographically restricted because of evolutionary forces such as mutation, genetic drift, migration and natural selection, the assessment of the genetic background in individuals chosen for a study is crucial in genetic epidemiology [1].
While still a topic of controversy [2], there is ample evidence that self-reported race, as for example used in the US Census, can predict ancestral clusters in a population sample. However, it does not completely inform on how genetic variation is apportioned within and between racial groups, nor does information on race reveal the extent of admixture [2,3].
Especially in the context of mapping disease genes, more objective and accurate methods of defining homogenous populations for the investigation of specific population-disease associations are required. This is not only paramount for specific mapping approaches such as admixture mapping [4], but has also been recognized as a crucial prerequisite for genetic association studies, as the presence of undetected population structure can lead to both false-positive results and failures to detect genuine associations [5]. Furthermore, it has been shown that the consequences of population structure on association outcomes increase markedly with sample size, and even modest levels of population structure within population groups cannot safely be ignored in the large studies needed to detect typical genetic effects in common diseases [6].
In order to assess genetic background diversity, a large number of ancestry informative marker (AIM) panels have been developed for particular applications. Genome-wide panels for admixture mapping have been developed for Hispanic populations [7], African Americans [8] or three-way admixture in the Americas [9], and smaller AIM panels have been designed to discern ancestry at either the global level [10][11][12] or within specific populations such as the Native and Mexican Americans [13][14][15], Europeans [16][17][18][19][20] or African Americans [21,22]. In addition, genome-wide association studies (GWAS) are able to leverage ancestral information from the allele frequencies of the several thousand SNPs generated for whole-genome applications, alleviating the need for specific AIM panels [5].
However, determining ancestry and controlling for population structure is just as important in smaller genetic association studies. These include for example candidate gene studies involving only a few genetic markers, replication of GWAS findings, or consist of smaller, highly valuable collections of rare pathological phenotypes and historical collections with limited amounts of DNA. Genotyping these samples on large AIM panels or leveraging ancestry information from preexisting genotyping is often not practical or possible.
To address this specific need, we set out to develop a highly informative AIM panel that would allow us to infer a subject's ancestral origin at the continental level and estimate admixture proportions among at least seven main geographic regions Africa, the Middle East, Europe, Central and South Asia, East Asia, Oceania and the Americas. The selection of such AIMs has to focus on SNPs with the largest allele frequency differences between the continental regions of interest to achieve the desired resolution at the continental level. Such high resolution is required because genetic diversity of human populations follows gradients or geographic clines within and among continents rather than specific clusters or clades [3,23,24].
We further aimed for the development of a feasible method to determine ancestry, as resources such as funding and available DNA are often limited for these applications. We therefore developed panels of AISNPs suitable for multiplex application on two commonly used platforms, the ABI SNPlex [25] and Sequenome iPLEX [26] systems. Additionally, all markers are also included on the Illumina HumanHap550 array, thus allowing for a combined analysis with studies genotyped on the Illumina whole-genome arrays.
Lastly, we specifically focused on the applicability of our panel to determine the ancestry of subjects from any of the worldwide geographic origins. To date, most research involving genetic association studies has focused on populations of European descent, where longer LD blocks require fewer genetic markers to be genotyped [27]. However, current gene-mapping efforts specifically request more global research, thus increasing the need for global AIM panels. Furthermore, global ancestry determination is especially important in clinical samples ascertained in specific geographic regions such as Southern California that are inhabited by individuals with very diverse and often heavily admixed ancestries.
Here we describe the development of AIM panels based on the well-studied global reference populations from the HGDP-CEPH [28], which include 52 geographically diverse populations collected from seven continental regions. We then greatly expanded the reference population set by genotyping the AIMs in over 2,000 additional subjects of known ancestry with the goal of achieving the most comprehensive global reference collection possible. We report on these efforts and describe highly discriminative ancestry informative 41-and 31marker panels for multiplex applications.

Reference populations
AIM panels were developed based on the global reference populations from the HGDP-CEPH [28]. A total of 941 subjects including 52 populations from the standardized H952 subset were selected [29]. Based on the geographic origin of the samples, HGDP subjects were assigned to one of seven geographic or continental regions: Africa (n = 131), the Middle East (including the North African Moabites, n = 133), Europe (n = 158), Central/South Asia (CS Asia, n = 198), East Asia (E Asia, n = 229), the Americas (n = 64) and Oceania (n = 28) (Additional file 1: Table S1).
We used Infocalc 1.1 [30] to calculate the marker informativeness (I_n) among the seven continental regions for each of the 644,195 autosomal markers. The mean informativeness of all markers was 0.0539, with a wide range of I_n = 0.0003-0.406. AIMs were selected according to the following criteria: being autosomal, unambiguous (AC, AG, TC, TG) and present on the Illumina Hap550 array (n = 547,458). Next, the top 5,000 markers with the highest I_n were chosen (I_n > 0.077) and, to reduce the correlation of markers, were subjected to LD pruning using PLINK [31] at a VIF = 1.5. The resulting pool of AIMs included 1,442 SNPs (Additional file 2: Table S2).
A small panel for multiplexing applications was developed by first choosing from the pool of 1,442 AIMs the top ten markers with the highest allele frequency differences (δ) between each of the 21 pairwise continental region comparisons. This set of 210 markers was then further reduced in an iterative way by considering multiplex genotyping requirements for the ABI SNPlex genotyping system [25] and Sequenome iPLEX system [26], leading to the final 41-AIM set for ABI SNPlex genotyping and the matching 31-AIM set for Sequenome iPLEX genotyping.

Additional reference and test populations
To validate the AIM panels and increase the global coverage of the reference population set for downstream applications, we included two additional, very large data sets with worldwide populations: the International HapMap Project (http://hapmap.ncbi.nlm.nih. gov/; phase III release 2 and 3) standard set HAP1161 [32] included 931 subjects from 11 populations, and the Yale data set included 2,146 subjects from 57 populations [33]. The combined reference set included 4,018 unrelated subjects from 120 (partially overlapping) populations (Additional file 1: Table S1). These reference populations have been described previously [33], and geographic features such as latitude and longitude of these populations are presented in the allele frequency database ALFRED (http://alfred.med. yale.edu/) [34]. Genotypes of at least 40 of the 41 AIMs were available for all reference subjects.
Finally, to illustrate a practical application of the 41-AIM panel with our complete set of global reference populations, a contemporary population sample of 2,392 subjects ascertained in Southern California [35] was genotyped using the ABI SNPlex system. Ancestry was determined for all subjects with < 5% genotypes missing.

Statistical analyses
Population structure and individual ancestry estimates were obtained using STRUCTURE v2.3.2.1. [36,37]. To assess the global informativeness of the 41-AIM panel in the original HGDP reference populations, five independent runs without prior population assignment were performed at K = 2 to K = 7, using 20,000 burnin cycles and 20,000 MCMC replications under the admixture model. The "infer α" option with the same, uniform alpha for all populations was used under the λ = 1 option. All other parameters were set at default.
To further validate the 41-AIM panel, ancestry estimates of 3,077 independent subjects of known ancestry from 68 global populations (reference set 2) were determined at k = 7 using the above STRUCTURE parameters, but now including prior population information of the HGDP reference set. Allele frequencies were updated using only individuals with population information at a migration prior of 0.05. Graphs were plotted using DISTRUCT v1.1 [38].
CLUMPP v1.1.2 [39] was used to evaluate different replicates of STRUCTURE runs. To assign a subject to a specific cluster, we applied cutoffs of >85% and >50% cluster membership, respectively. These criteria were selected to facilitate a comparison with Seldin's 93-AIM panel [10]. Finally, to validate the AIM panels, the percentage of subjects that clustered correctly compared to the known geographic origin was calculated.
Population structure was further analyzed using principal component analysis (PCA) implemented in the EIGENSTRAT software [40] and multi-dimensional scaling (MDS) as implemented in PLINK. All other calculations were performed in R v2.15.0.
As a measure of informativeness of the different AIM panels at the population level, we calculated F ST , a genetic distance measure for inter-population differentiation compared to intra-population variation. Significance of pairwise F ST s was established using 10,000 permutations. A Mantel test was used to correlate the F ST matrices based on the 41-AIM and 31-AIM panels. Calculations were performed in ARLEQUIN 3.5 [41].
To investigate the informativeness of the AIM panels in detecting admixture at the individual level, subjects from two admixed populations of the Southern California test population (self-reported African Americans and self-reported Hispanic White and Native Americans) were selected. These subjects were subjected to the Illumina HumanOmniExpressExome array, and individual ancestry estimates were determined with a second, independent approach (see [42] for details). In brief, we used over 10,000 GWAS-derived SNPs, a set of 2,513 (partly overlapping) reference individuals and a two-step analysis approach implemented in ADMIX-TURE [43]. Individual admixture estimates based on the GWAS-derived panel were then compared to the admixture estimates based on the 41-and 31-AIM panels for these two admixed populations (see above).

Characterization of small AIM panels to determine continental ancestry
Fifty-two global populations from the HGDP-CEPH panel [28] were used to select AIMs optimized for the determination of continental ancestry. We developed a small 41-AIM panel specifically for multiplex application on the ABI system from a pre-selected pool of 1,442 highly informative AIMs (Additional file 1: Table S1). The panel was further reduced to 31 AIMs for application on the Sequenome iPLEX system. Table 1 shows the informativeness (I_n) and pairwise allele frequency differences (δ) among the seven continental regions for each of the 41 AIMs. I_n ranges from 0.08 -0.41 with a high mean of 0.23. The largest I_n and largest δ for each of the 21 continental comparisons are indicated in bold, highlighting the strength of a marker to distinguish between specific different global origins. Most continental comparisons included several markers with very high δ of >0.8. The smallest allele frequency differences were found for comparisons of regions within Eurasia where the top markers showed δ in the range of 0.4, indicating limited power to accurately distinguish subjects from Europe, the Middle East and Central/South Asia from each other.
The AIM panels were further characterized by calculating F ST [41] as a measure of the panel's relative strength to distinguish the seven geographic regions. Table 2 shows the genetic distance between the continental regions when using the 41-AIM (lower diagonal) and 31-AIM panel (upper diagonal), respectively. Intercontinent differentiation was based on allele frequencies from 51 HGDP populations; the atypical North African Mozabites were excluded here.
In general, we found high F ST values distinguishing the African, East Asian, American and Oceanian regions. As expected, the lower F ST values among Europe, the Middle East and Central/South Asia reflect the I_n and δ found for the single markers. When comparing the F ST values of the full 41-AIM panel with the reduced 31-AIM panel, no significant differences were found (Wilcoxon signed rank test, n = 21 paired comparisons, p > 0.38). In addition, a comparison of all pairwise F ST values among the 52 populations showed a highly significant correlation among the F ST values calculated based on the 41-AIM panel and the 31-AIM panel (Mantel test, r = 0.987, p < 0.001), further indicating no significant loss of power to discern global ancestry in the smaller panel.
Lastly, the population structure of the HGDP was analyzed using STRUCTURE. To facilitate a comparison with previous studies (e.g., [10,12,24,33,44]), we used similar model parameters without prior information about individual sampling locations. Figure 1 shows the most typical patterns with the highest likelihood from each of 20 independent runs at K = 2-7. Similar to Rosenberg's analyses including 377 microsatellites [44] and 993 SNPs [24], we found stable results with two clusters anchored by Africa and the Americas at K = 2 (20/20 runs) and a separation of Africa at K = 3 (19/20). At K = 4, a new cluster emerged isolating either the Americas (11/20) or alternatively Central/South Asia (9/ 20), and at K = 5 both of these regions were isolated (14/20). Most runs separated Europe from the Middle East at K = 6 (17/20), and at K = 7 the main continental regions for whose partitioning the panel was designed were separated from each other in the majority of runs (11/20) and with the highest likelihood.

Validation of the 41-AIM panel using additional populations of known origin
We further tested the performance of the 41-AIM panel in a realistic setting and estimated the ancestry of 3,077 test subjects from 68 regionally collected populations from the HapMap III and Yale collections. These test populations have been extensively characterized by us and others (see, e.g., [33] and [45]) and are well suited for this purpose. STRUCTURE was run with the HGDP as predefined reference populations at K = 7 (Yale samples were not genotyped for rs2717329). Table 3 shows the average cluster membership of individuals belonging to a specific population for each of the seven continental regions, Africa, the Middle East, Europe, Central/South Asia, East Asia, the Americas and Oceania (n = 68 populations). We calculated the percentage of subjects that clustered correctly, using criteria of >85% and >50% cluster membership (MS), respectively.
We found that African populations had very high cluster membership in the African cluster, but East African populations (e.g., Chagga, Maasai and Sandawe) showed slightly lower values. As expected, admixed African Americans as well as a population of Ethiopian Jews showed some cluster membership in Europe and the Middle East, and less than 50% of the subjects were included in the African group at the 85% MS criteria.
The ethnoreligious Samaritans, Yemenite Jews and Druze clustered with the Kuwaiti predominantly in the Middle East, but also showed a significant European contribution. As expected, most European populations clustered predominantly with Europe. However, there was a significant Middle Eastern component, even for the Northern European populations such as the Finns and Irish, demonstrating the somewhat reduced specificity of the 41-AIM panel to distinguish between Europe and the Middle East compared to the resolution between other continents. When applying the less stringent 50% MS criterion, most populations had over 90% of their subjects placed in Europe. Not surprisingly, the Russian Table 1 Informativeness (I_n) and allele frequency differences (δ) between seven continental regions for SNPs on the 41-AIM panel   populations Adygei, Chuvash, Komi Zyriane and Russian Vologda were found to have a significant Central/South Asian component. The Central/South Asian cluster included the majority of the Gujarati, Keralite and Thoti Indians at the 50% MS criterion. As expected, the Kachari Assam, located in the East, also showed a significant East Asian contribution. However, there was no predominant placing in any of the seven continental groupings for the Khanty, a population from western Siberia. This is expected since the current continental grouping at K = 7 does not include a specific Siberian/North Asian cluster. The Khanty are currently our only representatives of this large geographic area.
The East Asian test subjects from 15 diverse populations clustered in East Asia with almost no exception. Most Southern Malaysians also showed some Central/ South Asian contribution. Most Native American populations clustered predominantly in the Americas.
Exceptions were the admixed Muscogee and HapMap Mexicans, which were not placed in this cluster, but showed a strong European component. The Oceanic cluster included all Papua-New Guinean and Nasioi Melanesian subjects. However, the Micronesian and Samoan subjects from this broad geographic area were not assigned to Oceania at the 50% MS criterion, but were found to be admixed with a strong East Asian component.
Finally, we combined the HapMap III and Yale collections with the HGDP, and further analyses were conducted with our complete reference population set including 4,018 subjects genotyped on the 41-AIM panel. A principal component analysis (PCA) including all 4,018 subjects and averaged for each of the 120 populations is shown in Figure 2. We found that the first PC explained 27.6% of the genetic variability in the data set and corresponded with the Africa to Americas gradient found by STRUCTURE at K = 2. PC2 explained an     We performed an analysis of the eigenvalues of the first 15 PCs and found that over 56% of the genetic variation among the seven continental regions was accounted for by the first five PCs (Figure 3).

Applications of the 41-AIM panel
To highlight a practical application of the 41-AIM panel and our large collection of reference populations, we considered the case of a genetic association study with subjects collected in Southern California. In order to minimize spurious results due to population stratification (i.e., false-positive associations between a phenotype and genetic marker), a PCA is often applied. PCs can be used as an easy tool to visualize large amounts of data or can be included as covariates in association analyses to adjust for population stratification.
PC plots of the first three PCs generated based on genotype data of the 41 AIMs are shown in Figure 4 for the complete reference set of 4,018 subjects from 120 populations. When placed in the context of clusters, several populations appear as admixed among the eight colored continental regions (see Table 3; Siberia has been added as its own region here) or are truly admixed (such as the African Americans, Mexicans, Mozabites and others). To increase resolution, we removed these populations (n = 13, open symbols) in specific applications. Figure 5 shows PC plots of Southern Californian test subjects and 107 'typical' reference populations. The first five PCs (PC1 -PC5) explain a total of 51.5% variability in the data, and each identifies different aspects of the population distribution. PC2 highlights the European-African gradient and identified African Americans approach implemented in the genetic association software PLINK. MDS analyses lead to essentially the same results (Additional file 3: Figure S1).
Guided by these visual approaches, subjects are then typically grouped into a small number of more homogeneous groups (e.g., European Americans or African Americans) prior to association analysis, using clustering methods such as implemented in STRUCTURE. Additional population stratification and varying degrees of individual admixture are then accounted for within these more homogeneous groups.
To assess the informativeness of the AIM panels in detecting admixture at the individual level, we compared STRUCTURE admixture estimates based on the 41 and 31 AIMs with independently derived estimates based on a large, GWAS-derived panel (see Methods). Subjects were selected based on self-report from the admixed African American and Hispanic White and Native American populations ( Figure 6). Individual ancestry proportions derived from the 41-AIM and GWAS panels were strongly correlated for both the Hispanic White and Native American populations (n = 484, Pearson's r = 0.81 for the proportion of Native American ancestry in panel A, r = 0.81 for the proportion of European ancestry in panel C) and the African Americans (n = 106, Pearson's r = 0.86 for the proportion of African ancestry in panel B, r = 0.85 for the proportion of European ancestry in panel D; all p < 2 × 10 -16 ). Slightly reduced correlations of individual admixture estimates were achieved between the GWAS-derived and smaller 31-AIM panels (A: r = 0.77, B: r = 0.78, C: r = 0.84 and D: r = 0.85, respectively). Importantly, a detailed inspection of the scatter plots indicates that the AIM panels lack sensitivity in detecting admixture in individuals with low proportions of admixture (in the range of < 20-25%) when compared to the GWAS-derived panel.

Application and limitations
Our motivation was to develop a feasible method to discern continental ancestry that would enable a safeguard against the impact of population stratification in small genetic association studies where limited resources preclude large genotyping efforts. We achieved this by choosing a very small set of highly discriminative AIMs suitable for multiplexing applications, thus enabling lower cost and higher throughput. To ensure a wide application potential, we optimized our panel for two commonly used multiplex platforms, the ABI SNPlex [25] and Sequenome iPLEX systems [26]. At the same time, these AIMs perform well in single SNP TaqMan assays and can also be extracted from whole-genome arrays such as the Illumina HumanHap550 chip, thus allowing an easy combination of samples with genotyping from different sources. This is especially important for AIMs, where imputation of SNPs based on information from genotyped markers is not advisable.
Our panel was able to accurately discern the global ancestry of a large majority of subjects originating from one of the seven specific ancestral clusters. This was the case for both the full 41-AIM panel and the subset of 31 AIMs, indicating that a balanced reduction of markers in these small panels did not significantly impact the robustness of the results. A direct comparison of our findings with a previously published small panel of 93 AIMs published by Seldin's group [10] showed 89.7% agreement in continental assignment of HGDP subjects (data not shown), further validating our panel.
Not surprisingly, the biggest limitation to imputing global ancestry was found for subjects from Eurasia, where low F ST values of 0.06 -0.09 among Europe, the Middle East and Central/South Asia indicated little genetic diversity. The clinal distributions of allele frequencies between Europe and East Asia pose a challenge for the identification of highly discriminative markers, a limitation also impacting other small AIM panels [12,46]. We therefore suggest supplementing our panel with additional high-resolution markers for studies with focus on Eurasia. Such markers suitable for discerning specific pairs of regions can easily be extracted from our extensive preselected list of global AIMs.

Impact of reference populations
Independent of the statistical method used to determine ancestry and admixture proportions, the results of these analyses depend not only on the informativeness of the genetic markers, but also strongly on the set of reference populations included. An omission of reference subjects from an ancestral group likely leads to misclassification of test subjects with similar ancestries. For example, we previously found that African Americans clustered strongly with Central Asians in a three-way admixture analysis (erroneously) including only reference subjects from Europe, Central Asia and the Americas.
We considered this crucial issue during panel development and leveraged the publicly available 52 HGDP populations collected across the globe [47]. We then increased our reference set by leveraging AIM frequencies from the HapMap III and performing additional AIM genotyping in our large global collection [33]. With over 4,000 subjects from 120 global populations, we thus assembled one of the largest reference sets published for the purpose of ancestry determination. However, specific regions such as Siberia are still underrepresented, and efforts to expand our reference subject collection are ongoing.

Admixed subjects
Whereas ancestry assignment of subjects from a specific geographic area represented by a cluster in the reference population set is a quantifiable and relatively straightforward task, admixed subjects resulting from ancient or recent contact of populations with distinct ancestries pose challenges.
If such a cohort consists of admixed subjects with known ancestry contributions, such as two-way admixed Mexican Americans collected from a distinct area in Southern California, the varying degree of European and Native American ancestries can easily be estimated in admixture analyses implemented in Statistical packages such as STRUCTURE, ADMIXTURE [43] or BAPS [48]. Our AIM panels were able to detect admixture in individuals from these populations, but as expected for such small panels, were less sensitive when individual admixture proportions were low.
However, in a clinical sample including subjects of unknown ancestral origin and complex population structure, as is often the case in our studies (e.g., [35,49,50]), the presented methods may lack specificity to distinguish between admixture of distinct populations and erroneously place admixed subjects together with intermediate populations. This is especially true when including only the first few components of multivariate data reduction methods such as MDS and PC analyses (see, e.g., Figure 2, Panel A, for Mexican Americans clustering with Central/South Asians). In these cases, adding demographic information such as self-declared race and ethnicity information is strongly suggested to help minimize misassignments.

Challenges for genetic association studies
Adding to the complexities of accurately differentiating ancestral groups and estimating admixture proportions is the appropriate incorporation of this information into the design of genetic association studies. While the negative impact of population structure on association studies is well known [6], and methods to control for it are established and now routinely applied to studies of relatively homogeneous cohorts such as typically collected for GWAS (see e.g. a recent review [5]), the situation remains challenging for heterogeneous clinical collections or epidemiological cohorts.
Depending on the composition and relative numbers of subjects from different ancestral backgrounds, common questions in such studies include the genetic definition of African Americans, which typically show degrees of European admixture that vary among individuals [51]. There is currently no consensus for an appropriate cutoff point between European Americans and African Americans. Even less trivial is the incorporation into association studies of three-way admixed subjects such as Caribbean Latinos originating from Puerto Rico and the Dominican Republic [52], typically showing both Native American and high levels of African ancestry.
For practical purposes, we often employ a multi-tier approach: we first group subjects into continental clusters using a majority criterion with statistical methods such as STRUCTURE and then confirm the plausibility of the grouping with demographic data, where available. Next, we aim to place most subjects into a very small number of clusters including genetically similar subjects, for example, by combining similar continental groups such as from Eurasia, and excluding outliers of minority ancestries. Lastly, we control for additional population stratification within clusters by incorporating MDS components into association studies and ultimately combine