Multiplex genotyping system for efficient inference of matrilineal genetic ancestry with continental resolution

Background In recent years, phylogeographic studies have produced detailed knowledge on the worldwide distribution of mitochondrial DNA (mtDNA) variants, linking specific clades of the mtDNA phylogeny with certain geographic areas. However, a multiplex genotyping system for the detection of the mtDNA haplogroups of major continental distribution that would be desirable for efficient DNA-based bio-geographic ancestry testing in various applications is still missing. Results Three multiplex genotyping assays, based on single-base primer extension technology, were developed targeting a total of 36 coding-region mtDNA variants that together differentiate 43 matrilineal haplo-/paragroups. These include the major diagnostic haplogroups for Africa, Western Eurasia, Eastern Eurasia and Native America. The assays show high sensitivity with respect to the amount of template DNA: successful amplification could still be obtained when using as little as 4 pg of genomic DNA and the technology is suitable for medium-throughput analyses. Conclusions We introduce an efficient and sensitive multiplex genotyping system for bio-geographic ancestry inference from mtDNA that provides resolution on the continental level. The method can be applied in forensics, to aid tracing unknown suspects, as well as in population studies, genealogy and personal ancestry testing. For more complete inferences of overall bio-geographic ancestry from DNA, the mtDNA system provided here can be combined with multiplex systems for suitable autosomal and, in the case of males, Y-chromosomal ancestry-sensitive DNA markers.


Background
Establishing the geographic region of a person's genetic origin -also called bio-geographic ancestry -is of forensic relevance when the short tandem repeat (STR) profile of trace DNA found at a crime scene does not match that of a suspect or does not yield any matches in a criminal DNA database because it may provide investigative leads to finding unknown persons [1]. Similarly, such information can be useful for locating antemortem samples or putative relatives of unidentified body remains, including disaster victim identification [2]. Furthermore, inferring geographic information from DNA data is important in population history studies [3,4] and has gained attention in the growing field of personal ancestry testing [5,6].
Several years of intensive research into the understanding of the geographic distribution of human genetic diversity present in the non-recombining mitochondrial genome and respective parts of the Y-chromosome (NRY), mostly for population history purposes, have produced an immense body of knowledge allowing us to pick specific mtDNA and NRY markers with restricted (sub)continental distributions [4,7,8]. MtDNA is especially useful for forensic application due to its high copy number (hundreds to thousands of copies per cell) and small size (16.6 kb), which allows the analysis of small amounts of degraded DNA often encountered in crimescene situations [9]. Although mtDNA only reveals information about matrilineal ancestry, it can be seen as a first step toward a more comprehensive picture of personal ancestry when combined with suitable NRY and autosomal DNA evidence [10,11]. Furthermore, investigating the geographic origin of mtDNA in comparison to that of the Y-chromosome in a population can also reveal insights into sex-biased aspects of human population history such as those caused by patri-or matrilocal residence patterns [12].
In human population genetics studies, the typical approach for mtDNA analysis consists of sequencing the first hypervariable segment (HVS1), sometimes in combination with HVS2, within the non-coding control region (see, for example [13,14]), whereas in forensics it has nowadays become standard practice to sequence the entire control region [15]. Although haplogroup inference from HVS sequence data is possible for many mtDNA haplogroups, not all haplogroups present suitable diagnostic variants in HVS1 and/or HVS2 that allow an unequivocal assignment. In such cases, simple nucleotide polymorphisms (SNPs; i.e. single-nucleotide polymorphisms as well as small insertions and deletions) from the coding region of mtDNA are required in order to establish the haplogroup status. Moreover, because SNP typing assays are usually more sensitive and consume less DNA than sequencing, in many cases it might be desirable to perform SNP genotyping alone (in the absence of HVS data) or prior to HVS sequencing [16,17].
Several mtDNA SNP multiplex assays have already been developed focussing on particular geographic subregions (see, for example, [18]) or on the dissection of particular haplogroups (see, for example, [19]). However, what is missing so far is an mtDNA SNP multiplex system that includes the mtDNA haplogroups of major continental distribution. We describe a sensitive genotyping system based on single-base primer extension technology, consisting of three independent multiplex assays that together include 36 SNPs determining 43 mtDNA haplo-/paragroups that allow the inference of matrilineal bio-geographic ancestry at the level of continental resolution.

Multiplexes and targeted haplogroups
MtDNA coding-region SNPs defining the major haplogroups that occur in Africa, Western Eurasia, Eastern Eurasia and Native America were carefully selected ( Figure 1) and combined into three multiplex genotyping assays (Figures 2, 3, 4) each consisting of a polymerase chain reaction (PCR) amplification step and a subsequent single-base primer extension step (Tables 1,  2, 3). The haplogroups detectable with Multiplex 1 and 2 are broadly similar to those typed by the Genographic Project [20] with some noticeable exceptions. Multiplex 1 (Figure 2) was designed to target haplogroups L0/L1, L2/L4/L6, L3, M, M1, C, D, N, N1, I, W, A, X and R. Due to the homoplasy of some of the selected markers in the worldwide mtDNA phylogeny [7], Multiplex 1 can additionally detect some (relatively rare) haplogroups that were not originally intended, namely L0k/L0d1a/L0d3, L5, X2a1, R11/B6 and B4a1. The hierarchical organization of the mitochondrial SNPs in Multiplex 1 ensures that all these haplogroups, intended and unintended, are well differentiable ( Figure 2). Some haplogroups are only identified with Multiplex 1 on a broad level and, in those cases, additional genotyping with Multiplex 2 or 3 is needed to achieve further haplogroup resolution and final geographic inferences. Multiplex 2 ( Figure 3) targets haplogroup R and haplogroups nested within R, namely R0, HV, HV0a (which includes V), H, R9 (which includes F), B, J, T, U, U6 and U8b (which includes K). A notable difference with the Genographic Project SNP panel [20] is that we included in our multiplexes haplogroups M1 and U6 which have a predominantly African distribution, probably due to backmigration events to Africa [21]. As such, Multiplex 1 and 2 together offer a convenient method for the classification of unknown mtDNAs into any of the major worldwide mtDNA haplogroups. However, they do not allow for the differentiation of the Native American subsets of otherwise Eastern Eurasian haplogroups A, B, C and D and Western Eurasian/African haplogroup X. Therefore, we designed a third assay, Multiplex 3 ( Figure 4), which specifically aims at detecting the Native American haplogroups A2, B2, C1, C4c, D1, D4h3a and X2a, as well as Eskimo/Siberian haplogroups A2a, A2b, D2a and D3 and Eastern Eurasian haplogroup C1a [22]. Together, the three multiplexes include 36 different coding-region mtDNA SNPs (of which 34 are single-nucleotide transitions/transversions and two are small insertion/deletion polymorphisms). It should be noted that, despite the fact that haplogroups M1, C and D within macrohaplogroup M, haplogroups N1, A, W and X within macrohaplogroup N, and haplogroups R0, R9, B, JT and U within macrohaplogroup R, can be detected with the method, much of the Southern Asian, East/Southeast Asian and Oceanic variation within M, N and R remains unresolved (denoted as M*, N* and R*, respectively, in Figure 1). However, this is inevitable given the large number of independent haplogroups descending from M, N and R but it can be overcome by developing additional multiplex assays that specifically target the relevant subhaplogroups for those regions.

Design and optimization
The successful dessign of a useful multiplex single-base extension assay requires careful consideration of the SNPs and their PCR amplification primers as well as extension primers, followed by extensive laboratory testing [23]. One criterion of SNP selection was the overall level of homoplasy of the marker in the entire mtDNA phylogeny [7]. For each haplogroup, one or several defining SNPs are available; in the latter case care was taken to select the more stable (phylogenetically less recurrent) SNP sites. Nevertheless, some of the selected SNPs do occur more than once in the phylogeny (underlined in Figure 1) as discussed above. Notably, Multiplex 1 contains two tri-allelic SNPs: nucleotide position (np) 3552 is either a T (ancestral state), an A (haplogroup C), or a C (haplogroup X2a1); and np 12950 is either an A (ancestral state), a C (haplogroup M1) or a G (haplogroups L5, R11 and B6). Primer design using Primer3Plus [24] considered small amplicon size and avoided numt amplification [25]. The compatibility of primers within the same multiplex was checked with AutoDimer [26], especially avoiding 3' end complementarities. Amplicon sizes were kept small, ranging from 80 to 237 bp with an average of 133 bp (Tables 1, 2, 3), in order to facilitate the amplification of (partially) degraded DNA typically encountered in forensic settings as well as in population history studies when using difficult source materials (for example, ancient DNA). All primers were first tested in singleplex before combining them in a multiplex. Primers that showed substantial artifacts were replaced by alternatively designed primers. In order to ensure electrophoretic separation of extension primer products, extension primers within the same multiplex were given different lengths by adding 5' non-homologous (poly)GACT tails (Tables 1, 2, 3). Peak heights in the electropherograms (Figures 5,6) were balanced by adjusting primer concentrations in the PCR and extension reactions (Tables 1, 2, 3).

Haplogroup distribution and inferring bio-geographic ancestry
The labels used to describe the geographic affiliations of the haplogroups (Figure 1) mostly correspond to one of four regions or continents of the world, namely Africa, Western Eurasia, Eastern Eurasia and Native America, Figure 1 Overall phylogenetic scheme of targeted mtSNPs with geographic haplogroup classification. The combined use of the three multiplex assays allows any person's mtDNA to be classified into one of the colour-labelled haplogroups. Colours correspond to the geographic origin of the haplogroups as indicated. SNP position numbers are relative to the revised Cambridge Reference Sequence (rCRS). Deletion mutations are denoted by the suffix 'd'. Recurrent SNPs are underlined. The numbers 1, 2 or 3 in square brackets shown for each SNP refer to the respective multiplex assay in which the SNP is included. Note: haplogroups F, K and V are encompassed within R9, U8b and HV0a, respectively, as indicated because this does not follow logically from the nomenclature.
van Oven et al. Investigative Genetics 2011, 2:6 http://www.investigativegenetics.com/content/2/1/6 consistent with the terminology used in human genetics and anthropology literature. With some haplogroups, however, only combined regions can be inferred, namely Western Eurasia/Africa, Western Eurasia/Southern Asia, Eastern Eurasia/Oceania, Native America/Eastern Eurasia and Eastern Eurasia/Southern Asia/Oceania ( Figure  1). While these geographic designations are convenient descriptors of the 'center of gravity' of haplogroup occurrence, it is important to keep in mind that, instead of sharp genetic borders, there exist transition areas between continents. Populations from the Middle East, for example, carry a considerable portion of African mtDNA lineages [27]. Similarly, Northern Africa has a relatively large portion of Western Eurasian mtDNA lineages [28,29]. In addition, the Central Asian mtDNA pool is composed of Western Eurasian, Eastern Eurasian and, to a lesser extent, also Southern Asian components [30,31] Furthermore, one should be aware that traditional distribution patterns of genetic variation, including mtDNA, may have been affected by (evolutionary recent) migration/admixture events, including as a result of colonialism, so that some populations carry portions of ancestry from multiple geographic regions. The most prominent case is, perhaps, the American continent where, due to colonization by Europeans which started around the beginning of the 16th century and the subsequent European introduction of African slaves, the current population carries a mixture of Native American, Western Eurasian and African mtDNA lineages, in varying proportions depending on the subpopulation [10,11,14,32,33]. Other well-known cases include Madagascar (African and Eastern Eurasian lineages) [13], and coastal/island parts of Near Oceania, as well as all of Remote Oceania (Oceanic and Eastern Eurasian lineages) [34]. In addition, groups of more or less recent Multiplex 1   Inferred haplogroup  769  1243  1736  3552  4883  7146  10034  10238  10400  12705  12950  13966  15301  Further testing   T1243C   G7146A   A12950G   G15301A   A769G   C10400T   A12950C   T3552A   C4883T   A15301G   T10238C   T10034C   T1243C   A1736G   A13966G   T3552C   T12705C   A12950G   T10238C   Multiplex 3   Multiplex 3   Multiplex 3   Multiplex 3 Multiplex 2 Figure 2 Marker phylogeny and haplogroup-defining genotypes of Multiplex 1. Recurrent SNPs are underlined. Boxed alleles indicate for each haplogroup those SNPs that are minimally required to define that haplogroup. If additional genotyping is required for more detailed haplogroup inference, the respective additional multiplex to be genotyped subsequently is noted.
van Oven et al. Investigative Genetics 2011, 2:6 http://www.investigativegenetics.com/content/2/1/6 immigrants often carry a mixture of 'native' lineages and lineages typical from the area to which they moved. For example, Polish Roma, having an ultimate origin in India, harbour both Southern Asian and Western Eurasian mtDNA variants [35]. Finally, rare cases have been reported where European individuals carried African mtDNA haplogroups without being aware of any African ancestry [36]. Therefore, for any bio-geographic ancestry prediction purposes, mtDNA evidence should be interpreted in the context of the relevant local demographic history. Also, because mtDNA only reflects the matrilineal portion of a person's genetic ancestry, ideally the markers should be combined with evidence obtained from autosomal and/or (when dealing with male DNA) Y-chromosome markers, to obtain a more accurate picture of a person's overall ancestry.

Sensitivity testing
In order to establish the sensitivity of our multiplex assays we performed tests with different starting amounts of genomic DNA, ranging from 25 ng to 1 pg of template DNA, for four individuals originating from different continents and with respective diagnostic haplogroups: a European with haplogroup J; an African with L3*(xM,N); a Native American with C1*(xC1a); and an East Asian with R9 ( Figures 5, 6). This enabled us to monitor the behaviour of the different SNP alleles with decreasing amounts of template DNA. Overall, we observed high sensitivity and basically full profiles could be obtained with all three multiplexes for all four individuals with as little as 4 pg of DNA template (with the only exception of 13368 in Multiplex 2 that sometimes caused difficulties in allele calling with 4 pg and lower). Marker dropouts for some SNPs in all the individuals and all three multiplexes (except for Multiplex 1 in the European and the African sample and with Multiplex 3 in the European) started to occur only at the 1 pg level, as well as allele-calling difficulties for some other SNPs in all three multiplexes (Figures 5, 6). The achieved sensitivity is similar to that of two previously published mtDNA multiplex assays [18,37] but, presumably, higher than that of many other published mtDNA multiplexes which typically require 1-10 ng DNA (for example [19,[38][39][40]; although many such studies do not provide details on sensitivity). Furthermore, the achieved sensitivity of our assays is significantly higher than that of commercially available STR multiplexes [41][42][43], which can be expected due to the higher relative abundance of Multiplex 2   Inferred haplogroup  2706  3348  3480  8281-8289   11251  11719  12308  12705  13368  13928  14766  15904  Further testing   T12705C   A11719G   T14766C   C15904T   G2706A G13928C Figure 3 Marker phylogeny and haplogroup-defining genotypes of Multiplex 2. Boxed alleles indicate for each haplogroup those SNPs that are minimally required to define that haplogroup. The allelic states of deletion polymorphism 8281-8289 are denoted as 'a' (ancestral) and 'd' (deletion), respectively. If additional genotyping is required for more detailed haplogroup inference, the respective additional multiplex to be genotyped subsequently is noted.
van Oven et al. Investigative Genetics 2011, 2:6 http://www.investigativegenetics.com/content/2/1/6 mtDNA as compared to nuclear DNA. When working with ancient DNA or forensic trace DNA, it might be useful to quantify the amount of human DNA prior to genotyping because, in such situations, human DNA often represents only a fraction of total DNA due to the presence of non-human (for example, bacterial, fungal, or others) DNA.

Illustration of the method application
In order to illustrate the reliability of our method in inferring bio-geographic ancestry from mtDNA, we compared in worldwide individuals, their haplogroup status as determined from full mtDNA sequence data and their population affiliation known from the sampling region, with the haplogroup and corresponding geographic information obtainable from our multiplex SNP assays ( Table 4). The data used for this purpose consisted of 75 samples from the Centre d'Etude du Polymorphisme Humain-Human Genome Diversity Project (CEPH-HGDP) panel [44] for which entire mitochondrial genome sequences are available [45]. From the full mtDNA sequences we extracted the alleles of those SNP sites that are included in our assays and used the resulting genotypes to infer haplogroups and respective geographic regions of matrilineal origin. In all cases, the haplogroups inferable by our assays were consistent with the full sequence-based haplogroups (although a more detailed haplogroup assignment could be achieved from the sequence data as expected); accordingly, the regions of bio-geographic ancestry derived from the assay-inferable haplogroups were in agreement with the individuals' sampling origins (Table 4). For example, sample HGDP01076 is an individual from Sardinia (Italy) whose full mtDNA sequence can be classified as haplogroup J2b1a; our assays would predict the haplogroup of this person as J with Western Eurasian geographic origin. Notably, the HGDP samples from Pakistan exhibit both Western Eurasian and Southern Asian haplogroups (for example, HGDP00163 belongs to Western Eurasian haplogroup H2a and HGDP00165 belongs to Southern Asian haplogroup M30), consistent with previous observations (see Discussion above). Similarly, the Bedouin samples belong to both African as well as Western Eurasian haplogroups.

Conclusions
We developed an efficient and sensitive method for the multiplex genotyping of informative mtDNA SNPs, allowing for the inference of a person's matrilineal biogeographic ancestry at a continental level. We would Inferred haplogroup  290-291  2092  3330  3826  6285  6374  11177  11365  11959  12007  14433  14502 290-291d      N). The three multiplex assays were each performed on five different starting amounts of DNA template, ranging from 0.25 ng to 0.001 ng. Grey circles indicate marker dropouts that occur at the very low DNA concentration whereas grey arrows indicate cases where allele calling becomes difficult due to artefacts that come up at the low DNA concentrations.  like to emphasize that matrilineal ancestry must be seen as reflecting only one aspect of the overall bio-geographic ancestry of a person [5,6,46]. A more accurate establishment of the overall bio-geographic ancestry is achievable when mtDNA is used in conjunction with informative Y-chromosomal (in the case of males) [8] and autosomal ancestry-informative DNA markers [47][48][49][50], especially when a person's biological ancestors are from different geographic regions resulting in mixed bio-geographic ancestry.  Table 1 for Multiplex 1, in Table 2 for Multiplex 2 and in Table 3 for Multiplex 3. Extended primers were separated by capillary electrophoresis on a 3130xl Genetic Analyzer (Applied Biosystems) using POP-7 polymer by loading a mixture of 1 μL purified extension product, 8.8 μL Hi-Di formamide (Applied Biosystems) and 0.2 μL GeneScan-120 LIZ internal size standard (Applied Biosystems). Results were analysed using GeneMapper version 3.7 software (Applied Biosystems).

Dilution series
For the purpose of sensitivity testing, genomic DNA from four individuals of different matrilineal continental origin was extracted from buccal swabs. For each individual, the DNA was diluted to obtain a solution of precisely 1 ng/μL as determined by two independent Quantifiler (Applied Biosystems) measurements. All Quantifiler assays were carried out according to manufacturer's recommendations. A dilution series was made from each of the four 1 ng/μL DNA solutions, producing concentrations of 0.25, 0.063, 0.016, 0.004 and 0.001 ng/μL for each individual. Concentrations of the dilutions were measured again and confirmed by triplicate Quantifiler measurements. The Quantifiler assays were carried out according to the manufacturer's recommendations, except for the addition of two extra dilutions to the recommended standard curve to be able to measure the very low DNA concentrations.