Early modern human dispersal from Africa: genomic evidence for multiple waves of migration

Background Anthropological and genetic data agree in indicating the African continent as the main place of origin for anatomically modern humans. However, it is unclear whether early modern humans left Africa through a single, major process, dispersing simultaneously over Asia and Europe, or in two main waves, first through the Arab Peninsula into southern Asia and Oceania, and later through a northern route crossing the Levant. Results Here, we show that accurate genomic estimates of the divergence times between European and African populations are more recent than those between Australo-Melanesia and Africa and incompatible with the effects of a single dispersal. This difference cannot possibly be accounted for by the effects of either hybridization with archaic human forms in Australo-Melanesia or back migration from Europe into Africa. Furthermore, in several populations of Asia we found evidence for relatively recent genetic admixture events, which could have obscured the signatures of the earliest processes. Conclusions We conclude that the hypothesis of a single major human dispersal from Africa appears hardly compatible with the observed historical and geographical patterns of genome diversity and that Australo-Melanesian populations seem still to retain a genomic signature of a more ancient divergence from Africa Electronic supplementary material The online version of this article (doi:10.1186/s13323-015-0030-2) contains supplementary material, which is available to authorized users.


Populations and markers
We combined genomic data from several published datasets: the Human Genome Diversity Cell Line Panel [32]  We devised a careful strategy to combine the seven datasets genotyped with different platforms according to different protocols developing a pipeline built on Perl. First, for each dataset, we checked for the presence of old rs ids, if necessary changing them with the new ones.
Then, we looked for the SNPs shared among all datasets and we mapped the genome positions of these variants to the human reference genome, build hg18 (NCBI 36).
When merging data from different SNP-chip versions, strand identification can be ambiguous, possibly leading to mistakes in identifying the right alleles for A/T and G/C SNPs (as also reported in the PLINK tool documentation [37]). Thus, to preserve as much genetic information as possible, we selected from each dataset only these ambiguous SNPs and we used the information contained in Affymetrix Annotation file to evaluate the strand polarity used to define each allele. We considered each dataset separately and we annotated the SNPs on the plus strand, flipping only the proper SNPs. We checked the reliability of this conversion process comparing the allele frequencies for these SNPs in specific populations typed in more than one 6 dataset (i.e Besemah, CEU, Onge), so as to verify the consistency of the frequency spectrums between the different datasets. Once these ambiguities have been resolved, with the PLINK v 1.07 software [37] we merged progressively the datasets selecting, from each one, just the individuals from populations of our interest and flipping SNPs discordant for strand.
Using the same software, we selected only the autosomal SNPs with genotyping success rate >98% and minor allele frequency (MAF) >0.01. We identified cryptic relatedness amongst samples computing identity-by-descent (IBD) statistic for all pairs of individuals, as unmodeled excess of genetic sharing would violate sample independence assumption of downstream analyses. When pairs of individuals showed a Pi-Hat value > 0.3, we removed the individual with the lowest genotyping rate. We did not apply this screening procedure for the South-East Asia and Oceania samples, since they come from populations with extremely low effective sizes, where a certain degree of random inbreeding is inevitable [38]. To determine whether there were genetic outliers within each population, we conducted in PLINK a "distance to the nearest neighbor analysis" (--neighbor option). Within each population, the measure of similarity in terms of identity by state (IBS) between each individual and their nearest neighbor was calculated and transformed into a Z-score. Z score distributions were examined from the first to the fifth neighbor. Outliers were identified by an extremely negative Z-score produced by low allele sharing with their nearest neighbor and were then dropped from the population. We grouped populations according to ethnological and linguistic information; the final dataset is shown in Figure 1a.
To visualize the genetic relationships between such populations, we performed a Principal Component Analysis using the R [39] SNPRelate package.

Population structure analysis
Individual genotypes were clustered, and admixture proportions were inferred, by the algorithm embedded in the software ADMIXTURE, based on the principle of maximum likelihood [40]. This method considers each genotype as drawn from an admixed population with contributions from k hypothetical ancestral populations. Because this model assumes linkage equilibrium among markers, we checked with the PLINK v1.07 tool [37] that the set of SNPs we used did not show a level of Linkage Disequilibrium higher than r 2 =0.3; this way, in the pruned dataset 54,978 markers were retained. The optimal value of k was evaluated through a crossvalidation procedure, testing values from k=2 to k=14, thus identifying the number of ancestral populations for which the model had the best predictive accuracy. We then ran an unsupervised analysis, assuming a number of ancestral admixing populations from k=2 to k=7. The proportion of the individuals' genome belonging to each ancestral population was calculated for each k value from 5 independent runs, then combined by the software CLUMPP [41] and plotted by the software Distruct [42].

Discriminant Analysis of Principal Components
In addition to ADMIXTURE, to identify and describe clusters of genetically related individuals we used a Discriminant Analysis of Principal Components (DAPC) [43] implemented in the R [39] package adegenet ver. 1.3-9.2 [44].DAPC methods allow one to assess the relationships between populations overlooking the within-group variation and summarizing the degree of between group variation. Being a multivariate method, DAPC is suitable for analysing large numbers of genome-wide SNPs, providing assignment of individuals to different groups and an intuitive visual description of between-population differentiation. Because it does not rely on any particular population genetics model, DAPC is free of assumptions about Hardy-Weinberg equilibrium or linkage equilibrium [43], and so we could use the full set of 96,156 SNPs for this analysis.
By the function find.clusters, we determined the most likely number of genetic clusters in our dataset, using all principal components (PCs) calculated on the data. The method uses a Kmeans clustering of principal components [10] and a Bayesian Information Criterion (BIC) approach to assess the best supported number of clusters.
Then, we determined the optimal number of principal components (PCs) to retain to perform a discriminant analysis avoiding unstable (and improper) assignment of individuals to clusters. It is worth noting that, unlike K-means, DAPC can benefit from not using too many PCs: retaining too many components with respect to the number of individuals can lead to over-fitting and instability in the membership probabilities returned by the method.

Population divergence dates
The divergence times between populations (T), was estimated from the population differentiation index (FST) and the effective population size (Ne). FST is the proportion of the total variance in allele frequencies that is found between groups and it was calculated between pairs of populations for each SNP individually under the random population model for diploid loci, as described by Weir and Cockerham [45], and then averaged over all loci to obtain a single value representing pairwise variation between populations . Under neutrality, the differences between populations accumulate because of genetic drift, and so their extent depends on two quantities: it is inversely proportional to the effective population sizes (Ne) and directly proportional to the time passed since separation of the two populations (T).
Therefore, to estimate T from genetic difference between populations, independent estimate of Ne are needed; for this purpose we focused on the relationship between Ne and the level of linkage disequilibrium within populations. Indeed, levels of LD also depend on Ne, and on the recombination rate between the SNPs considered [46], with LD between SNPs separated by large distances along the chromosome reflecting the effects of relatively recent Ne, whereas LD over short recombination distances depending on relatively ancient Ne [47]. Thus, we estimated LD independently in each population using all polymorphic markers available for that population (MAF > 0.05), from a minimum of ~ 90,000 SNPs in Polynesia to a maximum of ~ 370,000 SNPs in North India. This way, we also reduced the impact of ascertainment bias, i.e. the bias due to the fact that most SNPs in the genotyping platforms were discovered in a single (typically European) population [48].
We assigned to each SNP a genetic map position based on HapMap2 (Release #22) recombination data, and for each pair of SNPs separated by less than 0.25 cM we quantified LD as r 2 LD [49] or as σ 2 LD [50] (hereafter: ρ). All the observed ρ values were then binned into one of 250 overlapping recombination distance classes. Pairs of SNPs separated by less than 0.005 cM were not considered in the analysis, since at these very short distances gene conversion may mimic the effects of recombination [46]. We also adjusted the ρ value for the sample size using [46]. Finally, we calculated the effective population size for each population in each recombination distance class as , corresponding to the effective population size generations ago, where c is the recombination distance between loci, in Morgans [47,51,52]. Finally, the long-term Ne for each population was calculated as the harmonic mean of Ne over all recombination distance classes up to 0.25 cM. The confidence intervals of these Ne values were inferred from the observed variation in the estimates across chromosomes.
Based on the independently-estimated values of Ne, we could then estimate T as [53] where time is expressed in generations.
All procedures were performed by in-house developed software packages, NeON [54] and 4P [55].

Simulations
To understand whether the divergence times estimated were compatible with a SD model, we used a neutral coalescent approach to simulate genetic polymorphism data under the infinitesites model of mutation. We simulated data representing 1Mb chromosome segments in two populations according to the demographic scenario shown in Additional File 3 using the coalescent simulator ms [56]. We assumed an ancestral population with an initial Ne= 10,000. At t = T, the population splits into two populations. Population_2a's Ne remains constant, population_2b has a 50% reduction in Ne followed by an exponential growth, representing the genetic bottleneck experienced by populations dispersing out of Africa. In all simulations the scaled mutation rate (θ), and the scaled population recombination rate (ρ) were fixed at 400. For a sequence length of 1Mb and an effective population size of Ne= 10,000 these parameters correspond to a mutation rate of 10 -8 and a recombination rate of 1 cM/Mb.
To account for the uncertainty in the estimates of both the timing of the process and the effective population sizes, according to Ref. [57][58][59], this model was simulated considering 4 different separation times (T) (between 40,000 and 70,000 years ago, in steps of 10,000 years) and 11 1,000). For each of the 24 simulation conditions, 1,000 independent datasets were simulated and then analyzed according to the following procedure: 1) A sample of 50 individuals (i.e. 100 chromosomes) was randomly selected from each population. The simulated genetic data were single nucleotide polymorphisms (SNPs) segregating within the two populations.
3) Any SNPs with a minor allele frequency (MAF) less than 0.05 were removed from the datasets.

4)
We estimated the population differentiation index, effective population size and divergence time between the two simulated populations following the same procedures used for the observed data and detailed above.

Possible effects of a Denisovan admixture in Melanesia
To rule out the possibility that the divergence time estimated between Africans and New Guinea/Australia samples could reflect, largely or in part, admixture between the Denisovan archaic human population from Siberia [60] and the direct ancestor of Melanesians, we removed from our dataset the variants could be regarded as resulting from such a process of introgression.
These SNPs would carry the derived state in the archaic population and in the New Guinean/Australian samples, while being ancestral in East Africans and Europeans (i.e. those populations that did not show any signal of introgression from Denisova [5,60]).
Using the VCF tools [61] we extracted our 96,156 SNP from the high coverage Denisovan genome.
We then removed from these data all transitions SNPs (C/T and G/A) because in ancient DNA these sites are known to be prone to a much higher error rate than the transversions [5] .Then, we selected the sites meeting the following set of criteria: -the site has human-chimpanzee ancestry information; -the human-chimpanzee ancestral allele matches one of the two alleles at heterozygous sites; -Denisova has at least one derived allele; -New Guineans and Australians have at least one derived allele; -in East African and European individuals the ancestral allele is fixed; When the ancestry information was missing (1,438 SNPs), to define the ancestral state, we used the East African individuals selecting the SNPs where East Africans were homozygous and considering those as ancestral.
Once we had thus identified a subset of sites putatively introgressed from Denisova, we removed them from the dataset. The remaining 80,621 SNPs were used to compute the pairwise FST [45] values and to infer the divergence time between populations, as described above.

Testing for the effect of recent gene flow
Using TreeMix, we inferred from genomic data a tree in which populations may exchange migrants after they have split from their common ancestor, thus violating the assumptions upon which simple bifurcating trees are built [62]. This method first infers a maximum-likelihood tree from genome-wide allele frequencies, and then identifies populations showing a poor fit to this tree model; migration events involving these populations are finally added. This way, each population may have multiple origins, and the contributions of each parental population provide an estimate of the fraction of alleles in the descendant population that originated in each parental population.
Allele frequencies for the TreeMix analysis were calculated by PLINK tool [37], after pruning for linkage disequilibrium as we did for ADMIXTURE analysis. We modelled several scenarios allowing a number of migration events from 0 to 6, and stopping adding a migration when the 13 following event did not increase significantly the variance explained by the model. The trees were forced to have a root in East Asia, and we used the window size of 500 (-k option).

Geographical patterns of dispersal
We developed explicit geographic models of demographic expansion, and looked for the model giving the closest association between genomic and geographical distances. In all cases, migration routes were constrained by 5 obligatory waypoints, identified in Ref. [26] and accepted by several successive studies (see e.g Ref. [13]). In addition, because of some inconsistencies in the definition of the geographic regions affected by the two waves of migration under MD [12,14,16], To obtain a realistic representation of migrational distances between populations, we did not simply estimate the shortest (great-circle) distances between sampling localities. Rather, we modelled resistance to gene flow, based on the landscape features known to influence human dispersal. We used a resistance method from the circuit theory implemented in the software Circuitscape v.3.5.2 [63], starting from the landscape information in Ref [64] and referring to the distribution of land masses at the last glacial maximum. Next, we added data about altitude and 14 river presence from the Natural Earth database. Each area of the map was assigned a resistance value (rv) by the Reclassify tool in ArcGIS 10 (ESRI; Redlands, CA, USA), as follows: mountains higher than 2,000 m: rv=100; land or mountains below 2,000 m: rv=10; rivers: rv=5, oceans: NoData (absolute dispersal barrier); narrow arms of sea across which prehistoric migration is documented: rv=10. The low rv for rivers reflects the human tendency to follow, whenever possible, water bodies in their dispersal (see e.g. Ref. [65]).
Under the SD model we hampered movement from Arabia to India (rv=100), hence preventing the dispersal along the Southern route; under the MD models, we created a buffer of low resistance value (rv=1) along the Southern route. For all models we then estimated leastresistance distances between the populations analyzed, when applicable going through Addis Ababa, chosen as a starting point for the African expansion [26]. The final output was then exported in Google Earth where geographic distances were expressed in kilometers.
We evaluated by partial Mantel tests [66] the correlation between genomic (FST) and geographic distances, while holding divergence times constant. This way we could control for the drift effects, due to the fact that populations separated at distinct points in time and space.

Genomic structure of Old World populations.
We assembled genome-wide SNP data from the literature obtaining information As a preliminary step, we visualized by Principal Component Analysis the genetic relationships between such populations, as inferred from these autosomal SNPs (Fig. 2). The first two principal components, accounting respectively for 8.4% and 4.3% of the total genetic variance, show that the populations we grouped in meta-populations do cluster together genetically. In addition, genetic relationships largely correspond to geographical distances, with Eurasian populations separated from the African ones along the axis represented by PC1, and forming an orderly longitudinal cline, all the way from Europe to East Asia and Oceania, along the PC2 axis.
Then, to further investigate the worldwide genomic structure, we applied the unsupervised ancestry-inference algorithm of the ADMIXTURE software [40]. After identifying k=6 as the most

Population divergence dates.
There is a clear geographical structure in the data, which in principle allows one to test for the relative goodness of fit of the two models. The SD model implies that the separation time from  Guinea on the other, respectively at the lower and the upper tails of the distribution. Even considering the full range of uncertainty around these estimates (95% of the confidence interval) we observed no overlap, with Europe having an upper confidence limit 77/71K years ago (depending on the LD measure used, respectively the r 2 and σ 2 statistic) and Australia having a lower confidence limit 88/80K years ago. Because we kept into consideration the effects of Ne in the estimation procedure, this difference cannot possibly be accounted for by the different impact of genetic drift upon these populations, and supports a rather complex ''Out of Africa'' scenario, suggesting at least two main phenomena of AMH dispersal from Africa. The Australo-Melanesian populations, i.e. Australians and New Guineans, with their early separation times from East Africa, may be regarded as the putative descendants of an early dispersal process, whereas the status of most Asian populations would seem, at this stage of the analysis, unclear.

Comparing the predictions of single vs multiple African exit models: Divergence times.
Having shown that significantly different times of separation from Africa are estimated for Europe and Australia/New Guinea, the question arises whether it would be possible to obtain such results by chance alone, had AMH dispersed in a single wave, at the time period at which that dispersal is generally placed (in the calculations that follow, we always considered the Ne and T estimates obtained using the unweighted r 2 statistic). To answer, we needed a null distribution of T values under the SD model, which we constructed by simulation, using the software ms [56]. We plotted the (null) distribution of the 24,000 separation times derived from the simulations and we compared it with the observed T estimates. Whereas the value estimated in the European sample falls perfectly within the range of times predicted by the classical SD model, that is not the case for the New Guinean and the Australian values, falling in the right tail of this distribution at P<0.05 level (Fig. 4). This can only mean that a single exit from Africa, even considering the uncertainty in our knowledge of the relevant parameters, cannot account for the differences in the separation times from Africa observed, respectively, in Europe on the one hand, and in Australo-Melanesia on the other.

Recent analyses of the genetic relationships between modern humans and Denisovans
suggested that a fraction possibly as high as 6-8% of the Melanesian genomes may be traced back to Denisovan ancestor [60]. To rule out the possibility that the apparent difference in African divergence times for Europe and Australo-Melanesia may somewhat reflect Denisovan admixture, we removed from the analysis the SNPs that were identified as representing the Denisovan contribution to the latter's genome. We recalculated the FSTs from the 80,621 SNPs obtained from the filtering process, and reestimated the divergence times from Africa, finding they are still very close to those previously estimated (Additional File 10 and 11).

Estimates of population admixture.
Other Far Eastern populations, besides Australia and New Guineans for which we estimated a remote separation from Africans, may have taken part in an early exit from Africa through a Southern route. Identifying them is not straightforward, though, because we basically have a continuous set of divergence times from East Africa, from 66K to 107K years ago (Table 1).
This result is consistent with both a continuous migration process from Africa across some 40K years (which so far has never been proposed, to the best of our knowledge) and with an early exit, followed by genetic exchanges with later-dispersing groups, which has diluted or erased altogether the genetic evidence of the earliest migration. Our previous ADMIXTURE analysis highlighted an ancestral genetic component (green, Figure 3 As for admixture among modern populations after the split from Africa, which may affect estimates of their divergence time [70], the demographic history of large sections of Asia is too complex and elusive to allow one to tell apart admixed and non-admixed groups. However, we argue the impact of modern admixture upon our results cannot be too strong, because geographically-intermediate populations were excluded from the final analysis. This way, significant differences in time estimates were observed for populations (Europe, the Caucasus, New Guinea and Australia) showing a rather homogeneous genetic composition in the ADMIXTURE analysis, with most individual genotypes attributed to a single ancestral component ( Figure 3). 22 The method used in the present study allows us to safely rule out that fluctuations in longterm population sizes might have distorted our time estimates. Three-fold differences in very ancient (e.g. > 100,000 years ago; Additional File 7) population sizes may appear, at a first sight, difficult to justify, because at that time all Ne values should converge to a value representing the size of the common ancestral African population. However, a similar result was also obtained in the only previous study based on the same method [18], and interpreted as reflecting founder effects accompanying the dispersal from Africa. In turn, these phases of increased genetic drift may have increased LD, and hence caused underestimation of Ne in all non-Africans. However, the resulting distortion, if any, should have affected the absolute values of T, but not the relative timing of the Europeans' and Asians' separation from Africans, which is what this study is concerned with. Another possibility is that 100,000 or so years ago the ancestors of current Eurasians were already genetically distinct from the ancestors of modern Africans (as proposed by Refs. [71,72]). If so, the different Ne estimates of the present study would not be a statistical artifact, but would reflect actual differences between geographically-isolated ancient populations. When we modeled population dispersal in space, the correlation between genetic and geographic distances was higher under the MD than under the SD model, but this difference was statistically insignificant (Table 2). This seems likely due to the fact that the three models being compared share several features, such as the same set of geographic/genetic distances for the European populations, which reduces the power of any test. However, the separation times previously estimated made us confident that the SD model is inconsistent with the data, and so what was really important at this stage was the comparison between the two MD models. The better fit of model 3 than model 2 implies that the MD model works better if only part of the Asian genomic diversity is attributed to the earliest dispersal. A better fit of a MD than of a SD model was also observed in parallel analyses of cranial measures and of a much smaller genomic dataset [13], suggesting that our findings may indeed reflect a general pattern of human diversity.
The data we analyzed are probably affected, to an unknown but not negligible extent, by a bias due to the fact that most SNPs in the genotyping platforms were discovered in European populations; however the measure we used to calculate Ne and hence the separation time, LD, is expected to be relatively unaffected by this kind of bias [18,73]. At any rate, a likely effect of such a bias would be a spurious increase of the estimated differences between Europeans and the populations being compared with them, Africans in this case. Quite to the contrary, here the Europeans appeared significantly closer to Africans than Australo-Melanesians, a result which therefore cannot be due to that kind of ascertainment bias.
Can selection account, at least in part, for these findings? In principle, we have no way to rule this out. However, in practice, even though positive selection may have extensively affected the human genome, large allele-frequency shifts at individual loci are surprisingly rare [74], so much so that so far only for very few SNPs any effects of selection have been demonstrated [75]. If we also consider that genomic regions with large allele-frequency differences are not generally associated with high levels of linkage disequilibrium, in contrast with what would be expected after a selective sweep [74,76], it seems fair to conclude that the main allele frequency shifts occurred in a rather remote past and are unlikely to reflect geographic differences in the selection regimes [77]. In any case, only 8% of the SNPs we considered map within expressed loci, or in their control regions ( Figure 5); therefore, the impact of selection upon the results of this study, if any, can hardly be regarded as substantial.

Conclusions
Analyses of genomic data based on different sets of assumptions and different methods agree in indicating: (i) that a model with a single early dispersal from Africa fails to account for one crucial aspect of human genome diversity, the distribution of divergence times from Africa, and (ii) that within the model of multiple dispersal, geographical patterns of genome diversity are more accurately predicted assuming that not all Asian and New Guinea/Australian populations have had the same evolutionary history.
In the light of these results, we propose that at least two major dispersal phenomena from Africa led to the peopling of Eurasia and Australo-Melanesia. These phenomena seem clearly distinct both in their timing and in their geographical scope.
The view whereby only part of the ancestors of current non-African populations dispersed through the Levant has some non-trivial consequences upon the possible interactions between AMH and archaic forms, traces of whose genomes have been identified in many non-African populations, including New Guineans [4,78]. The estimated contribution of Neandertals is less in the European than in the Asian/Melanesian genomes, despite the long coexistence between Neandertals and Europeans [79]. At present, the standard way to explain this finding is to assume one single, major episode of hybridization in Palestine (or perhaps further North and East [80]) 47K to 65K years ago [58], followed by a split between the Europeans' ancestors on the one hand, and the Asians' and Oceanians' on the other [80,81]. After that, additional contacts might have occurred, but only between Neandertals and Asians [82]. However, if most ancestors of New Guineans dispersed through a Southern route, as this study shows, they would have missed by 2,000 km or so the nearest documented Neandertals with whom they could have intercrossed. 25 Thus, this study raises the possibility that the current patterns of human diversity need more complex models to be fully explained. One possibility is that admixture with Neandertals might have occurred before AMH left Africa [83]. Another is that common ancestry, rather than hybridization, may account for the excess similarity of Eurasians with Neandertals, in the presence of an ancient structuring of populations [72,84]. A third possibility is that the apparent traces of Neandertal hybridization in Papua New Guinea may in fact be due to Denisovan admixture. We are not in a position to actually test for these possibilities, but exploring these hypotheses may contribute to a better understanding of the main human dispersal processes and of the relationships between archaic and modern human forms.     Table 1 | Estimated divergence times from (East) Africa using the r 2 or σ 2 statistics as estimator of LD level. For each comparisons with East African, the three columns report the 95% lower 35 confidence limit, the point estimate (in years, assuming a generation interval =25 years [85]), and the 95% upper confidence limit. permutations of one matrix' rows and columns.