- Open Access
Contrasting signals of positive selection in genes involved in human skin-color variation from tests based on SNP scans and resequencing
Investigative Geneticsvolume 2, Article number: 24 (2011)
Numerous genome-wide scans conducted by genotyping previously ascertained single-nucleotide polymorphisms (SNPs) have provided candidate signatures for positive selection in various regions of the human genome, including in genes involved in pigmentation traits. However, it is unclear how well the signatures discovered by such haplotype-based test statistics can be reproduced in tests based on full resequencing data. Four genes (oculocutaneous albinism II (OCA2), tyrosinase-related protein 1 (TYRP1), dopachrome tautomerase (DCT), and KIT ligand (KITLG)) implicated in human skin-color variation, have shown evidence for positive selection in Europeans and East Asians in previous SNP-scan data. In the current study, we resequenced 4.7 to 6.7 kb of DNA from each of these genes in Africans, Europeans, East Asians, and South Asians.
Applying all commonly used neutrality-test statistics for allele frequency distribution to the newly generated sequence data provided conflicting results regarding evidence for positive selection. Previous haplotype-based findings could not be clearly confirmed. Although some tests were marginally significant for some populations and genes, none of them were significant after multiple-testing correction. Combined P values for each gene-population pair did not improve these results. Application of Approximate Bayesian Computation Markov chain Monte Carlo based to these sequence data using a simple forward simulator revealed broad posterior distributions of the selective parameters for all four genes, providing no support for positive selection. However, when we applied this approach to published sequence data on SLC45A2, another human pigmentation candidate gene, we could readily confirm evidence for positive selection, as previously detected with sequence-based and some haplotype-based tests.
Overall, our data indicate that even genes that are strong biological candidates for positive selection and show reproducible signatures of positive selection in SNP scans do not always show the same replicability of selection signals in other tests, which should be considered in future studies on detecting positive selection in genetic data.
Large-scale genotyping projects using genome-wide single-nucleotide polymorphisms (SNPs) have provided large amounts of data describing the genetic diversity of human populations [1–6]. Several statistical methods have been developed and used for detection of signatures of selective processes from genome-wide SNP data, which we refer to as 'SNP scans' . All these approaches try to recover fingerprints of selective sweeps by detecting signals in the haplotypic variation of a genomic region and/or the spectrum of the variation of the genetic diversity [8–15]. However, the results obtained with the different test statistics usually show limited overlap (see results from Voight et al.  Wang et al.  and Akey ), therefore, it would be desirable to compare the results from SNP haplotype-based tests using the 'gold standard' of full resequencing data and suitable statistical tests. The most widely used approach involves computing a set of neutrality-test values for sequence data from a particular genomic region, and then estimating the likelihood of such values are under neutrality. This is achieved either by comparing the computed one in the region of interest with the value of the statistic in other regions of the genome (that is, by using empirical distributions ) or with the values obtained under demographic (neutral) simulations ). However, it is also desirable to have an estimate of the selective parameters (s and h) rather than just rejecting the hypothesis of neutrality. The latter could be in principle obtained by applying Approximate Bayesian Computation (ABC) , a statistical technique used to recover the posterior distribution of parameters shaping the statistical model, which is applied when computing the likelihood of the data given the parameters is not possible, but it is possible to simulate data under the model of interest . ABC has proven to be a valuable statistical tool for making inferences about demographic parameters in population genetics. Moreover, it has also been used in estimating selective parameters [20, 21].
Skin pigmentation is an excellent candidate system for investigating positive selection. It is very likely that this trait is under selective pressure, given the biological role of pigmentation and the large differences seen in pigmentation intensity between continental populations . For a large number of genes involved in the pigmentation pathway, signatures of recent selective sweeps have been suggested from SNP-scan data [12, 14, 23–26], and DNA sequence-based evidence has been tested for a limited number of the genes associated with human skin pigmentation (melanocortin 1 receptor (MC1R), solute carrier family 45 member 2 (SLC45A2), tyrosinase-related protein 1 (TYRP1, dopachrome tautomerase (DCT) and tyrosinase (TYR)[27–30]. In a previous SNP-based study, we identified signatures of selective sweeps in the genes oculocutaneous albinism II (OCA2),TYRP1, DCT, and KIT ligand (KITLG) in Europeans, in OCA2, DCT, KITLG, epidermal growth factor receptor (EGFR) and dopamine receptor D2 (DRD2) in East Asians. In contrast, Africans did not show any evidence of positive selection in any of the genes that we tested . Evidence for selection in OCA2, DCT, KITLG, and TYRP1 was also shown by other SNP-based studies applying similar haplotype-based tests to data from other samples [23–26].
In the present study, we generated DNA sequence data from approximately 4.7 to 6.7 kb of each of the four genes OCA2, TYRP1, DCT and KITLG in Africans, Europeans, East Asians and South Asians, and applied both neutrality tests and an ABC-Markov chain Monte Carlo (MCMC) approach for detecting evidence of positive selection. Furthermore, we compared the outcomes from such sequence-based test with our previous results from haplotype-based tests using SNP-scan data.
We sequenced DNA regions of approximately 4.7 to 6.7 kb from each four genes, OCA2, DCT, TYRP1 and KITLG, involved mainly in human skin pigmentation, in 24 to 26 DNA samples from African, European, East Asian and South Asian populations (Table 1). In total, we detected 146 polymorphic positions: 47 in the OCA2 fragment, 35 in the TYRP1 fragment, 31 in the DCT fragment and 33 in the KITLG fragment (Table 1).
Neutrality tests and their statistical significance
The application of the four-gamete rule between all pairs of SNPs within each population suggested that, for all genes except TYRP1, the sequenced region could be considered as a single block (see Additional file 1). We compiled results from commonly used neutrality tests under different demographic models and using an empirical distribution (Table 2), and computed the two-tailed P value for each statistic given the population and gene (see Additional file 2). We found discrepancies in rejecting the null hypothesis of neutrality depending on which expected neutral distribution was used. After Bonferroni correction for multiple testing (P<0.05/96 computed tests for each expected distribution considered under neutrality in the case of the cosi model (CM) and the Gutenkunst model (GM, a best-fit demographic model); and P<0.05/60 computed tests in the case of Encyclopedia of DNA Elements (ENCODE) data (ED)), Fu's Fs statistic was significant for the TYRP1 gene and the Han Chinese from Beijing (CHB) group when using the empirical distribution from the ED, and marginally significant (P = 0.045) for the OCA2 gene and the Council for Education on Public Health (CEPH) Utah (CEU) group when using the empirical distribution from CM. None of the neutrality tests reached statistical significance after Bonferroni correction using the CM, and only the Fu and Li D and D* statistics were significant for KITLG in Europeans using the GM. Combining the P values of the different statistics produced a different picture. We detected significant (combined P< 0.05) departures from neutrality in the African and European populations for OCA2, the African population for DCT, and the European population for KITLG using the CM, and for KITLG in the European population using the GM. By contrast, ED-based P values were significant for TYRP1 in the Asian populations.
To visualize the phylogenetic relationships between the haplotypes and to provide additional insights into their evolutionary history, we constructed median-joining networks. Overall, the networks had few reticulations (Figure 1). On average, the African population tended to show a large number of haplotypes at low frequency, whereas the European, East Asian and Pakistanis populations all had two main haplotypes separated by a large number of mutations. This was particularly dramatic in the case of KITLG and DCT, and less striking for OCA2 and TYRP1.
Approximate Bayesian computation/Markov chain Monte Carlo analyses
We first estimated demographic parameters for our simplified OOA (out-of-Africa) model (Figure 2) considering 50 loci from the Environmental Genome Project; the statistics of centrality and dispersion of the different parameters are described (Table 3). In all cases, the posterior distributions strongly diverged from the priors (which were all uniform) (Table 3), and thus were influenced by the data. Posterior estimates of the parameters related to a putative selective event for each of the four genes under study are shown (Table 4). As controls, we also performed ABC-MCMC with data from SLC45A2 and from a simulated neutral region of 5 kb. Histograms of the posterior distributions for each gene are provided (see Additional file 3). Posterior distributions of the parameters were similar to the prior distributions in all the cases except for SLC45A2; in this case, selection was restricted to Europeans, beginning after the split from East Asians and fitting a dominant model of inheritance (h = 1). The estimated posterior distribution of σ was skewed towards large values.
In the present study, we focused on reinvestigating previous conclusions about positive selection based on long-range haplotype (LRH) tests, using four genes putatively associated with human pigmentation. The phylogenetic networks for each gene based on sequence data were in agreement with our previous findings ; the different populations tended to show a high frequency of one of the major haplotypes, which tended to diverge from the others by a large number of mutations, and the single SNP differentiation between populations was also in agreement with previous results. This was particularly evident for OCA2, KITLG and DCT in European and Asian populations, and less evident in the African population. The Pakistani population, geographically situated between the Europeans and Asians, shares the main haplotypes with these two populations. The presence of long network branches within each population can be indicative of balancing selection ; however, we failed to replicate previous LRH findings with the sequence-based tests, and we observed dependence of the statistical significance of the sequence-based tests on which neutrality distribution was used. Only the KITLG in the European population had statistical departures from neutrality in the CM and GM, which is in agreement with the outcome from the LRH test, but neutrality could not be rejected using ENCODE data. Furthermore, we were not able to replicate a previously suggested signal for the DCT gene in Asian populations . As the whole DCT gene is a single linkage disequilibrium (LD) block, as seen in the HapMap East Asian data, it seems unlikely that the discrepancy is explicable by the different DCT regions sequenced. Indeed, the agreement between different SNP-scan studies has been described as 'underwhelming' .
The discrepancies we detected here between haplotype-based and sequence-based test outcomes can be explained by a number of factors.
First, we cannot exclude the possibility that the positive-selection signals from our previous SNP-based study were false positives; the complex demographic history of humans [12, 32, 33] and the power dependency of the tested site  can affect the outcome of such tests.
Second, it has been emphasized that the SNP ascertainment bias introduced during marker discovery [35, 36] and genotyping array can lead to spurious false-positive findings in haplotype-based tests [37, 38].
Third, there might be a lack of power in the sequence-based tests because of the small sample sizes and/or small sequenced regions . Although we cannot exclude this possibility, the length (approximately 5 kb) sequenced from the four genes proved to be sufficient to detect departures from neutrality in SLC45A2 in the European population (data not shown).
A fourth possibility is that the distributions that we computed for each statistic under neutrality do not represent the true underlying distribution for the human species. Parameters of the demographic events need to be defined a priori, which in humans is challenging because of the complex history of migrations, admixture, expansions and bottlenecks . The differences seen in the values of the parameters could be indeed a major source of variation. The ENCODE data we used as an alternative is hampered by the fact that the considered regions were ascertained based on their genomic peculiarities , and they may not be representative of the genetic variability of the genome. There has been progress in resequencing entire genomes (for example, the 1000 Genomes Project; http://www.1000genomes.org/page.php); however, current projects rely on combining low-coverage data from multiple samples, and are not able to produce the accurate sequence for each genome that is needed for such comparisons .
The fifth, and perhaps most likely, reason for discrepancies between LRH and sequence-based tests we observed here may be the different underlying assumptions of the evolutionary models used (that is, instantaneous selective sweep versus incomplete selective sweeps) in the definition of each statistic, and the evolutionary timescale over which each type of test can recover departures from neutrality . In that case, our results might indicate an extremely recent selection in the pigmentation genes, which would be recovered by haplotype-based but not sequence-based tests.
We also used a Bayesian approach to estimate selective parameters of the populations putatively under positive selection in each gene. The demographic parameters were on average in concordance with those described in previous studies [43–46]; however, it should be noted that they were not entirely comparable as there were a large number of differences in the assumptions of the models and data. Despite this technical limitation of the approach, the estimates of the time when selection started and the mode of inheritance correlated well with expectations in the case of SLC45A2, independently of whether the complete 10 kb sequence or a subsample of 5 kb was used (data not shown). To our knowledge, this is the first time this known selective sweep has been quantified in such a way. However, for OCA2, TYRP1, DCT and KITLG and the neutral simulated region, the strong resemblance between the prior and the posterior distributions suggests that the latter are mainly dominated by the priors rather than by the information contained in the genetic data.
In this study, we have shown that using sequence-based neutrality tests to confirm signatures of positive selection derived from LRH tests on SNP-scan data can be difficult, even though there is a strong likelihood that the skin-pigmentation genes we studied have been targets of selection . Deciphering whether this is a consequence of the power of the different statistics to detect the fingerprint of selection on different timescales; different assumptions on the strength of the selective event; or lack of power due to experimental limitations seems challenging.
Our findings should be considered in future studies that set out to further investigate signatures of selection in other genes or regions of the human genome suggested by SNP-scan data. It may be argued that the final proof of positive selection should not be provided by additional genetic data but rather by functional evidence, but this may have its own caveats. A recent study  has demonstrated that the TRPV6 gene shows strong evidence of positive selection in all non-African populations tested using a novel modification of the extended haplotype homozygosity (EHH) test, but no functional differences between the ancestral and derived sequence could be detected using experiments relevant to the known gene function. Clearly, further advancements in the methods used to detect and validate putative signatures of positive selection are needed, and provide one of the most exciting areas for future developments. Complete understanding of positive selection in the human genome will require the combination of multiple lines of evidence.
Genomic DNAs (Coriell Institute for Medical Research; Camden, NJ, USA from randomly ascertained participants from three continents chosen from the HapMap panel  were used: 24 Yoruba from Ibadan, Nigeria (YRI); 24 Han Chinese from Beijing (CHB); and 24 CEPH Utah residents with ancestry from northern and western Europe (CEU). In addition, 26 (OCA2 and TYRP1) or 25 (DCT and KITLG) Brahui (BRU) from Pakistan  were included because of their distinct and intermediate geographic location and history. In some analyses, we used published data on SLC42A2 (Table 3 of Soejima et al.) in European-Africans from Cape Town, Ghanaians from Accra, and Chinese from Guangzhou; these data had excluded SNPs at a frequency of <5%. Genomic DNA samples from one chimpanzee (Pan troglodytes), one orangutan (Pongo pygmaeus) and one gorilla (Gorilla gorilla) (all from the European Collection of Cell Cultures, Salisbury, Wiltshire, UK) were also included to allow inference of the ancestral state of each SNP.
Ascertainment of the sequenced regions
For each of the four genes included in this study (OCA2, DCT, TYRP1 and KITLG) a DNA region of approximately 4.7 to 6.7 kb surrounding the most informative SNP was selected for resequencing (Table 1). The region of interest was chosen based on its high rate of differentiation (quantified by means of the mean informativeness of ancestry  in the region) between East Asians, Europeans and Africans (Figure 3). As established previously , any candidate for explaining the differences between the amounts of pigmentation should genetically co-vary with such differences. Further EHH analyses suggested that these regions showed evidence of selective sweeps . Additional evidence linking skin-pigmentation phenotype with the ascertained region is available for TYRP1; an LD r2 value of 0.704 was seen between rs2733832, which is associated with pigmentation, and rs683, which lies within the region ascertained for TYRP1 in HapMapII CEU. The SNPs included in the ascertained regions were rs2311806, rs7166228, rs7170869, rs2311805, rs1375164 and rs12442147 for OCA2; rs2209277, rs10960756, rs10809828, rs17280629, rs2733834, rs683, rs2762464, rs910 and rs1063380 for TYRP1; rs3782974, rs1325611, rs16949829, rs7992630, rs9516414, rs9524491, rs2892680 and rs7987802 for DCT; and rs1873681, rs3782179, rs3782180, rs3782181, rs4590952, rs1907702, rs1907703 and rs7953414 for KITLG. Although most of the ascertained regions consisted of intronic sequence, the TYRP1 and DCT fragments also contained exonic sequence: OCA2 (introns 2 to 3), TYRP1 (introns 6 to 7 and 7 to 8 plus exons 7 and 8), DCT (introns 6 to 7 + exon 7 + introns 7 to 8) and KITLG (introns 2 to 3).
Enzymatic amplification by PCR
Primers were designed from the human reference sequence obtained from GenBank (OCA2 accession number NC_000015.8; TYRP1 accession number NC_000009.10; DCT accession number NC_000013.9 and KITLG accession number NC_000012.9), and used to amplify fragments of approximately 6 to 7 kb (positions on chromosome: OCA2 25959080-25966266; TYRP1 12694505-12700830; DCT 93890084-93895889; KITLG 87482044-87476046) covering the chosen regions. PCR assays were performed using a volume of 25 μl containing approximately 40 ng of genomic DNA, 2 μmol/l of each primer, 200 μmol/l of each dNTP, 2 mmol/l MgSO4, 0.5 U high-fidelity Taq polymerase (Platinum Taq; Invitrogen Corp., Carlsbad, CA, USA). Amplification was performed in a themal cycler (Peltier; MJ Research Inc., Waltham, MA, USA) using the following cycling profile: 94°C for 2 minutes, followed by 35 cycles at 94°C for 30 seconds, 59°C (OCA2, DCT and KITLG) or 56°C (TYRP1) for 1 minute, and 72°C for 1 minute, and a final step at 72°C for 5 minutes. Subsequently, 3 μl samples of a 1:200 dilution of these PCR products were used as templates to reamplify overlapping fragments with sizes of approximately 350 to 700 bp. PCR assays for reamplification were performed in a volume of 15 μl containing 2 μmol/l of each primer, 1.6 mmol/l MgCl2, 200 μmol/l of each dNTP, and 0.5 U Taq (Platinum Taq; Invitrogen). The cycling conditions for the reamplification were 94°C for 2 minutes, followed by 31 cycles at 94°C for 45 seconds, 60°C for 45 seconds, and 72°C for 1.5 minutes; then 72°C for 3 minutes. Details of all primers used in this study are given in Additional File 4.
PCR products were purified (ExoSAP-IT; GE Healthcare, Princeton, NJ, USA) before sequencing. The primers used for reamplification (see Additional File 4) were used (individually) to sequence in both orientations, using a large-scale sequencing facility (Wellcome Trust Sanger Institute) with standard capillary methods. For each individual, each nucleotide position was determined from both strands by at least two reads each. The genomic DNA sequence for OCA2 (accession number NC_000015.8), TYRP1 (accession number NC_000009.10), DCT (accession number NC_000013.9) and KITLG (accession number NC_000012.9) were obtained from GenBank and used as the reference sequence for the relevant gene. Sequence traces were processed using the software program ExoTrace (Wellcome Trust Sanger Institute; http://www.sanger.ac.uk/ humgen/exoseq/analysis.shtml). Potentially polymorphic positions were flagged by the program, and were then checked manually. Variable positions were compared in overlapping and complementary reads for all samples. A quality-control test was performed to verify the SNPs called in each gene by comparing the genotype assignment performed by one investigator with the same genotype assignment performed by a second investigator. This test was performed on 2 kb of sequence per gene. The largest discrepancies were seen in the gene TYRP1, (3.4% discrepancies over all the polymorphic sites examined) and DCT (1.3%); KITLG and OCA2 rates of 0.28% and 0.6%, respectively. All of these discrepancies reflected differences in investigator interpretation, and all could be resolved by re-examining the traces. Therefore, for the rest of the sequenced regions, a maximum number of plausible variants was first identified by the two investigators, and the genotypes were then confirmed by a third investigator. In addition, genotypes of previously identified SNPs were compared with the genotype of the same individual in the HapMap database; 97% of the genotypes corresponded with each other, which is similar to the accuracy of the HapMap data themselves .
Data analysis and coalescent simulations
Haplotypes were reconstructed using PHASE software (version 2.02 (http://depts.washington.edu/uwc4c/express-licenses/assets/phase/) [52, 53]. The four-gamete rule was computed for each pair of loci in each gene. A value of 1 was assigned when the four-gamete rule could not be rejected and 0 when it could. LDheatmap http://stat-db.stat.sfu.ca:8080/statgen/research/LDheatmap/ was used to plot the matrix of loci for each gene. Measures of genetic diversity, including nucleotide diversity (π)  and Watterson's estimator, , were calculated with DnaSP (version 4.10; http://www.ub.edu/dnasp) . Maximum parsimony networks (using the median-joining algorithm) were constructed using the Network 4.1 software package http://www.fluxus-technology.com. Tajima's D ; Fu and Li's D* and F* (no outgroup), and D and F (with outgroup ; Fay and Wu's H ; Fu's Fs ; and Ewens and Watterson homozygosity-test  statistics were estimated using a custom script to automate the computations. To test the reliability of these computations, we first compared the results obtained using our script with those obtained with DnaSP (version 4.10)  for some dummy data files, and saw no discrepancies. The observed values of these neutrality-test statistics in the European, Asian and African populations were compared with the values of the same neutrality-test statistics based on 10,000 coalescent simulations under the best-fit complex demographic history assuming an OOA model (the CM) implemented in the cosi software package (http://www.broadinstitute.org/~sfs/cosi/) . We also estimated the departure of neutrality using the demographic model proposed by Gutenkunst et al. (GM) using the ms software with the syntax described in that paper (, supplementary material). For each statistic, a two-tailed P value was calculated  and for each population and gene, a combined P value was obtained .
Data analysis by comparison with ENCODE data
DNA sequence information from three different regions of 500 kb each, free or almost free of genes, were obtained from the ENCODE project (EP; http://www.genome.gov/10005107) , comprising regions Enr112, Enr113 and Enr213. These regions have been resequenced by the ENCODE project in several samples from the populations (YRI, CEU, CHB, and JPT) present in the HapMap project . We then performed 10,000 re-samples for each population (YRI, CEU, and CHB) and gene by dividing each of the 500 kb regions into bins of size corresponding to the sequenced region of interest, and taking at random the same number of chromosomes per population as sequenced in each of the four genes. For each bin, we computed Tajima's D, Fu and Li's D*, Fu and Li's F*, Fu's Fs and the Ewens-Watterson homozygosity-test statistics, and compared them with those computed in the gene and population of interest. These statistics were preferred over the others because they do not require the ancestral state of each SNP, which is difficult to estimate for some of the SNPs found through the ENCODE project.
Approximate Bayesian computation/Markov chain Monte Carlo
We also implemented an MCMC without likelihoods in a similar way to the algorithm proposed by Marjoram et al. . For details of the implementation of the ABC-MCMC, choice of summary statistics and the forward simulator we have used, see Additional File 1.
To estimate the strength of s and h and the time when selection started in each of the four genes putatively under selection, the SLC45A2 gene (both the full sequenced region and a randomly ascertained region of 5 kb), and the outcome from a neutral simulation of 5 kb with the OOA model (s = 0, h = 0), we used a two-step approach. In the first step, we performed ABC-MCMC using observed data for 50 regions of 10 kb each from the genes sequenced (see Additional file 5) in the same three populations by the Environmental Genome Project , which has previously been used to quantify demographic parameters . Absence of recombination in each 10 kb segment was tested by applying the four-gamete rule in each population. The demographic model we used here is a simplification of the OOA model (Figure 2) . The choice of a simple model in our study was motivated by the computational time of the forward simulations when increasing the complexity of the demographic parameters (such as bottlenecks and population expansions). As the topology of the coalescence tree is defined with a small number of sequences , but estimating the number of differences between pairs of sequences is performed in quadratic time, we only used 10 samples per population in order to minimize the number of computations. All the demographic parameters were scaled by a factor of 10 to reduce computational time  and the mutation rate per nucleotide was thus 2.35 × 10-7. In total, 10,000 simulations were performed, with a burn-in of 1,000 and a thinning of 9. Prior distributions for all the demographic parameters were uniform (Table 3). In the second step, we set as the constant the demographic parameters with the median value from the estimated posterior distributions, and estimated the selective parameters of each of the genomic regions under study, using all the chromosomes per population. For each genomic region, the population under selection was set as the population previously described as being under selective pressure. In all the analyses, prior distributions (without scaling) were: truncated normal (mean ± sd 1810 ± 800 generations, range 400 to 4000) for tsel, truncated normal (mean ± sd -4 ± 3, range -4 to 2) for s in logit form, and truncated normal (mean ± sd 1 ± 1, range 0.5 to 6) for the h parameter. In total, 20,000 simulations were performed with a burn-in of 5,000 and a thinning of 15. Prior distributions of s, h and tsel under selection were obtained by taking a sample from the prior distributions, performing a simulation, and retaining the sampled values if the simulated data contained the mutation under selection. Using this procedure, 1,000 simulations were retained. We repeated the procedure twice, considering CEU and CHB as being under positive or balancing selective pressures depending on the overdominance parameter.
approximate Bayesian computation
Brahui from Pakistan
Council for Education on Public Health
Han Chinese from Beijing
Encyclopedia of DNA Elements (ENCODE) data
extended haplotype homozygosity
Japanese from Tokyo
long-range haplotype test
Markov chain Monte Carlo
out of Africa
polymerase chain reaction
Yoruba from Ibadan, Nigeria.
International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.
Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Whole-genome patterns of common DNA variation in three human populations. Science. 2005, 307: 1072-1079. 10.1126/science.1105436.
Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, et al: Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008, 451: 998-1003. 10.1038/nature06742.
Lao O, Lu TT, Nothnagel M, Junge O, Freitag-Wolf S, Caliebe A, Balascakova M, Bertranpetit J, Bindoff LA, Comas D, et al: Correlation between genetic and geographic structure in Europe. Curr Biol. 2008, 18: 1241-1248. 10.1016/j.cub.2008.07.049.
Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM: Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008, 319: 1100-1104. 10.1126/science.1153717.
Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, et al: Genes mirror geography within Europe. Nature. 2008, 456: 98-101. 10.1038/nature07331.
Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, Shamovsky O, Palma A, Mikkelsen TS, Altshuler D, Lander ES: Positive natural selection in the human lineage. Science. 2006, 312: 1614-1620. 10.1126/science.1124309.
Kimura R, Fujimoto A, Tokunaga K, Ohashi J: A practical genome scan for population-specific strong selective sweeps that have reached fixation. PLoS ONE. 2007, 2: e286-10.1371/journal.pone.0000286.
Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG: Recent and ongoing selection in the human genome. Nat Rev Genet. 2007, 8: 857-868.
Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R, et al: Genome-wide detection and characterization of positive selection in human populations. Nature. 2007, 449: 913-918. 10.1038/nature06250.
Tang K, Thornton KR, Stoneking M: A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 2007, 5: e171-10.1371/journal.pbio.0050171.
Voight BF, Kudaravalli S, Wen X, Pritchard JK: A map of recent positive selection in the human genome. PLoS Biol. 2006, 4: e72-10.1371/journal.pbio.0040072.
Wang ET, Kodama G, Baldi P, Moyzis RK: Global landscape of recent inferred Darwinian selection for Homo sapiens. Proc Natl Acad Sci USA. 2006, 103: 135-140. 10.1073/pnas.0509691102.
Williamson SH, Hubisz MJ, Clark AG, Payseur BA, Bustamante CD, Nielsen R: Localizing recent adaptive evolution in the human genome. PLoS Genet. 2007, 3: e90-10.1371/journal.pgen.0030090.
Zhang C, Bailey DK, Awad T, Liu G, Xing G, Cao M, Valmeekam V, Retief J, Matsuzaki H, Taub M, et al: A whole genome long-range haplotype (WGLRH) test for detecting imprints of positive selection in human populations. Bioinformatics. 2006, 22: 2122-2128. 10.1093/bioinformatics/btl365.
Akey JM: Constructing genomic maps of positive selection in humans: where do we go from here?. Genome Res. 2009, 19: 711-722. 10.1101/gr.086652.108.
Quach H, Barreiro LB, Laval G, Zidane N, Patin E, Kidd KK, Kidd JR, Bouchier C, Veuille M, Antoniewski C, Quintana-Murci L: Signatures of purifying and local positive selection in human miRNAs. Am J Hum Genet. 2009, 84: 316-327. 10.1016/j.ajhg.2009.01.022.
Stefansson H, Helgason A, Thorleifsson G, Steinthorsdottir V, Masson G, Barnard J, Baker A, Jonasdottir A, Ingason A, Gudnadottir VG, et al: A common inversion under selection in Europeans. Nat Genet. 2005, 37: 129-137. 10.1038/ng1508.
Beaumont MA, Zhang W, Balding DJ: Approximate bayesian computation in population genetics. Genetics. 2002, 162: 2025-2035.
Przeworski M: Estimating the time since the fixation of a beneficial allele. Genetics. 2003, 164: 1667-1676.
Jensen JD, Thornton KR, Andolfatto P: An approximate bayesian estimator suggests strong, recurrent selective sweeps in Drosophila. PLoS Genet. 2008, 4: e1000198-10.1371/journal.pgen.1000198.
Parra EJ: Human pigmentation variation: evolution, genetic basis, and implications for public health. Am J Phys Anthropol. 2007, Suppl 45: 85-105.
Izagirre N, Garcia I, Junquera C, de la Rua C, Alonso S: A scan for signatures of positive selection in candidate loci for skin pigmentation in humans. Mol Biol Evol. 2006, 23: 1697-1706. 10.1093/molbev/msl030.
Lao O, de Gruijter JM, van Duijn K, Navarro A, Kayser M: Signatures of positive selection in genes associated with human skin pigmentation as revealed from analyses of single nucleotide polymorphisms. Ann Hum Genet. 2007, 71: 354-369. 10.1111/j.1469-1809.2006.00341.x.
Myles S, Somel M, Tang K, Kelso J, Stoneking M: Identifying genes underlying skin pigmentation differences among human populations. Hum Genet. 2007, 120: 613-621.
Norton HL, Kittles RA, Parra E, McKeigue P, Mao X, Cheng K, Canfield VA, Bradley DG, McEvoy B, Shriver MD: Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. Mol Biol Evol. 2007, 24: 710-722.
Rana BK, Hewett-Emmett D, Jin L, Chang BH, Sambuughin N, Lin M, Watkins S, Bamshad M, Jorde LB, Ramsay M, et al: High polymorphism at the human melanocortin 1 receptor locus. Genetics. 1999, 151: 1547-1557.
Harding RM, Healy E, Ray AJ, Ellis NS, Flanagan N, Todd C, Dixon C, Sajantila A, Jackson IJ, Birch-Machin MA, Rees JL: Evidence for variable selective pressures at MC1R. Am J Hum Genet. 2000, 66: 1351-1361. 10.1086/302863.
Alonso S, Izagirre N, Smith-Zubiaga I, Gardeazabal J, Diaz-Ramon JL, Diaz-Perez JL, Zelenika D, Boyano MD, Smit N, de la Rua C: Complex signatures of selection for the melanogenic loci TYR, TYRP1 and DCT in humans. BMC evolutionary biology. 2008, 8: 74-10.1186/1471-2148-8-74.
Soejima M, Tachida H, Ishida T, Sano A, Koda Y: Evidence for recent positive selection at the human AIM1 locus in a European population. Mol Biol Evol. 2006, 23: 179-188.
Andres AM, Hubisz MJ, Indap A, Torgerson DG, Degenhardt JD, Boyko AR, Gutenkunst RN, White TJ, Green ED, Bustamante CD, et al: Targets of balancing selection in the human genome. Mol Biol Evol. 2009, 26: 2755-2764. 10.1093/molbev/msp190.
Sabeti PC, Reich DE, Higgins JM, Levine HZ, Richter DJ, Schaffner SF, Gabriel SB, Platko JV, Patterson NJ, McDonald GJ, et al: Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002, 419: 832-837. 10.1038/nature01140.
Teshima KM, Coop G, Przeworski M: How reliable are empirical genomic scans for selective sweeps?. Genome Res. 2006, 16: 702-712. 10.1101/gr.5105206.
Zeng K, Mano S, Shi S, Wu CI: Comparisons of site- and haplotype-frequency methods for detecting positive selection. Mol Biol Evol. 2007, 24: 1562-1574. 10.1093/molbev/msm078.
Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R: Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 2005, 15: 1496-1502. 10.1101/gr.4107905.
Nielsen R, Signorovitch J: Correcting for ascertainment biases when analyzing SNP data: applications to the estimation of linkage disequilibrium. Theor Popul Biol. 2003, 63: 245-255. 10.1016/S0040-5809(03)00005-4.
Kreitman M, Di Rienzo A: Balancing claims for balancing selection. Trends Genet. 2004, 20: 300-304. 10.1016/j.tig.2004.05.002.
Soldevila M, Calafell F, Helgason A, Stefansson K, Bertranpetit J: Assessing the signatures of selection in PRNP from polymorphism data: results support Kreitman and Di Rienzo's opinion. Trends Genet. 2005, 21: 389-391. 10.1016/j.tig.2005.05.001.
Simonsen KL, Churchill GA, Aquadro CF: Properties of statistical tests of neutrality for DNA polymorphism data. Genetics. 1995, 141: 413-429.
Harding RM, McVean G: A structured ancestral population for the evolution of modern humans. Curr Opin Genet Dev. 2004, 14: 667-674. 10.1016/j.gde.2004.08.010.
Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 53-59. 10.1038/nature07517.
Fagundes NJ, Ray N, Beaumont M, Neuenschwander S, Salzano FM, Bonatto SL, Excoffier L: Statistical evaluation of alternative models of human evolution. Proc Natl Acad Sci USA. 2007, 104: 17614-17619. 10.1073/pnas.0708280104.
Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD: Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009, 5: e1000695-10.1371/journal.pgen.1000695.
Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D: Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005, 15: 1576-1583. 10.1101/gr.3709305.
Wall JD, Lohmueller KE, Plagnol V: Detecting ancient admixture and estimating demographic parameters in multiple human populations. Mol Biol Evol. 2009, 26: 1823-1827. 10.1093/molbev/msp096.
Jablonski NG: The evolution of human skin and skin color. Annual Review of Anthropology. 2004, 33: 585-623. 10.1146/annurev.anthro.33.070203.143955.
Hughes DA, Tang K, Strotmann R, Schoneberg T, Prenen J, Nilius B, Stoneking M: Parallel selection on TRPV6 in human populations. PLoS ONE. 2008, 3: e1686-10.1371/journal.pone.0001686.
Qamar R, Ayub Q, Mohyuddin A, Helgason A, Mazhar K, Mansoor A, Zerjal T, Tyler-Smith C, Mehdi SQ: Y-chromosomal DNA variation in Pakistan. Am J Hum Genet. 2002, 70: 1107-1124. 10.1086/339929.
Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry. Am J Hum Genet. 2003, 73: 1402-1422. 10.1086/380416.
Sturm RA: Molecular genetics of human pigmentation diversity. Hum Mol Genet. 2009, 18: R9-17. 10.1093/hmg/ddp003.
Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet. 2003, 73: 1162-1169. 10.1086/379378.
Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, Stanley SE, Jiang R, Messer CJ, Chew A, Han JH, et al: Haplotype variation and linkage disequilibrium in 313 human genes. Science. 2001, 293: 489-493. 10.1126/science.1059431.
Shin J-H, Blay S, McNeney B, Graham J: LDheatmap: An R function for graphical display of pairwise linkage disequilibria between single nucleotide polymorphisms. Journal of Statistical Software. 2006, 16: Code Snippet 3
Nei M: Molecular Evolutionary Genetics. 1987, New York: Columbia University Press
Watterson GA: On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 1975, 7: 256-276. 10.1016/0040-5809(75)90020-9.
Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R: DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics. 2003, 19: 2496-2497. 10.1093/bioinformatics/btg359.
Tajima FV: Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989, 123: 585-595.
Fu YX, Li WH: Statistical tests of neutrality of mutations. Genetics. 1993, 133: 693-709.
Fay JC, Wu CI: Hitchhiking under positive Darwinian selection. Genetics. 2000, 155: 1405-1413.
Fu YX: Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics. 1997, 147: 915-925.
Watterson GA: The homozygosity test of neutrality. Genetics. 1978, 88: 405-417.
Voight BF, Adams AM, Frisse LA, Qian Y, Hudson RR, Di Rienzo A: Interrogating multiple aspects of variation in a full resequencing data set to infer human population size changes. Proc Natl Acad Sci USA. 2005, 102: 18508-18513. 10.1073/pnas.0507325102.
Wilson DJ, McVean G: Estimating diversifying selection and functional constraint in the presence of recombination. Genetics. 2006, 172: 1411-1425.
Marjoram P, Molitor J, Plagnol V, Tavare S: Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci USA. 2003, 100: 15324-15328. 10.1073/pnas.0306899100.
Livingston RJ, von Niederhausern A, Jegga AG, Crawford DC, Carlson CS, Rieder MJ, Gowrisankar S, Aronow BJ, Weiss RB, Nickerson DA: Pattern of sequence variation across 213 environmental response genes. Genome Res. 2004, 14: 1821-1831. 10.1101/gr.2730004.
DeGiorgio M, Jakobsson M, Rosenberg NA: Out of Africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proc Natl Acad Sci USA. 2009, 106: 16057-16062. 10.1073/pnas.0903341106.
Rosenberg NA, Nordborg M: Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet. 2002, 3: 380-390. 10.1038/nrg795.
Hoggart CJ, Chadeau-Hyam M, Clark TG, Lampariello R, Whittaker JC, De Iorio M, Balding DJ: Sequence-level population simulations over large genomic regions. Genetics. 2007, 177: 1725-1731. 10.1534/genetics.106.069088.
We thank Stefano Mona for stimulating discussions and suggestions about the ABC approach, Mirjam van der Hout for support and computer supplies when running initial implementations of ABC, the Wellcome Trust Sanger Institute large-scale sequencing facility for support in data generation, and Steven Leonard for help with ExoTrace. We also thank the International HapMap project and BRU donors for making their DNA samples available. This study was supported by The Wellcome Trust and by the Erasmus University Medical Center and partially by funds from the Netherlands Forensic Institute.
The authors declare that they have no competing interests.
O. L. carried out most of the data analyses including analysis design and development and data interpretation, and took the lead in writing the manuscript. J. M. dG., with support from the Wellcome Trust Sanger Institute large-scale sequencing facility, took the lead in acquisition of data, carried out network analysis, and contributed to manuscript drafting. Y. X. was involved in acquisition of data, and provided input on the manuscript. M. V., C. W., C. J. G., A. J. C. and Q. A. were all involved in acquisition and preparation of the data. M. K. and C. T-S. designed the study, provided various resources, and were involved in data interpretation and manuscript writing. All authors have read and approved the final manuscript.
Johanna Maria de Gruijter, Oscar Lao, Manfred Kayser and Chris Tyler-Smith contributed equally to this work.
Electronic supplementary material
Additional file 3: Histograms of the posterior distributions (in black) of the selective parameters (σ (4 × N e × s), h and t sel ) of (A) SLC45A2, (B) OCA2, (C) DCT, (D) TYRP1, (E) KITLG and (F) a neutral simulated sequence. These should be compared with the histograms of the prior distributions (in red) for the same parameters, using as population under selective pressures (A,B,D,E) the Council for Education on Public Health Utah (CEU) and (D) East Asian populations. (PDF 81 KB)
About this article
- Selective Sweep
- Approximate Bayesian Computation
- Wellcome Trust Sanger Institute
- Oculocutaneous Albinism
- Extended Haplotype Homozygosity