Contrasting signals of positive selection in genes involved in human skin-color variation from tests based on SNP scans and resequencing

Background Numerous genome-wide scans conducted by genotyping previously ascertained single-nucleotide polymorphisms (SNPs) have provided candidate signatures for positive selection in various regions of the human genome, including in genes involved in pigmentation traits. However, it is unclear how well the signatures discovered by such haplotype-based test statistics can be reproduced in tests based on full resequencing data. Four genes (oculocutaneous albinism II (OCA2), tyrosinase-related protein 1 (TYRP1), dopachrome tautomerase (DCT), and KIT ligand (KITLG)) implicated in human skin-color variation, have shown evidence for positive selection in Europeans and East Asians in previous SNP-scan data. In the current study, we resequenced 4.7 to 6.7 kb of DNA from each of these genes in Africans, Europeans, East Asians, and South Asians. Results Applying all commonly used neutrality-test statistics for allele frequency distribution to the newly generated sequence data provided conflicting results regarding evidence for positive selection. Previous haplotype-based findings could not be clearly confirmed. Although some tests were marginally significant for some populations and genes, none of them were significant after multiple-testing correction. Combined P values for each gene-population pair did not improve these results. Application of Approximate Bayesian Computation Markov chain Monte Carlo based to these sequence data using a simple forward simulator revealed broad posterior distributions of the selective parameters for all four genes, providing no support for positive selection. However, when we applied this approach to published sequence data on SLC45A2, another human pigmentation candidate gene, we could readily confirm evidence for positive selection, as previously detected with sequence-based and some haplotype-based tests. Conclusions Overall, our data indicate that even genes that are strong biological candidates for positive selection and show reproducible signatures of positive selection in SNP scans do not always show the same replicability of selection signals in other tests, which should be considered in future studies on detecting positive selection in genetic data.


Background
Large-scale genotyping projects using genome-wide single-nucleotide polymorphisms (SNPs) have provided large amounts of data describing the genetic diversity of human populations [1][2][3][4][5][6]. Several statistical methods have been developed and used for detection of signatures of selective processes from genome-wide SNP data, which we refer to as 'SNP scans' [7]. All these approaches try to recover fingerprints of selective sweeps by detecting signals in the haplotypic variation of a genomic region and/or the spectrum of the variation of the genetic diversity [8][9][10][11][12][13][14][15]. However, the results obtained with the different test statistics usually show limited overlap (see results from Voight et al. [12] Wang et al. [14] and Akey [16]), therefore, it would be desirable to compare the results from SNP haplotype-based tests using the 'gold standard' of full resequencing data and suitable statistical tests. The most widely used approach involves computing a set of neutrality-test values for sequence data from a particular genomic region, and then estimating the likelihood of such values are under neutrality. This is achieved either by comparing the computed one in the region of interest with the value of the statistic in other regions of the genome (that is, by using empirical distributions [17]) or with the values obtained under demographic (neutral) simulations [18]). However, it is also desirable to have an estimate of the selective parameters (s and h) rather than just rejecting the hypothesis of neutrality. The latter could be in principle obtained by applying Approximate Bayesian Computation (ABC) [19], a statistical technique used to recover the posterior distribution of parameters shaping the statistical model, which is applied when computing the likelihood of the data given the parameters is not possible, but it is possible to simulate data under the model of interest [19]. ABC has proven to be a valuable statistical tool for making inferences about demographic parameters in population genetics. Moreover, it has also been used in estimating selective parameters [20,21].
Skin pigmentation is an excellent candidate system for investigating positive selection. It is very likely that this trait is under selective pressure, given the biological role of pigmentation and the large differences seen in pigmentation intensity between continental populations [22]. For a large number of genes involved in the pigmentation pathway, signatures of recent selective sweeps have been suggested from SNP-scan data [12,14,[23][24][25][26], and DNA sequence-based evidence has been tested for a limited number of the genes associated with human skin pigmentation (melanocortin 1 receptor (MC1R), solute carrier family 45 member 2 (SLC45A2), tyrosinase-related protein 1 (TYRP1, dopachrome tautomerase (DCT) and tyrosinase (TYR) [27][28][29][30]. In a previous SNP-based study, we identified signatures of selective sweeps in the genes oculocutaneous albinism II (OCA2),TYRP1, DCT, and KIT ligand (KITLG) in Europeans, in OCA2, DCT, KITLG, epidermal growth factor receptor (EGFR) and dopamine receptor D2 (DRD2) in East Asians. In contrast, Africans did not show any evidence of positive selection in any of the genes that we tested [24]. Evidence for selection in OCA2, DCT, KITLG, and TYRP1 was also shown by other SNP-based studies applying similar haplotype-based tests to data from other samples [23][24][25][26].
In the present study, we generated DNA sequence data from approximately 4.7 to 6.7 kb of each of the four genes OCA2, TYRP1, DCT and KITLG in Africans, Europeans, East Asians and South Asians, and applied both neutrality tests and an ABC-Markov chain Monte Carlo (MCMC) approach for detecting evidence of positive selection. Furthermore, we compared the outcomes from such sequence-based test with our previous results from haplotype-based tests using SNP-scan data.

Results
We sequenced DNA regions of approximately 4.7 to 6.7 kb from each four genes, OCA2, DCT, TYRP1 and KITLG, involved mainly in human skin pigmentation, in 24 to 26 DNA samples from African, European, East Asian and South Asian populations (Table 1). In total, we detected 146 polymorphic positions: 47 in the OCA2 fragment, 35 in the TYRP1 fragment, 31 in the DCT fragment and 33 in the KITLG fragment (Table 1).

Neutrality tests and their statistical significance
The application of the four-gamete rule between all pairs of SNPs within each population suggested that, for all genes except TYRP1, the sequenced region could be considered as a single block (see Additional file 1). We compiled results from commonly used neutrality tests under different demographic models and using an empirical distribution (Table 2), and computed the two-tailed P value for each statistic given the population and gene (see Additional file 2). We found discrepancies in rejecting the null hypothesis of neutrality depending on which expected neutral distribution was used. After Bonferroni correction for multiple testing (P<0.05/96 computed tests for each expected distribution considered under neutrality in the case of the cosi model (CM) and the Gutenkunst model (GM, a best-fit demographic model); and P<0.05/60 computed tests in the case of Encyclopedia of DNA Elements (ENCODE) data (ED)), Fu's Fs statistic was significant for the TYRP1 gene and the Han Chinese from Beijing (CHB) group when using the empirical distribution from the ED, and marginally significant (P = 0.045) for the OCA2 gene and the Council for Education on Public Health (CEPH) Utah (CEU) group when using the empirical distribution from CM. None of the neutrality tests reached statistical significance after Bonferroni correction using the CM, and only the Fu and Li D and D* statistics were significant for KITLG in Europeans using the GM. Combining the P values of the different statistics produced a different picture. We detected significant (combined P< 0.05) departures from neutrality in the African and European populations for OCA2, the African population for DCT, and the European population for KITLG using the CM, and for KITLG in the European population using the GM. By contrast, ED-based P values were significant for TYRP1 in the Asian populations.

Phylogenetic networks
To visualize the phylogenetic relationships between the haplotypes and to provide additional insights into their evolutionary history, we constructed median-joining networks. Overall, the networks had few reticulations ( Figure 1). On average, the African population tended to show a large number of haplotypes at low frequency, whereas the European, East Asian and Pakistanis populations all had two main haplotypes separated by a large number of mutations. This was particularly dramatic in the case of KITLG and DCT, and less striking for OCA2 and TYRP1.

Approximate Bayesian computation/Markov chain Monte Carlo analyses
We first estimated demographic parameters for our simplified OOA (out-of-Africa) model ( Figure 2) considering 50 loci from the Environmental Genome Project; the statistics of centrality and dispersion of the different parameters are described (Table 3). In all cases, the posterior distributions strongly diverged from the priors (which were all uniform) ( Table 3), and thus were influenced by the data. Posterior estimates of the parameters related to a putative selective event for each of the four genes under study are shown ( Table 4). As controls, we also performed ABC-MCMC with data from SLC45A2 and from a simulated neutral region of 5 kb. Histograms of the posterior distributions for each gene are provided (see Additional file 3). Posterior distributions of the parameters were similar to the prior distributions in all the cases except for SLC45A2; in this case, selection was restricted to Europeans, beginning after the split from East Asians and fitting a dominant model of inheritance (h = 1). The estimated posterior distribution of σ was skewed towards large values.

Discussion
In the present study, we focused on reinvestigating previous conclusions about positive selection based on long-range haplotype (LRH) tests, using four genes putatively associated with human pigmentation. The phylogenetic networks for each gene based on sequence data were in agreement with our previous findings [24]; the different populations tended to show a high frequency of one of the major haplotypes, which tended to diverge         [29]. As the whole DCT gene is a single linkage disequilibrium (LD) block, as seen in the HapMap East Asian data, it seems unlikely that the discrepancy is explicable by the different DCT regions sequenced. Indeed, the agreement between different SNP-scan studies has been described as 'underwhelming' [16]. The discrepancies we detected here between haplotype-based and sequence-based test outcomes can be explained by a number of factors.
First, we cannot exclude the possibility that the positiveselection signals from our previous SNP-based study were false positives; the complex demographic history of humans [12,32,33] and the power dependency of the tested site [34] can affect the outcome of such tests.
Second, it has been emphasized that the SNP ascertainment bias introduced during marker discovery [35,36] and genotyping array can lead to spurious false-positive findings in haplotype-based tests [37,38].
Third, there might be a lack of power in the sequencebased tests because of the small sample sizes and/or small sequenced regions [39]. Although we cannot exclude this possibility, the length (approximately 5 kb) sequenced from the four genes proved to be sufficient to detect departures from neutrality in SLC45A2 in the European population (data not shown).
A fourth possibility is that the distributions that we computed for each statistic under neutrality do not represent the true underlying distribution for the human species. Parameters of the demographic events need to be defined a priori, which in humans is challenging because of the complex history of migrations, admixture, expansions and bottlenecks [40]. The differences seen in the values of the parameters could be indeed a major source of variation. The ENCODE data we used as an alternative is hampered by the fact that the considered regions were ascertained based on their genomic peculiarities [41], and they may not be representative of the genetic variability of the genome. There has been progress in resequencing entire genomes (for example, the 1000 Genomes Project; http://www.1000genomes.org/page.php); however, current projects rely on combining low-coverage data from multiple samples, and are not able to produce the accurate sequence for each genome that is needed for such comparisons [42].
The fifth, and perhaps most likely, reason for discrepancies between LRH and sequence-based tests we observed here may be the different underlying assumptions of the evolutionary models used (that is, instantaneous selective sweep versus incomplete selective sweeps) in the definition of each statistic, and the evolutionary timescale over which each type of test can recover departures from neutrality [7]. In that case, our results might indicate an extremely recent selection in the pigmentation genes, which would be recovered by haplotype-based but not sequence-based tests.
We also used a Bayesian approach to estimate selective parameters of the populations putatively under positive selection in each gene. The demographic parameters were on average in concordance with those described in previous studies [43][44][45][46]; however, it should be noted that they were not entirely comparable as there were a large number of differences in the assumptions of the models and data. Despite this technical limitation of the approach, the estimates of the time when selection started and the mode of inheritance correlated well with expectations in the case of SLC45A2 [30], independently of whether the complete 10 kb sequence or a subsample of 5 kb was used (data not shown). To our knowledge, this is the first time this known selective sweep has been quantified in such a way. However, for OCA2, TYRP1, DCT and KITLG and the neutral simulated region, the strong resemblance between the prior and the posterior distributions suggests that the latter are mainly dominated by the priors rather than by the information contained in the genetic data.

Conclusions
In this study, we have shown that using sequence-based neutrality tests to confirm signatures of positive selection derived from LRH tests on SNP-scan data can be difficult, even though there is a strong likelihood that the skin-pigmentation genes we studied have been targets of selection [47]. Deciphering whether this is a consequence of the power of the different statistics to detect the fingerprint of selection on different timescales; different assumptions on the strength of the selective event; or lack of power due to experimental limitations seems challenging. Our findings should be considered in future studies that set out to further investigate signatures of selection in other genes or regions of the human genome suggested by SNP-scan data. It may be argued that the final proof of positive selection should not be provided by additional genetic data but rather by functional evidence, but this may have its own caveats. A recent study [48] has demonstrated that the TRPV6 gene shows strong evidence of positive selection in all non-African populations tested using a novel modification of the extended haplotype homozygosity (EHH) test, but no functional differences between the ancestral and derived sequence could be detected using experiments relevant to the known gene function. Clearly, further advancements in the methods used to detect and validate putative signatures of positive selection are needed, and provide one of the most exciting areas for future developments. Complete understanding of positive selection in the human genome will require the combination of multiple lines of evidence.

Population samples
Genomic DNAs (Coriell Institute for Medical Research; Camden, NJ, USA from randomly ascertained participants from three continents chosen from the HapMap panel [1] were used: 24 Yoruba from Ibadan, Nigeria (YRI); 24 Han Chinese from Beijing (CHB); and 24 CEPH Utah residents with ancestry from northern and western Europe (CEU). In addition, 26 (OCA2 and TYRP1) or 25 (DCT and KITLG) Brahui (BRU) from Pakistan [49] were included because of their distinct and intermediate geographic location and history. In some analyses, we used published data on SLC42A2 (Table 3 of Soejima et al. [30]) in European-Africans from Cape Town, Ghanaians from Accra, and Chinese from Guangzhou; these data had excluded SNPs at a frequency of <5%. Genomic DNA samples from one chimpanzee (Pan troglodytes), one orangutan (Pongo pygmaeus) and one gorilla (Gorilla gorilla) (all from the European Collection of Cell Cultures, Salisbury, Wiltshire, UK) were also included to allow inference of the ancestral state of each SNP.

Ascertainment of the sequenced regions
For each of the four genes included in this study (OCA2, DCT, TYRP1 and KITLG) a DNA region of approximately 4.7 to 6.7 kb surrounding the most informative SNP was selected for resequencing ( Table 1). The region of interest was chosen based on its high rate of differentiation (quantified by means of the mean informativeness of ancestry [50] in the region) between East Asians, Europeans and Africans ( Figure 3). As established previously [24], any candidate for explaining the differences between the amounts of pigmentation should genetically co-vary with such differences. Further EHH analyses suggested that these regions showed evidence of selective sweeps [24]. Additional evidence linking skin-pigmentation phenotype with the ascertained region is available for TYRP1 [51]; an LD r 2 value of 0.704 was seen between rs2733832, which is associated with pigmentation, and rs683, which lies within the region ascertained for TYRP1 in HapMapII CEU. The SNPs included in the ascertained regions were rs2311806, rs7166228, rs7170869, rs2311805, rs1375164 and rs12442147 for OCA2; rs2209277, rs10960756, rs10809828, rs17280629, rs2733834, rs683, rs2762464, rs910 and rs1063380 for TYRP1; rs3782974, rs1325611, rs16949829, rs7992630, rs9516414, rs9524491, rs2892680 and rs7987802 for DCT;and rs1873681, rs3782179, rs3782180, rs3782181, rs4590952, rs1907702, rs1907703 and rs7953414 for KITLG. Although most of the ascertained regions consisted of intronic sequence, the TYRP1 and DCT fragments also contained exonic sequence: OCA2 (introns 2 to 3), TYRP1 (introns 6 to 7 and 7 to 8 plus exons 7 and 8), DCT (introns 6 to 7 + exon 7 + introns 7 to 8) and KITLG (introns 2 to 3).

Enzymatic amplification by PCR
Primers were designed from the human reference sequence obtained from GenBank (OCA2 accession number NC_000015.8; TYRP1 accession number NC_000009.10; DCT accession number NC_000013.9 and KITLG accession number NC_000012.9), and used to amplify fragments of approximately 6 to 7 kb (positions on chromosome:  3 μl samples of a 1:200 dilution of these PCR products were used as templates to reamplify overlapping fragments with sizes of approximately 350 to 700 bp. PCR assays for reamplification were performed in a volume of 15 μl containing 2 μmol/l of each primer, 1.6 mmol/l MgCl 2 , 200 μmol/l of each dNTP, and 0.5 U Taq (Platinum Taq; Invitrogen). The cycling conditions for the reamplification were 94°C for 2 minutes, followed by 31 cycles at 94°C for 45 seconds, 60°C for 45 seconds, and 72°C for 1.5 minutes; then 72°C for 3 minutes. Details of all primers used in this study are given in Additional File 4.

DNA sequencing
PCR products were purified (ExoSAP-IT; GE Healthcare, Princeton, NJ, USA) before sequencing. The primers used for reamplification (see Additional File 4) were used (individually) to sequence in both orientations, using a large-scale sequencing facility (Wellcome Trust Sanger Institute) with standard capillary methods. For each individual, each nucleotide position was determined from both strands by at least two reads each. The genomic DNA sequence for OCA2 (accession number NC_000015.8), TYRP1 (accession number NC_000009.10), DCT (accession number NC_000013.9) and KITLG (accession number NC_000012.9) were obtained from GenBank and used as the reference sequence for the relevant gene. Sequence traces were processed using the software program ExoTrace (Wellcome Trust Sanger Institute; http://www.sanger.ac.uk/ humgen/exoseq/analysis.shtml). Potentially polymorphic positions were flagged by the program, and were then checked manually. Variable positions were compared in overlapping and complementary reads for all samples. A quality-control test was performed to verify the SNPs called in each gene by comparing the genotype assignment performed by one investigator with the same genotype assignment performed by a second investigator. This test was performed on 2 kb of sequence per gene. The largest discrepancies were seen in the gene TYRP1, (3.4% discrepancies over all the polymorphic sites examined) and DCT (1.3%); KITLG and OCA2 rates of 0.28% and 0.6%, respectively. All of these discrepancies reflected differences in investigator interpretation, and all could be resolved by re-examining the traces. Therefore, for the rest of the sequenced regions, a maximum number of plausible variants was first identified by the two investigators, and the genotypes were then confirmed by a third investigator. In addition, genotypes of previously identified SNPs were compared with the genotype of the same individual in the HapMap database; 97% of the genotypes corresponded with each other, which is similar to the accuracy of the HapMap data themselves [1].

Data analysis and coalescent simulations
Haplotypes were reconstructed using PHASE software (version 2.02 (http://depts.washington.edu/uwc4c/ express-licenses/assets/phase/) [52,53]. The four-gamete rule was computed for each pair of loci in each gene. A value of 1 was assigned when the four-gamete rule could not be rejected and 0 when it could. LDheatmap http://stat-db.stat.sfu.ca:8080/statgen/research/LDheatmap/ [54] was used to plot the matrix of loci for each gene. Measures of genetic diversity, including nucleotide diversity (π) [55] and Watterson's estimator, ∧ θ w [56], were calculated with DnaSP (version 4.10; http://www. ub.edu/dnasp) [57]. Maximum parsimony networks (using the median-joining algorithm) were constructed using the Network 4.1 software package http://www. fluxus-technology.com. Tajima's D [58]; Fu and Li's D* and F* (no outgroup), and D and F (with outgroup [59]; Fay and Wu's H [60]; Fu's Fs [61]; and Ewens and Watterson homozygosity-test [62] statistics were estimated using a custom script to automate the computations. To test the reliability of these computations, we first compared the results obtained using our script with those obtained with DnaSP (version 4.10) [57] for some dummy data files, and saw no discrepancies. The observed values of these neutrality-test statistics in the European, Asian and African populations were compared with the values of the same neutrality-test statistics based on 10,000 coalescent simulations under the best-fit complex demographic history assuming an OOA model (the CM) implemented in the cosi software package (http://www.broadinstitute.org/~sfs/cosi/) [45]. We also estimated the departure of neutrality using the demographic model proposed by Gutenkunst et al. (GM) using the ms software with the syntax described in that paper ( [44], supplementary material). For each statistic, a two-tailed P value was calculated [63] and for each population and gene, a combined P value was obtained [64].
Data analysis by comparison with ENCODE data DNA sequence information from three different regions of 500 kb each, free or almost free of genes, were obtained from the ENCODE project (EP; http://www. genome.gov/10005107) [41], comprising regions Enr112, Enr113 and Enr213. These regions have been resequenced by the ENCODE project in several samples from the populations (YRI, CEU, CHB, and JPT) present in the HapMap project [1]. We then performed 10,000 re-samples for each population (YRI, CEU, and CHB) and gene by dividing each of the 500 kb regions into bins of size corresponding to the sequenced region of interest, and taking at random the same number of chromosomes per population as sequenced in each of the four genes. For each bin, we computed Tajima's D, Fu and Li's D*, Fu and Li's F*, Fu's Fs and the Ewens-Watterson homozygosity-test statistics, and compared them with those computed in the gene and population of interest. These statistics were preferred over the others because they do not require the ancestral state of each SNP, which is difficult to estimate for some of the SNPs found through the ENCODE project.