pSTR Finder: a rapid method to discover polymorphic short tandem repeat markers from whole-genome sequences
© Lee et al. 2015
Received: 18 March 2015
Accepted: 21 July 2015
Published: 5 August 2015
Whole-genome sequencing is performed routinely as a means to identify polymorphic genetic loci such as short tandem repeat loci. We have developed a simple tool, called pSTR Finder, which is freely available as a means of identifying putative polymorphic short tandem repeat (STR) loci from data generated from genome-wide sequences. The program performs cross comparisons on the STR sequences generated using the Tandem Repeats Finder based on multiple-genome samples in a FASTA format. These comparisons generate reports listing identical, polymorphic, and different STR loci when comparing two samples.
The web site http://forensic.mc.ntu.edu.tw:9000/PSTRWeb/Default has been developed as a means to identify polymorphic STR loci within complex mass genome sequences. The program was developed to generate a series of user-friendly reports.
As proof of concept for the program, four FASTA genome sequence samples of human chromosome X (AC_000155.1, CM000685.1, NC_018934.2, and CM000274.1) were obtained from GenBank and were analyzed for the presence of putative STR regions. The sequences within AC-000155.1 were used as an initial reference sequence from which there were 5443 identical and 4305 polymorphic STR loci identified using a repeat unit of 1–6 and 10 bp as the flanking sequence either side of the putative STR loci. A reliability test was used to compare five FASTA samples, which had sections of DNA sequence removed to mimic partial or fragmented DNA sequences, to determine whether pSTR Finder can efficiently and consistently find identical, polymorphic, and different STR loci.
From the mass of DNA sequence data, the project was found to reproducibly identify polymorphic STR loci and generate user-friendly reports detailing the number and location of these potential polymorphic loci. This freely available program was found to be a useful tool to find polymorphic STR within whole-genome sequence data in forensic genetic studies.
KeywordsSTR Bioinformatics Whole-genome sequences MASSIVE parallel sequencing Forensic TRF FASTA
Microsatellites, also known as short tandem repeat (STR) loci, are abundant in eukaryotic genomes . Their polymorphic nature makes them suitable for use in population biology, especially in forensic science and parentage testing . The search for polymorphic STR loci can be laborious and time-consuming using conventional means for any species for which little genome sequence is known . The advent of massive parallel sequencing technology has led to the development of a number of search tools for identifying STR loci [4–7]. Despite these recent developments, it is a far from simple task to identify polymorphic STR loci from a mass of DNA sequence data [8, 9]. This process is still labor intensive, and few software programs are available to assist with identifying whether putative STR loci are polymorphic and suitable for further inclusion in any multiplex. Additionally, it is often very difficult, if not impossible, to detect all potential STR sequences in the whole-genome sequence data.
There are a number of software program developed for this purpose particularly when dealing with data from massive parallel sequencing such as MyFLq , lobSTR , and RepeatFinder , while some programs are useful in identifying a putative STR, not all are designed to indicate whether the locus is polymorphic and able to pullout the flanking DNA.
We have developed pSTR Finder (pSTR) to efficiently analyze multiple-genome sequence samples for the presence of STR loci using Tandem Repeats Finder (TRF) . pSTR accepts sample data in the FASTA format and utilizes TRF. The pSTR program is then designed to analyze all input samples to discover and record putative polymorphic STR loci, regardless of whether the input sample was complete, or fractions of, a genome. We have found this program to be highly efficient when screening for potential polymorphic STR loci from genome-wide sequences and a major improvement on the current situation such that polymorphic STR loci can be identified rapidly from a large dataset.
pSTR is a web application available for non-commercial use (http://forensic.mc.ntu.edu.tw:9000/PSTRWeb/Default.aspx). pSTR requires STR data generated using TRF. The most recent release of TRF that has been tested and integrated with pSTR is version 4.07b.
pSTR relies on TRF to adequately generate user-desired sizes (1–10 bp) of repeat sequences as the input data. While the user interface for pSTR is intuitive, it is possible to download and run either the desktop or command line version of TRF using the desired specific options and then submit the results of the TRF-generated repeat sequences to pSTR for processing.
In regard to the requirements for the input of TRF data, the user submits multiple FASTA contig sequences from which one sequence file is designated as the sample reference. Currently, the program is designed with the characterization of STR loci in mind, and therefore, users can specify the size of the 5′ and 3′ flanking sequences. These flanking sequences are used by pSTR to compare two sample sequences entered.
A summary report shows the number of identical STR loci, the number of polymorphic STR loci, and the number of different STR loci between two matching samples. The summary report also shows the total number of identical (or matching in repeat motif) STR loci, polymorphic STR loci, and unique (identified only once) STR loci for all samples analyzed, based on comparison between the tested sequence file and the designated reference sample file.
The detailed report includes in an CSV format: the 5′ and 3′ flanking sequences, the number of bases constituting the repeat unit, the sequence of the STR motif, the position of the first base in the STR repeat, the variation in the number of repeat units based on the samples included and with the same 5′ and 3′ flanking sequences, the total number of STR loci within the sample, and the number of repeats for every other sample included. Currently, excluded in this detailed report are STR sequence records that are duplicates such as where there is the same putative STR locus recorded but having within it a possible shorter repeat motif and STR sequence records having the same 5′ and 3′ flanking sequences but a smaller number of repeats. These two types of STR sequence, excluded from the detailed report, are saved for each sample in the duplicated STR report.
Duplicated STR report: a separate report for each sample captures the ‘duplicated’ STR records described above.
Identical STR report: this report captures all ‘identical’ STR sequence records from the comparison of two samples. An ‘identical’ STR is determined as having both the same flanking sequences and the repeat number for any two matching STR loci within the two input samples.
Polymorphic STR report: this report captures all STR sequence records having the same flanking sequences but a different repeat number between two input samples.
Different STR report: this report, like 4 and 5 above, requires only two input samples being the source sample of interest and any target file. This report captures all STR sequence records that exist in the source sample only and not in the other sequence file.
Different STR report after switching samples: this report also captures all ‘different’ STR sequence records when the ‘source’ and ‘target’ samples are switched and then re-matched.
The performance of pSTR was tested using four sets of sequences of human chromosome X (AC_000155.1, CM000685.1, NC_018934.2, and CM000274.1) from GenBank. There are 143,733,266, 155,270,560, 155,181,468, and 155,407,050 bp in the respective FASTA source samples.
Alignment Score for match, mismatch, indel: 2, 7, 7
Minimum Alignment Score: 50
Maximum Repeat Unit Size: 6
5′ and 3′ Flanking Sequence Size: 10
A reliability test was performed using AC_000155.1 as the reference sample. To mimic fragmented or partial DNA sequences, AC_000155.1 was divided into between four DNA fragments each comprising 5–10 Mbp. These new files were called AC_000155.1a, AC_000155.1b, AC_000155.1c, and AC_000155.1d.
Results and discussion
As a result of searching the FASTA contig sequences of four X-chromosome samples with access number AC_000155.1, CM000685.1, NC_018934.2, and CM000274.1, the TRF found 14,026, 16,936, 15,891 and 16,698 potential STR loci, respectively. Loci that were found to be duplicates, i.e., those with the same repeat motif and the same flanking sequences, were saved to the specific duplicated STR report prior to matching. It should also be noted that the number and existence of false counts of identical STR loci, polymorphic STR loci, and different STR loci depends on the size of 5′ and 3′ flanking sequences specified (in this case, only ten were used).
Summary results of the number of identical, polymorphic and different STR loci among four samples after searching using pSTR
Summary results of the number of identical, polymorphic and different STR loci among five samples after searching using pSTR with 10 bp flanking sequences used as a reliability test
Summary results of the number of identical, polymorphic and different STR loci among five samples after searching using pSTR with 100 bp flanking sequences used as a reliability test
The summary report (see Table 1) provides a quick view of the number of identical and polymorphic STR loci, how many potentially different STR were found, and how many STR loci were found to be unique. A record of these findings are grouped and saved in the detailed report. Additionally, the following is provided: different STR report, identical STR report, and polymorphic STR report. If required, users may utilize Excel pivot tables and graphs or other tools as an advanced method to analyze grouped STR loci in any of the reports generated by the pSTR program. For instance, the flanking sequences saved in the reports can be used as the sources of PCR primers for the development of a STR genotyping system.
There are between 2790 and 4654 polymorphic STR based on the comparison of any two samples found by the pSTR program. All STR loci identified as exhibiting variation in the number of repeats based on the comparison of two samples are saved in the polymorphic STR report. Likewise, all identical STRs after the comparison of two samples are saved in the identical STR report.
The duplicated STR report captures all the STR records that are at the same position but have different annotated STR sequences and also the STR records having the same 5′ and 3′ flanking sequences but found at different positions. For instance, in the duplicated STR report of sample AC_000155.1, there are 54 clusters containing two STR loci recorded as being at the same position. Further, there are 157 clusters containing between 2 and 14 STRs having the same flanking sequences. In addition, there are 35 clusters having the same position and also the same flanking sequences. pSTR effectively minimizes the false data commonly seen in genome-wide sequences due to errors created during the assembling of the data by excluding all the duplicated STR from the duplicated STR reports.
The different STR report, generated by analyzing two samples, captures all the STR loci that exist in one sample, called the ‘source sample’, but do not exist in another sample called the ‘target sample’. For the same pair of matching samples, a separate different STR report can be generated if the source sample and target sample are switched. The performance test found between 2109 and 5033 different STR loci as matches when two samples were compared. As this is a different STR report, then a smaller number suggests a higher similarity between the samples being compared.
No unique STR loci were found among four samples using ‘AC_000155.1’ as the reference sample in our performance test. A unique STR may be recorded if there is variation in the flanking sequences or variation in a repeat sequence. It is also possible that the presence of a unique STR locus was due to sequencing or assembling errors.
In the reliability test, since all samples are a ‘subset’ of the reference sample AC_000155.1, no polymorphic STR loci should be observed in the summary report. In Table 2, however, there are a few polymorphic STR loci recorded. Analysis of the data found that these are all artifacts of the sequences over the conjunctions of DNA contigs in AC_000155.1a to d. For the same reason, no ‘different STR’ was found while comparing all the ‘subset’ samples with the reference sample, an exception being between AC_000155.1b, and AC_000155.1 where 1 is recorded in Table 3.
As pSTR uses the 5′ and 3′ flanking sequences to compare between two input samples, the accuracy of identification of polymorphic STR loci can be increased by increasing the length of the flanking DNA. For instance, by increasing the 5′ and 3′ flanking size from 10 to 100 bp, the polymorphic STR count of each matching of two samples became 0.
Examples of all these reports can be found in the Additional file 1.
The program pSTR was found to efficiently analyze multiple TRF generated data to accurately identify putative STR loci. The program will generate reports detailing different, identical, and polymorphic STR data from all the input samples. The concept was to design a useful and valuable tool that is easy to use when searching for polymorphic and unique STRs from genome-wide sequences.
Recently published programs such as RepeatFinder [12, 13], lobSTR , and myFLq  are very effective at identifying putative STR loci and are an advance of prior methods when examining a plethora of data generated by massively parallel sequencing. pSTR not only identifies putative STR loci but also generates numerous reports detailing the polymorphic nature of the loci such that these loci may be suitable for further genetic testing.
A practical use of pSTR relevant to forensic practice is to identify variations of STR loci between two whole-genome sequences from the same sample type but of different size or complexity. An example of which is a DNA sample from a known person, where a complete genome is available from a near pristine sample, and an evidential sample collected from a scene, where the size of genetic data may be limited or compromised.
Another practical use of pSTR is to search SNPs within whole-genome sequence data to differentiate identical twins . Since the mutation rate of some STR loci is approximately 100,000 times higher than some SNPs , there would be a high expectation of identifying STRs that separate even identical twins. This program can work on sections of a genome or a complete genome to identify polymorphic STR loci that may distinguish such monozygotic twins. The program pSTR can also find SNPs by searching large flanking sequences from data obtained from such identical twins.
It is intended that this program will aid in the easy analysis of data generated from whole-genome sequencing and become a labor-saving program.
comma separated values is a file format
text-based format for DNA and protein sequences used in bioinformatics
the abbreviation used for pSTR Finder, the software described in this paper
- short tandem repeat:
a form of hypervariable DNA sequence where the repeat motif occurs in a tandemly repeated manner and the number of repeats is variable between individuals
Tandem Repeats Finder is a software tool to identify potential repetitive polymorphic DNA sequences
We thank the National Science Council of Taiwan by a grant [NSC 102-2320-B-002-010] and the Integrated Core Facility for Functional Genomics of the National Core Facility Program for Biotechnology, the Microarray Core Facility of the National Taiwan University Center of Genomic Medicine for technical support.
- Ellegren H. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004;5:435–45.PubMedView ArticleGoogle Scholar
- Butler JM. Advanced topics in forensic DNA typing: methodology. San Diego: Elsevier Academic Press; 2012.Google Scholar
- Edwards A, Civitello A, Hammond HA, Caskey CT. DNA typing and genetic mapping with trimeric and tetrameric tandem repeats. Am J Hum Genet. 1991;49:746–56.PubMed CentralPubMedGoogle Scholar
- Kofler R, Schlotterer C, Lelley T. SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics. 2007;23:1683–5.PubMedView ArticleGoogle Scholar
- Meglécz E, Costedoat C, Dubut V, Gilles A, Malausa T, Pech N, et al. QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects. Bioinformatics. 2010;26:403–4.PubMedView ArticleGoogle Scholar
- Churbanov A, Ryan R, Hasan N, Bailey D, Chen H, Milligan B, et al. High SSR high-throughput SSR characterization and locus development from next-gen sequencing data. Bioinformatics. 2012;28:2797–803.PubMed CentralPubMedView ArticleGoogle Scholar
- Du L, Li Y, Zhang X, Yue B. MSDB: a user-friendly program for reporting distribution and building databases of microsatellites from genome sequences. J Hered. 2013;104:154–7.PubMedView ArticleGoogle Scholar
- Zalapa JE, Cuevas H, Zhu H, Steffan S, Senalik D, Zeldin E, et al. Using next-generation sequencing approaches to isolate simple sequence repeat (SSR) loci in the plant sciences. Am J Bot. 2012;99:193–208.PubMedView ArticleGoogle Scholar
- Miller AD, Good RT, Coleman RA, Lancaster ML, Weeks AR. Microsatellite loci and the complete mitochondrial DNA sequence characterized through next generation sequencing and de novo genome assembly for the critically endangered orange-bellied parrot Neophema chrysogaster. Mol Biol Rep. 2013;40:35–42.PubMedView ArticleGoogle Scholar
- Van Neste C, Vandewoestyne M, Van Criekinge W, Deforce D, Van Nieuwerburgh F. My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing. Forensic Sci Int Genet. 2014;9:1–8.PubMedView ArticleGoogle Scholar
- Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 2012;22(6):1154–62.PubMed CentralPubMedView ArticleGoogle Scholar
- Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–80.PubMed CentralPubMedView ArticleGoogle Scholar
- Center for Bioinformatic and Computational Biology ‘Repeat Finder’ http://www.cbcb.umd.edu/software/RepeatFinder/ accessed 16 June 2015.
- Weber-Lehmann J, Schilling E, Gradl G, Richter DC, Wiehler J, Rolf B. Finding the needle in the haystack: differentiating "identical" twins in paternity testing and forensics by ultra-deep next generation sequencing. Forensic Sci Int Genet. 2014;9:42–6.PubMedView ArticleGoogle Scholar
- Butler JM, Coble MD, Vallone PM. STRs vs. SNPs: thoughts on the future of forensic DNA testing. Forensic Sci Med Pathol. 2007;3:200–5.PubMedView ArticleGoogle Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.