This is the full tutorial for pyRAD v.2.16. This is an attempt to somewhat exhaustively explain the inner workings of each parameter setting used in assembling data sets in pyRAD. I highly recommend that users begin by doing a run through of the RAD tutorial, which only takes a few minutes, and will help you to gain a general sense for how the program works, and what the data and results look like. Then follow up with a more detailed look at this tutorial to see the full range of options available, and if you still have questions, please send them to the google group at the link above.
Reduced-representation genomic sequence data (e.g., RADseq, GBS, ddRAD) are commonly used to study population-level research questions and consequently most software packages for assembling or analyzing RADseq data are designed for data with little variation across samples. Phylogenetic analyses typically include species with deeper divergence times (more variable loci across samples) and thus a different approach to clustering and identifying orthologs will perform better.
pyRAD, the software pipeline described here, is intended to maximize phylogenetic information across disparate samples from RAD, ddRAD, or GBS data, hereafter referred to generally as RADseq. Written in Python
, the program is flexible and easy to use. The code is human-readable, open-source and may be modified to meet users specific needs. With respect to constructing phylogenetic data sets, the largest difference between pyRAD and alternative programs such as Stacks, is in the way that loci are clustered. Stacks clusters reads by grouping those with less than N differences between them, but does not allow for indels, such that the presence of even a single indel will cause a frameshift resulting in many base differences between otherwise highly similar sequences. pyRAD instead uses a measure of sequence similarity based on a global alignment of sequences through the program USEARCH. Alignment clustering allows for indels, and should perform better in data sets including more distantly related samples.
pyRAD was written with the aim that it be easy to use, fast, and flexible. It is easy in the sense that installation does not require any difficult libraries or extensions, and that it requires few arguments to the command line. It is fast by offerring a number of heuristics to speed clustering of very large data sets, or long (paired-end) reads. And it is flexible in that it has built-in methods to account for different library types (RAD, ddRAD, GBS, paired-ddRAD, paired-GBS) as well as quality filtering, including overlap among paired or reverse-complemented reads, and the presence of adapter sequences. Thus, data do not need to be pre-screened or filtered before analysis in pyRAD.
Two fast clustering methods: hierarchical-clustering for large data sets, and split-clustering for paired-end ddRAD data are described further below. Using these methods population or phylogenetic scale data sets with even many hundreds of individuals can be assembled in relatively short time by distributing across multiple processing cores. pyRAD can be used to perform D-statistic tests on RADseq loci to test for historical introgression. Assembled data sets can be output in a number of formats for easy input into external applications such as BEAST, RAxML, TreeMix, Structure, and migrate-n.
An open access pre-print describing pyRAD and comparing it with Stacks is available here: link
pyRAD is available for Linux and Mac (but not Windows). Find the download link here. A few Python packages and external software are required as dependencies. I list how you can find and install each below:
Numpy and Scipy are common Python libraries required for numerical and scientific calculations.
__ On Ubuntu Linux__ these can be installed using apt-get commands:
sudo apt-get install python-numpy
sudo apt-get install python-scipy
On a Mac there are several ways to install these, and it varies depending on the version of your OS. I suggest you search "Mac install scipy", or "Scipysuperpack", or "Conda install", for installing these two packages. To check that they installed properly open Python in a command terminal by typing "python" and enter the import arguments below. If everything is installed properly you will get no error messages.
import numpy
import scipy
import scipy.stats
The other two required software are MUSCLE and USEARCH. Both are available at http://www.drive5.com. Once downloaded you can move the executables into your $PATH such that they can be called from the command line. If you don't know what that means, don't worry, simply take note of the location where you saved the downloads, this is all you will need to execute them. Sometimes the downloaded executable of USEARCH does not have the correct permissions set for execution (I don't know why). If you get an error when you execute the program saying you do not have permission to use it, then run the following unix command to change permissions of the file:
chmod 555 usearch_v.7.0.1090_i86linux32
Important note: Versions of pyRAD v.1.70 and above are compatible with USEARCH v.7.0.1090 and above, not with older versions.
The barcodes file is a simple table linking barcodes to samples. Barcodes can be of varying lengths. Each line should have one name and one barcode, separated by a single tab (important: no spaces).
sample1 ACAGG
sample2 ATTCA
sample3 CGGCATA
sample4 AAGAACA
If your data are already de-multiplexed (sorted) into separate files for each individual then a barcodes file is not needed.
The parameter input file, usually named params.txt, can be created with the -n option in pyRAD, like below. This file lists all of the options or paramater settings necessary for an analysis.
Every line has a parameter (e.g., a number or character string) followed by any number of spaces or tabs and then two hash marks (“##”), after which the parameter is described, or comments can be added. In parentheses I list the step of a pyRAD analysis affected by the parameter. If a line is left blank the default value is used. I highly recommend beginning all analysis by creating an empty template params.txt file with the following command:
~/pyrad_v.2.16/pyRAD -n
The following section describes the params file line by line. The first 14 lines are required, all lines following this are optional and used for more advanced analyses or data filtering. For each input line I list several possible formats:
## 1. uses current as working directory
./ ## 1. uses current as working directory
/home/user/RAD/ ## 1. uses set location as working directory
./raw/*.fastq ## 2. path to raw data files
raw/*.fastq.gz ## 2. path to raw data files (gzipped)
./barcodes.txt ## 3. path to barcodes file
mydata.barcodes ## 3. path to barcodes file
/user/bin/usearch7.0.1090_i86linux32 ## 4. full path
usearch7.0.1090_i86linux32 ## 4. in system PATH
muscle ## 5. path to call muscle
muscle3.8.31_i86linux32 ## 5. path to muscle long name
TGCAG ## 6. PstI cutsite overhang C|TGCAG
CWGC,AATT ## 6. ddRAD example for apeKI, EcoRI
8 ## 7. N processors
10 ## 8. mindepth for base calls
4 ## 9. maxN: max number of Ns in reads
.90 ## 10. clustering threshold
rad ## 11. Datatype
pairddrad ## 11. Datatype
4 ## 12. minCov
4 ## 13. maxSharedH.
arabidopsis_c90m4p4 ## 14. output name prefix
These options are not necessary for all analyses and can be left empty in the params file, in which case pyRAD uses their default values. They mostly pertain to options for additional quality filtering, for paired-end analyses, or for increasing the speed of analyses.
A ## 15. selects prefix subset of files
outg1,outg2 ## 16. outgroup selector for step 7
sample3,sample4 ## 17. Exclude from step 7
~/RAD/fastq/*.fastq ## 18. Sorted files
@~/RAD/fastq/*.fastq ## 18. sorted and stripped files
1 ## 19. max barcode mismatches
33 ## 20. min Quality score step 2
0 ## 21. strictness of Q filter
0.0005,0.001 ## 22. E and H, respectively
5 ## 23. MaxN in consensus seqs
5 ## 24. MaxH in a consensus seq
1 ## 25. diploid filter max haplos=2
10 ## 26. max SNPs in a final locus 8,12 ## 26. max SNPs in 1st and 2nd pair read
3 ## 27. max indels in a within-sample cluster 3,10 ## 27. max within-sample, max for across-sample cluster 3,6,10,10 ## 27. max within_read1,within_read2,across_read1,across_read2
112233 ## 28. random number seed
1,1 ## 29. overhang/trim ends of locus
a,n ## 30. outputs formats
1 ## 31. call majority base on low depth sites
30 ## 32. minimum overlap (nbases) of pairs
0 ## 33. don't keep trimmed sequences
50 ## 33. keep trimmed longer than 50
## 34. default max depth option
50000 ## 34. a set max depth
999999999999999 ## 34. no max depth
1 ## 35. exclude singletons
5 ## 35. exclude reads occurring less than 5 times
A 4 1A0,1B0,1C0,1D0
B 4 2*
C 4 3*
pyRAD should either be called from the directory in which it comes (i.e., with its other dependency .py files), or you can create a symlink to call the program from anywhere. For example, the command below would create a link so that you call pyRAD by simply typing 'pyRAD' if you were to put the link in your $PATH.
ln -s ~/pyrad_v.2.16/pyRAD pyRAD
If the symlink business seems confusing, don't worry. You can just call pyRAD from its directory. For example, to call pyRAD if it had been downloaded to a user’s home/ directory, type the command below, which will print the help menu.
~/pyrad_v.2.16/pyRAD -h
Or if you created a symlink:
pyRAD -h
There are six options to the command line interface of pyRAD:
The parameter input file (params.txt) should be created using the -n option and then filled in based on the type of analysis you plan to do. If the params file is properly filled out then the entire analysis – converting raw RADseq data into a assembled de novo loci – can be done by simply entering:
pyRAD -p params.txt
This will perform all steps (1-7) of the analysis, as described below. If, however, you wish to perform only one, or a few step at a time, then you can select these using the -s (steps) option. For example, to select only the first step:
pyRAD -p params.txt -s 1
And to run steps 2-7 you could enter:
pyRAD -p params.txt -s 234567
After finishing one assembly, you could rerun step 7 with the same or a different params.txt file indicating different step 7 filtering options.
pyRAD -p params.txt -s 7
When learning new software, or applying it to a new data set, it can take time to find the appropriate parameter settings and options. This can be espeicially frustrating with computationally intensive analyses, where one might have to wait many hours (or days) before finding they made a simple mistake. The -s option is an attempt to reduce this annoyance by breaking the analysis into seven discrete jobs, each of which is performed sequentially. In this way, you know whether step 1 worked before moving to step 2. And if step 2 fails, you retain the results of step 1, and can start again from where the error arose. I suggest running the example data sets (tutorials) to make yourself familiar with the steps of an analysis and what the output files and statistics look like before beginning your own analysis.
Step 1: de-multiplexing. This step uses information from the barcodes file to separate sequences from your raw fastq files into a separate file for each sample. These are placed in a new directory within your working directory called “fastq/”. File names are not important for single end data. However, for paired-end reads it is necessary that the raw data file names follow a specific format: The first read files must contain "_R1_" in the name, and the second read files must be identical to the first read files but with "_R2_" in place of "_R1_". Here is an example pair of input files:
yourfilename_R1_001.fastq
yourfilename_R2_001.fastq
A reminder, if your data are already demultiplexed this step can be skipped. You will not need to enter a location for your raw data in the params file, but instead you must enter the location of your demultiplexed files into the proper location on line 18 of the params file.
Step 2: filtering. This step uses the Phred quality score recorded in the FASTQ data files to filter low quality base calls. Sites with a score below a set value are changed into “N”s, and reads with more than the number of allowed “N”s are discarded. Files are written to the “edits/” directory with the suffix “.edit”. It also implements a number of optional filters. For paired-end data you can separate out overlapping sequences (line 32). You can filter for Illumina adapter sequences (line 20).
Step 3: within-sample clustering. This step first dereplicates the filtered sequences from step 2, recording the number of times each unique read is observed. These are then clustered using USEARCH, recording if they match within a set sequence similarity. Sequences that clustered together are then aligned and written to a new file in the “clust.xx/” directory with the ending “.clustS.gz”.
Step 4: error-rate and heterozygosity estimates. This step uses the ML equation of Lynch (20XX) to jointly estimate error rate and heterozygosity from the base counts in each site across all clusters. Results are written to the Pi_estimate.txt file in the stats/ directory.
Step 5: create consensus sequences. Using the mean error rate and heterozygosity estimated in step 4, this step creates consensus sequences for each cluster. Those which have less than the minimum coverage, more than the maximum number of undetermined sites, or more than the maximum number of heterozygous sites, or more than the allowed number of alleles, are discarded. If two alleles are present the phase of heterozygous sites are retained for the consensus sequences.
Step 6. Consensus sequences are clustered across samples using the same settings as in step 3. If heterozygous one allele is randomly sampled and used in this step, although both alleles are retained in the final data set.
Step 7. Alignment, detection of paralogs, output of human readable fasta-like file (.loci), and concatenated loci (.phy), other optional formats can be specified (.nex, .snps, .alleles), and final statistics are written to .stats file. This step is relatively fast, and can be repeated with different values for options 12,13,16,17 to create different data sets that are optimized in coverage for different samples.
The following reduced representation library data types are supported by pyRAD, and I list beside each my general experience with analyzing empirical data sets. My quality judgements are based on the frequency with which users have generated data sets that include many short fragment reads with adapter sequences, or which require extra types of analyses (e.g., reverse complement matching for overlapping or paired-end GBS.) This may change as library preparations of GBS and ddRAD libraries are improving. I describe each data type and how pyRAD treats them differently below.
RAD (very good)
ddRAD (usually good)
GBS (sometimes difficult/messy)
nextRAD (very good)
paired ddRAD (usually difficult/messy)
paired GBS (very difficult/messy)
other types (... I haven't tried yet)
In general single-end Illumina data that employ a good size selection window will produce very good and usable data by any of the library prep methods above. Problems tend to arise when the size selection is too small (or non-existent), and especially if this coincides with paired-end sequencing.
Poor size selection has little effect on traditional RAD or on single end ddRAD, as you can see above, except to include adapter sequences in the reads which can be filtered out. It has a more significant effect on paired ddRAD, and a very significant effect on single or paired GBS data. This is because short fragments will be sequenced from both ends of these types of data, leading to sequence overlap.
In paired ddRAD the first and second reads come in pairs, and pyRAD can test for overlap of these reads using the merge overlap option. This will remove merged reads and write them to a separate file. These can be analyzed separately from the non-overlapping reads if they make up a large proportion of your data set. Recommendations for this type of analysis are made in the paired ddRAD tutorial. I only recommend trying to rescue merged reads if they comprise a large proportion of your data.
For GBS data the problem is more difficult. In single end data the Illumina adapter can ligate to either end of fragments, meaning short single-end fragments can have overlap leading to duplication in the data set. pyRAD can correct for this by performing reverse-complement clustering. This is much slower than normal clustering. It detects overlap of the reads and groups them into partially or fully overlapping clusters that are then used to make consensus base calls to create merged contigs. Recommendations for this type of analysis are made in the GBS tutorial.
For paired GBS data the merge overlap option can merge overlapping reads, however the first and second reads are completely interchangeable, so reverse complement clustering is also required in this case.
See Example Tutorials at this page
By default pyRAD will return the final data in two formats:
There are a number of additional formats possible as well.
And two more complicated output formats that allow pooling individuals into populations. See instructions in the Example RAD tutorial.
These additional formats will be output if indicated on line 30 of the params file by their first letter, i.e, a for alleles, n for nexus, etc., except for structure which is indicated by 'k'.
Examples are best viewed by running the RAD tutorial under Example Analyses
, above, but are also described below.
This is a custom format that is easy to read, showign each individual locus with variable sites indicated. Custom scripts can easily parse this file for loci containing certain amounts of taxon coverage or variable sites. Also it is the most easily readable file for assuring that your analyses are working properly. Example:
For paired-end data the two linked loci are shown separated by a 'xxxx' separator.
This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, with missing data for any sample filled in with N's. This format is used in RAxML among other phylogenetic programs.
This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N's, and with data information appended to the beginning. This format is used in BEAST among other phylogenetic programs.
This is the same as the .loci format but instead of ambiguity codes for the consensus sequences the two alleles are printed out for each sample at each locus. Output of polyploid (>2) alleles is not supported.
This format lists only the variable sites in each locus with a space between SNPs from different loci. An underscore represents an invariable locus. Loci are in the same order as in the .loci file. Paired loci are treated as a single locus, meaning SNPs from the two loci are not separated in this file (linked).
Each column contains one SNP sampled from one locus. If multiple SNPs in a locus, SNP sites that contain the least missing data across taxa are sampled, if equal amounts of missing data, they are randomly sampled.
The excluded loci file is in the same format as the .loci file but shows the loci that were filtered out in step 7. The filter is listed below each locus. %P indicates the locus was filtered as a paralog because the samples shared a heterozygous site over more than the allowed number of samples. %D indicates the locus had more than one sequence from the same individual. %S indicates there were more SNPs detected in the locus than the set maximum allowed number. %I means there were more indels in the locus than the max allowed number.
>A CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>B CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>C CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>D CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCMCCCGCGATCT
>E CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCACCCCCGCGATCT
>F CCAACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCAACCCCGCGATCT
//%P * *-- |
>A CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>B CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>C CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
//%D - - |
>A CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>B CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
>C CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
//%S - - ** ******* *|
...
Clustering across samples (step 6 of the analysis) can be quite fast for data sets with few sampled individuals, but for large studies (>100 individuals or >3 lanes of data) this step can sometimes take very long (weeks), depending on the total number of unique sequences. Hierarchical clustering splits this job up into multiple smaller jobs that each run on a separate processing core, allowing clustering of large data sets many times faster.
The most significant speed increase comes from excluding loci at the first step of clustering that have significant missing data within a subgroup. By excluding singletons, or loci with little shared data early in the clustering process time is not spent comparing them to all of the other more distantly related samples. This is particularly relevant if you plan to only use loci in your final data set that recover data shared across many individuals.
The figure above illustrates the general concept. Individuals are grouped a priori into a small number of clades or populations, in this case a, b, and c. The assignment of individuals does not have to be perfectly correct in terms of knowing their true phylogeny, rather this is an approximation or heuristic, where those grouped together are expected to share more data than those which are not grouped together.
In this case each clade has four sampled taxa. To group our samples into these three clades we assign them to groups in the bottom lines of the params file.
Each group is listed on a separate line, it is OK to leave a line blank. Each line will contain three elements, each separated by a space: (1) a group name (e.g., 'A' or '1'), these names are arbitrary, just choose something different for each group. (2) the minimum size of a cluster within this group, any cluster smaller than this will not be clustered across-groups. (3) the names of samples/individuals in this group. Sample names can be individually listed comma separated, or selected using a wildcard selector (e.g., *).
The sample names that this is matching to are the prefix names to the ".consens.gz" files located in the clust directory. For example, imagine we have the samples below.
sample1.consens.gz
sample2.consens.gz
sample3.consens.gz
sample4.consens.gz
sample5.consens.gz
sample6.consens.gz
sample7.consens.gz
sample8.consens.gz
sample9.consens.gz
Imagine also that we know samples 1-4 are individuals of species A, samples 5-8 are individuals of species B, and sample9 is from a more distant outgroup species. We could cluster them using the following settings:
A 4 sample1,sample2,sample3,sample4
B 2 sample5,sample6,sample7,sample8
C 1 sample9
The minimum cluster size option tells pyRAD to only cluster loci across clades that have at least N matches within a clade. In this way loci that are singletons, or shared by very few taxa are not clustered across clades. In the case of the outgroup sample C, we entered 1 for the minimum size, and thus all reads from this sample are clustered to the other two clades. In contrast, we chose 4 for clade A, and 2 for clade B. This means only loci that have data for all four samples in clade A will be clustered with clades B and C, and only loci with data from at least 2 individuals in clade B will be clustered with clades A and C.
Following this logic, you can assign individuals to clades in a way that speeds clustering but also maximizes the amount of data retained in your data set. If a clade has very few individuals in it, or those individuals have less data overall than other samples then you may want to set a low minimum cluster size to discard as little data as possible. In general, though, you only get dramatic speed increases if at least some clades have minimum cluster size set >1.
There is an example tutorial implementing hierarchical clustering and which provides more detail in the Examples section above.
coming soon...