Full Tutorial: pyRAD v.3.0

date: 12/4/2014

author: Deren Eaton | [email protected]


This is the full tutorial for pyRAD v.3.0. It is an attempt to explain the inner workings of each parameter setting used in assembling RADseq data sets in pyRAD. I highly recommend that users begin by doing a run through of the RAD tutorial (v.3.0), which only takes a few minutes, and will help you to gain a general sense for how the program works, and what the data and results look like. Then follow up with a more detailed look at this tutorial to see the full range of options available, and if you still have questions, please send them to the google group at the link above.


Contents:

1. Why use pyRAD?

Reduced-representation genomic sequence data (e.g., RADseq, GBS, ddRAD) are commonly used to study population-level research questions and consequently most software packages for assembling or analyzing such data are designed for sequences with little variation across samples. Phylogenetic analyses typically include species with deeper divergence times (more variable loci across samples) and thus a different approach to clustering and identifying orthologs will perform better.

pyRAD, the software pipeline described here, is intended to maximize phylogenetic information across disparate samples from RAD, ddRAD, or GBS data, hereafter referred to generally as RADseq. Written in Python, the program is flexible and easy to use. The code is human-readable, open-source and may be modified to meet users specific needs. With respect to constructing phylogenetic data sets, the largest difference between pyRAD and alternative programs such as Stacks, is in the way that loci are clustered. Stacks clusters reads by grouping those with less than N differences between them, but does not allow for indels, such that the presence of even a single indel will cause a frameshift resulting in many base differences between otherwise highly similar sequences. pyRAD instead uses a measure of sequence similarity based on a global alignment of sequences through the program VSEARCH (or USEARCH). Alignment clustering allows for indels, and should perform better in data sets including more distantly related samples.

pyRAD was written with the aim that it be easy to use, fast, and flexible. It is easy in that installation does not require any difficult libraries or extensions, and that it requires few arguments to the command line. It is fast, using both multi-processing and multi-threading, and also offers a number of heuristics to speed clustering of extremely large data sets. And it is flexible in that it has built-in methods to account for different library types (RAD, ddRAD, GBS, paired-ddRAD, paired-GBS) as well as quality filtering, including overlap among paired or reverse-complemented reads, and the presence of adapter sequences. Thus, data require little or no filtering before before analysis in pyRAD.

Two fast clustering methods: hierarchical-clustering for large data sets, and split-clustering for paired-end ddRAD data are described further below. Using these methods data sets with even many hundreds of individuals can be assembled in a few days by distributing across multiple processing cores. pyRAD can be used to perform D-statistic tests on RADseq loci to test for historical introgression. Assembled data sets can be output in a number of formats for easy input into external applications such as BEAST, RAxML, TreeMix, Structure, Admixture and migrate-n.

An open access pre-print describing pyRAD v.1 and comparing it with Stacks is available here: link


2. Installation

pyRAD

You can download a stable release that will work on Mac or Linux here or clone the git repository onto your machine with the following command:

In [ ]:
git clone https://github.com/dereneaton/pyrad.git

With git, future updates can be attained by using cd into the pyrad directory and running git pull.

pyRAD requires the common Python packages Numpy and Scipy, and two external programs, muscle and vsearch.

Numpy & Scipy:

These will already be available on almost any HPC machine. To install them on a personal machine you can use the apt-get command sudo apt-get install python-numpy python-scipy on Ubuntu Linux. On a Mac there are several ways to install these, but it varies depending on the version of your OS. I recommend using conda install which you should be able to find with google.

Muscle

Muscle is used for sequence alignments. This will already be available on most HPC machines. It can be installed on Ubuntu linux with the command apt-get install muscle, or you can find binaries at www.drive5.com/muscle/downloads.

Vsearch

Vsearch is used to cluster sequences by similarity. Both source and binaries are available here: https://github.com/torognes/vsearch. Older versions of pyrad required the program Usearch (usearch_v.7.0.1090) for clustering, and this is still supported. However, I now recommend vsearch as a replacement because it is free and open-source, and I've found that it can perform more than 30X faster while yielding identical results. The speed improvement does require greater memory load.


Take note of the location where you install muscle and vsearch. Below I will describe how to ensure pyRAD can find your installations.

3. Input files

3.1 - the barcodes file

The barcodes file is a simple table linking barcodes to samples. Barcodes can be of varying lengths. Each line should have one name and one barcode, separated by a tab or space.

In [ ]:
sample1     ACAGG
sample2     ATTCA  
sample3     CGGCATA  
sample4     AAGAACA    

If your data are already de-multiplexed (sorted) into separate files for each individual then a barcodes file is not needed.

3.2 - the params file

The parameter input file, usually named params.txt, can be created with the -n option in pyRAD, like below. This file lists all of the options or paramater settings necessary for an analysis.

Every line has a parameter (e.g., a number or character string) followed by any number of spaces or tabs and then two hash marks (“##”), after which the parameter is described, or comments can be added. In parentheses I list the step of a pyRAD analysis affected by the parameter. If a line is left blank the default value is used. I highly recommend beginning all analysis by creating an empty template params.txt file with the following command:

In [ ]:
pyRAD -n

The following section describes the params file line by line. The first 14 lines are required, all lines following this are optional and used for more advanced analyses or data filtering. For each input line I list several possible formats:

Required lines 1-14


  • Line 1: The working directory for your analysis. Three different examples are shown below. This is where all of the output files from the analyses will be placed.
In [ ]:
                   ## 1. uses current as working directory
./                 ## 1. uses current as working directory
/home/user/RAD/    ## 1. uses set location as working directory
  • Line 2: The location of the raw fastq formatted Illumina sequence data. Use the wildcard character * to select multiple files in the same directory. Paired-end data files should be in pairs that differ only by the presence of a "R1" and "R2" in their names. Data files can be gzipped.
In [ ]:
./raw/*.fastq       ## 2. path to raw data files
raw/*.fastq.gz      ## 2. path to raw data files (gzipped)
  • Line 3: The location of the barcodes file (described above). If your data are already de-multiplexed then lines 2 and 3 can be left blank, but line 18 (see below) must be entered. Example:
In [ ]:
./barcodes.txt        ## 3. path to barcodes file
mydata.barcodes       ## 3. path to barcodes file
  • Line 4: The command to call vsearch (or usearch). If vsearch is in your $PATH, then simply list the executable file name here. Otherwise, enter the full path to vsearch that is required for you to call it from the command line. Examples:
In [ ]:
vsearch                                ## 4. in systemPATH
/user/bin/usearch7.0.1090_i86linux32   ## 4. full path
usearch7.0.1090_i86linux32             ## 4. in system PATH
  • Line 5: The command to call muscle. Similar to above.
In [ ]:
muscle                       ## 5. path to call muscle  
muscle3.8.31_i86linux32      ## 5. path to muscle long name    
  • Line 6: The sequence of the restriction recognition site overhang. If your data are ddRAD list the common cutter (attached to first reads) followed by the rare cutter (attached to second reads), separated by a comma. If you're not sure of the sequence of your cutters look at your raw data files for the common sequence found at the left side of your reads. Cutters can contain ambiguous bases (Example: CWGC). Examples:
In [ ]:
TGCAG                  ## 6. PstI cutsite overhang C|TGCAG
CWGC,AATT              ## 6. ddRAD example for apeKI, EcoRI
  • Line 7: The number of parallel jobs to run. For most non-clustering steps this will also be the number of processing cores that are used. Parallel processing in pyRAD usually works by executing individual samples separately on different processors, so you rarely gain speed by setting a number greater than the number of samples in your data set. However, for clustering steps 3 and 6 the number of threads per parallel job can be set to use many more cores (see line 36). Example:
In [ ]:
8                      ## 7. N processors
  • Line 8: The minimum depth of coverage to make a statistical base call at each site in a cluster. Depending on the inferred error rate, five is typically the minimum depth at which a base can be called. For popgen analyses you may want to use a higher minimum depth, like 10. If your data are very low coverage you can instead call majority consensus base calls. For this, set a low number here (e.g., 2), and turn on majority rule calling (line 29; see below).
In [ ]:
5                     ## 8. mindepth for base calls
  • Line 9: The maximum number of low quality, undetermined (“N”) sites in step 2 filtered sequences. This number should be selected with respect to the clustering threshold. A low clustering threshold (.85) can allow more undetermined sites; whereas a high clustering threshold (.95) should exclude reads with too many undetermined sites or they will affect clustering. Similarly, very long sequences (200 bp paired ddRAD) should allow more Ns since there are more bases and because second reads may be lower quality.
In [ ]:
4                      ## 9. maxN: max number of Ns in reads     
  • Line 10: The similarity threshold used by vsearch for sequence clustering, entered as a decimal. This value is used for both within-sample clustering and across-sample clustering. I (more or less) require you to use the same threshold for both, and recommend doing so. Given the way the clustering algorithm works it only makes sense (to me) to do so, otherwise reads that do not cluster together within a sample will later cluster together when you do across-sample clustering, which is undesirable. During clustering the cutsite bases remain attached to the reads, thus all loci will have an invariable set of ~4-6 bases on one side. For 100 bp reads containing a 5-bp cutter and with have a 5 bp barcode trimmed off this results in 95 bp of data, 5 of which are invariable. Thus, if you want to allow 5 base differences between reads (90/95 = .947) you should use a setting of .94. Repeat regions are masked using the 'dust' method by default (see masking; line 35). Example:
In [ ]:
.90               ## 10. clustering threshold
  • Line 11: The type of Reduced representation library. There are six options, currently. Entered as all lowercase lettering: rad, ddrad, gbs, pairddrad, pairgbs, merge. The different treatments of these data types are described below in section RADseq/RRL datatypes.
In [ ]:
rad               ## 11. Datatype  
pairddrad         ## 11. Datatype
  • Line 12: Minimum taxon coverage. The minimum number of (ingroup) samples with data for a given locus to be retained in the final data set. If you enter a number equal to the full number of samples in your data set then it will return only loci that have data across all samples. If you enter a lower value, like 4, it will return a more sparse matrix, including any loci for which at least four samples contain data. Example:
In [ ]:
4                 ## 12. minCov
  • Line 13: Maximum number (or proportion) of shared polymorphic sites in a locus. Enter a number, or a decimal with the prefix “p” (e.g., p.10 for 10%). This option is used to detect potential paralogs, as a shared heterozygous site across many samples likely represents clustering of paralogs with a fixed difference rather than a true heterozygous site. Example:
In [ ]:
4                 ## 13. maxSharedH.
  • Line 14: Prefix name for final output files. I usually enter a name indicative of the other parameters from the lines above. Example:
In [ ]:
arabidopsis_c90d5m4p4      ## 14. output name prefix       

optional lines 15-37

These options are not necessary for all analyses and can be left empty in the params file, in which case pyRAD uses their default values. They mostly pertain to options for additional quality filtering, for paired-end analyses, or for increasing the speed of analyses.


  • Line 15: Subset selector for steps 2-7. If, for example, you have data for a number of individuals from two different clades, and for one clade all of the sample names begin with the letter “A”, then you can select only these samples to analyze by entering the shared prefix "A*” on this line.
In [ ]:
A               ## 15. selects prefix subset of files
  • Line 16: Add-on (outgroup) taxa selector. The MinCov parameter on line 12 tells pyRAD to keep only loci with data from at least n ingroup samples. If you wanted to maximize coverage within a certain clade you could set the other taxa as "outgroups", and only loci with at least MinCov ingroup samples will be kept, and any additional matching of ’outgroup’ taxa to those will be included in the data, but not count toward the minimum number of samples. List ’outgroup’ taxa names here separated by commas.
In [ ]:
outg1,outg2            ## 16. outgroup selector for step 7
  • Line 17: Exclude taxa. When you are constructing data sets at step 7 you can exclude certain taxa in order to maximize coverage among other included samples. For example you may want to exclude taxa that are found to have very few data, or to exclude outgroups for population analyses. List sample names comma separated here. Example:
In [ ]:
sample3,sample4        ## 17. Exclude from step 7
  • Line 18: If your data are already de-multiplexed into separate files for each barcode then you can skip step 1 in the analysis and go straight to step 2. Now, you must enter the location of your sorted fastq files on this line. The reads should have the barcode trimmed off. If they also have the cutsite trimmed off then enter the ’@’ symbol at the beginning of the line to indicate this. Files can be compressed (.gz) or not. Examples:
In [ ]:
~/RAD/fastq/*.fastq    ## 18. Sorted files
@~/RAD/fastq/*.fastq   ## 18. sorted and stripped files
  • Line 19: The maximum number of mismatches allowed in a barcode during de-multiplexing. Default is 1, it's usually not necessary to adjust this.
In [ ]:
1                      ## 19. max barcode mismatches
  • Line 20: Phred Quality score offset (usually 33, but sometimes 64). Base calls with a phred score below 20 are converted to Ns. If you want this value to be more strict you can change the offset to be higher. For example, to raise the minimum score to 30 set the offset to 43. If left blank the default is 33.
In [ ]:
33                   ## 20. min Quality score step 2
  • Line 21: Strictness of filtering in step 2. (0) is default and means no filtering for barcodes, adapters and cut sites, only correct for quality scores of base calls. (1) looks for cutsite+adapters and trims them out. (2) tries to detect these while allowing some errors in them, thus enforcing the most strict filtering. For most data sets that do not include many overlapping and short fragments options 0 or 1 is recommended. If you prepared your own libraries instead of using a commercial service I highly recommend running at least filter=1 to check for adapter sequences.
In [ ]:
0                    ## 21. strictness of Q filter
  • Line 22: a priori error rate and heterozygosity. Step 4 of the analysis jointly estimates the error rate and heterozygosity, and step 5 uses these values to make base calls. If for some reason you wish to skip the error rate estimate (e.g., your samples are polyploid), then you can enter an a priori estimate of the error rate here. You could estimated this from data for a diploid relative, for example. Caution: you should generally let the program estimate it by using step 4, as the error rate here is being measured on data that have already passed through some filters, so error rates estimated from elsewhere may not be correct. Across several data sets I typically find error rates between 0.0001 - 0.002, depending on length and sequence chemistry. Example:
In [ ]:
0.0005,0.001        ## 22. E and H, respectively
  • Line 23: maximum number of “N”s in a consensus sequence. Clustering across samples can be affected by the number of “N”s, and so this number should be chosen with respect to line 6 and the lengths of reads. Default is 4.
In [ ]:
5                  ## 23. MaxN in consensus seqs
  • Line 24: maximum number Hs (heterozygous sites) in a consensus sequence. Among several methods to exclude potential paralogs or highly repetitive genomic regions, you can set a maximum number of heterozygous sites in a consensus sequence. Default is 4.
In [ ]:
5                  ## 24. MaxH in a consensus seq
  • Line 25: Ploidy (options=1,2, or any number >2). Default=2. If diploid this allows only 2 haplotypes in a consensus sequence after correcting for sequencing errors, and excludes consensus seqs with more than 2 alleles. If set to 1 (haploid) then E,H are estimated with H fixed to 0.0, and base calls are made with this estimated error rate and excludes consensus seqs with more than 1 allele present. If ploidy is set >2 then E,H and base calls are made using diploid assumptions, but seqs with more than 2 alleles are not excluded (see polyploidy section in FAQs). Example:
In [ ]:
1                  ## 25. diploid filter max haplos=2
  • Line 26: Maximum number of SNPs in a final locus. This can remove potential effects of poor alignments in repetitive regions in a final data set by excluding loci with more than N snps in them. For paired data you can set max SNPs in the first read and second read separately, or use one value for the pair, Examples:
In [ ]:
10                ## 26. max SNPs in a final locus
8,12              ## 26. max SNPs in 1st and 2nd pair read
  • Line 27: Maximum number of insertions/deletions. If only one value, applies to within-sample clusters only. If two values, first is within-sample clusters, second is across-sample clusters. For paired-data, enter four values: withinclusterread1, withinclusterread2, acrossclustersread1, acrossclustersread2. Defaults are 3,6,99,99 respectively. Examples:
In [ ]:
3                ## 27. max indels in a within-sample cluster
3,10             ## 27. max within-sample, max for across-sample cluster
3,6,10,10        ## 27. max within_read1,within_read2,across_read1,across_read2
  • Line 28: random number seed. Randomization is used to sort the input order of sequences before clustering in steps 3 and 6. This can have a (usually minimal) effect on results. For the unlinked SNPs outputs (.usnps, .treemix, .str, .geno) one SNP is randomly sampled from each variable RAD locus in step 7. The default seed is 112233, so analyses are repeatable by default. You can change the seed to get different random sampling.
In [ ]:
998877           ## 28. random number seed
  • Line 29: Allow overhanging ends of reads in final data set. If reads are different lengths or overlap to different degrees, 1,1 will trim to shortest sequence on either side of locus; 0,1 would trim only the right side; 0,0 allows both ends to overhang. For paired data 1,1,1,1 would trim overhangs on left1,right1,left2,right2.
In [ ]:
1,1              ## 29. overhang/trim ends of locus
  • Line 30: Output formats. * selects all possible. options=(p,n,a,s,v,u,t,m,k,g). List letters comma-separated. See section on output formats.
In [ ]:
a,n              ## 30. outputs formats
  • Line 31: Call simple consensus (majority base) on clusters with depth less than mindepth, (allows you to set line 6 as low as 2). Still makes statistical base calls on high coverage sites.
In [ ]:
1                ## 31. call majority base on low depth sites
  • Line 32: Keep trimmed sequences longer than X (default=0=don't keep). This option should only be used for very messy single end GBS or ddRAD data, where a size selection method did not work properly such that many fragments were shorter than the read length and thus the adapters were frequently sequenced. Instead of discarding sequences with adapters, this option trims off adapters and keeps the remaining sequence if it is longer than X.
In [ ]:
0               ## 32. don't keep trimmed sequences 
50              ## 32. keep trimmed longer than 50
  • Line 33: maximum depth of a within-sample cluster. The default (empty) is max(meandepth+2*SD, 500), meaning the higher value of either the mean plus two times the standard deviation of cluster depth, or 500. If instead you want to set an absolute value enter it here. Or enter a ridiculously high value for no maxdepth filtering.
In [ ]:
                  ## 33. default max depth option
50000             ## 33. a set max depth
9999999999        ## 33. no max depth
  • line 34: minimum number of copies of de-replicated reads used for clustering. Default=1. Reads are dereplicated, or collapsed, into a single read that records its number of occurrences before clustering. The default is to not exclude any data. But for exploratory analyses minderep=2 may be useful because this can drastically increase the speed of within-sample clustering. In general I do not recommend discarding data for your final analysis, but for data sets with deep coverage it has little effect on results and greatly improves clustering times.
In [ ]:
2                 ## 34. exclude singletons 
5                 ## 34. exclude reads occurring less than 5 times
  • line 35: Use hierarchical clustering (default=No=0). See hierarchical clustering section. This can be used to speed up across-sample clustering, and also to improve clustering of close relatives when assembling a data set that also includes very divergent relatives.
In [ ]:
1                 ## 35. use hierarchical clustering
  • line 36: Repeat masking. Default=1=use default masking method in vsearch or usearch. If set to 0 then no masking is used. Vsearch uses the 'dust' masking method whereas usearch has a custom method.
In [ ]:
0                 ## 36. turn off repeat masking
  • line 37: Max number of threads per parallel clustering job used by vsearch. Default=6. This has not been optimized yet. It is much faster than usearch even when processors x threads is greater than the available number of cores, though this uses more memory! Until I or someone else finds an optimal setting feel free to change this and see if you get a speed increase or not.
In [ ]:
6                 ## 37. N vsearch threads

Remaining lines:

Group/clade assignments. These are used for "population" formatted output files (e.g., treemix, migrate-n), and for hierarchical clustering. Each group is listed on a separate line, it is OK to leave a line blank. Each line will contain three elements, each separated by spaces or tabs: (1) a group name (e.g., 'A' or '1'), these names are arbitrary, just choose something different for each group. (2) the minimum size of a cluster within this group, any cluster smaller than this will not be clustered across-groups, or not output in the assembled data set. (3) the names of samples/individuals in this group. Sample names can be individually listed comma separated, or selected using a wildcard selector (e.g., *). Example:

In [ ]:
A 4 1A0,1B0,1C0,1D0  
B 4 2*               
C 4 3*          

4. Running pyRAD

pyRAD should either be called from the directory in which it comes (i.e., with its other dependency .py files), or you can create a symlink to call the program from anywhere. For example, the command below would create a link so that you call pyRAD by simply typing 'pyRAD' if you were to put the link in your $PATH.

In [ ]:
ln -s ~/pyrad_v.3.0/pyRAD pyRAD

If the symlink business seems confusing, don't worry. You can just call pyRAD from its directory. For example, to call pyRAD if it had been downloaded to a user’s home/ directory, type the command below, which will print the help menu.

In [ ]:
~/pyrad_v.3.0/pyRAD -h

Or if you created a symlink:

In [ ]:
pyRAD -h

There are six options to the command line interface of pyRAD:

  • -h (help screen)
  • -n (generate new params file)
  • -p (designate params file)
  • -s (choose steps)
  • -d (D-test input file)
  • -D (generate new dtest input file)

The parameter input file (params.txt) should be created using the -n option and then filled in based on the type of analysis you plan to do. If the params file is properly filled out then the entire analysis – converting raw RADseq data into a assembled de novo loci – can be done by simply entering:

In [ ]:
pyRAD -p params.txt

This will perform all steps (1-7) of the analysis, as described below. If, however, you wish to perform only one, or a few step at a time, then you can select these using the -s (steps) option. For example, to select only the first step:

In [ ]:
pyRAD -p params.txt -s 1

And to run steps 2-7 you could enter:

In [ ]:
pyRAD -p params.txt -s 234567 

After finishing one assembly, you could rerun step 7 with the same or a different params.txt file indicating different step 7 filtering options.

In [ ]:
pyRAD -p params.txt -s 7

When learning new software, or applying it to a new data set, it can take time to find the appropriate parameter settings and options. This can be espeicially frustrating with computationally intensive analyses, where one might have to wait many hours (or days) before finding they made a simple mistake. The -s option is an attempt to reduce this annoyance by breaking the analysis into seven discrete jobs, each of which is performed sequentially. In this way, you know whether step 1 worked before moving to step 2. And if step 2 fails, you retain the results of step 1, and can start again from where the error arose. I suggest running the example data sets (tutorials) to make yourself familiar with the steps of an analysis and what the output files and statistics look like before beginning your own analysis.

The seven steps described

Step 1: de-multiplexing. This step uses information from the barcodes file to separate sequences from your raw fastq files into a separate file for each sample. These are placed in a new directory within your working directory called “fastq/”. File names are not important for single end data. However, for paired-end reads it is necessary that the raw data file names follow a specific format: The first read files must contain "_R1_" in the name, and the second read files must be identical to the first read files but with "_R2_" in place of "_R1_". Here is an example pair of input files:

In [ ]:
yourfilename_R1_001.fastq  
yourfilename_R2_001.fastq

A reminder, if your data are already demultiplexed this step can be skipped. You will not need to enter a location for your raw data in the params file, but instead you must enter the location of your demultiplexed files into the proper location on line 18 of the params file.

Step 2: filtering. This step uses the Phred quality score recorded in the FASTQ data files to filter low quality base calls. Sites with a score below a set value are changed into “N”s, and reads with more than the number of allowed “N”s are discarded. Files are written to the “edits/” directory with the suffix “.edit”. It also implements a number of optional filters.

Step 3: within-sample clustering. This step first dereplicates the filtered sequences from step 2, recording the number of times each unique read is observed. These are then clustered using VSEARCH, recording if they match within a set sequence similarity. Sequences that clustered together are then aligned and written to a new file in the “clust.xx/” directory with the ending “.clustS.gz”.

Step 4: error-rate and heterozygosity estimates. This step uses the ML equation of Lynch (20XX) to jointly estimate error rate and heterozygosity from the base counts in each site across all clusters. Results are written to the Pi_estimate.txt file in the stats/ directory.

Step 5: create consensus sequences. Using the mean error rate and heterozygosity estimated in step 4, this step creates consensus sequences for each cluster. Those which have less than the minimum coverage, more than the maximum number of undetermined sites, or more than the maximum number of heterozygous sites, or more than the allowed number of alleles, are discarded. In diploid data if two alleles are present the phase of heterozygous sites are retained in the consensus sequences.

Step 6. Consensus sequences are clustered across samples using the same settings as in step 3. If heterozygous, one allele is randomly sampled and used in this step, although both alleles are retained in the final data set.

Step 7. Alignment, filtering for paralogs, and output of human readable fasta-like file (.loci). A large number of alternative formats are also available (see output formats). Final assembly statistics are written to the stats/ directory in the .stats file. This step is relatively fast, and can be repeated with different values for options 12,13,14,16,17,etc. to create different data sets that are optimized in coverage for different samples.


5. RADseq/RRL datatypes

The following reduced representation library data types are supported by pyRAD, and I list beside each my general experience with analyzing empirical data sets. My quality judgements are based on the frequency with which users have generated data sets that include many short fragment reads with adapter sequences, or which require extra types of analyses (e.g., reverse complement matching for overlapping or paired-end GBS.) This may change as library preparations of GBS and ddRAD libraries are improving. I describe each data type and how pyRAD treats them differently below.

  RAD             (very good)
  ddRAD           (usually good)
  GBS             (sometimes difficult/messy)
  nextRAD         (very good)
  paired ddRAD    (sometimes difficult/messy)
  paired GBS      (very difficult/messy)
  other types     (... I haven't tried yet)

Alt text

In general single-end Illumina data that employ a good size selection window will produce very good and usable data by any of the library prep methods above. Problems tend to arise when the size selection is too small (or non-existent), and especially if this coincides with paired-end sequencing.

Poor size selection has little effect on traditional RAD or on single end ddRAD, as you can see above, except to include adapter sequences in the reads which can be filtered out. It has a more significant effect on paired ddRAD, and a very significant effect on single or paired GBS data. This is because short fragments will be sequenced from both ends of these types of data, leading to sequence overlap.

In paired ddRAD the first and second reads come in pairs, and pyRAD can test for overlap of these reads using the merge overlap option. This will remove merged reads and write them to a separate file. These can be analyzed separately from the non-overlapping reads if they make up a large proportion of your data set. Recommendations for this type of analysis are made in the paired ddRAD tutorial. I only recommend trying to rescue merged reads if they comprise a large proportion of your data.

For GBS data the problem is more difficult. In single end data the Illumina adapter can ligate to either end of fragments, meaning short single-end fragments can have overlap leading to duplication in the data set. pyRAD can correct for this by performing reverse-complement clustering. This is much slower than normal clustering. It detects overlap of the reads and groups them into partially or fully overlapping clusters that are then used to make consensus base calls to create merged contigs. Recommendations for this type of analysis are made in the GBS tutorial.

For paired GBS data the merge overlap option can merge overlapping reads, however the first and second reads are completely interchangeable, so reverse complement clustering is also required in this case.

6. Example Tutorials

See Example Tutorials at this page

7. Output data formats

sequence files

By default pyRAD will return the final data set and excluded loci in the easily readable .loci format.

  • (.loci file): human-readable individual loci file
  • (.excluded_loci): loci that did not pass step 7 filtering

These data can be transformed into a large number of additional formats, each listed by a single letter in the params file:

  • a: (.alleles): .loci file with two alleles separated (diploid analyses only).
  • p: (.phy): non-interleaved phylip file used by raxml.
  • n: (.nex): interleaved nexus file used by BEAST.

snp files

  • s: (.snps): all SNPs in the assembled data set, see example.
  • v: (.vcf): variant call format of all SNPs.

sampled snp files

  • u: (.usnps): unlinked SNPs, a random sample of one SNP per locus.
  • k: (.str): Structure format version of unlinked SNPs file.
  • g: (.geno): Admixture format version of unlinked SNPs file.

population seq or snp files

And two more complicated output formats that allow pooling individuals into populations/clades using the assignments in the params file. See instructions in the Example RAD tutorial.

  • t: (.treemix): TreeMix formatted unlinked SNPs file w/ pooled samples and filtered by pop coverage.
  • m: (.migrate): migrate-n formatted loci w/ pooled samples and filtered by pop coverage.

Examples are best viewed in the RAD tutorial above, but are also described below.


*.loci

This is a custom format that is easy to read, showing each individual locus with variable sites indicated. Custom scripts can easily parse this file for loci containing certain amounts of taxon coverage or variable sites. Also it is the most easily readable file for assuring that your analyses are working properly. a (-) indicates a variable site, a (*) indicates the site is phylogenetically informative. Example:

For paired-end data the two linked loci are shown separated by a 'xxxx' separator.

*.phy

This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, with missing data for any sample filled in with N's. This format is used in RAxML among other phylogenetic programs.

*.nex

This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N's, and with data information appended to the beginning. This format is used in BEAST among other phylogenetic programs.

*.alleles

This is the same as the .loci format but instead of ambiguity codes for the consensus sequences the two alleles are printed out for each sample at each locus. Output of polyploid (>2) alleles is not supported.

*.snps

This format lists only the variable sites in each locus with a space between SNPs from different loci. An underscore represents an invariable locus. Loci are in the same order as in the .loci file. Paired loci are treated as a single locus, meaning SNPs from the two reads are not separated in this file (they're linked).

*.unlinked_snps

Each column contains one SNP sampled from one locus. If multiple SNPs in a locus, SNP sites that contain the least missing data across taxa are sampled, if equal amounts of missing data, they are randomly sampled.

*.excluded_loci

The excluded loci file is in the same format as the .loci file but shows the loci that were filtered out in step 7. The filter is listed below each locus. %P indicates the locus was filtered as a paralog because the samples shared a heterozygous site over more than the allowed number of samples. %D indicates the locus had more than one sequence from the same individual. %S indicates there were more SNPs detected in the locus than the set maximum allowed number. %I means there were more indels in the locus than the max allowed number.

In [ ]:
>A     CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>B     CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>C     CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>D     CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCMCCCGCGATCT
>E     CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCACCCCCGCGATCT
>F     CCAACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCAACCCCGCGATCT
//%P     *                                       *--          |
>A     CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>B     CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>C     CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>D     CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
>D     CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
//%D                           -     -                        |
>A     CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>B     CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
>C     CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
>D     CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
//%S                           -     -          **  *******  *|
...

8. Hierarchical clustering

Clustering across samples (step 6 of the analysis) can be quite fast for data sets with few sampled individuals, but for large studies (many hundreds of individuals) this step can sometimes take very long (>1 week), depending on the total number of unique sequences. Hierarchical clustering splits this job up into multiple smaller jobs that each run on a separate processing core, allowing clustering of large data sets many times faster.

(I no longer recommend excluding data through hierarchical clustering since clustering times are much shorter with vsearch, perhaps only for very very large data sets). The most significant speed increase comes from excluding loci at the first step of clustering that have significant missing data within a subgroup. By excluding singletons, or loci with little shared data early in the clustering process time is not spent comparing them to all of the other more distantly related samples. This is particularly relevant if you plan to only use loci in your final data set that recover data shared across many individuals.

However, without excluding data hierarchical clustering can still be useful, particularly for studies of highly divergent organisms. Because the input order of sequences can effect their clustering, the use of hierarchical clustering can be used to better cluster very highly divergent sequences by making close relatives cluster to each other first, and then cluster to more distant groups.

The figure above illustrates the general concept. Individuals are grouped a priori into a small number of clades or populations, in this case a, b, and c, through their assignment listed in the params file. The assignment of individuals does not have to be perfectly correct in terms of knowing their true phylogeny, rather this is an approximation or heuristic, where those grouped together are expected to share more data than those which are not grouped together.

In this case each clade has four sampled taxa. To group our samples into these three clades we assign them to groups in the bottom lines of the params file.

Each group is listed on a separate line, it is OK to leave a line blank. Each line will contain three elements, each separated by spaces or tabs: (1) a group name (e.g., 'A' or '1'), these names are arbitrary, just choose something different for each group. (2) the minimum size of a cluster within this group, any cluster smaller than this will not be clustered across-groups, 1=do not exclude any data. (3) the names of samples/individuals in this group. Sample names can be individually listed comma separated, or selected using a wildcard selector (e.g., *).

The sample names that this is matching to are the prefix names to the ".consens.gz" files located in the clust directory. For example, imagine we have the samples below.

In [ ]:
sample1.consens.gz  
sample2.consens.gz  
sample3.consens.gz  
sample4.consens.gz  
sample5.consens.gz  
sample6.consens.gz    
sample7.consens.gz  
sample8.consens.gz  
sample9.consens.gz   

Imagine also that we know samples 1-4 are individuals of species A, samples 5-8 are individuals of species B, and sample9 is from a more distant outgroup species. We could cluster them using the following settings:

In [ ]:
A 4 sample1,sample2,sample3,sample4  
B 2 sample5,sample6,sample7,sample8  
C 1 sample9  

The minimum cluster size option tells pyRAD to only cluster loci across clades that have at least N matches within a clade. In this way loci that are singletons, or shared by very few taxa are not clustered across clades. In the case of the outgroup sample C, we entered 1 for the minimum size, and thus all reads from this sample are clustered to the other two clades. In contrast, we chose 4 for clade A, and 2 for clade B. This means only loci that have data for all four samples in clade A will be clustered with clades B and C, and only loci with data from at least 2 individuals in clade B will be clustered with clades A and C.

Following this logic, you can assign individuals to clades in a way that speeds clustering but also maximizes the amount of data retained in your data set. If a clade has very few individuals in it, or those individuals have less data overall than other samples then you may want to set a low minimum cluster size to discard as little data as possible.

There is an example tutorial implementing hierarchical clustering and which provides more detail in the Examples section above.

9. FAQs

coming soon...

Using population assignments in step 7

For the migrate-n and treemix outputs the minimum size option in the populations assignments filters out loci that have data for fewer than this number of individuals for that population. Only loci with data for at least the minium number of individuals across all populations are included in the data set.