The double digest library preparation method was developed and described by Peterson et al. 2012.
A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. Depending on the size selection window, the far end adapter and cutsite will not appear in the part of the fragment that is sequenced (the read).
For this tutorial I simulated ddRAD data on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions.
%%bash
## download the data
wget -q http://www.dereneaton.com/downloads/simddrads.zip
## unzip the data
unzip simddrads.zip
Archive: simddrads.zip inflating: simddrads.barcodes inflating: simddrads_R1_.fastq.gz
These data were simulated with a size selection window between 50-250 bp, meaning that many of the reads will contain the far-end adapter sequence within the 100 bp sequenced read. I included this because it is a fairly common occurrence in empirical ddRAD data sets. I will show how to filter out these adapters in step 2 of the pyrad analysis.
We first create an empty params.txt input file for the pyRAD analysis. The following command will create a template which we will fill in with all relevant parameter settings for the analysis.
%%bash
pyrad -n
new params.txt file created
Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-37 are optional.
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.4 **======================== affected step == ./ ## 1. Working directory (all) ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 16) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 16) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2) 2 ## 7. N processors (parallel) (all) 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 4 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .88 ## 10. Wclust: clustering threshold as a decimal (s3,s6) rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merged (all) 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) c88d6m4p3 ## 14. Prefix name for final output (no spaces) (s7) ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7) ## 30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7) ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
I will use the script below to substitute new values, but you should simply use any text editor to make changes. For this analysis I made the following changes from the defaults:
6. set the two restriction enzymes used to generate the ddRAD data
10. lowered clustering threshold to .85
11. set datatype to ddrad
14. changed the output name prefix
19. mismatches for demulitiplexing set to 0, exact match.
24. Raised maxH. Lower is better for filtering paralogs.
30. added additional output formats (e.g., nexus,SNPs,STRUCTURE)
%%bash
sed -i '/## 6. /c\TGCAG,AATT ## 6. cutsites... ' ./params.txt
sed -i '/## 10. /c\.85 ## 10. lowered clust thresh' ./params.txt
sed -i '/## 11. /c\ddrad ## 11. datatype... ' ./params.txt
sed -i '/## 14. /c\c85d6m4p3 ## 14. prefix name ' ./params.txt
sed -i '/## 19./c\0 ## 19. errbarcode... ' ./params.txt
sed -i '/## 24./c\10 ## 24. maxH... ' ./params.txt
sed -i '/## 30./c\* ## 30. outformats... ' ./params.txt
Let's take a look at the edited params.txt file
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.4 **======================== affected step == ./ ## 1. Working directory (all) ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 16) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 16) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG,AATT ## 6. cutsites... 2 ## 7. N processors (parallel) (all) 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 4 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .85 ## 10. lowered clust thresh ddrad ## 11. datatype... 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) c85d6m4p3 ## 14. prefix name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) 0 ## 19. errbarcode... ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) 10 ## 24. maxH... ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7) * ## 30. outformats... ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using pyRAD or separate software. Here pyrad uses information from the barcodes file to demultiplex the reads.
%%bash
cat simddrads.barcodes
1A0 CATCAT 1B0 GGGAAT 1C0 AAGGTG 1D0 GTTTTA 2E0 GTTAGT 2F0 TATAGT 2G0 TGTGAA 2H0 TGTGGT 3I0 TATAAG 3J0 GGAAAG 3K0 GTGTTT 3L0 GGAGGT
%%bash
pyrad -p params.txt -s 1
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 1: sorting reads by barcode .
Now we can look at the stats output for this below:
%%bash
cat stats/s1.sorting.txt
file Nreads cut_found bar_matched simddrads_.fastq.gz 240000 240000 240000 sample true_bar obs_bars N_obs 1C0 AAGGTG AAGGTG 20000 1A0 CATCAT CATCAT 20000 3J0 GGAAAG GGAAAG 20000 3L0 GGAGGT GGAGGT 20000 1B0 GGGAAT GGGAAT 20000 3K0 GTGTTT GTGTTT 20000 2E0 GTTAGT GTTAGT 20000 1D0 GTTTTA GTTTTA 20000 3I0 TATAAG TATAAG 20000 2F0 TATAGT TATAGT 20000 2G0 TGTGAA TGTGAA 20000 2H0 TGTGGT TGTGGT 20000 nomatch _ 0
The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/
%%bash
ls fastq/
1A0_R1.fq.gz 1B0_R1.fq.gz 1C0_R1.fq.gz 1D0_R1.fq.gz 2E0_R1.fq.gz 2F0_R1.fq.gz 2G0_R1.fq.gz 2H0_R1.fq.gz 3I0_R1.fq.gz 3J0_R1.fq.gz 3K0_R1.fq.gz 3L0_R1.fq.gz
An individual file will look like below:
%%bash
## I show the first 12 lines and 80 characters to print clearly here
less fastq/1A0_R1.fq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R1_0 1:N:0: TGCAGGCGAAGGTATATTGAGCGTCTTAAGCCCCCGTGGTACTTTCTGCTCTTAGCCCATTATCTTGTCTACGAATTCAT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R1_1 1:N:0: TGCAGGCGAAGGTATATTGAGCGTCTTAAGCCCCCGTGGTACTTTCTGCTCTTAGCCCATTATCTTGTCTACGAATTCAT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R1_2 1:N:0: TGCAGGCGAAGGTATATTGAGCGTCTTAAGCCCCCGTGGTACTTTCTGCTCTTAGCCCATTATCTTGTCTACGAATTCAT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
If your data were previously de-multiplexed you need the following things before step 2:
your sorted file names should be formatted similar to above, but with your sample names substituted for 1A0, 1A1, etc.
the files can be zipped (.gz) or not (.fq or .fastq).
the barcode should be removed (not on left side of reads)
the restriction site should not be removed, but if it is, enter a '@' symbol before the location of your sorted data.
Enter on line 18 of the params file the location of your sorted data.
Next we apply the quality filtering.
Here we set the quality filter (param 21) to 1, meaning that it will check for presence of Illumina adapters and trim them off. In combination, we will also set the minimum length of kept reads after trimming (param 32) to 50.
In addition to filtering for adapters, this step also filters based on quality scores. Low quality sites are converted to Ns, and any locus with more than X number of Ns is discarded, where X is the number set in param 9.
%%bash
sed -i '/## 21./c\1 ## 24. filter setting ' ./params.txt
sed -i '/## 32./c\35 ## 32. min trim length' ./params.txt
%%bash
pyrad -p params.txt -s 2
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 2: editing raw reads ............
Statistics for the number of reads that passed filtering can be found in the stats/ directory. You can see here that a few reads were filtered out erroneously due to detection of the common cutter overhang (AATT) adjacent to the beginning of what appeared to be the adapter sequence. In this case 3 loci (~60 read copies) per sample were excluded. About 200 (20%) loci had the adapter detected but were still retained after trimming.
%%bash
cat stats/s2.rawedit.txt
sample Nreads passed passed.w.trim passed.total 1A0 20000 15804 4136 19940 1B0 20000 15802 4139 19941 1C0 20000 15820 4120 19940 1D0 20000 15798 4142 19940 2E0 20000 15799 4140 19939 2F0 20000 15804 4137 19941 2G0 20000 15800 4140 19940 2H0 20000 15810 4131 19941 3I0 20000 15799 4141 19940 3J0 20000 15801 4139 19940 3K0 20000 15780 4160 19940 3L0 20000 15821 4119 19940 Nreads = total number of reads for a sample passed = retained reads that passed quality filtering at full length passed.w.trim= retained reads that were trimmed due to detection of adapters passed.total = total kept reads of sufficient length note: you can set the option in params file to include trimmed reads of xx length.
The filtered data files are converted to fasta format and written to a directory called edits/. I show this below:
%%bash
ls edits/
1A0.edit 1B0.edit 1C0.edit 1D0.edit 2E0.edit 2F0.edit 2G0.edit 2H0.edit 3I0.edit 3J0.edit 3K0.edit 3L0.edit
The edited reads are saved in fasta format in the edits/ directory
%%bash
head -n 8 edits/1A0.edit
>1A0_0_r1 TGCAGACTCAATTGACAGGATCTTGAGCGGGTGCGTCAGTAACCGAAGACCCCTGAGGCTACTTGAGGGCTGATCAACATAGCCAATTTAAGTG >1A0_1_r1 TGCAGACTCAATTGACAGGATCTTGAGCGGGTGCGTCAGTAACCGAAGACGCCTGAGGCTACTTGAGGGCTGATCAACATAGCCAATTTAAGTG >1A0_2_r1 TGCAGACTCAATTGACAGGATCTTGAGCGGGTGCGTCAGTAACCGAAGACCCCTGAGGCTACTTGAGGGCTGATCAACATAGCCAATTTAAGTG >1A0_3_r1 TGCAGACTCAATTGACAGGATCTTGAGCGGGTGCGTCAGTAACCGAAGACCCCTGAGGCTACTTGAGGGCTGATCAACATAGCCAATTTAAGTG
Steps 3-7 proceed the same as in the original 'rad' datatype, with the only difference being that in step 7 the second cut site is trimmed from the end of loci when it is present.
%%bash
pyrad -p params.txt -s 3
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ de-replicating files for clustering... step 3: within-sample clustering of 12 samples at '.85' similarity. Running 2 parallel jobs with up to 6 threads per job. If needed, adjust to avoid CPU and MEM limits sample 3L0 finished, 997 loci sample 1C0 finished, 998 loci sample 2H0 finished, 999 loci sample 1B0 finished, 999 loci sample 2F0 finished, 998 loci sample 1A0 finished, 1000 loci sample 3J0 finished, 997 loci sample 2G0 finished, 997 loci sample 1D0 finished, 997 loci sample 3I0 finished, 997 loci sample 2E0 finished, 999 loci sample 3K0 finished, 997 loci
%%bash
head -n 23 stats/s3.clusters.txt
taxa total dpt.me dpt.sd d>5.tot d>5.me d>5.sd badpairs 1A0 1000 19.94 1.024 997 19.996 0.077 0 1B0 999 19.961 0.85 997 19.999 0.032 0 1C0 998 19.98 0.602 997 19.999 0.032 0 1D0 997 20.0 0.0 997 20.0 0.0 0 2E0 999 19.959 0.851 997 19.997 0.055 0 2F0 998 19.981 0.601 997 20.0 0.0 0 2G0 997 20.0 0.0 997 20.0 0.0 0 2H0 999 19.961 0.85 997 19.999 0.032 0 3I0 997 20.0 0.0 997 20.0 0.0 0 3J0 997 20.0 0.0 997 20.0 0.0 0 3K0 997 20.0 0.0 997 20.0 0.0 0 3L0 997 20.0 0.0 997 20.0 0.0 0 ## total = total number of clusters, including singletons ## dpt.me = mean depth of clusters ## dpt.sd = standard deviation of cluster depth ## >N.tot = number of clusters with depth greater than N ## >N.me = mean depth of clusters with depth greater than N ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)
We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:
%%bash
pyrad -p params.txt -s 4
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 4: estimating error rate and heterozygosity ............
%%bash
pyrad -p params.txt -s 5
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 5: creating consensus seqs for 12 samples, using H=0.00143 E=0.00050 ............
%%bash
cat stats/s5.consens.txt
taxon nloci f1loci f2loci nsites npoly poly 1C0 998 997 997 79620 121 0.0015197 2F0 998 997 996 79510 126 0.0015847 1A0 1000 997 997 79592 112 0.0014072 2E0 999 997 997 79594 117 0.00147 1B0 999 997 997 79594 115 0.0014448 3K0 997 997 997 79571 124 0.0015584 2G0 997 997 997 79592 123 0.0015454 3L0 997 997 997 79611 134 0.0016832 2H0 999 997 997 79601 126 0.0015829 3I0 997 997 997 79594 93 0.0011684 1D0 997 997 997 79592 134 0.0016836 3J0 997 997 997 79591 127 0.0015957 ## nloci = number of loci ## f1loci = number of loci with >N depth coverage ## f2loci = number of loci with >N depth and passed paralog filter ## nsites = number of sites across f loci ## npoly = number of polymorphic sites in nsites ## poly = frequency of polymorphic sites
This step clusters consensus sequences across samples. It can take a long time for very large data sets. If run normally it will print its progress to the screen so you can judge how long it might take. This example will take less than a minute.
%%bash
pyrad -p params.txt -s 6
vsearch v1.0.7_linux_x86_64, 7.5GB RAM, 4 cores https://github.com/torognes/vsearch finished clustering
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 6: clustering across 12 samples at '.85' similarity Reading file /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/clust.85/cat.haplos_ 100% 1074692 nt in 11963 seqs, min 49, max 97, avg 90 Indexing sequences 100% Counting unique k-mers 100% Clustering 100% Writing clusters 100% Clusters: 997 Size min 11, max 12, avg 12.0 Singletons: 0, 0.0% of seqs, 0.0% of clusters
Alignment of loci, statistics, and output of various formatted data files.
%%bash
pyrad -p params.txt -s 7
ingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0 addon exclude final stats written to: /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/stats/c85d6m4p3.stats output files being written to: /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/outfiles/ directory writing nexus file writing phylip file + writing full SNPs file + writing unlinked SNPs file + writing STRUCTURE file + writing geno file ** must enter group/clade assignments for treemix output writing vcf file writing alleles file ** must enter group/clade assignments for migrate-n output
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ ...
Stats for this run. In this case 996 loci were shared across all 12 samples (remember a few loci were filtered out when we were testing for adapter sequences).
%%bash
cat stats/c85d6m4p3.stats
997 ## loci with > minsp containing data 997 ## loci with > minsp containing data & paralogs removed 997 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci 1A0 997 1B0 997 1C0 997 1D0 997 2E0 997 2F0 996 2G0 997 2H0 997 3I0 997 3J0 997 3K0 997 3L0 997 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 0 * 997 5 0 * 997 6 0 * 997 7 0 * 997 8 0 * 997 9 0 * 997 10 0 * 997 11 1 * 997 12 996 * 996 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 21 0 295 0 1 78 78 332 332 2 168 414 222 776 3 187 975 104 1088 4 204 1791 30 1208 5 160 2591 12 1268 6 94 3155 2 1280 7 38 3421 0 1280 8 33 3685 0 1280 9 6 3739 0 1280 10 6 3799 0 1280 11 2 3821 0 1280 total var= 3821 total pis= 1280 sampled unlinked SNPs= 976 sampled unlinked bi-allelic SNPs= 928
%%bash
ls outfiles/
c85d6m4p3.alleles c85d6m4p3.excluded_loci c85d6m4p3.geno c85d6m4p3.loci c85d6m4p3.nex c85d6m4p3.phy c85d6m4p3.snps c85d6m4p3.str c85d6m4p3.unlinked_snps c85d6m4p3.vcf
%%bash
ls stats/
c85d6m4p3.stats Pi_E_estimate.txt s1.sorting.txt s2.rawedit.txt s3.clusters.txt s5.consens.txt