Tutorial:

paired-end ddRAD w/ merged reads

(or other two-cutter based datatypes)

pyRAD v.3.0.4


Topics:

  • Setup of params file for paired-end ddRAD (pairddrad)
  • Check for merge/overlap of paired reads
  • Assemble simulated data set with merged reads
  • Combine merged with non-merged assembly

What do the data look like?

The double digest library preparation method was developed and described by Peterson et al. 2012. Here I will be talking about paired-end ddRAD data and describe my recommendations for how to setup a params file to analyze them in pyRAD when many of the data are merged.

A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. The first reads may come in one or multiple files with "_R1_" in the name, and the second reads are in a separate file/s with "_R2_". Second read files will contain the corresponding pair from the first read files in the exact same order.


alt

Which cutters did you use?

A feature of ddRAD data is that two different cutters are used to generate the data. There is typically a rare cutter and a common cutter. You will need to know what the overhang sequence is that these cutters leave on your sequences. This can easily be found by looking at the raw forward and reverse reads files. Find the invariant sequence near the beginning of R1 files for the first cutter, and invariant sequence at the end of the R2 files for the second cutter. You will list them in this order in the params file, discussed below.

Barcodes

Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using pyRAD or separate software. If your reads are already de-multiplexed that is OK as well.

Detecting merged reads

A failure to merge paired end reads that have overlapping sequences can lead to major problems during the assembly. A number of external programs are available to check for overlap of paired end reads, and you can run your data through these programs before being input to pyRAD. At the time of writing this, I recommend the software PEAR (https://github.com/xflouris/PEAR), which I'll demonstrate below.

The Example data

For this tutorial I simulated paired-end ddRAD reads on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions.

In [1]:
%%bash
## download the data
wget -q https://github.com/dereneaton/dereneaton.github.io/raw/master/downloads/simpairddradsmerge.zip

## unzip the data
unzip simpairddradsmerge.zip
Archive:  simpairddradsmerge.zip
  inflating: simpairddradsmerge.barcodes  
  inflating: simpairddradsmerge_R1_.fastq.gz  
  inflating: simpairddradsmerge_R2_.fastq.gz  

Where the example data come from (skip this section if you want)

You can simply download the data, but it might also be worth describing how the data were generated. In this case I simulated ddRAD-like data using the egglib coalescent simulator with the following options using my program simRRLs.py

In [2]:
indels  = 0.005   ## low rate of indels (prob. mutation is indel)
dropout = 0       ## if 1, mutations to restriction site can cause locus dropout.
nloci   = 1000    ## Total Nloci simulated (less if locus dropout or merged reads) 
ninds   = 1       ## sampled individuals per tip taxon
shortfrag = 50    ## smallest size of digested fragments
longfrag  = 300   ## largest size of digested fragments
In [3]:
## Because the data are simulated at 20X coverage, 
## and the reads are sequenced from both ends (paired-end)
## the total number of reads is:  
reads = 12*ninds*nloci*20*2
print "%s sequenced reads" % reads
480000 sequenced reads

Here I execute the simulation script to simulate the data, and write it to a file called simpairddradsmerge. If you simply downloaded the data you do not need to do this step, I show it only for those interested. The relevant point is that the fragment size is randomly selected between 50 and 300 bp, meaning that around half of our fragments will be <200 bp long, which in the case of paired 100 bp reads will lead to overlap.

In [4]:
%%bash
## download the simulation program
wget -q http://www.dereneaton.com/downloads/simRRLs.py

## simulate data with param settings described above
## (requires the Python package `Egglib`)  <---- !! 
python simRRLs.py 0.005 0 1000 1 50,300 pairddrad simpairddradsmerge
	simulating pairddrad data
	simulating 1000 loci at 20X coverage across 12 tip taxa with 1 samples per taxon
	indels arise at frequency of 0.005000 per mutation
	mutations in restriction site = False
	sequencing error rate = 0.0005
	theta=4Nu= 0.0014
	min fragment length allows read overlaps/adapter sequences 
	creating new barcode map
.

Assembling the data set with pyRAD

We first create an empty params.txt input file for the pyRAD analysis.
The following command will create a template which we will fill in with all relevant parameter settings for the analysis.

In [5]:
%%bash
pyrad -n
	new params.txt file created

Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-37 are optional.

In [6]:
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.4  **======================== affected step ==
./                          ## 1. Working directory                                 (all)
./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 16)  (s1)
./*.barcodes                ## 3. Loc. of barcode file (if not line 16)             (s1)
vsearch                     ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                       ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
2                           ## 7. N processors (parallel)                           (all)
6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                           ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.88                         ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                         ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merged (all)
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
c88d6m4p3                   ## 14. Prefix name for final output (no spaces)         (s7)
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: loc. of de-multiplexed data                      (s2)
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
                       ## 30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7)
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Edit the params file

I will use the script below to substitute new values, but you should simply use any text editor to make changes. For this analysis I made the following changes from the defaults:


 6. set the two restriction enzymes used to generate the ddRAD data
 10. lowered clustering threshold to .85
 11. set datatype to pairddrad
 14. changed the output name prefix
 19. mismatches for demulitiplexing set to 0, exact match.
 24. Raised maxH. Lower is better for filtering paralogs.
 30. added additional (all) output formats (e.g., nexus,SNPs,STRUCTURE)


In [7]:
%%bash
sed -i '/## 6. /c\TGCAG,AATT                 ## 6. cutsites... ' ./params.txt
sed -i '/## 10. /c\.85                       ## 10. lowered clust thresh' ./params.txt
sed -i '/## 11. /c\pairddrad                 ## 11. datatype... ' ./params.txt
sed -i '/## 14. /c\merged                    ## 14. prefix name ' ./params.txt
sed -i '/## 19./c\0                     ## 19. errbarcode... ' ./params.txt
sed -i '/## 24./c\10                    ## 24. maxH... ' ./params.txt
sed -i '/## 30./c\*                     ## 30. outformats... ' ./params.txt

Let's take a look at the edited params.txt file

In [8]:
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.4  **======================== affected step ==
./                          ## 1. Working directory                                 (all)
./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 16)  (s1)
./*.barcodes                ## 3. Loc. of barcode file (if not line 16)             (s1)
vsearch                     ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG,AATT                 ## 6. cutsites... 
2                           ## 7. N processors (parallel)                           (all)
6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                           ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.85                       ## 10. lowered clust thresh
pairddrad                 ## 11. datatype... 
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
merged                    ## 14. prefix name 
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: loc. of de-multiplexed data                      (s2)
0                     ## 19. errbarcode... 
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
10                    ## 24. maxH... 
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
*                     ## 30. outformats... 
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Step 1: De-multiplexing the data


Four examples of acceptable input file name formats for paired-end data:

   1. xxxx_R1_001.fastq      xxxx_R2_001.fastq
   2. xxxx_R1_001.fastq.gz   xxxx_R2_001.fastq.gz
   3. xxxx_R1_100.fq.gz      xxxx_R2_100.fq.gz
   4. xxxx_R1_.fq            xxxx_R2_.fq

  • The file ending can be .fastq, .fq, or .gz.
  • There should be a unique name or number shared by each pair and the characters _R1_ and _R2_.
  • For every file name with _R1_ there should be a corresponding _R2_ file.

If your data are already demultiplexed skip step 1 and see step 2 below.


Now we run step 1 of the analysis by designating the params file with the -p flag, and the step with the -s flag.

In [9]:
%%bash
pyrad -p params.txt -s 1

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 1: sorting reads by barcode
	 .

Now we can look at the stats output for this below:

In [10]:
%%bash
cat stats/s1.sorting.txt
file    	Nreads	cut_found	bar_matched
simpairddradsmerge_.fastq.gz	240000	240000	240000


sample	true_bar	obs_bars	N_obs
1C0    	AAAAGA    	AAAAGA	20000   
2G0    	AAGTGA    	AAGTGA	20000   
1D0    	AGAATG    	AGAATG	20000   
2H0    	AGGGTG    	AGGGTG	20000   
1A0    	CATCAT    	CATCAT	20000   
3I0    	GAATGG    	GAATGG	20000   
3K0    	GAGTAA    	GAGTAA	20000   
2E0    	GGAGAG    	GGAGAG	20000   
1B0    	GTAGTG    	GTAGTG	20000   
3L0    	GTTGAA    	GTTGAA	20000   
2F0    	TGATTT    	TGATTT	20000   
3J0    	TTAATG    	TTAATG	20000   

nomatch  	_            	0

The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/

In [11]:
%%bash 
ls fastq/
1A0_R1.fq.gz
1A0_R2.fq.gz
1B0_R1.fq.gz
1B0_R2.fq.gz
1C0_R1.fq.gz
1C0_R2.fq.gz
1D0_R1.fq.gz
1D0_R2.fq.gz
2E0_R1.fq.gz
2E0_R2.fq.gz
2F0_R1.fq.gz
2F0_R2.fq.gz
2G0_R1.fq.gz
2G0_R2.fq.gz
2H0_R1.fq.gz
2H0_R2.fq.gz
3I0_R1.fq.gz
3I0_R2.fq.gz
3J0_R1.fq.gz
3J0_R2.fq.gz
3K0_R1.fq.gz
3K0_R2.fq.gz
3L0_R1.fq.gz
3L0_R2.fq.gz

An individual file will look like this:

In [12]:
%%bash
## FIRST READS file -- 
## I show only the first 12 lines and 80 characters to print it clearly here.
less fastq/1A0_R1.fq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R1_0 1:N:0:
TGCAGCATGACAATCTATGGACCACAGAGCGCGAAATCTGCTCTCGGACATCAACGTCTGAAGTTCGGGCCCGTAACGCT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_1 1:N:0:
TGCAGCATGACAATCTATGGACCACAGAGCGCGAAATCTGCTCTCGGACATCAACGTCTGAAGTTCGGGCCCGTAACGCT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_2 1:N:0:
TGCAGCATGACAATCTATGGACCACAGAGCGCGAAATCTGCTCTCGGACATCAACGTCTGAAGTTCGGGCCCGTAACGCT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
In [13]:
%%bash
## SECOND READS file
less fastq/1A0_R2.fq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R2_0 1:N:0:
AATTGTTGGTTGTTTTACATGCAGGATATGAACTAGTGGCACTGATAGGATATTATCCCGTGCGAGCGCGTATACCGTGG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R2_1 1:N:0:
AATTGTTGGTTGTTTTACATGCAGGATATGAACTAGTGGCACTGATAGGATATTATCCCGTGCGAGCGCGTATACCGTGG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R2_2 1:N:0:
AATTGTTGGTTGTTTTACATGCAGGATATGAACTAGTGGCACTGATAGGATATTATCCCGTGCGAGCGCGTATACCGTGG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

The reads were sorted into a separate file for the first (R1) and second (R2) reads for each individual.

If your data were previously de-multiplexed you need the following things before step 2:

  • your sorted file names should be formatted similar to above, but with sample names substituted for 1A0, 1A1, etc.
  • file names should include "__R1." in first read files, and "_R2._" in second read files (note this is different from the format for non de-multiplexed data files).
  • the files can be gzipped (.gz) or not (.fq or .fastq).
  • the barcode should be removed (not on left side of first reads)
  • the restriction site should not be removed, but if it is, enter a '@' symbol before the location of your sorted data.

  • Enter on line 18 of the params file the location of your sorted data.


Before Step 2: Merge/overlap filtering

Due to the use of a common cutter (e.g., ecoRI), paired ddRAD data often contains fragments that are shorter than the sequence length such that the first and second reads overlap. It is usually beneficial to remove these from your data. To do so, you could run the following script in the program PEAR. If you think your data don't have this problem then skip this step (but most data have overlaps).

In [14]:
%%bash
## you first need to un-archive the files
for gfile in fastq/*.gz; 
  do gunzip $gfile;
done
In [15]:
%%bash
## then run PEAR on each data file
for gfile in fastq/*_R1.fq; 
  do pear -f $gfile \
          -r ${gfile/_R1.fq/_R2.fq} \
          -o ${gfile/_R1.fq/} \
          -n 33 \
          -j 4  >> pear.log 2>&1
done

take a peek at the log file

In [16]:
%%bash 
tail -n 42 pear.log
 ____  _____    _    ____ 
|  _ \| ____|  / \  |  _ \
| |_) |  _|   / _ \ | |_) |
|  __/| |___ / ___ \|  _ <
|_|   |_____/_/   \_\_| \_\

PEAR v0.9.7 [February 12, 2015]

Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593

Forward reads file.................: fastq/3L0_R1.fq
Reverse reads file.................: fastq/3L0_R2.fq
PHRED..............................: 33
Using empirical frequencies........: YES
Statistical method.................: OES
Maximum assembly length............: 999999
Minimum assembly length............: 33
p-value............................: 0.010000
Quality score threshold (trimming).: 0
Minimum read size after trimming...: 1
Maximal ratio of uncalled bases....: 1.000000
Minimum overlap....................: 10
Scoring method.....................: Scaled score
Threads............................: 4

Allocating memory..................: 200,000,000 bytes
Computing empirical frequencies....: DONE
  A: 0.256721
  C: 0.240800
  G: 0.247794
  T: 0.254686
  0 uncalled bases
Assemblying reads: 100%

Assembled reads ...................: 9,981 / 20,000 (49.905%)
Discarded reads ...................: 0 / 20,000 (0.000%)
Not assembled reads ...............: 10,019 / 20,000 (50.095%)
Assembled reads file...............: fastq/3L0.assembled.fastq
Discarded reads file...............: fastq/3L0.discarded.fastq
Unassembled forward reads file.....: fastq/3L0.unassembled.forward.fastq
Unassembled reverse reads file.....: fastq/3L0.unassembled.reverse.fastq

Important: Assembling our merged reads

As expected, about 1/2 of reads are merged since we simulated fragment data in a size range of 50-300 bp. We now have a number of decisions to make. We could (1) analyze only the merged reads, (2) analyze only the non-merged reads, or (3) analyze both. If most of your data are merged, it may not be worth your time to bother with the non-merged data, or vice-versa. And combining them may just introduce more noise rather than more signal.

We could combine the merged and non-merged reads early in the analysis, however, I recommend assembling the two separately (if you choose to keep both), and combining them at the end, if desired. This way you can easily check whether the two give conflicting signals, and if one dataset or the other appears messy or noisy. To see how to assemble the non-merged reads see the paired ddRAD non-merged tutorial. For step 2 you would enter the location of the ".unassembled.*" reads below instead of the ".assembled.*", as we do here. I show at the end of this tutorial how to combine the data sets.

In this tutorial I focus on how to assemble the merged data:

For this, we need to make two changes to the params file:

  • Change param 11 datatype from 'pairddrad' to 'ddrad' (we do not use the 'merged' data type for ddRAD data because they do not have the problem of needing to be reverse-complement clustered because the forward and reverse adapters ligate to different cutters).
  • Set the location of our pear output files for param 18. This tells pyrad to use these files instead of simply selected all files in the fastq/ directory.
In [17]:
%%bash
## set location of demultiplexed data that are 'pear' filtered
sed -i '/## 11./c\ddrad                     ## 11. datatype         ' ./params.txt
sed -i '/## 18./c\fastq/*.assembled.*       ## 18. demulti data loc ' ./params.txt

Step 2: Quality filtering

Next we apply the quality filtering.

  • We set the filter (param 21) to its default of 0, meaning that filtering is based only on quality scores of base calls. In this step, low quality sites are converted to Ns, and any locus with more than X number of Ns is discarded, where X is the number set on line 9 of the params file. We do not need to filter for Illumina adapters because PEAR has already done this.
In [18]:
%%bash
pyrad -p params.txt -s 2

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

	sorted .fastq from fastq/*.assembled.* being used
	step 2: editing raw reads 
	............

Statistics for the number of reads that passed filtering can be found in the stats/ directory. You can see that not all samples have the same exact number of reads. This is because there is some variation around how well reads are merged by pear when they overlap by only a few bases.

In [19]:
%%bash 
cat stats/s2.rawedit.txt
sample 	Nreads	passed	passed.w.trim	passed.total
1A0.assembled	9960	9960	0	9960
1B0.assembled	9970	9970	0	9970
1C0.assembled	9961	9961	0	9961
1D0.assembled	9960	9960	0	9960
2E0.assembled	9961	9961	0	9961
2F0.assembled	9962	9962	0	9962
2G0.assembled	9980	9980	0	9980
2H0.assembled	9960	9960	0	9960
3I0.assembled	9960	9960	0	9960
3J0.assembled	9960	9960	0	9960
3K0.assembled	9961	9961	0	9961
3L0.assembled	9981	9981	0	9981

    Nreads = total number of reads for a sample
    passed = retained reads that passed quality filtering at full length
    passed.w.trim= retained reads that were trimmed due to detection of adapters
    passed.total  = total kept reads of sufficient length
    note: you can set the option in params file to include trimmed reads of xx length. 

Assembled (merged) reads

The filtered data files are converted to fasta format and written to a directory called edits/. Our merged data set will have the name ".assembled" attached to it. This is important to keep it separate from our un-merged data set.

In [20]:
%%bash
ls edits/
1A0.assembled.edit
1B0.assembled.edit
1C0.assembled.edit
1D0.assembled.edit
2E0.assembled.edit
2F0.assembled.edit
2G0.assembled.edit
2H0.assembled.edit
3I0.assembled.edit
3J0.assembled.edit
3K0.assembled.edit
3L0.assembled.edit

Take a look at the merged reads

The merged data should generally have the first cutter at the beggining of the read, and the second cutter at the end of the read, like below:

In [21]:
%%bash
head -n 8 edits/1A0.assembled.edit
>1A0.assembled_0_r1
TGCAGCTCAAGCCTCTTAGGGGCCAGCGGTGCTACATGCTCCGCGTGCACTGGGTTCCTCCTGTTACCGTTCTGGGTGCTGCTACTACGTCTAGCGCTTCTGAAAGGTGGATGCCGGTTTGACCAAATATGCTACTGCGAGTATATACCTTGTGTGAGAGAATT
>1A0.assembled_1_r1
TGCAGCTCAAGCCTCTTAGGGGCCAGCGGTGCTACATGCTCCGCGTGCACTGGGTTCCTCCTGTTACCGTTCTGGGTGCTGCTACTACGTCTAGCGCTTCTGAAAGGTGGATGCCGGTTTGACCAAATATGCTACTGCGAGTATATACCTTGTGTGAGAGAATT
>1A0.assembled_2_r1
TGCAGCTCAAGCCTCTTAGGGGCCAGCGGTGCTACATGCTCCGCGTGCACTGGGTTCCTCCTGTTACCGTTCTGGGTGCTGCTACTACGTCTAGCGCTTCTGAAAGGTGGATGCCGGTTTGACCAAATATGCTACTGCGAGTATATACCTTGTGTGAGAGAATT
>1A0.assembled_3_r1
TGCAGCTCAAGCCTCTTAGGGGCCAGCGGTGCTACATGCTCCGCGTGCACTGGGTTCCTCCTGTTACCGTTCTGGGTGCTGCTACTACGTCTAGCGCTTCTGAAAGGTGGATGCCGGTTTGACCAAATATGCTACTGCGAGTATATACCTTGTGTGAGAGAATT

Step 3: Within-sample clustering

Unlike for paired-GBS or ez-RAD data we do not need to perform reverse complement clustering for paired or single ddRAD data. There are some options for different ways to cluster paired-ddRAD data in the un-merged tutorial, but for merged data it is pretty straight forward, since the data act like single-end data. With our datatype set to 'ddrad' we simply run step 3 like below:

In [22]:
%%bash
pyrad -p params.txt -s 3

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	de-replicating files for clustering...

	step 3: within-sample clustering of 12 samples at 
	        '.85' similarity. Running 2 parallel jobs
	 	with up to 6 threads per job. If needed, 
		adjust to avoid CPU and MEM limits

	sample 3L0 finished, 500 loci
	sample 2G0 finished, 499 loci
	sample 1B0 finished, 499 loci
	sample 2F0 finished, 499 loci
	sample 1C0 finished, 499 loci
	sample 3K0 finished, 499 loci
	sample 2E0 finished, 499 loci
	sample 2H0 finished, 498 loci
	sample 1A0 finished, 498 loci
	sample 3I0 finished, 498 loci
	sample 1D0 finished, 498 loci
	sample 3J0 finished, 498 loci
In [23]:
%%bash
head -n 23 stats/s3.clusters.txt
taxa	total	dpt.me	dpt.sd	d>5.tot	d>5.me	d>5.sd	badpairs
1A0.assembled	498	20.0	0.0	498	20.0	0.0	0
1B0.assembled	499	19.98	0.447	499	19.98	0.447	0
1C0.assembled	499	19.962	0.85	498	20.0	0.0	0
1D0.assembled	498	20.0	0.0	498	20.0	0.0	0
2E0.assembled	499	19.962	0.85	498	20.0	0.0	0
2F0.assembled	499	19.964	0.805	498	20.0	0.0	0
2G0.assembled	499	19.98	0.447	499	19.98	0.447	0
2H0.assembled	498	20.0	0.0	498	20.0	0.0	0
3I0.assembled	498	20.0	0.0	498	20.0	0.0	0
3J0.assembled	498	20.0	0.0	498	20.0	0.0	0
3K0.assembled	499	19.962	0.85	498	20.0	0.0	0
3L0.assembled	500	19.962	0.849	499	20.0	0.0	0

    ## total = total number of clusters, including singletons
    ## dpt.me = mean depth of clusters
    ## dpt.sd = standard deviation of cluster depth
    ## >N.tot = number of clusters with depth greater than N
    ## >N.me = mean depth of clusters with depth greater than N
    ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N
    ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)

Steps 4 & 5: Consensus base calling

We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:

In [24]:
%%bash
pyrad -p params.txt -s 4

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 4: estimating error rate and heterozygosity
	............

Calling consensus sequences applies a number of filters, as listed in the params file and the general tutorial.

In [25]:
%%bash
pyrad -p params.txt -s 5

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 5: creating consensus seqs for 12 samples, using H=0.00129 E=0.00049
	............
In [26]:
%%bash
cat stats/s5.consens.txt
taxon          	nloci	f1loci	f2loci	nsites	npoly	poly
1A0.assembled	498	498	498	55339	61	0.0011023
2E0.assembled	499	498	497	55187	83	0.001504
3I0.assembled	498	498	498	55339	83	0.0014998
3L0.assembled	500	499	499	55508	71	0.0012791
2F0.assembled	499	498	498	55340	88	0.0015902
2H0.assembled	498	498	498	55339	62	0.0011204
3J0.assembled	498	498	498	55339	73	0.0013191
1D0.assembled	498	498	498	55340	74	0.0013372
2G0.assembled	499	499	499	55509	83	0.0014953
1C0.assembled	499	498	498	55339	73	0.0013191
1B0.assembled	499	499	498	55349	70	0.0012647
3K0.assembled	499	498	498	55339	72	0.0013011

    ## nloci = number of loci
    ## f1loci = number of loci with >N depth coverage
    ## f2loci = number of loci with >N depth and passed paralog filter
    ## nsites = number of sites across f loci
    ## npoly = number of polymorphic sites in nsites
    ## poly = frequency of polymorphic sites

Step 6: Across-sample clustering

This step will sometimes take the longest, depending on the size of your data set. Here it will go very quickly.

In [27]:
%%bash
pyrad -p params.txt -s 6
vsearch v1.0.7_linux_x86_64, 7.5GB RAM, 4 cores
https://github.com/torognes/vsearch


	finished clustering

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 6: clustering across 12 samples at '.85' similarity 

Reading file /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/clust.85/cat.haplos_ 100%
724037 nt in 5977 seqs, min 59, max 184, avg 121
Indexing sequences 100%
Counting unique k-mers 100%
Clustering 100%
Writing clusters 100%
Clusters: 501 Size min 1, max 12, avg 11.9
Singletons: 3, 0.1% of seqs, 0.6% of clusters

Step 7: Statistics and output files

Alignment of loci, statistics, and output of various formatted data files.

In [28]:
%%bash
pyrad -p params.txt -s 7
	ingroup 1A0.assembled,1B0.assembled,1C0.assembled,1D0.assembled,2E0.assembled,2F0.assembled,2G0.assembled,2H0.assembled,3I0.assembled,3J0.assembled,3K0.assembled,3L0.assembled
	addon 
	exclude 
	
	final stats written to:
	 /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/stats/merged.stats
	output files being written to:
	 /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/outfiles/ directory

	writing nexus file
	writing phylip file
	  + writing full SNPs file
	  + writing unlinked SNPs file
	  + writing STRUCTURE file
	  + writing geno file
	  ** must enter group/clade assignments for treemix output 
	writing vcf file
	writing alleles file
	  ** must enter group/clade assignments for migrate-n output 

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

...

Stats for this run. In this case ~490 loci were shared across all 12 samples.

In [37]:
%%bash
cat stats/merged.stats

498         ## loci with > minsp containing data
498         ## loci with > minsp containing data & paralogs removed
498         ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
1A0.assembled	498
1B0.assembled	497
1C0.assembled	498
1D0.assembled	498
2E0.assembled	497
2F0.assembled	498
2G0.assembled	498
2H0.assembled	498
3I0.assembled	498
3J0.assembled	498
3K0.assembled	498
3L0.assembled	498


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	0	*	498
5	0	*	498
6	0	*	498
7	0	*	498
8	0	*	498
9	0	*	498
10	0	*	498
11	2	*	498
12	496	*	496


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	9	0	107	0
1	21	21	150	150
2	53	127	125	400
3	78	361	63	589
4	69	637	33	721
5	82	1047	13	786
6	56	1383	3	804
7	45	1698	2	818
8	35	1978	2	834
9	21	2167	0	834
10	11	2277	0	834
11	9	2376	0	834
12	6	2448	0	834
13	2	2474	0	834
14	0	2474	0	834
15	1	2489	0	834
total var= 2489
total pis= 834
sampled unlinked SNPs= 489
sampled unlinked bi-allelic SNPs= 465

Output files

In [38]:
%%bash
ls outfiles/
merged.alleles
merged.excluded_loci
merged.geno
merged.loci
merged.nex
merged.phy
merged.snps
merged.str
merged.unlinked_snps
merged.vcf

Stats files

In [39]:
%%bash
ls stats/
merged.stats
Pi_E_estimate.txt
s1.sorting.txt
s2.rawedit.txt
s3.clusters.txt
s5.consens.txt

Take a look at the data

In [42]:
%%bash
head -n 39 outfiles/merged.loci | cut -b 1-80
>1A0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>1B0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>1C0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>1D0.assembled    GGGTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>2E0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCGAATATTTACCCCTATTA
>2F0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>2G0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>2H0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>3I0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>3J0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>3K0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
>3L0.assembled    GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA
//                  -                                         -                 
>1A0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG
>1B0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG
>1C0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG
>1D0.assembled    GCGTGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG
>2E0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTWTTCAGGGGCAGTTAGCCACTG
>2F0.assembled    GCATGAGCGAAAAATCTTAMGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG
>2G0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG
>2H0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG
>3I0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTATCCACTG
>3J0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTATCCACTG
>3K0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTATCCACTG
>3L0.assembled    GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTATCCACTG
//                  -                -                    -              *      
>1A0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>1B0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>1C0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>1D0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>2E0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>2F0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>2G0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>2H0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>3I0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>3J0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>3K0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC
>3L0.assembled    TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTMCGGTGAAGCCCRC
//                                                                -           - 

What to do with the non-merged reads

Assemble them as described in the pairddrad non-merge tutorial link.

Combining the two data sets

(1) Concatenate the .loci output files from the merged and non-merged analyses to a tempfile.

In [33]:
%%bash
#cat outfiles/merged.loci outfiles/nonmerged.loci > outfiles/totaldata.loci

(2) Remove ".assembled." and ".unassembled" from the names in the file.

In [34]:
%%bash
#sed -i s/.unassembled.//g outfiles/totaldata.loci
#sed -i s/.assembled.//g outfiles/totaldata.loci

(3) Set the output prefix (param 14) to the name of your concatenated data file. And ask for additional output formats.

In [35]:
%%bash
#sed -i '/## 14. /c\totaldata                 ## 14. prefix name ' ./params.txt
#sed -i '/## 30. /c\*                         ## 30. additional formats ' ./params.txt

(4) Run step 7 to create additional output file formats using the concatenated .loci file

In [36]:
%%bash
#pyrad -p params.txt -s 7