#!/usr/bin/env python
# coding: utf-8

# # Tutorial: 
# ### paired-end ddRAD w/ merged reads
# #### (or other two-cutter based datatypes)
# ### _pyRAD_ v.3.0.4

# -----------------------------  
# 
# ###__Topics__:  
# 
#  + Setup of params file for paired-end ddRAD (pairddrad)
#  + Check for merge/overlap of paired reads
#  + Assemble simulated data set with merged reads
#  + Combine merged with non-merged assembly
# 
# -----------------------------  

# ## What do the data look like?

# The double digest library preparation method was developed and described by [Peterson et al. 2012](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0037135). Here I will be talking about __ _paired-end ddRAD_ __ data and describe my recommendations for how to setup a params file to analyze them in _pyRAD_ when many of the data are __merged__. 

# A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. The first reads may come in one or multiple files with "\_R1\_" in the name, and the second reads are in a separate file/s with "\_R2\_". Second read files will contain the corresponding pair from the first read files in the exact same order.
# 
# -------------------

# ![alt](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/diag_ddradpair.svg)

# #### Which cutters did you use?

# A feature of ddRAD data is that two different cutters are used to generate the data. There is typically a rare cutter and a common cutter. You will need to know what the overhang sequence is that these cutters leave on your sequences. This can easily be found by looking at the raw forward and reverse reads files. Find the invariant sequence near the beginning of R1 files for the first cutter, and invariant sequence at the end of the _R2_ files for the second cutter. You will list them in this order in the params file, discussed below.

# #### Barcodes
# Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using _pyRAD_ or separate software. If your reads are already de-multiplexed that is OK as well. 
# 
# #### Detecting merged reads
# A failure to merge paired end reads that have overlapping sequences can lead to _major_ problems during the assembly. A number of external programs are available to check for overlap of paired end reads, and you can run your data through these programs before being input to _pyRAD_. At the time of writing this, I recommend the software PEAR (https://github.com/xflouris/PEAR), which I'll demonstrate below. 
# 

# ## The Example data

# For this tutorial I simulated paired-end ddRAD reads on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions.

# ![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/fig_tree4sims.svg)

# In[1]:


get_ipython().run_cell_magic('bash', '', '## download the data\nwget -q https://github.com/dereneaton/dereneaton.github.io/raw/master/downloads/simpairddradsmerge.zip\n\n## unzip the data\nunzip simpairddradsmerge.zip\n')


# ### Where the example data come from (skip this section if you want)

# You can simply download the data, but it might also be worth describing how the data were generated. In this case I simulated ddRAD-like data using the egglib coalescent simulator with the following options using my program [simRRLs.py](http://dereneaton.com/software/simrrls/)
# 

# In[2]:


indels  = 0.005   ## low rate of indels (prob. mutation is indel)
dropout = 0       ## if 1, mutations to restriction site can cause locus dropout.
nloci   = 1000    ## Total Nloci simulated (less if locus dropout or merged reads) 
ninds   = 1       ## sampled individuals per tip taxon
shortfrag = 50    ## smallest size of digested fragments
longfrag  = 300   ## largest size of digested fragments


# In[3]:


## Because the data are simulated at 20X coverage, 
## and the reads are sequenced from both ends (paired-end)
## the total number of reads is:  
reads = 12*ninds*nloci*20*2
print "%s sequenced reads" % reads


# Here I execute the simulation script to simulate the data, and write it to a file called _simpairddradsmerge_. If you simply downloaded the data you do not need to do this step, I show it only for those interested. The relevant point is that the fragment size is randomly selected between 50 and 300 bp, meaning that around half of our fragments will be <200 bp long, which in the case of paired 100 bp reads will lead to overlap.

# In[4]:


get_ipython().run_cell_magic('bash', '', '## download the simulation program\nwget -q http://www.dereneaton.com/downloads/simRRLs.py\n\n## simulate data with param settings described above\n## (requires the Python package `Egglib`)  <---- !! \npython simRRLs.py 0.005 0 1000 1 50,300 pairddrad simpairddradsmerge\n')


# # Assembling the data set with _pyRAD_

# We first create an empty params.txt input file for the _pyRAD_ analysis.  
# The following command will create a template which we will fill in with all relevant parameter settings for the analysis.

# In[5]:


get_ipython().run_cell_magic('bash', '', 'pyrad -n\n')


# -------------  
# 
# Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-37 are optional.

# In[6]:


get_ipython().run_cell_magic('bash', '', 'cat params.txt\n')


# ---------------   
# 
# ### Edit the params file

# 
# I will use the script below to substitute new values, but you should simply __use any text editor__ to make changes. For this analysis I made the following changes from the defaults:  
# 
# --------------------  
# 
#      6. set the two restriction enzymes used to generate the ddRAD data
#      10. lowered clustering threshold to .85
#      11. set datatype to pairddrad
#      14. changed the output name prefix
#      19. mismatches for demulitiplexing set to 0, exact match.
#      24. Raised maxH. Lower is better for filtering paralogs.
#      30. added additional (all) output formats (e.g., nexus,SNPs,STRUCTURE)
# 
# --------------------  
#     

# In[7]:


get_ipython().run_cell_magic('bash', '', "sed -i '/## 6. /c\\TGCAG,AATT                 ## 6. cutsites... ' ./params.txt\nsed -i '/## 10. /c\\.85                       ## 10. lowered clust thresh' ./params.txt\nsed -i '/## 11. /c\\pairddrad                 ## 11. datatype... ' ./params.txt\nsed -i '/## 14. /c\\merged                    ## 14. prefix name ' ./params.txt\nsed -i '/## 19./c\\0                     ## 19. errbarcode... ' ./params.txt\nsed -i '/## 24./c\\10                    ## 24. maxH... ' ./params.txt\nsed -i '/## 30./c\\*                     ## 30. outformats... ' ./params.txt\n")


# -----------------  
# 
# Let's take a look at the edited params.txt file
# 

# In[8]:


get_ipython().run_cell_magic('bash', '', 'cat params.txt\n')


# ## Step 1: De-multiplexing the data

# ----------------  
#   
# Four examples of acceptable input file name formats for paired-end data: 
# 
#        1. xxxx_R1_001.fastq      xxxx_R2_001.fastq
#        2. xxxx_R1_001.fastq.gz   xxxx_R2_001.fastq.gz
#        3. xxxx_R1_100.fq.gz      xxxx_R2_100.fq.gz
#        4. xxxx_R1_.fq            xxxx_R2_.fq
# 
# + The file ending can be .fastq, .fq, or .gz.  
# + There should be a unique name or number shared by each pair and the characters \_R1\_ and \_R2\_.  
# + For every file name with \_R1\_ there should be a corresponding \_R2\_ file. 
# 
# If your data are _already_ demultiplexed skip step 1 and see step 2 below.  
# 
# -----------------  
# 

# Now we run step 1 of the analysis by designating the params file with the -p flag, and the step with the -s flag. 

# In[9]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 1\n')


# Now we can look at the stats output for this below:

# In[10]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s1.sorting.txt\n')


# The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/

# In[11]:


get_ipython().run_cell_magic('bash', '', 'ls fastq/\n')


# An individual file will look like this:

# In[12]:


get_ipython().run_cell_magic('bash', '', '## FIRST READS file -- \n## I show only the first 12 lines and 80 characters to print it clearly here.\nless fastq/1A0_R1.fq.gz | head -n 12 | cut -c 1-80\n')


# In[13]:


get_ipython().run_cell_magic('bash', '', '## SECOND READS file\nless fastq/1A0_R2.fq.gz | head -n 12 | cut -c 1-80\n')


# -------------------------
# 
# The reads were sorted into a separate file for the first (R1) and second (R2) reads for each individual.  
# 
# __If your data were previously de-multiplexed__ you need the following things before step 2:  
# 
# +  your sorted file names should be formatted similar to above, but with sample names substituted for 1A0, 1A1, etc.
# +  file names should include "_\_R1._" in first read files, and "_\_R2._" in second read files (note this is different from the format for non de-multiplexed data files).
# +  the files can be gzipped (.gz) or not (.fq or .fastq).  
# +  the barcode should be removed (not on left side of first reads)  
# +  the restriction site should _not_ be removed, but if it is, enter a '@' symbol before the location of your sorted data.
# 
# +  __Enter on line 18 of the params file the location of your sorted data.__
# 
# --------------------  
# 

# ## Before Step 2: Merge/overlap filtering

# Due to the use of a common cutter (e.g., ecoRI), paired ddRAD data often contains fragments that are shorter than the sequence length such that the first and second reads overlap. It is usually beneficial to remove these from your data. To do so, you could run the following script in the program PEAR. If you think your data don't have this problem then skip this step (but most data have overlaps).
# 

# In[14]:


get_ipython().run_cell_magic('bash', '', '## you first need to un-archive the files\nfor gfile in fastq/*.gz; \n  do gunzip $gfile;\ndone\n')


# In[15]:


get_ipython().run_cell_magic('bash', '', '## then run PEAR on each data file\nfor gfile in fastq/*_R1.fq; \n  do pear -f $gfile \\\n          -r ${gfile/_R1.fq/_R2.fq} \\\n          -o ${gfile/_R1.fq/} \\\n          -n 33 \\\n          -j 4  >> pear.log 2>&1\ndone\n')


# #### take a peek at the log file

# In[16]:


get_ipython().run_cell_magic('bash', '', 'tail -n 42 pear.log\n')


# ## Important: Assembling our merged reads

# As expected, about 1/2 of reads are merged since we simulated fragment data in a size range of 50-300 bp. We now have a number of decisions to make. We could (1) analyze only the merged reads, (2) analyze only the non-merged reads, or (3) analyze both. If most of your data are merged, it may not be worth your time to bother with the non-merged data, or vice-versa. And combining them may just introduce more noise rather than more signal.   
# 
# We could combine the merged and non-merged reads early in the analysis, however, I recommend assembling the two separately (if you choose to keep both), and combining them at the end, if desired. This way you can easily check whether the two give conflicting signals, and if one dataset or the other appears messy or noisy. 
# To see how to assemble the non-merged reads see the [paired ddRAD non-merged tutorial](link). For step 2 you would enter the location of the ".unassembled.\*" reads below instead of the ".assembled.\*", as we do here. I show at the end of this tutorial how to combine the data sets. 
# 
# __In this tutorial I focus on how to assemble the merged data:__
# 
# For this, we need to make two changes to the params file:
# 
# + __Change param 11 datatype from 'pairddrad' to 'ddrad'__ (we do not use the 'merged' data type for ddRAD data because they do not have the problem of needing to be reverse-complement clustered because the forward and reverse adapters ligate to different cutters).
# + __Set the location of our _pear_ output files for param 18__. This tells pyrad to use these files instead of simply selected all files in the fastq/ directory. 
# 

# In[17]:


get_ipython().run_cell_magic('bash', '', "## set location of demultiplexed data that are 'pear' filtered\nsed -i '/## 11./c\\ddrad                     ## 11. datatype         ' ./params.txt\nsed -i '/## 18./c\\fastq/*.assembled.*       ## 18. demulti data loc ' ./params.txt\n")


# ## Step 2: Quality filtering

# Next we apply the quality filtering.  
# 
# + We set the filter (param 21) to its default of 0, meaning that filtering is based only on quality scores of base calls. In this step, low quality sites are converted to Ns, and any locus with more than X number of Ns is discarded, where X is the number set on line 9 of the params file. We do not need to filter for Illumina adapters because PEAR has already done this. 

# In[18]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 2\n')


# Statistics for the number of reads that passed filtering can be found in the stats/ directory. You can see that not all samples have the same exact number of reads. This is because there is some variation around how well reads are merged by _pear_ when they overlap by only a few bases. 
# 

# In[19]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s2.rawedit.txt\n')


# #### Assembled (merged) reads
# 
# The filtered data files are converted to fasta format and written to a directory called edits/. Our merged data set will have the name ".assembled" attached to it. This is important to keep it separate from our un-merged data set. 
# 

# In[20]:


get_ipython().run_cell_magic('bash', '', 'ls edits/\n')


# #### Take a look at the merged reads  
# The merged data should generally have the first cutter at the beggining of the read, and the second cutter at the end of the read, like below:

# In[21]:


get_ipython().run_cell_magic('bash', '', 'head -n 8 edits/1A0.assembled.edit\n')


# --------------   
# 
# ## Step 3: Within-sample clustering

# Unlike for paired-GBS or ez-RAD data we do not need to perform reverse complement clustering for paired or single ddRAD data. There are some options for different ways to cluster paired-ddRAD data in the un-merged tutorial, but for merged data it is pretty straight forward, since the data act like single-end data. With our datatype set to 'ddrad' we simply run step 3 like below:

# In[22]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 3 \n')


# In[23]:


get_ipython().run_cell_magic('bash', '', 'head -n 23 stats/s3.clusters.txt\n')


# ## Steps 4 & 5: Consensus base calling

# We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:

# In[24]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 4\n')


# Calling consensus sequences applies a number of filters, as listed in the params file and the general tutorial. 

# In[25]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 5\n')


# In[26]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s5.consens.txt\n')


# ## Step 6: Across-sample clustering

# This step will sometimes take the longest, depending on the size of your data set. Here it will go very quickly. 

# In[27]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 6\n')


# ## Step 7: Statistics and output files

# Alignment of loci, statistics, and output of various formatted data files.

# In[28]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 7\n')


# Stats for this run. In this case ~490 loci were shared across all 12 samples.

# In[37]:


get_ipython().run_cell_magic('bash', '', 'cat stats/merged.stats\n')


# ### Output files

# In[38]:


get_ipython().run_cell_magic('bash', '', 'ls outfiles/\n')


# ### Stats files

# In[39]:


get_ipython().run_cell_magic('bash', '', 'ls stats/\n')


# ### Take a look at the data

# In[42]:


get_ipython().run_cell_magic('bash', '', 'head -n 39 outfiles/merged.loci | cut -b 1-80\n')


# ## What to do with the non-merged reads

# Assemble them as described in the pairddrad non-merge tutorial [link](link).

# # Combining the two data sets

# (1) Concatenate the .loci output files from the merged and non-merged analyses to a tempfile.

# In[33]:


get_ipython().run_cell_magic('bash', '', '#cat outfiles/merged.loci outfiles/nonmerged.loci > outfiles/totaldata.loci\n')


# (2) Remove ".assembled." and ".unassembled" from the names in the file.

# In[34]:


get_ipython().run_cell_magic('bash', '', '#sed -i s/.unassembled.//g outfiles/totaldata.loci\n#sed -i s/.assembled.//g outfiles/totaldata.loci\n')


# (3) Set the output prefix (param 14) to the name of your concatenated data file. And ask for additional output formats.

# In[35]:


get_ipython().run_cell_magic('bash', '', "#sed -i '/## 14. /c\\totaldata                 ## 14. prefix name ' ./params.txt\n#sed -i '/## 30. /c\\*                         ## 30. additional formats ' ./params.txt\n")


# (4) Run step 7 to create additional output file formats using the concatenated .loci file

# In[36]:


get_ipython().run_cell_magic('bash', '', '#pyrad -p params.txt -s 7\n')