#!/usr/bin/env python
# coding: utf-8

# # Tutorial: 
# ### paired-end ddRAD analysis w/ merged reads
# #### (or other two-cutter based datatypes)
# ### _pyRAD_ v.3.0.4

# -----------------------------  
# 
# ###__Topics__:  
# 
#  + Setup of params file for paired-end ddRAD (pairddrad)
#  + Assemble simulated data set
#  + Visualize ideal results on simulated data
# 
# -----------------------------  

# ## What is ddRAD?

# The double digest library preparation method was developed and described by [Peterson et al. 2012](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0037135). Here I will be talking about __ _paired-end ddRAD_ __ data and describe my recommendations for how to setup a params file to analyze them in _pyRAD_. 

# A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. The first reads may come in one or multiple files with "\_R1\_" in the name, and the second reads are in a separate file/s with "\_R2\_". Second read files will contain the corresponding pair from the first read files in the exact same order.
# 
# -------------------

# ![alt](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/diag_ddradpair.svg)

# ---------------------  
# 
# Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using _pyRAD_ or separate software.  
# 
# A number of programs are available to check for overlap of paired end reads, and you can run your data through these programs before being input to _pyRAD_. At the time of writing this, I recommend the software PEAR (https://github.com/xflouris/PEAR), which I demonstrate on pairddrad data in a separate [tutorial here](http://nbviewer.ipython.org/gist/dereneaton/dc6241083c912519064e/tutorial_pairddRAD_3.0.4-merged.ipynb).
# 

# ## The Example data

# For this tutorial I simulated paired-end ddRAD loci on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions.

# ![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/fig_tree4sims.svg)

# In[1]:


get_ipython().run_cell_magic('bash', '', '## download the data\nwget -q http://www.dereneaton.com/downloads/simpairddrads.zip simpairddrads.zip\n## unzip the data\nunzip simpairddrads.zip\n')


# ---------------   
# # Assembling the data set with _pyRAD_

# We first create an empty params.txt input file for the _pyRAD_ analysis.  
# The following command will create a template which we will fill in with all relevant parameter settings for the analysis.

# In[2]:


get_ipython().run_cell_magic('bash', '', 'pyrad -n\n')


# -------------  
# 
# Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-37 are optional.

# In[3]:


get_ipython().run_cell_magic('bash', '', 'cat params.txt\n')


# ---------------   
# 
# ### Edit the params file

# 
# I will use the script below to substitute new values, but you should simply __use any text editor__ to make changes. For this analysis I made the following changes from the defaults:  
# 
# --------------------  
# 
#      6. set the two restriction enzymes used to generate the ddRAD data
#      10. lowered clustering threshold to .85
#      11. set datatype to pairddrad
#      14. changed the output name prefix
#      19. mismatches for demulitiplexing set to 0, exact match.
#      24. Raised maxH. Lower is better for filtering paralogs.
#      30. added additional output formats (e.g., nexus,SNPs,STRUCTURE)
# 
# --------------------  
#     

# In[4]:


get_ipython().run_cell_magic('bash', '', "sed -i '/## 6. /c\\TGCAG,AATT                 ## 6. cutsites... ' ./params.txt\nsed -i '/## 10. /c\\.85                       ## 10. lowered clust thresh' ./params.txt\nsed -i '/## 11. /c\\pairddrad                 ## 11. datatype... ' ./params.txt\nsed -i '/## 14. /c\\c85d6m4p3                 ## 14. prefix name ' ./params.txt\nsed -i '/## 19./c\\0                     ## 19. errbarcode... ' ./params.txt\nsed -i '/## 24./c\\10                    ## 24. maxH... ' ./params.txt\nsed -i '/## 30./c\\*                     ## 30. outformats... ' ./params.txt\n")


# -----------------  
# 
# Let's take a look at the edited params.txt file
# 

# In[5]:


get_ipython().run_cell_magic('bash', '', 'cat params.txt\n')


# ## Step 1: De-multiplexing the data

# ----------------  
#   
# Four examples of acceptable input file name formats for paired-end data: 
# 
#        1. xxxx_R1_001.fastq      xxxx_R2_001.fastq
#        2. xxxx_R1_001.fastq.gz   xxxx_R2_001.fastq.gz
#        3. xxxx_R1_001.fq.gz      xxxx_R2_001.fq.gz
#        4. xxxx_R1_001.fq         xxxx_R2_001.fq
# 
# The file ending can be .fastq, .fq, or .gz.  
# There should be a unique name or number shared by each pair and the characters "\_R1\_" and "\_R2\_"  
# For every file name with "\_R1\_" there should be a corresponding "\_R2\_" file. 
# 
# If your data are _already_ demultiplexed skip step 1 and see step 2 below.  
# 
# -----------------  
# 

# In[6]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 1\n')


# Now we can look at the stats output for this below:

# In[7]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s1.sorting.txt\n')


# The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/

# In[8]:


get_ipython().run_cell_magic('bash', '', 'ls fastq/\n')


# An individual file will look like below:

# In[9]:


get_ipython().run_cell_magic('bash', '', '## FIRST READS file\n## I show the first 12 lines and 80 characters to print clearly here\nless fastq/1A0_R1.fq.gz | head -n 12 | cut -c 1-80\n')


# In[10]:


get_ipython().run_cell_magic('bash', '', '## SECOND READS file\nless fastq/1A0_R2.fq.gz | head -n 12 | cut -c 1-80\n')


# -------------------------
# 
# The reads were sorted into a separate file for the first (R1) and second (R2) reads for each individual.  
# 
# __If your data were previously de-multiplexed__ you need the following things before step 2:  
# 
# +  your sorted file names should be formatted similar to above, but with sample names substituted for 1A0, 1A1, etc.   
# +  the files can be zipped (.gz) or not (.fq or .fastq).  
# +  the barcode should be removed (not on left side of reads)  
# +  the restriction site should _not_ be removed, but if it is, enter a '@' symbol before the location of your sorted data.
# 
# +  __Enter on line 18 of the params file the location of your sorted data.__
# 
# --------------------  
# 

# ## Step 2: Quality filtering

# Next we apply the quality filtering.  
# 
# + We left the quality filter (line 21) at the default value of 0, meaning that it will only filter based on quality scores but not look for the presence of Illumina adapters. Low quality sites are converted to Ns, and any locus with more than X number of Ns is discarded, where X is the number set on line 9 of the params file. If you used _pear_ to check for merged reads then this filter can be set to 0, otherwise I would suggest setting it to 1 (check for adapters) or 2 (more stringent check). 

# In[11]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 2\n')


# Statistics for the number of reads that passed filtering can be found in the stats/ directory. 
# 

# In[12]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s2.rawedit.txt\n')


# The filtered data files are converted to fasta format and written to a directory called edits/.  I show this below:
# 

# In[13]:


get_ipython().run_cell_magic('bash', '', 'ls edits/\n')


# ------------  
# 
# Here you can see that the first and second reads have been concatenated into a single read separated by 'nnnn' in these files, and the read quality scores have now been discarded. 
# 
# ----------  
# 
# 

# In[14]:


get_ipython().run_cell_magic('bash', '', 'head -n 8 edits/1A0.edit\n')


# --------------   
# 
# ## Step 3: Within-sample clustering

# From here you have two options for how to proceed. The first is to treat these concatenated reads like single end ddRAD sequences and cluster them as concatenated reads. To do this change the datatype option in the params folder from "pairddrad" to "ddrad", and proceed with steps 3-7. It will detect and remove the 'nnnn' seperator and cluster the long reads. 
# 
# Alternatively, the more complex option is to cluster the reads separately, which I call the "split clustering" method. This will do the following:  
# 
# + de-replicate concatenated reads 
# + split them and cluster first reads only 
# + match second reads back to firsts
# + align second read clusters separately from first-read clusters
# + discard loci for which second reads align poorly (incomplete digestion or paralogs)
# 
# This is illustrated graphically below:

# ![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/pairclustermethod.svg)

# ### The screen captures below show split clustering performed on an empirical paired ddRAD data set: 

# #### The good aligned clusters are written to a .clustS.gz file

# ![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/goodpairs.png)

# #### And the poor aligned pairs which are excluded from further analysis are written to a separate file ending with .badpairs.gz.

# ![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/badpairs.png)

# --------------  
# 
# The benefit of this method over clustering concatenated first+second reads is that second reads can be retained that are either more divergent than the clustering threshold, or which contain many low quality base calls (Ns), thus potentially yielding more SNPs. Also, for very very large data sets this method performs faster clustering.   
# 
# Drawbacks are that it discards sequences for which the first and second reads do not appear to be from the same DNA fragment, whereas the concatenated clustering method would likely cluster them separately. And also, even a single incompletely digested second read can cause an otherwise good cluster to be discarded.

# The split clustering method filters the alignment of second reads based on the number of indels in the alignment. This number can be set on line 27 of the params file. The default values are 3,6,99,99. Meaning for within-sample clusters we allow 3 indels in first read clusters and 6 in second read clusters. For across-sample clusters we allow 99 in first read clusters and 99 in second read clusters. If only two numbers are set (e.g., 3,10) this is equivalent to 3,3,10,10. 

# In[14]:


get_ipython().run_cell_magic('bash', '', '## we are using the split-clustering method since the \n## datatype is still set to pairddrad\npyrad -p params.txt -s 3 \n')


# In[15]:


get_ipython().run_cell_magic('bash', '', 'head -n 23 stats/s3.clusters.txt\n')


# ## Steps 4 & 5: Consensus base calling

# We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:

# In[16]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 4\n')


# Calling consensus sequences applies a number of filters, as listed in the params file, to remove potential paralogs or highly repetitive markers from the data set. For paired-end data the maxN filter only applies to first reads, since we don't mind allowing low quality base calls in second reads. The maxH filter applies to both reads together. The diploid filter (only allow 2 haplotypes in a consensus sequence) also applies to the two reads together.

# In[17]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 5\n')


# In[18]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s5.consens.txt\n')


# ## Step 6: Across-sample clustering

# This step clusters consensus sequences across samples. It can take a long time for very large data sets. If run normally it will print its progress to the screen so you can judge how long it might take. This example will take less than a minute.

# In[19]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 6\n')


# ## Step 7: Statistics and output files

# Alignment of loci, statistics, and output of various formatted data files.

# In[20]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 7\n')


# Stats for this run. In this case 1000 loci were shared across all 12 samples.

# In[21]:


get_ipython().run_cell_magic('bash', '', 'cat stats/c85d6m4p3.stats\n')


# ### Output files

# In[22]:


get_ipython().run_cell_magic('bash', '', 'ls outfiles/\n')


# ### Stats files

# In[23]:


get_ipython().run_cell_magic('bash', '', 'ls stats/\n')