#!/usr/bin/env python
# coding: utf-8

# # Tutorial: 
# ### single-end ddRAD analysis
# #### (or other two-cutter based datatypes)
# ### _pyRAD_ v.3.0.4

# -----------------------------  
# 
# ###__Topics__:  
# 
#  + Setup of params file for single-end ddRAD (ddrad)
#  + Assemble simulated data set 
#  + Visualize results 
# 
# -----------------------------  

# ## What is ddRAD?

# The double digest library preparation method was developed and described by [Peterson et al. 2012](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0037135).

# A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. Depending on the size selection window, the far end adapter and cutsite will not appear in the part of the fragment that is sequenced (the read). 

# ## The Example data

# For this tutorial I simulated ddRAD data on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions.

# ![](https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/fig_tree4sims.svg)

# In[1]:


get_ipython().run_cell_magic('bash', '', '## download the data\nwget -q http://www.dereneaton.com/downloads/simddrads.zip\n## unzip the data\nunzip simddrads.zip\n')


# #### Notes on the data set

# These data were simulated with a size selection window between 50-250 bp, meaning that many of the reads will contain the far-end adapter sequence within the 100 bp sequenced read. I included this because it is a fairly common occurrence in empirical ddRAD data sets. I will show how to filter out these adapters in step 2 of the pyrad analysis. 

# ---------------   
# # Assembling the data set with _pyRAD_

# We first create an empty params.txt input file for the _pyRAD_ analysis. The following command will create a template which we will fill in with all relevant parameter settings for the analysis.

# In[2]:


get_ipython().run_cell_magic('bash', '', 'pyrad -n\n')


# -------------  
# 
# Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-37 are optional.

# In[3]:


get_ipython().run_cell_magic('bash', '', 'cat params.txt\n')


# ---------------   
# 
# ### Edit the params file

# 
# I will use the script below to substitute new values, but you should simply __use any text editor__ to make changes. For this analysis I made the following changes from the defaults:  
# 
# --------------------  
# 
#      6. set the two restriction enzymes used to generate the ddRAD data
#      10. lowered clustering threshold to .85
#      11. set datatype to ddrad
#      14. changed the output name prefix
#      19. mismatches for demulitiplexing set to 0, exact match.
#      24. Raised maxH. Lower is better for filtering paralogs.
#      30. added additional output formats (e.g., nexus,SNPs,STRUCTURE)
# 
# --------------------  
#     

# In[4]:


get_ipython().run_cell_magic('bash', '', "sed -i '/## 6. /c\\TGCAG,AATT                 ## 6. cutsites... ' ./params.txt\nsed -i '/## 10. /c\\.85                       ## 10. lowered clust thresh' ./params.txt\nsed -i '/## 11. /c\\ddrad                     ## 11. datatype... ' ./params.txt\nsed -i '/## 14. /c\\c85d6m4p3                 ## 14. prefix name ' ./params.txt\nsed -i '/## 19./c\\0                     ## 19. errbarcode... ' ./params.txt\nsed -i '/## 24./c\\10                    ## 24. maxH... ' ./params.txt\nsed -i '/## 30./c\\*                     ## 30. outformats... ' ./params.txt\n")


# -----------------  
# 
# Let's take a look at the edited params.txt file
# 

# In[5]:


get_ipython().run_cell_magic('bash', '', 'cat params.txt\n')


# ## Step 1: De-multiplexing the data

# Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using _pyRAD_ or separate software. Here pyrad uses information from the barcodes file to demultiplex the reads. 

# #### Take a look at the barcodes file

# In[6]:


get_ipython().run_cell_magic('bash', '', 'cat simddrads.barcodes\n')


# In[7]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 1\n')


# Now we can look at the stats output for this below:

# In[8]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s1.sorting.txt\n')


# The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/

# In[9]:


get_ipython().run_cell_magic('bash', '', 'ls fastq/\n')


# An individual file will look like below:

# In[10]:


get_ipython().run_cell_magic('bash', '', '## I show the first 12 lines and 80 characters to print clearly here\nless fastq/1A0_R1.fq.gz | head -n 12 | cut -c 1-80\n')


# -------------------------
# 
# __If your data were previously de-multiplexed__ you need the following things before step 2:  
# 
# +  your sorted file names should be formatted similar to above, but with your sample names substituted for 1A0, 1A1, etc.   
# +  the files can be zipped (.gz) or not (.fq or .fastq).  
# +  the barcode should be removed (not on left side of reads)  
# +  the restriction site should _not_ be removed, but if it is, enter a '@' symbol before the location of your sorted data.
# 
# +  __Enter on line 18 of the params file the location of your sorted data.__
# 
# --------------------  
# 

# ## Step 2: Quality filtering

# Next we apply the quality filtering.  
# 
# + Here we set the quality filter (param 21) to 1, meaning that it will check for presence of Illumina adapters and trim them off. In combination, we will also set the minimum length of kept reads after trimming (param 32) to 50. 
# 
# + In addition to filtering for adapters, this step also filters based on quality scores. Low quality sites are converted to Ns, and any locus with more than X number of Ns is discarded, where X is the number set in param 9.

# In[11]:


get_ipython().run_cell_magic('bash', '', "sed -i '/## 21./c\\1                     ## 24. filter setting ' ./params.txt\nsed -i '/## 32./c\\35                    ## 32. min trim length' ./params.txt\n")


# In[12]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 2\n')


# Statistics for the number of reads that passed filtering can be found in the stats/ directory. You can see here that a few reads were filtered out erroneously due to detection of the common cutter overhang (AATT) adjacent to the beginning of what appeared to be the adapter sequence. In this case 3 loci (~60 read copies) per sample were excluded. About 200 (20%) loci had the adapter detected but were still retained after trimming.
# 

# In[13]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s2.rawedit.txt\n')


# The filtered data files are converted to fasta format and written to a directory called edits/.  I show this below:
# 

# In[14]:


get_ipython().run_cell_magic('bash', '', 'ls edits/\n')


# The edited reads are saved in fasta format in the edits/ directory

# In[15]:


get_ipython().run_cell_magic('bash', '', 'head -n 8 edits/1A0.edit\n')


# --------------   
# 
# ## Step 3: Within-sample clustering
# Steps 3-7 proceed the same as in the original 'rad' datatype, with the only difference being that in step 7 the second cut site is trimmed from the end of loci when it is present. 

# In[16]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 3 \n')


# In[18]:


get_ipython().run_cell_magic('bash', '', 'head -n 23 stats/s3.clusters.txt\n')


# ## Steps 4 & 5: Consensus base calling

# We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:

# In[19]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 4\n')


# In[20]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 5\n')


# In[21]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s5.consens.txt\n')


# ## Step 6: Across-sample clustering

# This step clusters consensus sequences across samples. It can take a long time for very large data sets. If run normally it will print its progress to the screen so you can judge how long it might take. This example will take less than a minute.

# In[22]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 6\n')


# ## Step 7: Statistics and output files

# Alignment of loci, statistics, and output of various formatted data files.

# In[23]:


get_ipython().run_cell_magic('bash', '', 'pyrad -p params.txt -s 7\n')


# Stats for this run. In this case 996 loci were shared across all 12 samples (remember a few loci were filtered out when we were testing for adapter sequences). 

# In[24]:


get_ipython().run_cell_magic('bash', '', 'cat stats/c85d6m4p3.stats\n')


# ### Output files

# In[25]:


get_ipython().run_cell_magic('bash', '', 'ls outfiles/\n')


# ### Stats files

# In[26]:


get_ipython().run_cell_magic('bash', '', 'ls stats/\n')