Assembling simulated and empirical data

This Jupyter notebook provides a completely reproducible record of all the assembly analyses for the ipyrad manuscript. In this notebook we assemble multiple data sets using five different software programs: ipyrad, pyrad, stacks, aftrRAD, and dDocent. We include executable code to download and install each program, to download each data set, and to run each data set through each program.

Jupyter notebook SSH Tunneling

This notebook was executed on the Yale farnam cluster with access to 40 compute cores on a single node. We performed local I/O in the notebook using SSH Tunneling as described in the ipyrad Documentation (http://ipyrad.readthedocs.io/HPC_Tunnel.html).

Organize the directory structure

In [1]:
import shutil
import glob
import sys
import os
In [2]:
## Set the scratch dir for data analysis
WORK_DIR = "/ysm-gpfs/scratch60/de243/manuscript-analysis"
if not os.path.exists(WORK_DIR):
    os.makedirs(WORK_DIR)
    
## Set a local dir for new (locally) installed software
SOFTWARE = os.path.realpath("./tmp-software")
if not os.path.exists(SOFTWARE):
    os.makedirs(SOFTWARE)
    
## Make paths to data (we download data into these dirs later)
EMPIRICAL_DATA_DIR = os.path.join(WORK_DIR, "empirical_data")
if not os.path.exists(EMPIRICAL_DATA_DIR):
    os.makedirs(EMPIRICAL_DATA_DIR)
    
SIMULATED_DATA_DIR = os.path.join(WORK_DIR, "simulated_data")
if not os.path.exists(SIMULATED_DATA_DIR):
    os.makedirs(SIMULATED_DATA_DIR)
In [3]:
## create dirs for results from each software
PACKAGES = ["ipyrad", "pyrad", "stacks", "ddocent"]
rdirs = {}
for pack in PACKAGES:
    rdir = os.path.join(WORK_DIR, "assembly-{}".format(pack))
    if not os.path.exists(rdir):
        os.makedirs(rdir)
    rdirs[pack] = rdir

## create subdirectories for results
for key, val in rdirs.iteritems():
    ## make for sim and real data
    real = os.path.join(val, "REALDATA")
    sim = os.path.join(val, "SIMDATA")
    for idir in [real, sim]:
        if not os.path.exists(idir):
            os.makedirs(idir)
            
    ## extra subdirectories for stacks
    if key == "stacks":
        extras = [os.path.join(real, "gapped"), 
                  os.path.join(real, "ungapped"), 
                  os.path.join(real, "default")]
        for idir in extras:
            if not os.path.exists(idir):
                os.makedirs(idir)
In [4]:
## create simulated data directories
SIMDATA = ["simno", "simlo", "simhi", "simla"]
sdirs = {}
for simd in SIMDATA:
    sdir = os.path.join(SIMULATED_DATA_DIR, simd)
    if not os.path.exists(sdir):
        os.makedirs(sdir)
    sdirs[simd] = sdir

Download the empirical RAD Pedicularis data set

This is a RAD data set for 13 taxa from Eaton and Ree (2013) open access link. Here we grab the demultiplexed fastq data from a public Dropbox link, but the same data is also hosted on the NCBI SRA as SRP021469.

In [12]:
%%bash -s $EMPIRICAL_DATA_DIR --err stderr --out stdout

## the location of the data 
dataloc="https://dl.dropboxusercontent.com/u/2538935/example_empirical_rad.tar.gz"

## grab data from the public link
curl -LkO $dataloc 
tar -xvzf example_empirical_rad.tar.gz
rm example_empirical_rad.tar.gz

## move data to the EMPIRICAL_DATA_DIR
mv ./example_empirical_rad/ $1/Pedicularis

Download the empirical Heliconius reference genome¶

Davey, John W., et al. "Major improvements to the Heliconius melpomene genome assembly used to confirm 10 chromosome fusion events in 6 million years of butterfly evolution." G3: Genes| Genomes| Genetics 6.3 (2016): 695-708.

In [11]:
%%bash -s $EMPIRICAL_DATA_DIR --err stderr --out stdout

## location of the data
dataloc="http://butterflygenome.org/sites/default/files/Hmel2-0_Release_20151013.tgz"

## grab data from the public link 
curl -LkO $dataloc
tar -zxvf Hmel2-0_Release_20151013.tgz
rm Hmel2-0_Release_20151013.tgz

## move data to EMPIRICAL_DATA_DIR
mv ./Hmel2/ $1/Heliconius

Install all the software

First install a new isolated miniconda directory

We could create a separate environment in an existing conda installation, but this is simpler since it works for users whether they have conda installed or not, and we can easily remove all the software at the end when we are done by simply deleting the 'tmp-software' directory. So all of the work we do in this notebook will be done in a completely isolated software environment.

In [13]:
%%bash -s $SOFTWARE --err stderr --out stdout

## Fetch the latest miniconda 
miniconda="https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh"
curl -LkO $miniconda

## Install the new miniconda silently into the 'software' directory.
## -b = batch mode; -f = force overwrite; -p = install dir
bash Miniconda-latest-Linux-x86_64.sh -b -f -p $1

## update conda, if this fails your other conda installation may be 
## getting in the way. Try clearing your ~/.condarc file.
$1/bin/conda update -y conda

## remove the install file
rm Miniconda-latest-Linux-x86_64.sh
In [23]:
%%bash -s $SOFTWARE 
## check installation
$1/bin/conda --version
conda 4.3.4

Install ipyrad (Eaton & Overcast 2017)

In [18]:
%%bash -s $SOFTWARE --err stderr --out stdout
## installs the latest release (silently)
$1/bin/conda install -y -c ipyrad ipyrad
In [22]:
%%bash -s $SOFTWARE 
## check installation
$1/bin/ipyrad -v
ipyrad 0.5.15

Install stacks (Catchen)

In [25]:
%%bash -s $SOFTWARE --err stderr --out stdout
## installs the latest release (silently)
$1/bin/conda install -y -c bioconda stacks
In [28]:
%%bash -s $SOFTWARE 
## check installation
$1/bin/ustacks -v
ustacks 1.44

Install ddocent (Puritz)

In [29]:
%%bash -s $SOFTWARE --err stderr --out stdout
## install all of the ddocent reqs
$1/bin/conda install -y -c conda-forge unzip  
$1/bin/conda install -y -c bioconda ddocent=2.2.4
In [42]:
%%bash -s $SOFTWARE --err stderr
## when running dDocent you have to always set the tmp-software
## directory to the top of your PATH
export PATH="$1/bin:$PATH"

## check installation (no -v argument, print start of stdout)
$1/bin/dDocent | head -n 1
dDocent 2.24 

Install pyrad (Eaton 2014)

In [49]:
%%bash -s $SOFTWARE --err stderr --out stdout

## Should be unnecessary because numpy and scipy already installed
$1/bin/conda install numpy scipy

## get muscle binary
muscle="http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz"
curl -LkO $muscle 
tar -xvzf muscle*.tar.gz 
mv muscle3.8.31_i86linux64 $1/bin/muscle
rm muscle3.8.31_i86linux64.tar.gz

## Download and install vsearch
vsearch="https://github.com/torognes/vsearch/releases/download/v2.0.3/vsearch-2.0.3-linux-x86_64.tar.gz"
curl -LkO $vsearch
tar xzf vsearch-2.0.3-linux-x86_64.tar.gz
mv vsearch-2.0.3-linux-x86_64/bin/vsearch $1/bin/vsearch
rm -r vsearch-2.0.3-linux-x86_64/ vsearch-2.0.3-linux-x86_64.tar.gz

## Fetch pyrad source from git repository 
if [ ! -d ./pyrad-git ]; then
  git clone https://github.com/dereneaton/pyrad.git pyrad-git
fi;

## and install to conda using pip
cd ./pyrad-git
$1/bin/pip install -e . 
cd ..
In [51]:
%%bash -s $SOFTWARE 
## check installation
$1/bin/pyrad --version
pyRAD 3.0.66

Simulate data

We will use simrrls to generate some simulated RAD-seq data for testing. This is a program that was written by Deren Eaton and is available on github: github.com/dereneaton/simrrls.git. simrrls requires the python egglib module, which is a pain to install in full, but we only need the simulation aspects of it, which are fairly easy to install. See below.

In [57]:
%%bash -s $SOFTWARE --err stderr --out stdout

## install gsl
$1/bin/conda install gsl -y 

## fetch egglib cpp and mv into tmp-software/pkgs
eggcpp="http://mycor.nancy.inra.fr/egglib/releases/2.1.11/egglib-cpp-2.1.11.tar.gz"
curl -LkO $eggcpp
tar -xzvf egglib-cpp-2.1.11.tar.gz 
mv egglib-cpp-2.1.11/ $1/pkgs/

## install egglib-cpp
cd $1/pkgx/egglib-cpp-2.1.11/
sh ./configure --prefix=$1
make
make install 
cd -

## fetch egglib py module and mv into tmp-software/pkgs
eggpy="http://mycor.nancy.inra.fr/egglib/releases/2.1.11/egglib-py-2.1.11.tar.gz"
curl -LkO $eggpy 
tar -xvzf egglib-py-2.1.11.tar.gz
mv egglib-py-2.1.11/ $1/pkgs/

## install egglib-py
cd $1/pkgs/egglib-py-2.1.11/
$1/bin/python setup.py build --prefix=$1
$1/bin/python setup.py install --prefix=$1
cd -

## clone simrrls into tmp-software/pkgs/
if [ ! -d $1/pkgs/simrrls ]; then
  git clone https://github.com/dereneaton/simrrls.git $1/pkgs/simrrls
fi

## install with tmp-software/bin/pip
cd $1/pkgs/simrrls
$1/bin/pip install -e .
cd -
In [68]:
%%bash -s $SOFTWARE 
## check installations
echo 'egglib' $($1/bin/egglib | grep Version)
$1/bin/simrrls --version
egglib Version number: 2.1.11
simrrls 0.0.13

Simulating different RAD-datasets

Both pyRAD and stacks have undergone a lot of work since the original pyrad analysis. Because improvements have been made we want to test performance of all the current pipelines and be able to compare current to past performance. We'll follow the original pyRAD manuscript analysis (Eaton 2013) by simulating modest sized datasets with variable amounts of indels. We'll also simulate one much larger dataset. Also, because stacks has since included an option for handling gapped analysis we'll test both gapped and ungapped assembly.

Tuning simrrls indel parameter

The -I parameter for simrrls has changed since the initial pyrad manuscript, so the we had to explore new values for this parameter that will approximate the number of indels we are after. I figured out a way to run simrrls and pipe the output to muscle to get a quick idea of the indel variation for different params. This gives a good idea of how many indel bearing seqs are generated.

In [12]:
# simrrls -n 1 -L 1 -I 1 -r1 $RANDOM 2>&1 | \
#                             grep 0 -A 1 | \
#                             tr '@' '>' | \
#                             muscle | grep T | head -n 60
In [13]:
# for i in {1..50}; 
#   do simrrls -n 1 -L 1 -I .05 -r1 $RANDOM 2>&1 | \
#                                   grep 0 -A 1 | \
#                                   tr '@' '>' | \
#                                   muscle |  \
#                                   grep T | \
#                                   head -n 40 >> rpt.txt;
# done
# grep "-" rpt.txt | wc -l

From experimentation:

-I value -- %loci w/ indels

0.02 -- ~10%
0.05 -- ~15%
0.10 -- ~25%

The simulated data will live in these directories:

SIM_NO_DIR = WORK_DIR/simulated_data/simno
SIM_LO_DIR = WORK_DIR/simulated_data/simlo
SIM_HI_DIR = WORK_DIR/simulated_data/simhi
SIM_LARGE_DIR = WORK_DIR/simulated_data/simlarge

Timing:

10K loci -- ~8MB -- ~ 2 minutes
100K loci -- ~80MB -- ~ 20 minutes

Call simrrls to simulate data

In [70]:
%%bash -s $SOFTWARE $SIMULATED_DATA_DIR

## call simrrls (No INDELS)
$1/bin/simrrls -o $2/simno/simno -ds 2 -L 10000 -I 0 

## call simrrls (Low INDELS)
$1/bin/simrrls -o $2/simlo/simlo -ds 2 -L 10000 -I 0.02

## call simrrls (High INDELS) 
$1/bin/simrrls -o $2/simhi/simhi -ds 2 -L 10000 -I 0.05

## call simrls on Large data set (this takes a few hours, sorry!)
## (30x12=360 Individuals at 100K loci) = gzipped 2.2GB
$1/bin/simrrls -o $2/simla/Big_i360_L100K -ds 0 -L 100000 -n 30

Demultiplex all four libraries

We will start each analysis from the demultiplexed data files, since this is commonly what's available when data are downloaded from NCBI. We use ipyrad to demultiplex the data.

In [ ]:
import ipyrad as ip

for name in ["simno", "simlo", "simhi", "simla"]:
    ## demultiplex the library
    data = ip.Assembly(name, quiet=True)
    pdir = os.path.join(SIMULATED_DATA_DIR, name)
    data.set_params("project_dir", pdir)
    data.set_params("raw_fastq_path", os.path.join(pdir, "*.gz"))
    data.set_params("barcodes_path", os.path.join(pdir, "*barcodes.txt"))
    data.run("1")
    print "demultiplexed fastq data written to: {}".format(data.dirs.fastqs)
  Assembly: simno
  [####################] 100%  chunking large files  | 0:00:00 | s1 | 
  [####################] 100%  sorting reads         | 0:00:28 | s1 | 
  [####################] 100%  writing/compressing   | 0:00:04 | s1 | 

  Assembly: simlo
  [####################] 100%  chunking large files  | 0:00:00 | s1 | 
  [####################] 100%  sorting reads         | 0:00:28 | s1 | 
  [####################] 100%  writing/compressing   | 0:00:04 | s1 | 

  Assembly: simhi
  [####################] 100%  chunking large files  | 0:00:00 | s1 | 
  [####################] 100%  sorting reads         | 0:00:30 | s1 | 
  [####################] 100%  writing/compressing   | 0:00:04 | s1 | 

  Assembly: simla
  [                    ]   0%  chunking large files  | 0:03:38 | s1 | 
In [ ]:
print "demultiplexed fastq data written to: {}".format(data.dirs.fastqs)

Assemble simulated data sets

ipyrad : simulated data assembly

Results
Data set Cores Loci Time
simno 4 10000 13:58
simlo 4 10000 15:17
simhi 4 10000 13:58
simla 4 100000 13:38
simla 80 100000 ...

In [ ]:
## A function to run ipyrad on a given data set

def run_ipyrad(name):
    ## create Assembly
    data = ip.Assembly(name)

    ## set data paths and params
    data.set_params("project_dir", 
        os.path.join(SIMULATED_DATA_DIR, name))
    data.set_params("sorted_fastq_path", 
        os.path.join(SIMULATED_DATA_DIR, name, name+"_fastqs", "*.gz"))
    data.set_params('max_low_qual_bases', 4)
    data.set_params('filter_min_trim_len', 69)
    data.set_params('max_Ns_consens', (99,99))
    data.set_params('max_Hs_consens', (99,99))
    data.set_params('max_SNPs_locus', (100, 100))
    data.set_params('min_samples_locus', 2)
    data.set_params('max_Indels_locus', (99,99))
    data.set_params('max_shared_Hs_locus', 99)

    ## run ipyrad steps 1-7
    data.run("1234567", show_cluster=True)
    return data
In [ ]:
%%timeit -n1 -r1 
## this records time to run the code in this cell once
run_ipyrad("simno")
In [ ]:
%%timeit -n1 -r1 
## this records time to run the code in this cell once
run_ipyrad("simlo")
In [ ]:
%%timeit -n1 -r1 
## this records time to run the code in this cell once
run_ipyrad("simhi")
In [ ]:
%%timeit -n1 -r1 
## this records time to run the code in this cell once
run_ipyrad("simla")

stacks: simulated data assembly

In [ ]: