Notebook

Assembly and analysis of Ficus RAD-seq data¶

A RAD-seq library of 95 samples was prepared by Floragenex with the PstI restriction enzyme, followed by sonication and size selection. Stats reported by Floragenex include: AverageFragmentSize=386bp, Concentration=2.51ng/uL, Concentation=10nM. The library was sequenced on two lanes of Illumina HiSeq 3000 yielding 378,809,976 reads in lane 1, and 375,813,513 reads in lane 2, for a total of ~755M reads.

This notebook¶

This is a jupyter notebook, a tool used to create an executable document to full reproduce our analyses. This notebook contains all of the code to assemble the Ficus RAD-seq data set with ipyrad. We begin by demultiplexing The raw data. The demultiplexed data (will be) archived and available online. If you downloaded the demultiplexed data you can skip to section The Demultiplexed Data and begin by loading in those data. The data were assembled under a range of parameter settings, which you can see in the Within-sample assembly section. Several Samples were filtered from the data set due to low coverage. The data was then clustered across Samples and final output files were created Across-sample assembly.

Required software¶

The following conda commands will locally install of the software required for this notebook.

In [2]:

## conda install ipyrad -c ipyrad
## conda install toytree -c eaton-lab

In [1]:

import ipyrad as ip
import toyplot

In [2]:

## print software versions
print 'ipyrad v.{}'.format(ip.__version__)
print 'toyplot v.{}'.format(toyplot.__version__)

ipyrad v.0.6.20
toyplot v.0.14.4

Cluster info¶

I started an ipcluster instance on a 40 core workstation with the ipcluster command as shown below. The cluster_info() command shows that ipyrad is able to find all 40 cores on the cluster.

In [3]:

##
## ipcluster start --n=40
##

In [4]:

#print ip.cluster_info()

The raw data¶

The data came to us as two large 20GB files. The barcodes file was provided by Floragenex and maps sample names to barcodes that are contained inline in the sequences, and are 10bp in length. The barcodes are printed a little further below. I ran the program fastQC on the raw data files to do a quality check, the results of which are available here lane1-fastqc and here lane2-fastqc. Overall, quality scores were very high and there was little (but some) adapter contamination, which we will filter out in the ipyrad analysis.

In [30]:

## The reference genome link
reference = """\
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/\
002/002/945/GCA_002002945.1_F.carica_assembly01/\
GCA_002002945.1_F.carica_assembly01_genomic.fna.gz\
"""

## Download the reference genome of F. carica
# ! wget $reference

## decompress it
# ! gunzip ./GCA_002002945.1_F.carica_assembly01_genomic.fna.gz

In [8]:

## Locations of the raw data and barcodes file
lane1data = "~/Documents/RADSEQ_DATA/Ficus/Ficus-1_S1_L001_R1_001.fastq.gz"
lane2data = "~/Documents/RADSEQ_DATA/Ficus/Ficus-2_S2_L002_R1_001.fastq.gz"
barcodes = "~/Documents/RADSEQ_DATA/barcodes/Ficus_Jander_2016_95barcodes.txt"

Create an ipyrad Assembly object for each lane of data¶

We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. You can see the full barcode information at this link.

In [9]:

## create an object to demultiplex each lane
demux1 = ip.Assembly("lane1")
demux2 = ip.Assembly("lane2")

## set path to data, bcodes, and max_mismatch params
demux1.set_params("project_dir", "./ficus_demux_reads")
demux1.set_params("raw_fastq_path", lane1data)
demux1.set_params("barcodes_path", barcodes)
demux1.set_params("max_barcode_mismatch", 0)

## set path to data, bcodes, and max_mismatch params
demux2.set_params("project_dir", "./ficus_demux_reads")
demux2.set_params("raw_fastq_path", lane2data)
demux2.set_params("barcodes_path", barcodes)
demux2.set_params("max_barcode_mismatch", 0)

  New Assembly: lane1
  New Assembly: lane2

Demultiplex raw data from both lanes¶

In [12]:

demux1.run("1")
demux2.run("1")

  Assembly: lane1
  [####################] 100%  chunking large files  | 0:21:55 | s1 | 
  [####################] 100%  sorting reads         | 0:43:43 | s1 | 
  [####################] 100%  writing/compressing   | 0:14:25 | s1 | 

  Assembly: lane2
  [####################] 100%  chunking large files  | 0:27:44 | s1 | 
  [####################] 100%  sorting reads         | 0:45:12 | s1 | 
  [####################] 100%  writing/compressing   | 0:15:53 | s1 |

The demultiplexed data¶

Now we have two directories with demultiplexed data, each with one gzipped fastq file corresponding to all of the reads matching to a particular Sample's barcode from that lane of sequencing. These are the data that (will be uploaded) to Genbank SRA when we publish. We will load the sorted fastq data at this step, to copy the same procedure that one would take if they were starting from access to the demultiplexed data.

In [18]:

lib1_fastqs = "./ficus_demux_reads/lane1_fastqs/*.gz"
lib2_fastqs = "./ficus_demux_reads/lane2_fastqs/*.gz"

In [19]:

lib1 = ip.Assembly("lib1", quiet=True)
lib1.set_params("sorted_fastq_path", lib1_fastqs)
lib1.run("1")

lib2 = ip.Assembly("lib2", quiet=True)
lib2.set_params("sorted_fastq_path", lib2_fastqs)
lib2.run("1")

  Assembly: lib1
  [####################] 100%  loading reads         | 0:01:47 | s1 | 

  Assembly: lib2
  [####################] 100%  loading reads         | 0:00:34 | s1 |

Merge the two lanes of data into one Assembly¶

We will join these two demultiplexed libraries into a single analysis that has the set of parameters we will use to assemble the data set. To do this we use the merge() command in ipyrad. On this merged Assembly we will then set a number of parameter settings that we will use to assemble the data.

In [21]:

## named corresponding to some params we are changing
data = ip.merge("merged", [demux1, demux2])

## set several non-default parameters
data.set_params("project_dir", "analysis-ipyrad")
data.set_params("filter_adapters", 3)
data.set_params("phred_Qscore_offset", 43)
data.set_params("max_Hs_consens", (5, 5))
data.set_params("max_shared_Hs_locus", 4)
data.set_params("filter_min_trim_len", 60)
data.set_params("trim_loci", (0, 8, 0, 0))
data.set_params("output_formats", list("lksapnv"))

## print parameters for prosperity's sake
data.get_params()

  0   assembly_name               merged                                       
  1   project_dir                 ./analysis-ipyrad                            
  2   raw_fastq_path              Merged: lane1, lane2                         
  3   barcodes_path               Merged: lane1, lane2                         
  4   sorted_fastq_path           Merged: lane1, lane2                         
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         43                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    10000                                        
  14  clust_threshold             0.85                                         
  15  max_barcode_mismatch        0                                            
  16  filter_adapters             3                                            
  17  filter_min_trim_len         60                                           
  18  max_alleles_consens         2                                            
  19  max_Ns_consens              (5, 5)                                       
  20  max_Hs_consens              (5, 5)                                       
  21  min_samples_locus           4                                            
  22  max_SNPs_locus              (20, 20)                                     
  23  max_Indels_locus            (8, 8)                                       
  24  max_shared_Hs_locus         4                                            
  25  trim_reads                  (0, 0, 0, 0)                                 
  26  trim_loci                   (0, 8, 0, 0)                                 
  27  output_formats              ('l', 'k', 's', 'a', 'p', 'n', 'v')          
  28  pop_assign_file

Create branch `ficus` and drop the control Sample¶

First we will drop the control sequence included for quality checking by Floragenex (FGXCONTROL). To do this we create a new branch using the argument subsample to include all Samples except FGXCONTROL.

In [22]:

## drop the Floragenex control sample if it is in the data
snames = [i for i in data.samples if i != "FGXCONTROL"]

## working branch
data = data.branch("ficus", subsamples=snames)

A summary of the number of reads per Sample.¶

In [23]:

print "summary of raw read covereage"
print data.stats.reads_raw.describe().astype(int)

summary of raw read covereage
count          95
mean      5149596
std       7716026
min         14703
25%        289541
50%       1890328
75%       7402440
max      51339646
Name: reads_raw, dtype: int64

Filtering options¶

From looking closely at the data it appears there are som poor quality reads with adapter contamination, and also that there are some conspicuous long strings of poly repeats, which are probably due to the library being put on the sequencer in the wrong concentration (the facility failed to do a qPCR quantification). Setting the filter parameter in ipyrad to strict (2) uses 'cutadapt' to filter the reads. By default ipyrad would look just for the Illumina universal adapter, but I'm also adding a few additional poly-{A,C,G,T} sequences to be trimmed. These appeared to be somewhat common in the raw data, followed by nonsense.

In [24]:

## run step 2
data.run("2", force=True)

  Assembly: ficus
  [####################] 100%  concatenating inputs  | 0:02:43 | s2 | 
  [####################] 100%  processing reads      | 0:51:52 | s2 |

Within-sample assembly¶

Steps 2-5 of ipyrad function to filter and cluster reads, and to call consensus haplotypes within samples. We'll look more closely at the stats for each step after it's finished.

reference & de novo assemblies with ipyrad¶

In [33]:

## create new branches for assembly method
ficus_d = data.branch("ficus_d")
ficus_r = data.branch("ficus_r")

## set reference info
reference = "GCA_002002945.1_F.carica_assembly01_genomic.fna"
ficus_r.set_params("reference_sequence", reference)
ficus_r.set_params("assembly_method", "reference")

In [ ]:

## map reads to reference genome
ficus_r.run("3", force=True)

In [35]:

## cluster reads denovo
ficus_d.run("3")

  Assembly: ficus_d
  [####################] 100%  dereplicating         | 0:04:31 | s3 | 
  [####################] 100%  clustering            | 0:13:54 | s3 | 
  [####################] 100%  building clusters     | 0:00:46 | s3 | 
  [####################] 100%  chunking              | 0:00:11 | s3 | 
  [####################] 100%  aligning              | 0:23:58 | s3 | 
  [####################] 100%  concatenating         | 0:00:14 | s3 |

In [7]:

ficus_d.run("4")

  Assembly: ficus_d
  [####################] 100%  inferring [H, E]      | 0:14:01 | s4 |

Branch to make consensus calls at different mindepth settings¶

Now that the reads are filtered and clustered within each Sample we want to try applying several different parameter settings for downstream analyses. One major difference will be in the minimum depth of sequencing we require to make a confident base call. We will leave one Assembly with the default setting of 6, which is somewhat conservative. We will also create a 'lowdepth' Assembly that allows base calls for depths as low as 2.

In [6]:

ficus_dhi = ficus_d.branch("ficus_dhi")
ficus_dlo = ficus_d.branch("ficus_dlo")
ficus_dlo.set_params("mindepth_majrule", 1)

In [7]:

ficus_dlo.run("5")
ficus_dhi.run("5")

  Assembly: ficus_dlo
  [####################] 100%  calculating depths    | 0:00:23 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:21 | s5 | 
  [####################] 100%  consens calling       | 0:15:17 | s5 | 

  Assembly: ficus_dhi
  [####################] 100%  calculating depths    | 0:00:23 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:21 | s5 | 
  [####################] 100%  consens calling       | 0:14:24 | s5 |

Plot consens reads¶

Compare hidepth and lodepth assemblies. The difference is not actually that great. Regardless, the samples with very few reads are going to recover very few clusters.

In [9]:

import numpy as np
import toyplot

## stack columns of consens stats
zero = np.zeros(ficus_dhi.stats.shape[0])
upper = ficus_dhi.stats.reads_consens
lower = -1*ficus_dlo.stats.reads_consens
boundaries = np.column_stack((lower, zero, upper))

## plot barplots
canvas = toyplot.Canvas(width=700, height=300)
axes = canvas.cartesian()
axes.bars(boundaries, baseline=None)
axes.y.ticks.show = True
axes.y.ticks.labels.angle = -90

Cluster data across Samples¶

In [14]:

## run step 6 on full and subsampled data sets
ficus_dlo.run("6")
ficus_dhi.run("6")

  Assembly: ficus_dlo
  [####################] 100%  concat/shuffle input  | 0:00:51 | s6 | 
  [####################] 100%  clustering across     | 3:26:51 | s6 | 
  [####################] 100%  building clusters     | 0:00:40 | s6 | 
  [####################] 100%  aligning clusters     | 0:03:15 | s6 | 
  [####################] 100%  database indels       | 0:01:00 | s6 | 
  [####################] 100%  indexing clusters     | 0:07:29 | s6 | 
  [####################] 100%  building database     | 0:51:26 | s6 | 

  Assembly: ficus_dhi
  [####################] 100%  concat/shuffle input  | 0:00:45 | s6 | 
  [####################] 100%  clustering across     | 2:05:50 | s6 | 
  [####################] 100%  building clusters     | 0:00:37 | s6 | 
  [####################] 100%  aligning clusters     | 0:02:56 | s6 | 
  [####################] 100%  database indels       | 0:00:50 | s6 | 
  [####################] 100%  indexing clusters     | 0:06:03 | s6 | 
  [####################] 100%  building database     | 0:41:25 | s6 |

Create branches that include/exclude Samples with little data¶

We have several Samples that recovered very little data, probably as a result of having low quality DNA extractions. Figs are hard. We'll assemble one data set that includes all of these samples, but since they are likely to have little information we'll also assemble most of our data sets without these low data samples.