A RAD-seq library of 95 samples was prepared by Floragenex with the PstI restriction enzyme, followed by sonication and size selection. Stats reported by Floragenex include: AverageFragmentSize=386bp, Concentration=2.51ng/uL, Concentation=10nM. The library was sequenced on two lanes of Illumina HiSeq 3000 yielding 378,809,976 reads in lane 1, and 375,813,513 reads in lane 2, for a total of ~755M reads.
This is a jupyter notebook, a tool used to create an executable document to full reproduce our analyses. This notebook contains all of the code to assemble the Ficus RAD-seq data set with ipyrad. We begin by demultiplexing The raw data. The demultiplexed data (will be) archived and available online. If you downloaded the demultiplexed data you can skip to section The Demultiplexed Data and begin by loading in those data. The data were assembled under a range of parameter settings, which you can see in the Within-sample assembly section. Several Samples were filtered from the data set due to low coverage. The data was then clustered across Samples and final output files were created Across-sample assembly.
The following conda commands will locally install of the software required for this notebook.
## conda install ipyrad -c ipyrad
## conda install toytree -c eaton-lab
import ipyrad as ip
import toyplot
## print software versions
print 'ipyrad v.{}'.format(ip.__version__)
print 'toyplot v.{}'.format(toyplot.__version__)
ipyrad v.0.6.20 toyplot v.0.14.4
I started an ipcluster instance on a 40 core workstation with the ipcluster command as shown below. The cluster_info()
command shows that ipyrad is able to find all 40 cores on the cluster.
##
## ipcluster start --n=40
##
import ipyparallel as ipp
ipyclient = ipp.Client(profile="testing")
#print ip.cluster_info()
The data came to us as two large 20GB files. The barcodes file was provided by Floragenex and maps sample names to barcodes that are contained inline in the sequences, and are 10bp in length. The barcodes are printed a little further below. I ran the program fastQC on the raw data files to do a quality check, the results of which are available here lane1-fastqc and here lane2-fastqc. Overall, quality scores were very high and there was little (but some) adapter contamination, which we will filter out in the ipyrad analysis.
## The reference genome link
reference = """\
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/\
002/002/945/GCA_002002945.1_F.carica_assembly01/\
GCA_002002945.1_F.carica_assembly01_genomic.fna.gz\
"""
## Download the reference genome of F. carica
# ! wget $reference
## decompress it
# ! gunzip ./GCA_002002945.1_F.carica_assembly01_genomic.fna.gz
## Locations of the raw data and barcodes file
lane1data = "~/Documents/RADSEQ_DATA/Ficus/Ficus-1_S1_L001_R1_001.fastq.gz"
lane2data = "~/Documents/RADSEQ_DATA/Ficus/Ficus-2_S2_L002_R1_001.fastq.gz"
barcodes = "~/Documents/RADSEQ_DATA/barcodes/Ficus_Jander_2016_95barcodes.txt"
We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. You can see the full barcode information at this link.
## create an object to demultiplex each lane
demux1 = ip.Assembly("lane1")
demux2 = ip.Assembly("lane2")
## set path to data, bcodes, and max_mismatch params
demux1.set_params("project_dir", "./ficus_demux_reads")
demux1.set_params("raw_fastq_path", lane1data)
demux1.set_params("barcodes_path", barcodes)
demux1.set_params("max_barcode_mismatch", 0)
## set path to data, bcodes, and max_mismatch params
demux2.set_params("project_dir", "./ficus_demux_reads")
demux2.set_params("raw_fastq_path", lane2data)
demux2.set_params("barcodes_path", barcodes)
demux2.set_params("max_barcode_mismatch", 0)
New Assembly: lane1 New Assembly: lane2
demux1.run("1")
demux2.run("1")
Assembly: lane1 [####################] 100% chunking large files | 0:21:55 | s1 | [####################] 100% sorting reads | 0:43:43 | s1 | [####################] 100% writing/compressing | 0:14:25 | s1 | Assembly: lane2 [####################] 100% chunking large files | 0:27:44 | s1 | [####################] 100% sorting reads | 0:45:12 | s1 | [####################] 100% writing/compressing | 0:15:53 | s1 |
Now we have two directories with demultiplexed data, each with one gzipped fastq file corresponding to all of the reads matching to a particular Sample's barcode from that lane of sequencing. These are the data that (will be uploaded) to Genbank SRA when we publish. We will load the sorted fastq data at this step, to copy the same procedure that one would take if they were starting from access to the demultiplexed data.
lib1_fastqs = "./ficus_demux_reads/lane1_fastqs/*.gz"
lib2_fastqs = "./ficus_demux_reads/lane2_fastqs/*.gz"
lib1 = ip.Assembly("lib1", quiet=True)
lib1.set_params("sorted_fastq_path", lib1_fastqs)
lib1.run("1")
lib2 = ip.Assembly("lib2", quiet=True)
lib2.set_params("sorted_fastq_path", lib2_fastqs)
lib2.run("1")
Assembly: lib1 [####################] 100% loading reads | 0:01:47 | s1 | Assembly: lib2 [####################] 100% loading reads | 0:00:34 | s1 |
We will join these two demultiplexed libraries into a single analysis that has the set of parameters we will use to assemble the data set. To do this we use the merge()
command in ipyrad. On this merged Assembly we will then set a number of parameter settings that we will use to assemble the data.
## named corresponding to some params we are changing
data = ip.merge("merged", [demux1, demux2])
## set several non-default parameters
data.set_params("project_dir", "analysis-ipyrad")
data.set_params("filter_adapters", 3)
data.set_params("phred_Qscore_offset", 43)
data.set_params("max_Hs_consens", (5, 5))
data.set_params("max_shared_Hs_locus", 4)
data.set_params("filter_min_trim_len", 60)
data.set_params("trim_loci", (0, 8, 0, 0))
data.set_params("output_formats", list("lksapnv"))
## print parameters for prosperity's sake
data.get_params()
0 assembly_name merged 1 project_dir ./analysis-ipyrad 2 raw_fastq_path Merged: lane1, lane2 3 barcodes_path Merged: lane1, lane2 4 sorted_fastq_path Merged: lane1, lane2 5 assembly_method denovo 6 reference_sequence 7 datatype rad 8 restriction_overhang ('TGCAG', '') 9 max_low_qual_bases 5 10 phred_Qscore_offset 43 11 mindepth_statistical 6 12 mindepth_majrule 6 13 maxdepth 10000 14 clust_threshold 0.85 15 max_barcode_mismatch 0 16 filter_adapters 3 17 filter_min_trim_len 60 18 max_alleles_consens 2 19 max_Ns_consens (5, 5) 20 max_Hs_consens (5, 5) 21 min_samples_locus 4 22 max_SNPs_locus (20, 20) 23 max_Indels_locus (8, 8) 24 max_shared_Hs_locus 4 25 trim_reads (0, 0, 0, 0) 26 trim_loci (0, 8, 0, 0) 27 output_formats ('l', 'k', 's', 'a', 'p', 'n', 'v') 28 pop_assign_file
ficus
and drop the control Sample¶First we will drop the control sequence included for quality checking by Floragenex (FGXCONTROL). To do this we create a new branch using the argument subsample
to include all Samples except FGXCONTROL.
## drop the Floragenex control sample if it is in the data
snames = [i for i in data.samples if i != "FGXCONTROL"]
## working branch
data = data.branch("ficus", subsamples=snames)
print "summary of raw read covereage"
print data.stats.reads_raw.describe().astype(int)
summary of raw read covereage count 95 mean 5149596 std 7716026 min 14703 25% 289541 50% 1890328 75% 7402440 max 51339646 Name: reads_raw, dtype: int64
From looking closely at the data it appears there are som poor quality reads with adapter contamination, and also that there are some conspicuous long strings of poly repeats, which are probably due to the library being put on the sequencer in the wrong concentration (the facility failed to do a qPCR quantification). Setting the filter parameter in ipyrad to strict (2) uses 'cutadapt' to filter the reads. By default ipyrad would look just for the Illumina universal adapter, but I'm also adding a few additional poly-{A,C,G,T} sequences to be trimmed. These appeared to be somewhat common in the raw data, followed by nonsense.
## run step 2
data.run("2", force=True)
Assembly: ficus [####################] 100% concatenating inputs | 0:02:43 | s2 | [####################] 100% processing reads | 0:51:52 | s2 |
Steps 2-5 of ipyrad function to filter and cluster reads, and to call consensus haplotypes within samples. We'll look more closely at the stats for each step after it's finished.
## create new branches for assembly method
ficus_d = data.branch("ficus_d")
ficus_r = data.branch("ficus_r")
## set reference info
reference = "GCA_002002945.1_F.carica_assembly01_genomic.fna"
ficus_r.set_params("reference_sequence", reference)
ficus_r.set_params("assembly_method", "reference")
## map reads to reference genome
ficus_r.run("3", force=True)
## cluster reads denovo
ficus_d.run("3")
Assembly: ficus_d [####################] 100% dereplicating | 0:04:31 | s3 | [####################] 100% clustering | 0:13:54 | s3 | [####################] 100% building clusters | 0:00:46 | s3 | [####################] 100% chunking | 0:00:11 | s3 | [####################] 100% aligning | 0:23:58 | s3 | [####################] 100% concatenating | 0:00:14 | s3 |
ficus_d.run("4")
Assembly: ficus_d [####################] 100% inferring [H, E] | 0:14:01 | s4 |
Now that the reads are filtered and clustered within each Sample we want to try applying several different parameter settings for downstream analyses. One major difference will be in the minimum depth of sequencing we require to make a confident base call. We will leave one Assembly with the default setting of 6, which is somewhat conservative. We will also create a 'lowdepth' Assembly that allows base calls for depths as low as 2.
ficus_dhi = ficus_d.branch("ficus_dhi")
ficus_dlo = ficus_d.branch("ficus_dlo")
ficus_dlo.set_params("mindepth_majrule", 1)
ficus_dlo.run("5")
ficus_dhi.run("5")
Assembly: ficus_dlo [####################] 100% calculating depths | 0:00:23 | s5 | [####################] 100% chunking clusters | 0:00:21 | s5 | [####################] 100% consens calling | 0:15:17 | s5 | Assembly: ficus_dhi [####################] 100% calculating depths | 0:00:23 | s5 | [####################] 100% chunking clusters | 0:00:21 | s5 | [####################] 100% consens calling | 0:14:24 | s5 |
Compare hidepth and lodepth assemblies. The difference is not actually that great. Regardless, the samples with very few reads are going to recover very few clusters.
import numpy as np
import toyplot
## stack columns of consens stats
zero = np.zeros(ficus_dhi.stats.shape[0])
upper = ficus_dhi.stats.reads_consens
lower = -1*ficus_dlo.stats.reads_consens
boundaries = np.column_stack((lower, zero, upper))
## plot barplots
canvas = toyplot.Canvas(width=700, height=300)
axes = canvas.cartesian()
axes.bars(boundaries, baseline=None)
axes.y.ticks.show = True
axes.y.ticks.labels.angle = -90
## run step 6 on full and subsampled data sets
ficus_dlo.run("6")
ficus_dhi.run("6")
Assembly: ficus_dlo [####################] 100% concat/shuffle input | 0:00:51 | s6 | [####################] 100% clustering across | 3:26:51 | s6 | [####################] 100% building clusters | 0:00:40 | s6 | [####################] 100% aligning clusters | 0:03:15 | s6 | [####################] 100% database indels | 0:01:00 | s6 | [####################] 100% indexing clusters | 0:07:29 | s6 | [####################] 100% building database | 0:51:26 | s6 | Assembly: ficus_dhi [####################] 100% concat/shuffle input | 0:00:45 | s6 | [####################] 100% clustering across | 2:05:50 | s6 | [####################] 100% building clusters | 0:00:37 | s6 | [####################] 100% aligning clusters | 0:02:56 | s6 | [####################] 100% database indels | 0:00:50 | s6 | [####################] 100% indexing clusters | 0:06:03 | s6 | [####################] 100% building database | 0:41:25 | s6 |
We have several Samples that recovered very little data, probably as a result of having low quality DNA extractions. Figs are hard. We'll assemble one data set that includes all of these samples, but since they are likely to have little information we'll also assemble most of our data sets without these low data samples.
ficus_dhi = ip.load_json("analysis-ipyrad/ficus_dhi.json", 1)
ficus_dlo = ip.load_json("analysis-ipyrad/ficus_dlo.json", 1)
## infer a min4 assembly to see which samples have least data
ficus_dhi.run("7", force=True)
ficus_dlo.run("7", force=True)
Assembly: ficus_dhi [####################] 100% filtering loci | 0:01:13 | s7 | [####################] 100% building loci/stats | 0:00:51 | s7 | [####################] 100% building alleles | 0:01:02 | s7 | [####################] 100% building vcf file | 0:02:37 | s7 | [####################] 100% writing vcf file | 0:00:01 | s7 | [####################] 100% building arrays | 0:00:43 | s7 | [####################] 100% writing outfiles | 0:02:12 | s7 | Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_outfiles Assembly: ficus_dlo [####################] 100% filtering loci | 0:01:07 | s7 | [####################] 100% building loci/stats | 0:00:59 | s7 | [####################] 100% building alleles | 0:01:11 | s7 | [####################] 100% building vcf file | 0:03:07 | s7 | [####################] 100% writing vcf file | 0:00:01 | s7 | [####################] 100% building arrays | 0:00:54 | s7 | [####################] 100% writing outfiles | 0:02:20 | s7 | Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_outfiles
ficus_dlo.stats_dfs.s7_samples.head(10)
sample_coverage | |
---|---|
A01_paraensis | 6133 |
A02_paraensis | 11534 |
A04_paraensis | 35023 |
A05_paraensis | 14597 |
A06_obtusifolia | 605 |
A07_obtusifolia | 455 |
A16_citrifolia | 3664 |
A18_citrifolia | 3804 |
A19_citrifolia | 9194 |
A26_popenoei | 542 |
## get list of samples with > 1000 reads in the min4
mask = ficus_dlo.stats_dfs.s7_samples.sample_coverage > 1000
skeep = ficus_dlo.stats.index[mask].tolist()
## print who was excluded
print "excluded samples:\t\tconsens reads"
for name, dat in ficus_dlo.samples.items():
if name not in skeep:
print " {:<22}\t{}".format(name, int(dat.stats.reads_consens))
excluded samples: consens reads A07_obtusifolia 3128 A62_turbinata 3400 C02_citrifolia 44 A26_popenoei 3374 A63_turbinata 1562 A34_nymphaeifolia 12204 A75_colubrinae 3982 A29_popenoei 1473 A28_popenoei 3168 C33_triangle 1373 A38_nymphaeifolia 6826 C09_costaricana 189 A06_obtusifolia 5760 C01_bullenei 3826 C52_citrifolia 269 C54_citrifolia 1409 C32_triangleXtrigonata 2389 A27_popenoei 2467 C34_triangle 374
Also with different thresholds for missing data.
## final min4s, min10s, and min20
for assembly in [ficus_dlo, ficus_dhi]:
for minsamp in [4, 10, 20]:
## new branch names
subname = assembly.name + "_s{}".format(minsamp)
## make new branches with subsampling
subdata = assembly.branch(subname, subsamples=skeep)
## set minsamp value
subdata.set_params("min_samples_locus", minsamp)
## run step7
subdata.run("7", force=True)
Assembly: ficus_dlo_s4 [####################] 100% filtering loci | 0:00:43 | s7 | [####################] 100% building loci/stats | 0:00:34 | s7 | [####################] 100% building alleles | 0:00:44 | s7 | [####################] 100% building vcf file | 0:01:26 | s7 | [####################] 100% writing vcf file | 0:00:01 | s7 | [####################] 100% building arrays | 0:00:40 | s7 | [####################] 100% writing outfiles | 0:01:21 | s7 | Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s4_outfiles Assembly: ficus_dlo_s10 [####################] 100% filtering loci | 0:00:44 | s7 | [####################] 100% writing outfiles | 0:00:47 | s7 | Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s10_outfiles Assembly: ficus_dlo_s20 [####################] 100% filtering loci | 0:00:44 | s7 | [####################] 100% building loci/stats | 0:00:33 | s7 | [####################] 100% building alleles | 0:00:41 | s7 | [####################] 100% building vcf file | 0:01:02 | s7 | [####################] 100% writing vcf file | 0:00:00 | s7 | [####################] 100% building arrays | 0:00:38 | s7 | [####################] 100% writing outfiles | 0:00:32 | s7 | Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s20_outfiles Assembly: ficus_dhi_s4 [####################] 100% filtering loci | 0:00:36 | s7 | [####################] 100% building loci/stats | 0:00:27 | s7 | [####################] 100% building alleles | 0:00:37 | s7 | [####################] 100% building vcf file | 0:01:14 | s7 | [####################] 100% writing vcf file | 0:00:01 | s7 | [####################] 100% building arrays | 0:00:32 | s7 | [####################] 100% writing outfiles | 0:01:14 | s7 | Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s4_outfiles Assembly: ficus_dhi_s10 [####################] 100% filtering loci | 0:00:37 | s7 | [####################] 100% building loci/stats | 0:00:27 | s7 | [####################] 100% building alleles | 0:00:35 | s7 | [####################] 100% building vcf file | 0:01:00 | s7 | [####################] 100% writing vcf file | 0:00:00 | s7 | [####################] 100% building arrays | 0:00:31 | s7 | [####################] 100% writing outfiles | 0:00:43 | s7 | Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s10_outfiles Assembly: ficus_dhi_s20 [####################] 100% filtering loci | 0:00:36 | s7 | [####################] 100% building loci/stats | 0:00:27 | s7 | [####################] 100% building alleles | 0:00:35 | s7 | [####################] 100% building vcf file | 0:00:53 | s7 | [####################] 100% writing vcf file | 0:00:00 | s7 | [####################] 100% building arrays | 0:00:30 | s7 | [####################] 100% writing outfiles | 0:00:29 | s7 | Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s20_outfiles
#subdata = ip.load_json("analysis-ipyrad/ficus_dhi_s20")
subdata.stats_dfs.s7_filters
total_filters | applied_order | retained_loci | |
---|---|---|---|
total_prefiltered_loci | 250234 | 0 | 250234 |
filtered_by_rm_duplicates | 6458 | 6458 | 243776 |
filtered_by_max_indels | 4889 | 4889 | 238887 |
filtered_by_max_snps | 9980 | 4476 | 234411 |
filtered_by_max_shared_het | 10985 | 9067 | 225344 |
filtered_by_min_sample | 197775 | 186819 | 38525 |
filtered_by_max_alleles | 17038 | 4609 | 33916 |
total_filtered_loci | 33916 | 0 | 33916 |
subdata.stats_dfs.s7_samples
sample_coverage | |
---|---|
A01_paraensis | 3178 |
A02_paraensis | 6990 |
A04_paraensis | 26258 |
A05_paraensis | 10147 |
A16_citrifolia | 1870 |
A18_citrifolia | 2028 |
A19_citrifolia | 5755 |
A33_nymphaeifolia | 2344 |
A41_nymphaeifolia | 19448 |
A42_nymphaeifolia | 25988 |
A48_trigonata | 5099 |
A49_trigonata | 5963 |
A55_triangle | 5659 |
A59_dugandii | 22571 |
A60_dugandii | 1555 |
A61_turbinata | 4224 |
A65_pertusa | 26857 |
A67_bullenei | 12912 |
A69_bullenei | 1530 |
A70_bullenei | 9034 |
A71_bullenei | 1141 |
A72_bullenei | 2881 |
A77_colubrinae | 22439 |
A82_perforata | 17661 |
A83_perforata | 17932 |
A84_perforata | 26821 |
A85_perforata | 16701 |
A87_costaricana | 2200 |
A94_maxima | 18643 |
A95_insipida | 10999 |
A96_glabrata | 20708 |
A97_glabrata | 16130 |
B102_obtusifolia | 26421 |
B103_obtusifolia | 728 |
B118_maxima | 1542 |
B119_maxima | 17984 |
B120_maxima | 20001 |
B123_maxima | 15354 |
B126_insipida | 22977 |
B127_insipida | 18709 |
B128_insipida | 22847 |
B130_glabrataXmaxima | 22352 |
B131_glabrataXmaxima | 23084 |
B133_glabrata | 21720 |
B134_glabrata | 22720 |
C04_colubrinae | 24782 |
C11_costaricana | 26317 |
C12_dugandii | 7438 |
C14_dugandii | 26742 |
C15_insipida | 22907 |
C17_maxima | 18066 |
C18_maxima | 23139 |
C19_nymphaeifolia | 26665 |
C21_obtusifolia | 2627 |
C22_obtusifolia | 27114 |
C24_obtusifolia | 25269 |
C25_popenoei | 26463 |
C26_popenoei | 25572 |
C27_popenoei | 15584 |
C28_pertusa | 25187 |
C30_triangle | 7256 |
C31_triangle | 12661 |
C36_trigonata | 27118 |
C37_trigonata | 26778 |
C39_trigonata | 26270 |
C41_trigonata | 10822 |
C43_trigonata | 27126 |
C45_yoponensis | 18584 |
C46_yoponensis | 22783 |
C47_yoponensis | 23199 |
C48_tonduzii | 20918 |
C49_dugandii | 24327 |
C50_insipida | 23203 |
C51_perforata | 22246 |
C53_citrifolia | 3392 |
C5_colubrinae | 26818 |
subdata.stats_dfs.s7_loci
locus_coverage | sum_coverage | |
---|---|---|
1 | 0 | 0 |
2 | 0 | 0 |
3 | 0 | 0 |
4 | 0 | 0 |
5 | 0 | 0 |
6 | 0 | 0 |
7 | 0 | 0 |
8 | 0 | 0 |
9 | 0 | 0 |
10 | 0 | 0 |
11 | 0 | 0 |
12 | 0 | 0 |
13 | 0 | 0 |
14 | 0 | 0 |
15 | 0 | 0 |
16 | 0 | 0 |
17 | 0 | 0 |
18 | 0 | 0 |
19 | 0 | 0 |
20 | 2432 | 2432 |
21 | 1887 | 4319 |
22 | 1248 | 5567 |
23 | 910 | 6477 |
24 | 844 | 7321 |
25 | 848 | 8169 |
26 | 841 | 9010 |
27 | 882 | 9892 |
28 | 911 | 10803 |
29 | 849 | 11652 |
30 | 829 | 12481 |
31 | 795 | 13276 |
32 | 705 | 13981 |
33 | 680 | 14661 |
34 | 664 | 15325 |
35 | 638 | 15963 |
36 | 544 | 16507 |
37 | 538 | 17045 |
38 | 589 | 17634 |
39 | 576 | 18210 |
40 | 644 | 18854 |
41 | 686 | 19540 |
42 | 754 | 20294 |
43 | 770 | 21064 |
44 | 941 | 22005 |
45 | 1019 | 23024 |
46 | 1094 | 24118 |
47 | 1073 | 25191 |
48 | 1150 | 26341 |
49 | 1122 | 27463 |
50 | 1083 | 28546 |
51 | 1129 | 29675 |
52 | 908 | 30583 |
53 | 848 | 31431 |
54 | 691 | 32122 |
55 | 570 | 32692 |
56 | 416 | 33108 |
57 | 317 | 33425 |
58 | 220 | 33645 |
59 | 122 | 33767 |
60 | 77 | 33844 |
61 | 36 | 33880 |
62 | 18 | 33898 |
63 | 15 | 33913 |
64 | 2 | 33915 |
65 | 0 | 33915 |
66 | 1 | 33916 |
67 | 0 | 33916 |
68 | 0 | 33916 |
69 | 0 | 33916 |
70 | 0 | 33916 |
71 | 0 | 33916 |
72 | 0 | 33916 |
73 | 0 | 33916 |
74 | 0 | 33916 |
75 | 0 | 33916 |
76 | 0 | 33916 |