Assembly and analysis of Ficus RAD-seq data¶

A RAD-seq library of 95 samples was prepared by Floragenex with the PstI restriction enzyme, followed by sonication and size selection. Stats reported by Floragenex include: AverageFragmentSize=386bp, Concentration=2.51ng/uL, Concentation=10nM. The library was sequenced on two lanes of Illumina HiSeq 3000 yielding 378,809,976 reads in lane 1, and 375,813,513 reads in lane 2, for a total of ~755M reads.

This notebook¶

This is a jupyter notebook, a tool used to create an executable document to full reproduce our analyses. This notebook contains all of the code to assemble the Ficus RAD-seq data set with ipyrad. We begin by demultiplexing The raw data. The demultiplexed data (will be) archived and available online. If you downloaded the demultiplexed data you can skip to section The Demultiplexed Data and begin by loading in those data. The data were assembled under a range of parameter settings, which you can see in the Within-sample assembly section. Several Samples were filtered from the data set due to low coverage. The data was then clustered across Samples and final output files were created Across-sample assembly.

Required software¶

The following conda commands will locally install of the software required for this notebook.

In [1]:

## conda install ipyrad -c ipyrad
## conda install toytree -c eaton-lab

In [2]:

import ipyrad as ip
import toyplot

In [3]:

## print software versions
print 'ipyrad v.{}'.format(ip.__version__)
print 'toyplot v.{}'.format(toyplot.__version__)

ipyrad v.0.6.20
toyplot v.0.14.4

Cluster info¶

I started an ipcluster instance on a 40 core workstation with the ipcluster command as shown below. The cluster_info() command shows that ipyrad is able to find all 40 cores on the cluster.

In [4]:

##
## ipcluster start --n=40
##

In [14]:

import ipyparallel as ipp
ipyclient = ipp.Client(profile="testing")
#print ip.cluster_info()

The raw data¶

The data came to us as two large 20GB files. The barcodes file was provided by Floragenex and maps sample names to barcodes that are contained inline in the sequences, and are 10bp in length. The barcodes are printed a little further below. I ran the program fastQC on the raw data files to do a quality check, the results of which are available here lane1-fastqc and here lane2-fastqc. Overall, quality scores were very high and there was little (but some) adapter contamination, which we will filter out in the ipyrad analysis.

In [30]:

## The reference genome link
reference = """\
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/\
002/002/945/GCA_002002945.1_F.carica_assembly01/\
GCA_002002945.1_F.carica_assembly01_genomic.fna.gz\
"""

## Download the reference genome of F. carica
# ! wget $reference

## decompress it
# ! gunzip ./GCA_002002945.1_F.carica_assembly01_genomic.fna.gz

In [8]:

## Locations of the raw data and barcodes file
lane1data = "~/Documents/RADSEQ_DATA/Ficus/Ficus-1_S1_L001_R1_001.fastq.gz"
lane2data = "~/Documents/RADSEQ_DATA/Ficus/Ficus-2_S2_L002_R1_001.fastq.gz"
barcodes = "~/Documents/RADSEQ_DATA/barcodes/Ficus_Jander_2016_95barcodes.txt"

Create an ipyrad Assembly object for each lane of data¶

We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. You can see the full barcode information at this link.

In [9]:

## create an object to demultiplex each lane
demux1 = ip.Assembly("lane1")
demux2 = ip.Assembly("lane2")

## set path to data, bcodes, and max_mismatch params
demux1.set_params("project_dir", "./ficus_demux_reads")
demux1.set_params("raw_fastq_path", lane1data)
demux1.set_params("barcodes_path", barcodes)
demux1.set_params("max_barcode_mismatch", 0)

## set path to data, bcodes, and max_mismatch params
demux2.set_params("project_dir", "./ficus_demux_reads")
demux2.set_params("raw_fastq_path", lane2data)
demux2.set_params("barcodes_path", barcodes)
demux2.set_params("max_barcode_mismatch", 0)

  New Assembly: lane1
  New Assembly: lane2

Demultiplex raw data from both lanes¶

In [12]:

demux1.run("1")
demux2.run("1")

  Assembly: lane1
  [####################] 100%  chunking large files  | 0:21:55 | s1 | 
  [####################] 100%  sorting reads         | 0:43:43 | s1 | 
  [####################] 100%  writing/compressing   | 0:14:25 | s1 | 

  Assembly: lane2
  [####################] 100%  chunking large files  | 0:27:44 | s1 | 
  [####################] 100%  sorting reads         | 0:45:12 | s1 | 
  [####################] 100%  writing/compressing   | 0:15:53 | s1 |

The demultiplexed data¶

Now we have two directories with demultiplexed data, each with one gzipped fastq file corresponding to all of the reads matching to a particular Sample's barcode from that lane of sequencing. These are the data that (will be uploaded) to Genbank SRA when we publish. We will load the sorted fastq data at this step, to copy the same procedure that one would take if they were starting from access to the demultiplexed data.

In [18]:

lib1_fastqs = "./ficus_demux_reads/lane1_fastqs/*.gz"
lib2_fastqs = "./ficus_demux_reads/lane2_fastqs/*.gz"

In [19]:

lib1 = ip.Assembly("lib1", quiet=True)
lib1.set_params("sorted_fastq_path", lib1_fastqs)
lib1.run("1")

lib2 = ip.Assembly("lib2", quiet=True)
lib2.set_params("sorted_fastq_path", lib2_fastqs)
lib2.run("1")

  Assembly: lib1
  [####################] 100%  loading reads         | 0:01:47 | s1 | 

  Assembly: lib2
  [####################] 100%  loading reads         | 0:00:34 | s1 |

Merge the two lanes of data into one Assembly¶

We will join these two demultiplexed libraries into a single analysis that has the set of parameters we will use to assemble the data set. To do this we use the merge() command in ipyrad. On this merged Assembly we will then set a number of parameter settings that we will use to assemble the data.

In [21]:

## named corresponding to some params we are changing
data = ip.merge("merged", [demux1, demux2])

## set several non-default parameters
data.set_params("project_dir", "analysis-ipyrad")
data.set_params("filter_adapters", 3)
data.set_params("phred_Qscore_offset", 43)
data.set_params("max_Hs_consens", (5, 5))
data.set_params("max_shared_Hs_locus", 4)
data.set_params("filter_min_trim_len", 60)
data.set_params("trim_loci", (0, 8, 0, 0))
data.set_params("output_formats", list("lksapnv"))

## print parameters for prosperity's sake
data.get_params()

  0   assembly_name               merged                                       
  1   project_dir                 ./analysis-ipyrad                            
  2   raw_fastq_path              Merged: lane1, lane2                         
  3   barcodes_path               Merged: lane1, lane2                         
  4   sorted_fastq_path           Merged: lane1, lane2                         
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         43                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    10000                                        
  14  clust_threshold             0.85                                         
  15  max_barcode_mismatch        0                                            
  16  filter_adapters             3                                            
  17  filter_min_trim_len         60                                           
  18  max_alleles_consens         2                                            
  19  max_Ns_consens              (5, 5)                                       
  20  max_Hs_consens              (5, 5)                                       
  21  min_samples_locus           4                                            
  22  max_SNPs_locus              (20, 20)                                     
  23  max_Indels_locus            (8, 8)                                       
  24  max_shared_Hs_locus         4                                            
  25  trim_reads                  (0, 0, 0, 0)                                 
  26  trim_loci                   (0, 8, 0, 0)                                 
  27  output_formats              ('l', 'k', 's', 'a', 'p', 'n', 'v')          
  28  pop_assign_file

Create branch `ficus` and drop the control Sample¶

First we will drop the control sequence included for quality checking by Floragenex (FGXCONTROL). To do this we create a new branch using the argument subsample to include all Samples except FGXCONTROL.

In [22]:

## drop the Floragenex control sample if it is in the data
snames = [i for i in data.samples if i != "FGXCONTROL"]

## working branch
data = data.branch("ficus", subsamples=snames)

A summary of the number of reads per Sample.¶

In [23]:

print "summary of raw read covereage"
print data.stats.reads_raw.describe().astype(int)

summary of raw read covereage
count          95
mean      5149596
std       7716026
min         14703
25%        289541
50%       1890328
75%       7402440
max      51339646
Name: reads_raw, dtype: int64

Filtering options¶

From looking closely at the data it appears there are som poor quality reads with adapter contamination, and also that there are some conspicuous long strings of poly repeats, which are probably due to the library being put on the sequencer in the wrong concentration (the facility failed to do a qPCR quantification). Setting the filter parameter in ipyrad to strict (2) uses 'cutadapt' to filter the reads. By default ipyrad would look just for the Illumina universal adapter, but I'm also adding a few additional poly-{A,C,G,T} sequences to be trimmed. These appeared to be somewhat common in the raw data, followed by nonsense.

In [24]:

## run step 2
data.run("2", force=True)

  Assembly: ficus
  [####################] 100%  concatenating inputs  | 0:02:43 | s2 | 
  [####################] 100%  processing reads      | 0:51:52 | s2 |

Within-sample assembly¶

Steps 2-5 of ipyrad function to filter and cluster reads, and to call consensus haplotypes within samples. We'll look more closely at the stats for each step after it's finished.

reference & de novo assemblies with ipyrad¶

In [33]:

## create new branches for assembly method
ficus_d = data.branch("ficus_d")
ficus_r = data.branch("ficus_r")

## set reference info
reference = "GCA_002002945.1_F.carica_assembly01_genomic.fna"
ficus_r.set_params("reference_sequence", reference)
ficus_r.set_params("assembly_method", "reference")

In [ ]:

## map reads to reference genome
ficus_r.run("3", force=True)

In [35]:

## cluster reads denovo
ficus_d.run("3")

  Assembly: ficus_d
  [####################] 100%  dereplicating         | 0:04:31 | s3 | 
  [####################] 100%  clustering            | 0:13:54 | s3 | 
  [####################] 100%  building clusters     | 0:00:46 | s3 | 
  [####################] 100%  chunking              | 0:00:11 | s3 | 
  [####################] 100%  aligning              | 0:23:58 | s3 | 
  [####################] 100%  concatenating         | 0:00:14 | s3 |

In [7]:

ficus_d.run("4")

  Assembly: ficus_d
  [####################] 100%  inferring [H, E]      | 0:14:01 | s4 |

Branch to make consensus calls at different mindepth settings¶

Now that the reads are filtered and clustered within each Sample we want to try applying several different parameter settings for downstream analyses. One major difference will be in the minimum depth of sequencing we require to make a confident base call. We will leave one Assembly with the default setting of 6, which is somewhat conservative. We will also create a 'lowdepth' Assembly that allows base calls for depths as low as 2.

In [6]:

ficus_dhi = ficus_d.branch("ficus_dhi")
ficus_dlo = ficus_d.branch("ficus_dlo")
ficus_dlo.set_params("mindepth_majrule", 1)

In [7]:

ficus_dlo.run("5")
ficus_dhi.run("5")

  Assembly: ficus_dlo
  [####################] 100%  calculating depths    | 0:00:23 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:21 | s5 | 
  [####################] 100%  consens calling       | 0:15:17 | s5 | 

  Assembly: ficus_dhi
  [####################] 100%  calculating depths    | 0:00:23 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:21 | s5 | 
  [####################] 100%  consens calling       | 0:14:24 | s5 |

Plot consens reads¶

Compare hidepth and lodepth assemblies. The difference is not actually that great. Regardless, the samples with very few reads are going to recover very few clusters.

In [9]:

import numpy as np
import toyplot

## stack columns of consens stats
zero = np.zeros(ficus_dhi.stats.shape[0])
upper = ficus_dhi.stats.reads_consens
lower = -1*ficus_dlo.stats.reads_consens
boundaries = np.column_stack((lower, zero, upper))

## plot barplots
canvas = toyplot.Canvas(width=700, height=300)
axes = canvas.cartesian()
axes.bars(boundaries, baseline=None)
axes.y.ticks.show = True
axes.y.ticks.labels.angle = -90

Cluster data across Samples¶

In [14]:

## run step 6 on full and subsampled data sets
ficus_dlo.run("6")
ficus_dhi.run("6")

  Assembly: ficus_dlo
  [####################] 100%  concat/shuffle input  | 0:00:51 | s6 | 
  [####################] 100%  clustering across     | 3:26:51 | s6 | 
  [####################] 100%  building clusters     | 0:00:40 | s6 | 
  [####################] 100%  aligning clusters     | 0:03:15 | s6 | 
  [####################] 100%  database indels       | 0:01:00 | s6 | 
  [####################] 100%  indexing clusters     | 0:07:29 | s6 | 
  [####################] 100%  building database     | 0:51:26 | s6 | 

  Assembly: ficus_dhi
  [####################] 100%  concat/shuffle input  | 0:00:45 | s6 | 
  [####################] 100%  clustering across     | 2:05:50 | s6 | 
  [####################] 100%  building clusters     | 0:00:37 | s6 | 
  [####################] 100%  aligning clusters     | 0:02:56 | s6 | 
  [####################] 100%  database indels       | 0:00:50 | s6 | 
  [####################] 100%  indexing clusters     | 0:06:03 | s6 | 
  [####################] 100%  building database     | 0:41:25 | s6 |

Create branches that include/exclude Samples with little data¶

We have several Samples that recovered very little data, probably as a result of having low quality DNA extractions. Figs are hard. We'll assemble one data set that includes all of these samples, but since they are likely to have little information we'll also assemble most of our data sets without these low data samples.

In [7]:

ficus_dhi = ip.load_json("analysis-ipyrad/ficus_dhi.json", 1)
ficus_dlo = ip.load_json("analysis-ipyrad/ficus_dlo.json", 1)

In [8]:

## infer a min4 assembly to see which samples have least data
ficus_dhi.run("7", force=True)
ficus_dlo.run("7", force=True)

  Assembly: ficus_dhi
  [####################] 100%  filtering loci        | 0:01:13 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:51 | s7 | 
  [####################] 100%  building alleles      | 0:01:02 | s7 | 
  [####################] 100%  building vcf file     | 0:02:37 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:01 | s7 | 
  [####################] 100%  building arrays       | 0:00:43 | s7 | 
  [####################] 100%  writing outfiles      | 0:02:12 | s7 | 
  Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_outfiles

  Assembly: ficus_dlo
  [####################] 100%  filtering loci        | 0:01:07 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:59 | s7 | 
  [####################] 100%  building alleles      | 0:01:11 | s7 | 
  [####################] 100%  building vcf file     | 0:03:07 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:01 | s7 | 
  [####################] 100%  building arrays       | 0:00:54 | s7 | 
  [####################] 100%  writing outfiles      | 0:02:20 | s7 | 
  Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_outfiles

In [54]:

ficus_dlo.stats_dfs.s7_samples.head(10)

Out[54]:

	sample_coverage
A01_paraensis	6133
A02_paraensis	11534
A04_paraensis	35023
A05_paraensis	14597
A06_obtusifolia	605
A07_obtusifolia	455
A16_citrifolia	3664
A18_citrifolia	3804
A19_citrifolia	9194
A26_popenoei	542

In [63]:

## get list of samples with > 1000 reads in the min4
mask = ficus_dlo.stats_dfs.s7_samples.sample_coverage > 1000
skeep = ficus_dlo.stats.index[mask].tolist()

## print who was excluded
print "excluded samples:\t\tconsens reads"
for name, dat in ficus_dlo.samples.items():
    if name not in skeep:
        print "  {:<22}\t{}".format(name, int(dat.stats.reads_consens))

excluded samples:		consens reads
  A07_obtusifolia       	3128
  A62_turbinata         	3400
  C02_citrifolia        	44
  A26_popenoei          	3374
  A63_turbinata         	1562
  A34_nymphaeifolia     	12204
  A75_colubrinae        	3982
  A29_popenoei          	1473
  A28_popenoei          	3168
  C33_triangle          	1373
  A38_nymphaeifolia     	6826
  C09_costaricana       	189
  A06_obtusifolia       	5760
  C01_bullenei          	3826
  C52_citrifolia        	269
  C54_citrifolia        	1409
  C32_triangleXtrigonata	2389
  A27_popenoei          	2467
  C34_triangle          	374

Assemble final data sets¶

Also with different thresholds for missing data.

In [ ]:

## final min4s, min10s, and min20
for assembly in [ficus_dlo, ficus_dhi]:
    for minsamp in [4, 10, 20]:
        ## new branch names
        subname = assembly.name + "_s{}".format(minsamp)
        
        ## make new branches with subsampling
        subdata = assembly.branch(subname, subsamples=skeep)
        
        ## set minsamp value
        subdata.set_params("min_samples_locus", minsamp)
        
        ## run step7
        subdata.run("7", force=True)

  Assembly: ficus_dlo_s4
  [####################] 100%  filtering loci        | 0:00:43 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:34 | s7 | 
  [####################] 100%  building alleles      | 0:00:44 | s7 | 
  [####################] 100%  building vcf file     | 0:01:26 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:01 | s7 | 
  [####################] 100%  building arrays       | 0:00:40 | s7 | 
  [####################] 100%  writing outfiles      | 0:01:21 | s7 | 
  Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s4_outfiles

  Assembly: ficus_dlo_s10
  [####################] 100%  filtering loci        | 0:00:44 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:47 | s7 | 
  Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s10_outfiles

  Assembly: ficus_dlo_s20
  [####################] 100%  filtering loci        | 0:00:44 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:33 | s7 | 
  [####################] 100%  building alleles      | 0:00:41 | s7 | 
  [####################] 100%  building vcf file     | 0:01:02 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:38 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:32 | s7 | 
  Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s20_outfiles

  Assembly: ficus_dhi_s4
  [####################] 100%  filtering loci        | 0:00:36 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:27 | s7 | 
  [####################] 100%  building alleles      | 0:00:37 | s7 | 
  [####################] 100%  building vcf file     | 0:01:14 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:01 | s7 | 
  [####################] 100%  building arrays       | 0:00:32 | s7 | 
  [####################] 100%  writing outfiles      | 0:01:14 | s7 | 
  Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s4_outfiles

  Assembly: ficus_dhi_s10
  [####################] 100%  filtering loci        | 0:00:37 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:27 | s7 | 
  [####################] 100%  building alleles      | 0:00:35 | s7 | 
  [####################] 100%  building vcf file     | 0:01:00 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:31 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:43 | s7 | 
  Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s10_outfiles

  Assembly: ficus_dhi_s20
  [####################] 100%  filtering loci        | 0:00:36 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:27 | s7 | 
  [####################] 100%  building alleles      | 0:00:35 | s7 | 
  [####################] 100%  building vcf file     | 0:00:53 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:30 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:29 | s7 | 
  Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s20_outfiles

Check some final stats¶

In [69]:

#subdata = ip.load_json("analysis-ipyrad/ficus_dhi_s20")
subdata.stats_dfs.s7_filters

Out[69]:

	total_filters	applied_order	retained_loci
total_prefiltered_loci	250234	0	250234
filtered_by_rm_duplicates	6458	6458	243776
filtered_by_max_indels	4889	4889	238887
filtered_by_max_snps	9980	4476	234411
filtered_by_max_shared_het	10985	9067	225344
filtered_by_min_sample	197775	186819	38525
filtered_by_max_alleles	17038	4609	33916
total_filtered_loci	33916	0	33916

In [70]:

subdata.stats_dfs.s7_samples

Out[70]:

	sample_coverage
A01_paraensis	3178
A02_paraensis	6990
A04_paraensis	26258
A05_paraensis	10147
A16_citrifolia	1870
A18_citrifolia	2028
A19_citrifolia	5755
A33_nymphaeifolia	2344
A41_nymphaeifolia	19448
A42_nymphaeifolia	25988
A48_trigonata	5099
A49_trigonata	5963
A55_triangle	5659
A59_dugandii	22571
A60_dugandii	1555
A61_turbinata	4224
A65_pertusa	26857
A67_bullenei	12912
A69_bullenei	1530
A70_bullenei	9034
A71_bullenei	1141
A72_bullenei	2881
A77_colubrinae	22439
A82_perforata	17661
A83_perforata	17932
A84_perforata	26821
A85_perforata	16701
A87_costaricana	2200
A94_maxima	18643
A95_insipida	10999
A96_glabrata	20708
A97_glabrata	16130
B102_obtusifolia	26421
B103_obtusifolia	728
B118_maxima	1542
B119_maxima	17984
B120_maxima	20001
B123_maxima	15354
B126_insipida	22977
B127_insipida	18709
B128_insipida	22847
B130_glabrataXmaxima	22352
B131_glabrataXmaxima	23084
B133_glabrata	21720
B134_glabrata	22720
C04_colubrinae	24782
C11_costaricana	26317
C12_dugandii	7438
C14_dugandii	26742
C15_insipida	22907
C17_maxima	18066
C18_maxima	23139
C19_nymphaeifolia	26665
C21_obtusifolia	2627
C22_obtusifolia	27114
C24_obtusifolia	25269
C25_popenoei	26463
C26_popenoei	25572
C27_popenoei	15584
C28_pertusa	25187
C30_triangle	7256
C31_triangle	12661
C36_trigonata	27118
C37_trigonata	26778
C39_trigonata	26270
C41_trigonata	10822
C43_trigonata	27126
C45_yoponensis	18584
C46_yoponensis	22783
C47_yoponensis	23199
C48_tonduzii	20918
C49_dugandii	24327
C50_insipida	23203
C51_perforata	22246
C53_citrifolia	3392
C5_colubrinae	26818

In [71]:

subdata.stats_dfs.s7_loci

Out[71]:

	locus_coverage	sum_coverage
1	0	0
2	0	0
3	0	0
4	0	0
5	0	0
6	0	0
7	0	0
8	0	0
9	0	0
10	0	0
11	0	0
12	0	0
13	0	0
14	0	0
15	0	0
16	0	0
17	0	0
18	0	0
19	0	0
20	2432	2432
21	1887	4319
22	1248	5567
23	910	6477
24	844	7321
25	848	8169
26	841	9010
27	882	9892
28	911	10803
29	849	11652
30	829	12481
31	795	13276
32	705	13981
33	680	14661
34	664	15325
35	638	15963
36	544	16507
37	538	17045
38	589	17634
39	576	18210
40	644	18854
41	686	19540
42	754	20294
43	770	21064
44	941	22005
45	1019	23024
46	1094	24118
47	1073	25191
48	1150	26341
49	1122	27463
50	1083	28546
51	1129	29675
52	908	30583
53	848	31431
54	691	32122
55	570	32692
56	416	33108
57	317	33425
58	220	33645
59	122	33767
60	77	33844
61	36	33880
62	18	33898
63	15	33913
64	2	33915
65	0	33915
66	1	33916
67	0	33916
68	0	33916
69	0	33916
70	0	33916
71	0	33916
72	0	33916
73	0	33916
74	0	33916
75	0	33916
76	0	33916

Assembly and analysis of Ficus RAD-seq data¶

This notebook¶

Required software¶

Cluster info¶

The raw data¶

Create an ipyrad Assembly object for each lane of data¶

Demultiplex raw data from both lanes¶

The demultiplexed data¶

Merge the two lanes of data into one Assembly¶

Create branch ficus and drop the control Sample¶

A summary of the number of reads per Sample.¶

Filtering options¶

Within-sample assembly¶

reference & de novo assemblies with ipyrad¶

Branch to make consensus calls at different mindepth settings¶

Plot consens reads¶

Cluster data across Samples¶

Create branches that include/exclude Samples with little data¶

Assemble final data sets¶

Check some final stats¶

Link to all of the final stats files on github¶

Create branch `ficus` and drop the control Sample¶