This notebook provides all code necessary to reproduce the assembled GBS data sets used in Federman et al. (xxxx). Starting from demultiplexed fastq data files we assemble the data into four complete data sets that were used in downstream analyses. All code in this notebook is written in Python and uses the ipyrad package for assembly.
## conda install ipyrad -c ipyrad
## conda install sra-tools -c bioconda
## conda install entrez-direct -c bioconda
import ipyrad as ip
import ipyrad.analysis as ipa
print "ipyrad v.{}".format(ip.__version__)
ipyrad v.0.7.20
import ipyparallel as ipp
ipyclient = ipp.Client()
ip.cluster_info(ipyclient)
host compute node: [40 cores] on sacra
## get accession data from sra
sra = ipa.sratools(accession="SRP106882", workdir="./fastq-files")
## print run info for posterity
run_info = sra.fetch_runinfo((1,4,6,29,30))
run_info
Fetching project data...
Run | spots | spots_with_mates | ScientificName | SampleName | |
---|---|---|---|---|---|
0 | SRR5534921 | 12818693 | 0 | Canarium lamianum | SF327 |
1 | SRR5534922 | 5675773 | 0 | Canarium longistipulatum | D12950 |
2 | SRR5534923 | 34404746 | 0 | Canarium ovatum | D14269 |
3 | SRR5534924 | 3382649 | 0 | Canarium pilicarpum | 5573 |
4 | SRR5534925 | 16632442 | 0 | Canarium obtusifolium | SF228 |
5 | SRR5534926 | 10769159 | 0 | Canarium odontophyllum | SFC1988 |
6 | SRR5534927 | 47625 | 0 | Canarium ntidifolium | 4304 |
7 | SRR5534928 | 17034881 | 0 | Canarium obtusifolium | SF224 |
8 | SRR5534929 | 211932 | 0 | Canarium ferrugineum | SF343 |
9 | SRR5534930 | 402493 | 0 | Canarium galokense | D13101 |
10 | SRR5534931 | 5796976 | 0 | Canarium galokense | SF155 |
11 | SRR5534932 | 3219083 | 0 | Canarium globosum | SF200 |
12 | SRR5534933 | 1902860 | 0 | Canarium globosum | SF209 |
13 | SRR5534934 | 9709559 | 0 | Canarium indicum | D13374 |
14 | SRR5534935 | 2192159 | 0 | Canarium lamianum | D13063 |
15 | SRR5534936 | 450626 | 0 | Canarium lamianum | SF160 |
16 | SRR5534937 | 12874061 | 0 | Canarium pulchrebracteatum | SF286 |
17 | SRR5534938 | 2499902 | 0 | Canarium compressum | D13090 |
18 | SRR5534939 | 7260851 | 0 | Canarium ferrugineum | SF172 |
19 | SRR5534940 | 15736904 | 0 | Canarium pulchrebracteatum | SF276 |
20 | SRR5534941 | 12539878 | 0 | Canarium pilicarpum | D13052 |
21 | SRR5534942 | 23355083 | 0 | Canarium compressum | D13097 |
22 | SRR5534943 | 2097537 | 0 | Canarium egregium | D13103 |
23 | SRR5534944 | 1033763 | 0 | Canarium elegans | D12963 |
24 | SRR5534945 | 2725698 | 0 | Canarium bengalense | D13852 |
25 | SRR5534946 | 12132085 | 0 | Canarium betamponae | SF175 |
26 | SRR5534947 | 18446248 | 0 | Canarium betamponae | SF328 |
27 | SRR5534948 | 400266 | 0 | Canarium boivinii | D12962 |
28 | SRR5534949 | 11647948 | 0 | Canarium velutinifolium | D14505 |
29 | SRR5534950 | 12852942 | 0 | Canarium velutinifolium | D14504 |
30 | SRR5534951 | 16100807 | 0 | Canarium scholasticum | SF197 |
31 | SRR5534952 | 186682 | 0 | Canarium scholasticum | SF301 |
32 | SRR5534953 | 1152381 | 0 | Canarium planifolium | SF153 |
33 | SRR5534954 | 3803237 | 0 | Canarium multiflorum | D14501 |
34 | SRR5534955 | 4734247 | 0 | Canarium multiflorum | D14485 |
35 | SRR5534956 | 10744745 | 0 | Canarium multinerve | D14482 |
36 | SRR5534957 | 2757099 | 0 | Canarium multiflorum | D14513 |
37 | SRR5534958 | 6645549 | 0 | Canarium multiflorum | D14477 |
38 | SRR5534959 | 225668 | 0 | Canarium madagascariense | D13091 |
39 | SRR5534960 | 7962974 | 0 | Canarium multiflorum | D14480 |
40 | SRR5534961 | 19015238 | 0 | Canarium multiflorum | D14478 |
41 | SRR5534962 | 668679 | 0 | Canarium scholasticum | D13075 |
42 | SRR5534963 | 294419 | 0 | Canarium multinerve | D14492 |
43 | SRR5534964 | 11466007 | 0 | Canarium multinerve | D14483 |
44 | SRR5534965 | 9829153 | 0 | Canarium planifolium | SF164 |
45 | SRR5534966 | 1788748 | 0 | Canarium pulchrebracteatum | D14528 |
46 | SRR5534967 | 5687824 | 0 | Canarium velutinifolium | D14506 |
47 | SRR5534968 | 1694555 | 0 | Canarium ferrugineum | D13053 |
## run parallel download
sra.run(ipyclient=ipyclient)
[####################] 100% Downloading fastq files | 0:24:21 | 48 fastq files downloaded to /home/deren/Documents/Canarium/fastq-files
Enter parameter values for the ipyrad assembly .
## create an Assembly
data = ip.Assembly("Canarium")
## set params
data.set_params("project_dir", "analysis-ipyrad")
data.set_params("sorted_fastq_path", "./fastq-files/*.gz")
data.set_params("restriction_overhang", ("CWGC", "CWGC"))
data.set_params("datatype", "gbs")
data.set_params("clust_threshold", 0.90)
data.set_params("filter_adapters", 2)
data.set_params("max_SNPs_locus", (10, 10))
data.set_params("max_shared_Hs_locus", 4)
data.set_params("trim_reads", (0, 0))
data.set_params("trim_loci", (0, 5))
data.set_params("output_formats", list("lpsvka"))
## print params for posterity
data.get_params()
New Assembly: Canarium 0 assembly_name Canarium 1 project_dir ./analysis-ipyrad 2 raw_fastq_path 3 barcodes_path 4 sorted_fastq_path ./fastq-files/*.gz 5 assembly_method denovo 6 reference_sequence 7 datatype gbs 8 restriction_overhang ('CWGC', 'CWGC') 9 max_low_qual_bases 5 10 phred_Qscore_offset 33 11 mindepth_statistical 6 12 mindepth_majrule 6 13 maxdepth 10000 14 clust_threshold 0.9 15 max_barcode_mismatch 0 16 filter_adapters 2 17 filter_min_trim_len 35 18 max_alleles_consens 2 19 max_Ns_consens (5, 5) 20 max_Hs_consens (8, 8) 21 min_samples_locus 4 22 max_SNPs_locus (10, 10) 23 max_Indels_locus (8, 8) 24 max_shared_Hs_locus 4 25 trim_reads (0, 0, 0, 0) 26 trim_loci (0, 5) 27 output_formats ('l', 'p', 's', 'v', 'k', 'a') 28 pop_assign_file
data.run("12")
data.run("3", force=True)
Assembly: Canarium [####################] 100% dereplicating | 0:01:55 | s3 | | [####################] 100% clustering | 15:30:31 | s3 | [####################] 100% building clusters | 0:04:02 | s3 | [####################] 100% chunking | 0:00:42 | s3 | [####################] 100% aligning | 1:00:07 | s3 | [####################] 100% concatenating | 0:03:13 | s3 |
data.run("45")
Assembly: Canarium [####################] 100% inferring [H, E] | 0:12:06 | s4 | [####################] 100% calculating depths | 0:00:54 | s5 | [####################] 100% chunking clusters | 0:01:22 | s5 | [####################] 100% consens calling | 0:21:34 | s5 |
data.run("6")
Assembly: Canarium [####################] 100% concat/shuffle input | 0:01:06 | s6 | [####################] 100% clustering across | 5:43:06 | s6 | [####################] 100% building clusters | 0:00:53 | s6 | [####################] 100% aligning clusters | 0:04:13 | s6 | [####################] 100% database indels | 0:02:13 | s6 | [####################] 100% indexing clusters | 0:02:19 | s6 | [####################] 100% building database | 0:24:48 | s6 |
data.run("7")
Assembly: Canarium [####################] 100% filtering loci | 0:00:39 | s7 | [####################] 100% building loci/stats | 0:00:29 | s7 | [####################] 100% building alleles | 0:00:36 | s7 | [####################] 100% building vcf file | 0:01:06 | s7 | [####################] 100% writing vcf file | 0:00:00 | s7 | [####################] 100% building arrays | 0:00:31 | s7 | [####################] 100% writing outfiles | 0:01:00 | s7 | Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium_outfiles
## re-load assembly in case coming back to this notebook later
data = ip.load_json("analysis-ipyrad/Canarium.json")
loading Assembly: Canarium from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium.json
## print some stats
print "total N reads:", data.stats.reads_raw.sum()
print "mean N reads/sample:", data.stats.reads_raw.mean()
print "S.D. N reads/sample:", data.stats.reads_raw.std()
total N reads: 365012834 mean N reads/sample: 7604434.04167 S.D. N reads/sample: 7407655.17926
## number of consens reads per sample.
data.stats_dfs.s7_samples
sample_coverage | |
---|---|
4304 | 7 |
5573 | 37672 |
D12950 | 44861 |
D12962 | 1981 |
D12963 | 10141 |
D13052 | 75316 |
D13053 | 21812 |
D13063 | 28227 |
D13075 | 4096 |
D13090 | 27541 |
D13091 | 1059 |
D13097 | 81399 |
D13101 | 2860 |
D13103 | 26359 |
D13374 | 56701 |
D13852 | 13687 |
D14269 | 79114 |
D14477 | 61515 |
D14478 | 97836 |
D14480 | 70588 |
D14482 | 69750 |
D14483 | 71352 |
D14485 | 51841 |
D14492 | 1272 |
D14501 | 49981 |
D14504 | 77524 |
D14505 | 77445 |
D14506 | 60470 |
D14513 | 39505 |
D14528 | 26587 |
SF153 | 9269 |
SF155 | 53298 |
SF160 | 2856 |
SF164 | 69764 |
SF172 | 64881 |
SF175 | 69099 |
SF197 | 82914 |
SF200 | 42029 |
SF209 | 28325 |
SF224 | 81039 |
SF228 | 82338 |
SF276 | 82961 |
SF286 | 78681 |
SF301 | 785 |
SF327 | 72815 |
SF328 | 80219 |
SF343 | 894 |
SFC1988 | 51364 |
## make subset lists to exclude taxa with little data
subs = [i.name for i in data.samples.values() if i.stats.reads_consens > 12000]
subsnout = list(set(subs) - set(["D14269", "D13374", "SFC1988", "D13852"]))
## make new branches
Can_min4 = data.branch("Canarium-min4", subsamples=subs)
Can_min10 = data.branch("Canarium-min10", subsamples=subs)
Can_min20 = data.branch("Canarium-min20", subsamples=subs)
Can_min30nout = data.branch("Canarium-min30-nout", subsamples=subsnout)
## set params on new assemblies
Can_min10.set_params("min_samples_locus", 10)
Can_min20.set_params("min_samples_locus", 20)
Can_min30nout.set_params("min_samples_locus", 30)
## final assemblies
Can_min4.run("7", force=True)
Can_min10.run("7", force=True)
Can_min20.run("7", force=True)
Can_min30nout.run("7", force=True)
Assembly: Canarium-min4 [####################] 100% filtering loci | 0:00:50 | s7 | [####################] 100% building loci/stats | 0:00:29 | s7 | [####################] 100% building alleles | 0:00:36 | s7 | [####################] 100% building vcf file | 0:01:00 | s7 | [####################] 100% writing vcf file | 0:00:00 | s7 | [####################] 100% building arrays | 0:00:40 | s7 | [####################] 100% writing outfiles | 0:00:50 | s7 | Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min4_outfiles Assembly: Canarium-min10 [####################] 100% filtering loci | 0:00:41 | s7 | [####################] 100% building loci/stats | 0:00:29 | s7 | [####################] 100% building alleles | 0:00:35 | s7 | [####################] 100% building vcf file | 0:00:49 | s7 | [####################] 100% writing vcf file | 0:00:00 | s7 | [####################] 100% building arrays | 0:00:39 | s7 | [####################] 100% writing outfiles | 0:00:27 | s7 | Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min10_outfiles Assembly: Canarium-min20 [####################] 100% filtering loci | 0:00:41 | s7 | [####################] 100% building loci/stats | 0:00:29 | s7 | [####################] 100% building alleles | 0:00:35 | s7 | [####################] 100% building vcf file | 0:00:42 | s7 | [####################] 100% writing vcf file | 0:00:00 | s7 | [####################] 100% building arrays | 0:00:38 | s7 | [####################] 100% writing outfiles | 0:00:15 | s7 | Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min20_outfiles Assembly: Canarium-min30-nout [####################] 100% filtering loci | 0:00:37 | s7 | [####################] 100% building loci/stats | 0:00:28 | s7 | [####################] 100% building alleles | 0:00:33 | s7 | [####################] 100% building vcf file | 0:00:35 | s7 | [####################] 100% writing vcf file | 0:00:00 | s7 | [####################] 100% building arrays | 0:00:35 | s7 | [####################] 100% writing outfiles | 0:00:06 | s7 | Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min30-nout_outfiles
See the github page for stats of each assembly.
## reoload assemblies from their JSON files
Can_min4 = ip.load_json("analysis-ipyrad/Canarium-min4.json")
Can_min10 = ip.load_json("analysis-ipyrad/Canarium-min10.json")
Can_min20 = ip.load_json("analysis-ipyrad/Canarium-min20.json")
Can_min30n = ip.load_json("analysis-ipyrad/Canarium-min30-nout.json")
loading Assembly: Canarium-min4 from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min4.json loading Assembly: Canarium-min10 from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min10.json loading Assembly: Canarium-min20 from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min20.json loading Assembly: Canarium-min30-nout from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min30-nout.json