This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set.
### Notebook 5
### Data set 5 (Heliconius)
### Authors: Nadeau et al. (2013)
### Data Location: ERP000991
Below I read in EraRunTable.txt for this project which contains all of the information we need to download the data.
%%bash
## make a new directory for this analysis
mkdir -p empirical_5/fastq/
The authors used only a subset of the data that are on the archive for their phylogenetic analyses so I will choose the same 54 samples here which are listed in Table S1 of their publication.
subsamples = ['ERS070268','ERS070269','ERS070270','ERS070271','ERS070272',
'ERS070273','ERS070274','ERS070275','ERS070276','ERS070277',
'ERS070246','ERS070247','ERS070248','ERS070249','ERS070245',
'ERS070252','ERS070253','ERS070254','ERS070256','ERS070255',
'ERS074398','ERS074399','ERS074400','ERS074401','ERS074402',
'ERS074403','ERS074404','ERS074405','ERS074406','ERS074407',
'ERS074408','ERS074409','ERS074410','ERS074411','ERS074412',
'ERS074413','ERS074414','ERS074415','ERS074416','ERS074417',
'ERS074418','ERS074419','ERS074420','ERS074421','ERS074422',
'ERS074423','ERS074424','ERS074425','ERS074426','ERS074427',
'ERS074428','ERS074429','ERS070250','ERS070251']
len(subsamples)
54
## IPython code
import pandas as pd
import numpy as np
import urllib2
import os
## open the SRA run table from github url
url = "https://raw.githubusercontent.com/"+\
"dereneaton/RADmissing/master/empirical_9_EraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")
## print first few rows
print indata.head()
study_accession secondary_study_accession sample_accession \ 0 PRJEB2743 ERP000991 SAMEA1322899 1 PRJEB2743 ERP000991 SAMEA1322899 2 PRJEB2743 ERP000991 SAMEA1322873 3 PRJEB2743 ERP000991 SAMEA1322873 4 PRJEB2743 ERP000991 SAMEA1322940 secondary_sample_accession experiment_accession run_accession tax_id \ 0 ERS070236 ERX030872 ERR053824 33444 1 ERS070236 ERX030873 ERR053825 33444 2 ERS070237 ERX030874 ERR053826 33444 3 ERS070237 ERX030875 ERR053827 33444 4 ERS070238 ERX030876 ERR053828 33444 scientific_name instrument_model library_layout \ 0 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED 1 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED 2 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED 3 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED 4 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED fastq_ftp \ 0 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053824/... 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053825/... 2 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053826/... 3 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053827/... 4 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053828/... fastq_galaxy \ 0 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053824/... 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053825/... 2 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053826/... 3 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053827/... 4 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053828/... submitted_ftp \ 0 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... 1 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... 2 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... 3 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... 4 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... submitted_galaxy cram_index_ftp \ 0 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN 1 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN 2 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN 3 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN 4 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN cram_index_galaxy 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN
def wget_download_ERR(ERR, outdir, outname):
""" Python function to get sra data from ncbi and write to
outdir with a new name using bash call wget """
## get output name
output = os.path.join(outdir, outname+".fastq.gz")
## create a call string
call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/"+\
"{}/{}/{}_1.fastq.gz".format(ERR[:6], ERR, ERR)
## call bash script
! $call
Here we pass the SRR number and the sample name to the wget_download
function so that the files are saved with their sample names.
for ID, ERS, ERR in zip(indata.scientific_name,
indata.secondary_sample_accession,
indata.run_accession):
if ERS in subsamples:
name = ID.split()[1]
name += "_"+ERS+"_"+ERR
print "{:<35}\t{}\t{}".format(ID, ERS, ERR)
wget_download_ERR(ERR, "empirical_5/fastq/", name)
Heliconius timareta ssp. JD-2011 ERS070246 ERR053848 Heliconius timareta ssp. JD-2011 ERS070246 ERR053849 Heliconius timareta ssp. JD-2011 ERS070246 ERR053850 Heliconius timareta ssp. JD-2011 ERS070247 ERR053851 Heliconius timareta ssp. JD-2011 ERS070247 ERR053852 Heliconius timareta ssp. JD-2011 ERS070247 ERR053853 Heliconius timareta ssp. JD-2011 ERS070245 ERR053854 Heliconius timareta ssp. JD-2011 ERS070245 ERR053855 Heliconius timareta ssp. JD-2011 ERS070248 ERR053856 Heliconius timareta ssp. JD-2011 ERS070248 ERR053857 Heliconius timareta ssp. JD-2011 ERS070249 ERR053858 Heliconius timareta ssp. JD-2011 ERS070249 ERR053859 Heliconius timareta timareta ERS070250 ERR053860 Heliconius timareta timareta ERS070250 ERR053861 Heliconius timareta timareta ERS070250 ERR053862 Heliconius timareta timareta ERS070251 ERR053863 Heliconius timareta timareta ERS070251 ERR053864 Heliconius timareta timareta ERS070251 ERR053865 Heliconius hecale felix ERS070252 ERR053866 Heliconius hecale felix ERS070252 ERR053867 Heliconius hecale felix ERS070253 ERR053868 Heliconius hecale felix ERS070253 ERR053869 Heliconius hecale felix ERS070254 ERR053870 Heliconius hecale felix ERS070254 ERR053871 Heliconius hecale felix ERS070255 ERR053872 Heliconius hecale felix ERS070255 ERR053873 Heliconius hecale felix ERS070255 ERR053874 Heliconius hecale felix ERS070256 ERR053875 Heliconius hecale felix ERS070256 ERR053876 Heliconius hecale felix ERS070256 ERR053877 Heliconius melpomene aglaope ERS070268 ERR053897 Heliconius melpomene aglaope ERS070268 ERR053898 Heliconius melpomene aglaope ERS070269 ERR053899 Heliconius melpomene aglaope ERS070269 ERR053900 Heliconius melpomene aglaope ERS070270 ERR053901 Heliconius melpomene aglaope ERS070270 ERR053902 Heliconius melpomene aglaope ERS070271 ERR053903 Heliconius melpomene aglaope ERS070271 ERR053904 Heliconius melpomene aglaope ERS070272 ERR053905 Heliconius melpomene aglaope ERS070272 ERR053906 Heliconius melpomene amaryllis ERS070273 ERR053907 Heliconius melpomene amaryllis ERS070273 ERR053908 Heliconius melpomene amaryllis ERS070274 ERR053909 Heliconius melpomene amaryllis ERS070274 ERR053910 Heliconius melpomene amaryllis ERS070275 ERR053911 Heliconius melpomene amaryllis ERS070275 ERR053912 Heliconius melpomene amaryllis ERS070276 ERR053913 Heliconius melpomene amaryllis ERS070276 ERR053914 Heliconius melpomene amaryllis ERS070277 ERR053915 Heliconius melpomene amaryllis ERS070277 ERR053916 Heliconius heurippa ERS074398 ERR055402 Heliconius heurippa ERS074399 ERR055403 Heliconius heurippa ERS074400 ERR055404 Heliconius heurippa ERS074401 ERR055405 Heliconius heurippa ERS074402 ERR055406 Heliconius heurippa ERS074403 ERR055407 Heliconius cydno cordula ERS074404 ERR055408 Heliconius cydno cordula ERS074405 ERR055409 Heliconius cydno cordula ERS074406 ERR055410 Heliconius cydno cordula ERS074407 ERR055411 Heliconius cydno cordula ERS074408 ERR055412 Heliconius cydno cordula ERS074409 ERR055413 Heliconius cydno chioneus ERS074410 ERR055414 Heliconius cydno chioneus ERS074411 ERR055415 Heliconius cydno weymeri ERS074412 ERR055416 Heliconius cydno weymeri ERS074413 ERR055417 Heliconius cydno cydnides ERS074414 ERR055418 Heliconius cydno cydnides ERS074415 ERR055419 Heliconius melpomene melpomene ERS074416 ERR055420 Heliconius melpomene melpomene ERS074417 ERR055421 Heliconius melpomene melpomene ERS074418 ERR055422 Heliconius melpomene melpomene ERS074419 ERR055423 Heliconius melpomene melpomene ERS074420 ERR055424 Heliconius melpomene melpomene ERS074421 ERR055425 Heliconius melpomene melpomene ERS074422 ERR055426 Heliconius melpomene melpomene ERS074423 ERR055427 Heliconius melpomene melpomene ERS074424 ERR055428 Heliconius melpomene melpomene ERS074425 ERR055429 Heliconius melpomene rosina ERS074426 ERR055430 Heliconius melpomene rosina ERS074427 ERR055431 Heliconius melpomene ecuadorensis ERS074428 ERR055432 Heliconius melpomene ecuadorensis ERS074429 ERR055433
The data look kind of weird because there are a lot of As in the beginning. I figured out it is just because the sequences are sorted alphabetically.
This study includes several technical replicates per sequenced individuals, which we combine into a single file for each individual here.
%%bash
mkdir -p empirical_5/fastq/merged
##IPython code
import glob
import os
## get all fastq files
taxa = [i.split("/")[-1].rsplit('_',1)[0] for i in \
glob.glob("empirical_5/fastq/*.gz")]
## iterate over individuals and create merge file
for taxon in set(taxa):
print taxon, "merged"
reps = glob.glob("empirical_5/fastq/"+taxon+"*")
if len(reps) > 1:
## get replicate files
with open("empirical_5/fastq/merged/"+taxon+".fastq.gz", 'wb') as out:
for rep in reps:
with open(rep, 'rb') as tempin:
out.write(tempin.read())
else:
dirs, ff = os.path.split(reps[0])
os.rename(os.path.join(dirs,ff), os.path.join(dirs,"merged",taxon+".fastq.gz"))
melpomene_ERS070276 merged melpomene_ERS070277 merged melpomene_ERS070274 merged melpomene_ERS070275 merged melpomene_ERS070272 merged melpomene_ERS070273 merged melpomene_ERS070270 merged melpomene_ERS070271 merged timareta_ERS070247 merged timareta_ERS070246 merged timareta_ERS070245 merged timareta_ERS070249 merged timareta_ERS070248 merged cydno_ERS074413 merged cydno_ERS074412 merged cydno_ERS074411 merged cydno_ERS074410 merged cydno_ERS074415 merged cydno_ERS074414 merged hecale_ERS070253 merged hecale_ERS070252 merged hecale_ERS070256 merged hecale_ERS070255 merged hecale_ERS070254 merged melpomene_ERS074418 merged melpomene_ERS074419 merged melpomene_ERS074416 merged melpomene_ERS074417 merged melpomene_ERS070269 merged melpomene_ERS070268 merged timareta_ERS070250 merged timareta_ERS070251 merged
%%bash
pyrad --version
pyRAD 3.0.63
%%bash
## remove old params file if it exists
rm params.txt
## create a new default params file
pyrad -n
new params.txt file created
The data here are from Illumina Casava <1.8, so the phred scores are offset by 64 instead of 33, so we use that in the params file below.
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_5/ ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG ## 6. cutters ' params.txt
sed -i '/## 7. /c\20 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_5_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\empirical_5/fastq/merged/*.gz ## 18. data location ' params.txt
sed -i '/## 20./c\64 ## 20. phred offset ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s ## 30. output formats ' params.txt
cat params.txt
==** parameter inputs for pyRAD version 3.0.63 **======================== affected step == empirical_5/ ## 1. working directory ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. cutters 20 ## 7. N processors 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 6 ## 9. NQual .85 ## 10. clust threshold rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov 10 ## 13. maxSH empirical_5 ## 14. output name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) empirical_5/fastq/merged/*.gz ## 18. data location ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) 64 ## 20. phred offset ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) 2,2 ## 29. trim overhang p,n,s ## 30. output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_5_m2 ## 14. output name ' params.txt
pyrad -p params.txt -s 7 >> log.txt 2>&1
We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.
The average number of raw reads per sample is 1.36M.
## read in the data
s2dat = pd.read_table("empirical_5/stats/s2.rawedit.txt", header=0, nrows=55)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
count 54.000000 mean 5315416.333333 std 2877440.775877 min 2022986.000000 25% 3310234.000000 50% 3784971.000000 75% 8453907.500000 max 12011939.000000 Name: passed.total, dtype: float64 most raw data in sample: 15 hecale_ERS070255 Name: sample , dtype: object
pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std of coverages is pretty low in this data set compared to several others.
## read in the s3 results
s5dat = pd.read_table("empirical_5/stats/s3.clusters.txt", header=0, nrows=54)
## print summary stats
print "summary of means\n=================="
print s5dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s5dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s5dat['d>5.tot']/s5dat["total"]).describe()
## find which sample has the greatest depth of retained loci
max_hiprop = (s5dat["d>5.tot"]/s5dat["total"]).max()
print "\nhighest coverage in sample:"
print s5dat['taxa'][s5dat['d>5.tot']/s5dat["total"]==max_hiprop]
summary of means ================== count 54.000000 mean 47.842574 std 19.705225 min 15.745000 25% 38.288250 50% 44.435500 75% 56.646000 max 92.441000 Name: dpt.me, dtype: float64 summary of std ================== count 54.000000 mean 165.924389 std 76.682339 min 34.403000 25% 118.026500 50% 150.211500 75% 215.882000 max 381.038000 Name: dpt.sd, dtype: float64 summary of proportion lowdepth ================== count 54.000000 mean 0.359732 std 0.135457 min 0.151041 25% 0.257660 50% 0.347190 75% 0.466101 max 0.669051 dtype: float64 highest coverage in sample: 53 timareta_ERS070251 Name: taxa, dtype: object
## print mean and std of coverage for the highest coverage sample
with open("empirical_5/clust.85/timareta_ERS070251.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
print depths.mean(), depths.std()
44.0298150232 117.15674992
Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.
import toyplot
import toyplot.svg
import numpy as np
## read in the depth information for this sample
with open("empirical_5/clust.85/timareta_ERS070251.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)",
ylabel="N loci",
label="dataset5/sample=timareta_ERS070251")
## select the loci with depth > 5 (kept)
keeps = depths[depths>5]
## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)
#toyplot.svg.render(canvas, "empirical_5_depthplot.svg")
<toyplot.mark.BarMagnitudes at 0x7ff22c0e08d0>
cat empirical_5/stats/empirical_5_m4.stats
125876 ## loci with > minsp containing data 119819 ## loci with > minsp containing data & paralogs removed 119819 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci cydno_ERS074404 37350 cydno_ERS074405 38572 cydno_ERS074406 37734 cydno_ERS074407 37863 cydno_ERS074408 37843 cydno_ERS074409 38464 cydno_ERS074410 33024 cydno_ERS074411 30593 cydno_ERS074412 22874 cydno_ERS074413 22699 cydno_ERS074414 22996 cydno_ERS074415 22966 hecale_ERS070252 38275 hecale_ERS070253 38129 hecale_ERS070254 38378 hecale_ERS070255 38013 hecale_ERS070256 37917 heurippa_ERS074398 38458 heurippa_ERS074399 38354 heurippa_ERS074400 37843 heurippa_ERS074401 38376 heurippa_ERS074402 38231 heurippa_ERS074403 38201 melpomene_ERS070268 40055 melpomene_ERS070269 40284 melpomene_ERS070270 39727 melpomene_ERS070271 39908 melpomene_ERS070272 38809 melpomene_ERS070273 40151 melpomene_ERS070274 36786 melpomene_ERS070275 40068 melpomene_ERS070276 39807 melpomene_ERS070277 37965 melpomene_ERS074416 33006 melpomene_ERS074417 36365 melpomene_ERS074418 36483 melpomene_ERS074419 35993 melpomene_ERS074420 38888 melpomene_ERS074421 38023 melpomene_ERS074422 37885 melpomene_ERS074423 38333 melpomene_ERS074424 30796 melpomene_ERS074425 33266 melpomene_ERS074426 35757 melpomene_ERS074427 37341 melpomene_ERS074428 32355 melpomene_ERS074429 31705 timareta_ERS070245 40166 timareta_ERS070246 38704 timareta_ERS070247 37990 timareta_ERS070248 39932 timareta_ERS070249 40204 timareta_ERS070250 38882 timareta_ERS070251 37739 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 26561 * 119819 5 13321 * 93258 6 6556 * 79937 7 5141 * 73381 8 4286 * 68240 9 3795 * 63954 10 3316 * 60159 11 2938 * 56843 12 2831 * 53905 13 2541 * 51074 14 2294 * 48533 15 2163 * 46239 16 1941 * 44076 17 1816 * 42135 18 1740 * 40319 19 1691 * 38579 20 1663 * 36888 21 1568 * 35225 22 1426 * 33657 23 1412 * 32231 24 1368 * 30819 25 1170 * 29451 26 1217 * 28281 27 1130 * 27064 28 1122 * 25934 29 1032 * 24812 30 1028 * 23780 31 1011 * 22752 32 1003 * 21741 33 947 * 20738 34 922 * 19791 35 881 * 18869 36 838 * 17988 37 806 * 17150 38 811 * 16344 39 802 * 15533 40 849 * 14731 41 800 * 13882 42 849 * 13082 43 864 * 12233 44 902 * 11369 45 959 * 10467 46 746 * 9508 47 794 * 8762 48 903 * 7968 49 1171 * 7065 50 1834 * 5894 51 480 * 4060 52 611 * 3580 53 969 * 2969 54 2000 * 2000 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 10137 0 29807 0 1 9141 9141 16094 16094 2 8326 25793 11449 38992 3 7494 48275 8901 65695 4 7121 76759 7571 95979 5 6775 110634 6655 129254 6 6705 150864 5800 164054 7 6231 194481 5399 201847 8 6035 242761 4762 239943 9 5580 292981 4234 278049 10 5326 346241 3797 316019 11 4909 400240 3255 351824 12 4535 454660 2658 383720 13 4217 509481 2376 414608 14 3864 563577 1919 441474 15 3381 614292 1423 462819 16 3056 663188 1085 480179 17 2665 708493 844 494527 18 2365 751063 594 505219 19 2096 790887 407 512952 20 1780 826487 297 518892 21 1596 860003 168 522420 22 1319 889021 129 525258 23 1131 915034 85 527213 24 912 936922 39 528149 25 706 954572 32 528949 26 563 969210 17 529391 27 465 981765 9 529634 28 394 992797 5 529774 29 277 1000830 7 529977 30 228 1007670 1 530007 31 152 1012382 0 530007 32 118 1016158 0 530007 33 88 1019062 0 530007 34 48 1020694 0 530007 35 28 1021674 0 530007 36 13 1022142 0 530007 37 15 1022697 0 530007 38 7 1022963 0 530007 39 5 1023158 0 530007 40 4 1023318 0 530007 41 3 1023441 0 530007 42 3 1023567 0 530007 43 1 1023610 0 530007 44 1 1023654 0 530007 45 0 1023654 0 530007 46 1 1023700 0 530007 47 0 1023700 0 530007 48 1 1023748 0 530007 49 0 1023748 0 530007 50 1 1023798 0 530007 total var= 1023798 total pis= 530007 sampled unlinked SNPs= 109682 sampled unlinked bi-allelic SNPs= 89479
%%bash
head -n 20 empirical_5/stats/empirical_5_m2.stats
218875 ## loci with > minsp containing data 212818 ## loci with > minsp containing data & paralogs removed 212818 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci cydno_ERS074404 40114 cydno_ERS074405 41446 cydno_ERS074406 40560 cydno_ERS074407 40694 cydno_ERS074408 40758 cydno_ERS074409 41320 cydno_ERS074410 35272 cydno_ERS074411 32803 cydno_ERS074412 51705 cydno_ERS074413 47291 cydno_ERS074414 52449 cydno_ERS074415 52917
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_5/ \
-n empirical_5_m4 -s empirical_5/outfiles/empirical_5_m4.phy
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_5/ \
-n empirical_5_m2 -s empirical_5/outfiles/empirical_5_m2.phy
%%bash
head -n 20 empirical_5/RAxML_info.empirical_5_m4
This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 546632 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 70.04% RAxML rapid bootstrapping and subsequent ML search
%%bash
head -n 20 empirical_5/RAxML_info.empirical_5_m2
This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 338778 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 80.98% RAxML rapid bootstrapping and subsequent ML search
%load_ext rpy2.ipython
%%R
library(ape)
ltre <- read.tree()
mean(cophenetic.phylo(ltre))
[1] 0.05029949