This notebook shows how to use the ipyrad.analysis
toolkit to generate structure input files that use population information.
# conda install ipyrad -c ipyrad
# conda install structure clumpp -c ipyrad
# conda install toytree -c eaton-lab
import ipyrad.analysis as ipa
import toyplot
If you include a 'mapfile' then we will use locus information to subsample just a single SNP from each locus so that the resulting data file will meet the expectations of structure that SNPs are "unlinked". If you create multiple replicates files using different random seeds then different SNPs will be selected in each rep.
s = ipa.structure(
name="test",
workdir="analysis-structure",
data="analysis-ipyrad/ped_min10_outfiles/ped_min10.str",
mapfile="analysis-ipyrad/ped_min10_outfiles/ped_min10.snps.map",
)
These values are used to generate the "mainparams" and "extraparams" files for structure.
## set run parameters (you probably want to run >10X this long)
s.mainparams.burnin = 1000
s.mainparams.numreps = 5000
## tell structure to expect popdata & popflag
s.mainparams.popdata = 1
s.mainparams.popflag = 1
## print all mainparams
s.mainparams
burnin 1000 extracols 0 label 1 locdata 0 mapdistances 0 markernames 0 markovphase 0 missing -9 notambiguous -999 numreps 5000 onerowperind 0 phased 0 phaseinfo 0 phenotype 0 ploidy 2 popdata 1 popflag 1 recessivealleles 0
## tell structure to use popinfo
s.extraparams.usepopinfo = 1
## print all other extraparams
s.extraparams
admburnin 500 alpha 1.0 alphamax 10.0 alphapriora 1.0 alphapriorb 2.0 alphapropsd 0.025 ancestdist 0 ancestpint 0.9 computeprob 1 echodata 0 fpriormean 0.01 fpriorsd 0.05 freqscorr 1 gensback 2 inferalpha 1 inferlambda 0 intermedsave 0 lambda_ 1.0 linkage 0 locispop 0 locprior 0 locpriorinit 1.0 log10rmax 1.0 log10rmin -4.0 log10rpropsd 0.1 log10rstart -2.0 maxlocprior 20.0 metrofreq 10 migrprior 0.01 noadmix 0 numboxes 1000 onefst 0 pfrompopflagonly 0 popalphas 0 popspecificlambda 0 printlambda 1 printlikes 0 printnet 1 printqhat 0 printqsum 1 randomize 0 reporthitrate 0 seed 12345 sitebysite 0 startatpopinfo 0 unifprioralpha 1 updatefreq 10000 usepopinfo 1
s.header
labels | popdata | popflag | locdata | phenotype | |
---|---|---|---|---|---|
0 | 29154_superba | ||||
1 | 30556_thamno | ||||
2 | 30686_cyathophylla | ||||
3 | 32082_przewalskii | ||||
4 | 33413_thamno | ||||
5 | 33588_przewalskii | ||||
6 | 35236_rex | ||||
7 | 35855_rex | ||||
8 | 38362_rex | ||||
9 | 39618_rex | ||||
10 | 40578_rex | ||||
11 | 41478_cyathophylloides | ||||
12 | 41954_cyathophylloides |
popdata
is the a priori population assignment of an individual to a population. Assignments should be non-zero integers (e.g., 1, 2, 3). Zero is reserved to mean that there is no a priori assignment. popflag
indicates whether or not to use the population assignment in the analysis (1) or to leave it to be inferred (0). So in the example below seven samples have assigned populations (popflag=1), and six samples will have their population assignments inferred (popflag=0). The popdata information will only be used for the seven individuals with assigned pops.
## assign popdata and popflag by entering a list of values
s.popdata = [1, 3, 1, 2, 3, 2, 3, 3, 3, 3, 3, 1, 1]
s.popflag = [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
## print the header information
s.header
labels | popdata | popflag | locdata | phenotype | |
---|---|---|---|---|---|
0 | 29154_superba | 1 | 1 | ||
1 | 30556_thamno | 3 | 1 | ||
2 | 30686_cyathophylla | 1 | 0 | ||
3 | 32082_przewalskii | 2 | 0 | ||
4 | 33413_thamno | 3 | 0 | ||
5 | 33588_przewalskii | 2 | 0 | ||
6 | 35236_rex | 3 | 0 | ||
7 | 35855_rex | 3 | 0 | ||
8 | 38362_rex | 3 | 1 | ||
9 | 39618_rex | 3 | 1 | ||
10 | 40578_rex | 3 | 1 | ||
11 | 41478_cyathophylloides | 1 | 1 | ||
12 | 41954_cyathophylloides | 1 | 1 |
This will write the .str
file (and subsample SNPs if you included a mapfile) with the header information included, and it will write a mainparams and extraparams file with the parameter settings that we entered above.
## writes three input files and then call structure yourself
s.write_structure_files(kpop=3)
('/home/deren/Documents/ipyrad/tests/analysis-structure/tmp-test-3-1.mainparams.txt', '/home/deren/Documents/ipyrad/tests/analysis-structure/tmp-test-3-1.extraparams.txt', '/home/deren/Documents/ipyrad/tests/analysis-structure/tmp-test-3-1.strfile.txt')
If you start an ipcluster instance (see other tutorials) you can submit structure jobs directly to the cluster and easily collect the results, like below.
## connect to a running ipcluster instance
import ipyparallel as ipp
ipyclient = ipp.Client()
print "connected to {} cores".format(len(ipyclient))
connected to 40 cores
## submit job replicates to ipyclient
s.run(kpop=3, nreps=3, ipyclient=ipyclient)
submitted 3 structure jobs [test-K-3]
## wait for jobs to finish
ipyclient.wait()
True
## get table of summarized results
table = s.get_clumpp_table(3)
mean scores across 3 replicates.
## reorder table by membership in groups
table.sort_values(by=[0, 1, 2], inplace=True)
## build barplot
canvas = toyplot.Canvas(width=500, height=250)
axes = canvas.cartesian(bounds=("10%", "90%", "10%", "45%"))
axes.bars(table)
## add labels to x-axis
ticklabels = [i for i in table.index.tolist()]
axes.x.ticks.locator = toyplot.locator.Explicit(labels=ticklabels)
axes.x.ticks.labels.angle = -60
axes.x.ticks.show = True
axes.x.ticks.labels.offset = 10
axes.x.ticks.labels.style = {"font-size": "12px"}