ipyrad testing tutorial

Getting started

Import ipyrad and remove previous test files if they are already present

In [1]:
## import modules
import ipyrad as ip                ## 
print "version", ip.__version__    ## print version
version 0.1.47
In [2]:
## clear data from test directory if it already exists
import shutil
import os
if os.path.exists("./test_rad/"):
    shutil.rmtree("./test_rad/")

Assembly and Sample objects

Assembly and Sample objects are used by ipyrad to access data stored on disk and to manipulate it. Each biological sample in a data set is represented in a Sample object, and a set of Samples is stored inside an Assembly object. The Assembly object has functions to assemble the data, and stores a log of all steps performed and the resulting statistics of those steps. Assembly objects can be copied or merged to allow branching events where different parameters can subsequently be applied to different Assemblies going forward. Examples of this are shown below.

We'll being by creating a single Assembly object named "data1". It is created with a set of default assembly parameters and without any Samples linked to it. The name provided will be used in the output files that this Assembly creates.

In [3]:
## create an Assembly object named data1. 
data1 = ip.Assembly("data1")
  New Assembly: data1

Modifying assembly parameters

An Assembly object's parameter settings can be viewed using its get_params() function. To get more detailed information about all parameters use the function ip.get_params_info() or select a single parameter with ip.get_params_info(N), where N is the number or string representation of a parameter. Assembly objects have a function set_params() that is used to modify parameters, like below.

In [4]:
## modify parameters for this Assembly object
data1.set_params('project_dir', "./test_rad")
data1.set_params('raw_fastq_path', "./data/sim_rad_test_R1_.fastq.gz")
data1.set_params('barcodes_path', "./data/sim_rad_test_barcodes.txt")
data1.set_params('filter_adapters', 0)
data1.set_params('datatype', 'rad')

## test on real data
#data1.set_params(2, "~/Dropbox/UO_C353_1.fastq.part-aa.gz")
#data1.set_params(3, "/home/deren/Dropbox/Viburnum_revised.barcodes")

## print the new parameters to screen
data1.get_params()
  0   assembly_name               data1                                        
  1   project_dir                 ./test_rad                                   
  2   raw_fastq_path              ./data/sim_rad_test_R1_.fastq.gz             
  3   barcodes_path               ./data/sim_rad_test_barcodes.txt             
  4   sorted_fastq_path                                                        
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    1000                                         
  14  clust_threshold             0.85                                         
  15  max_barcode_mismatch        1                                            
  16  filter_adapters             0                                            
  17  filter_min_trim_len         35                                           
  18  max_alleles_consens         2                                            
  19  max_Ns_consens              (5, 5)                                       
  20  max_Hs_consens              (8, 8)                                       
  21  min_samples_locus           4                                            
  22  max_SNPs_locus              (100, 100)                                   
  23  max_Indels_locus            (5, 99)                                      
  24  max_shared_Hs_locus         0.25                                         
  25  edit_cutsites               (0, 0)                                       
  26  trim_overhang               (1, 2, 2, 1)                                 
  27  output_formats              *                                            
  28  pop_assign_file                                                          
  29  excludes                                                                 
  30  outgroups                                                                

Starting data

If the data are already demultiplexed then fastq files can be linked directly to the Assembly object, which in turn will create new Sample objects from them, or link them to existing Sample objects based on the file names (or pair of fastq files for paired data files). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data.

In [5]:
## This would link fastq files from the 'sorted_fastq_path' if present
## Here it raises an error because there are no files in the sorted_fastq_path

## data1.link_fastqs() #path="./test_rad/data1_fastqs/*")

Step 1: Demultiplexing raw data files

Step1 uses barcode information to demultiplex data files found in param 2 ['raw_fastq_path']. It will create a Sample object for each barcoded sample. Below we use the step1() function to demultiplex. The stats attribute of an Assembly object is returned as a pandas data frame.

In [6]:
## run step 1 to demultiplex the data
data1.step1(force=True)

## print the results for each Sample in data1
print data1.stats
    Saving Assembly.
      state  reads_raw
1A_0      1      20099
1B_0      1      19977
1C_0      1      20114
1D_0      1      19895
2E_0      1      19928
2F_0      1      19934
2G_0      1      20026
2H_0      1      19936
3I_0      1      20084
3J_0      1      20011
3K_0      1      20117
3L_0      1      19901

Step 2: Filter reads

If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the step2 function on. Because step2 is a function of data, it will always execute with the parameters that are linked to data.

In [7]:
## example of ways to run step 2 to filter and trim reads
data1.step2()  #["1B_0", "1C_0"], force=True)      ## run on one or more samples

## print the results
print data1.statsfiles.s2
    Saving Assembly.
      reads_raw  filtered_by_qscore  filtered_by_adapter  reads_passed
1A_0      20099                   0                    0         20099
1B_0      19977                   0                    0         19977
1C_0      20114                   0                    0         20114
1D_0      19895                   0                    0         19895
2E_0      19928                   0                    0         19928
2F_0      19934                   0                    0         19934
2G_0      20026                   0                    0         20026
2H_0      19936                   0                    0         19936
3I_0      20084                   0                    0         20084
3J_0      20011                   0                    0         20011
3K_0      20117                   0                    0         20117
3L_0      19901                   0                    0         19901

Branching Assembly objects

Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the .dirs argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name.

In [8]:
## create a branch of our Assembly object
data2 = data1.branch(newname="data2")

## set clustering threshold to 0.90
data2.set_params("clust_threshold", 0.90)

## look at inherited parameters
data2.get_params()
  0   assembly_name               data1                                        
  1   project_dir                 ./test_rad                                   
  2   raw_fastq_path              ./data/sim_rad_test_R1_.fastq.gz             
  3   barcodes_path               ./data/sim_rad_test_barcodes.txt             
  4   sorted_fastq_path                                                        
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    1000                                         
  14  clust_threshold             0.9                                          
  15  max_barcode_mismatch        1                                            
  16  filter_adapters             0                                            
  17  filter_min_trim_len         35                                           
  18  max_alleles_consens         2                                            
  19  max_Ns_consens              (5, 5)                                       
  20  max_Hs_consens              (8, 8)                                       
  21  min_samples_locus           4                                            
  22  max_SNPs_locus              (100, 100)                                   
  23  max_Indels_locus            (5, 99)                                      
  24  max_shared_Hs_locus         0.25                                         
  25  edit_cutsites               (0, 0)                                       
  26  trim_overhang               (1, 2, 2, 1)                                 
  27  output_formats              *                                            
  28  pop_assign_file                                                          
  29  excludes                                                                 
  30  outgroups                                                                

Step 3: clustering within-samples

In [9]:
import ipyrad as ip
data1 = ip.load_json("test_rad/data1.json")
  loading Assembly: data1 [~/Documents/ipyrad/tests/test_rad/data1.json]
In [10]:
## run step 3 to cluster reads within samples using vsearch
data1.step3(force=True)

## print the results
print data1.statsfiles.s3
    Saving Assembly.
      avg_depth_mj  avg_depth_stat  clusters_hidepth  clusters_total  \
1A_0        20.099          20.099              1000            1000   
1B_0        19.977          19.977              1000            1000   
1C_0        20.114          20.114              1000            1000   
1D_0        19.895          19.895              1000            1000   
2E_0        19.928          19.928              1000            1000   
2F_0        19.934          19.934              1000            1000   
2G_0        20.026          20.026              1000            1000   
2H_0        19.936          19.936              1000            1000   
3I_0        20.084          20.084              1000            1000   
3J_0        20.011          20.011              1000            1000   
3K_0        20.117          20.117              1000            1000   
3L_0        19.901          19.901              1000            1000   

      avg_depth_total  
1A_0           20.099  
1B_0           19.977  
1C_0           20.114  
1D_0           19.895  
2E_0           19.928  
2F_0           19.934  
2G_0           20.026  
2H_0           19.936  
3I_0           20.084  
3J_0           20.011  
3K_0           20.117  
3L_0           19.901  
In [11]:
## run step 3 to cluster reads in data2 at 0.90 sequence similarity
data2.step3(force=True) 

## print the results
print data2.stats
    Saving Assembly.
      state  reads_raw  reads_filtered  clusters_total  clusters_hidepth
1A_0      3      20099           20099            1000              1000
1B_0      3      19977           19977            1000              1000
1C_0      3      20114           20114            1000              1000
1D_0      3      19895           19895            1000              1000
2E_0      3      19928           19928            1000              1000
2F_0      3      19934           19934            1000              1000
2G_0      3      20026           20026            1000              1000
2H_0      3      19936           19936            1000              1000
3I_0      3      20084           20084            1000              1000
3J_0      3      20011           20011            1000              1000
3K_0      3      20117           20117            1000              1000
3L_0      3      19901           19901            1000              1000

Branched Assembly objects

You can see below that the two Assembly objects are now working with several shared directories (working, fastq, edits) but with different clust directories (data1_clust_0.85 and data2_clust_0.9).

In [12]:
print "data1 directories:"
for (i,j) in data1.dirs.items():
    print "{}\t{}".format(i, j)
    
print "\ndata2 directories:"
for (i,j) in data2.dirs.items():
    print "{}\t{}".format(i, j)
data1 directories:
project	/home/deren/Documents/ipyrad/tests/test_rad
edits	/home/deren/Documents/ipyrad/tests/test_rad/data1_edits
fastqs	/home/deren/Documents/ipyrad/tests/test_rad/data1_fastqs
outfiles	
clusts	/home/deren/Documents/ipyrad/tests/test_rad/data1_clust_0.85
consens	

data2 directories:
consens	
project	/home/deren/Documents/ipyrad/tests/test_rad
edits	/home/deren/Documents/ipyrad/tests/test_rad/data1_edits
fastqs	/home/deren/Documents/ipyrad/tests/test_rad/data1_fastqs
outfiles	
clusts	/home/deren/Documents/ipyrad/tests/test_rad/data2_clust_0.9

Saving stats outputs

Example: two simple ways to save the stats data frame to a file.

In [13]:
data1.stats.to_csv("data1_results.csv", sep="\t")
data1.stats.to_latex("data1_results.tex")

Example of plotting with ipyrad

There are a a few simple plotting functions in ipyrad useful for visualizing results. These are in the module ipyrad.plotting. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set.

In [14]:
import ipyrad.plotting as iplot

## plot for one or more selected samples
#iplot.depthplot(data1, ["1A_0", "1B_0"])

## plot for all samples in data1
iplot.depthplot(data1)

## save plot as pdf and html
#iplot.depthplot(data1, outprefix="testfig")
010203040500501001501A_0010203040500501001501B_0010203040500501001501C_0010203040500501001501D_0010203040500501001502E_0010203040500501001502F_0010203040500501001502G_0010203040500501001502H_0010203040500501001503I_0010203040500501001503J_0010203040500501001503K_0010203040500501001503L_0

Step 4: Joint estimation of heterozygosity and error rate

In [15]:
## run step 4
data1.step4(force=True)

## print the results
print data1.stats
    Saving Assembly.
      clusters_hidepth  clusters_total  error_est  hetero_est  reads_filtered  \
1A_0              1000            1000   0.000756    0.002223           20099   
1B_0              1000            1000   0.000775    0.001910           19977   
1C_0              1000            1000   0.000751    0.002260           20114   
1D_0              1000            1000   0.000731    0.001876           19895   
2E_0              1000            1000   0.000770    0.001809           19928   
2F_0              1000            1000   0.000725    0.002103           19934   
2G_0              1000            1000   0.000707    0.001910           20026   
2H_0              1000            1000   0.000755    0.002215           19936   
3I_0              1000            1000   0.000784    0.001877           20084   
3J_0              1000            1000   0.000741    0.001698           20011   
3K_0              1000            1000   0.000767    0.001821           20117   
3L_0              1000            1000   0.000755    0.002003           19901   

      reads_raw  state  
1A_0      20099      4  
1B_0      19977      4  
1C_0      20114      4  
1D_0      19895      4  
2E_0      19928      4  
2F_0      19934      4  
2G_0      20026      4  
2H_0      19936      4  
3I_0      20084      4  
3J_0      20011      4  
3K_0      20117      4  
3L_0      19901      4  

Step 5: Consensus base calls

In [16]:
## run step 5
data1.step5(force=True) #"1B_0")

## print the results
print data1.stats
    Saving Assembly.
      clusters_hidepth  clusters_total  error_est  hetero_est  reads_consens  \
1A_0              1000            1000   0.000756    0.002223           1000   
1B_0              1000            1000   0.000775    0.001910           1000   
1C_0              1000            1000   0.000751    0.002260           1000   
1D_0              1000            1000   0.000731    0.001876           1000   
2E_0              1000            1000   0.000770    0.001809           1000   
2F_0              1000            1000   0.000725    0.002103           1000   
2G_0              1000            1000   0.000707    0.001910           1000   
2H_0              1000            1000   0.000755    0.002215           1000   
3I_0              1000            1000   0.000784    0.001877           1000   
3J_0              1000            1000   0.000741    0.001698           1000   
3K_0              1000            1000   0.000767    0.001821           1000   
3L_0              1000            1000   0.000755    0.002003           1000   

      reads_filtered  reads_raw  state  
1A_0           20099      20099      5  
1B_0           19977      19977      5  
1C_0           20114      20114      5  
1D_0           19895      19895      5  
2E_0           19928      19928      5  
2F_0           19934      19934      5  
2G_0           20026      20026      5  
2H_0           19936      19936      5  
3I_0           20084      20084      5  
3J_0           20011      20011      5  
3K_0           20117      20117      5  
3L_0           19901      19901      5  
In [17]:
## run step 6
data1.step6(force=True)
    Saving Assembly.
In [18]:
data1.step7(force=True)
    Outfiles written to: ~/Documents/ipyrad/tests/test_rad/data1_outfiles
    Saving Assembly.
In [ ]:
 
In [ ]:
 

Other stuffm

In [ ]:
import h5py
ioh5 = h5py.File(data1.database, 'r')
superseqs = ioh5["seqs"][:]
superseqs[superseqs == ""] = "N"
In [30]:
AMBIGS = {"R":("G", "A"),
          "K":("G", "T"),
          "S":("G", "C"),
          "Y":("T", "C"),
          "W":("T", "A"),
          "M":("C", "A")}
In [64]:
def ucount(sitecol):
    """ 
    Used to count the number of unique bases in a site for snpstring. 
    returns as a spstring with * and - 
    """
    ## make into set
    site = set(sitecol)

    ## get resolutions of ambiguitites
    for iupac in "RSKYWM":
        if iupac in site:
            site.discard(iupac)
            site.update(AMBIGS[iupac])

    ## remove - and N
    site.discard("N")
    site.discard("-")

    ## if site is invariant return ""
    if len(site) < 2:
        return " "
    else:
        ## get how many bases come up more than once 
        ccx = np.sum(np.array([np.sum(sitecol == base) for base in site]) > 1)
        ## if another site came up more than once return *
        if ccx > 1:
            return "*"
        else:
            return "-"
In [65]:
snps = np.apply_along_axis(ucount, 1, superseqs)
In [86]:
snpsarr = np.zeros((superseqs.shape[0], superseqs.shape[2], 2), dtype=np.bool)
snpsarr[:, :, 0] = snps == "-"
snpsarr[:, :, 1] = snps == "*"
snpsarr.shape
Out[86]:
(100, 150, 2)
In [107]:
import numpy as np
import os
a = np.array(['a', 'b'])
set(a)
Out[107]:
{'a', 'b'}
In [112]:
path = os.path.join(data1.dirs.outfiles, data1.name+"_tmpchunks", "snpsarr.0.npy")

np.load(path).shape
Out[112]:
(100, 150, 2)
In [113]:
path = os.path.join(data1.dirs.outfiles, data1.name+"_tmpchunks", "hetf.0.npy")

np.load(path).shape
Out[113]:
(100,)

Assembly finished

In [5]:
ll test_rad/outfiles/
total 9068
-rw-rw-r-- 1 deren 2600999 Jan 23 02:03 data1.alleles
-rw-rw-r-- 1 deren 1252625 Jan 23 02:03 data1.full.vcf
-rw-rw-r-- 1 deren 1250895 Jan 23 02:03 data1.gphocs
-rw-rw-r-- 1 deren 1353000 Jan 23 02:03 data1.loci
-rw-rw-r-- 1 deren 1269535 Jan 23 02:03 data1.nex
-rw-rw-r-- 1 deren 1128105 Jan 23 02:03 data1.phy
-rw-rw-r-- 1 deren   21655 Jan 23 02:03 data1.phy.partitions
-rw-rw-r-- 1 deren      34 Jan 23 02:03 data1.treemix.gz
-rw-rw-r-- 1 deren  388482 Jan 23 02:03 data1.vcf

Quick parameter explanations are always on-hand

In [18]:
ip.paramsinfo(10)
    (10) phred_Qscore_offset ---------------------------------------------
    Examples:
    ----------------------------------------------------------------------
    data.set_params(10) = 33
    data.set_params("phred_Qscore_offset") = 33
    ----------------------------------------------------------------------
    

Log history

A common problem at the end of an analysis, or while troubleshooting it, is that you find you've completely forgotten which parameters you used at what point, and when you changed them. Documenting or executing code inside Jupyter notebooks (like the one you're reading right now) is a great way to keep track of all of this. In addition, ipyrad also stores a log history which time stamps all modifications to Assembly objects.

In [ ]:
for i in data1.log:
    print i
    
print "\ndata 2 log includes its pre-branching history with data1"
for i in data2.log:
    print i

Saving Assembly objects

Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly step function is called, but you can also create your own checkpoints with the save command.

In [ ]:
## save assembly object
#ip.save_assembly("data1.p")

## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name
In [60]:
isinstance(0.25, float)
Out[60]:
True