#!/usr/bin/env python # coding: utf-8 # ## Cookbook recipe: Access and plot ipyrad stats in R # Jupyter notebooks provide a convenient interface for sharing data and functions between Python and R through use of the Python `rpy2` module. By combining all of your code from across multiple languages inside a single notebook your workflow from analyses to plots is easy to follow and reproduce. In this notebook, I show an example of exporting data from an ipyrad JSON object into R so that we can create high quality plots using plotting libraries in R. # # # ### Why do this? # A large motivation for creating the JSON storage object for ipyrad Assemblies is that this object stores all of the information for an entire assembly, and thus provides a useful portable format. This way you can execute very large assemblies on a remote HPC cluster and simply import the small JSON file onto your laptop to analyze and compare its size and other stats. It is of course easiest to analyze the stats in Python since ipyrad has a built-in parser for these JSON objects. However, many users may prefer to use R for plotting, and so here we show how to easily transfer results from ipyrad to R. # # # ### Two ways of accessing data in the ipyrad JSON file: # 1. Load the data in with the ipyrad API and save to CSV. # 2. Load the data in with the ipyrad API and export to R using rpy2. # ### Start by importing ipyrad # In[2]: ## import ipyrad and give it a shorter name import ipyrad as ip # ### Either create a new ipyrad assembly or load an existing one: # Here we use the API to run the example RAD data set, which only takes about 3 minutes on a 4-core laptop. # In[3]: ## create a test assembly data = ip.Assembly("data") data.set_params('project_dir', 'test') data.set_params('raw_fastq_path', 'ipsimdata/rad_example_R1_.fastq.gz') data.set_params('barcodes_path', 'ipsimdata/rad_example_barcodes.txt') ## Assemble data set; runs steps 1-7 data.run('1') # ### Or load a finished assembly from its JSON file # In[4]: ## load the JSON file for this assembly data = ip.load_json("test/data.json") # ### Look at the stats summary for this assembly # In[5]: ## Data can be accessed from the object's stats and stats_df attributes print data.stats # ### Load R-language extension # In[6]: ## This requires that you have the Python module `rpy2` installed. ## If you do not, it can be installed in anaconda with: ## conda install rpy2 # ### Transfer Python object to R # There are a few odd tricks to using this module. One being that you shouldn't try to transfer objects with '.' in their names. R doesn't like that. So simply rename these objects before passing them. Below I rename the stats data frame and then use the '-i' flag in R to import it into R namespace. The "%R" at the beginning of the line tells IPyhton to execute just that line as R code. # In[7]: get_ipython().run_line_magic('load_ext', 'rpy2.ipython') # In[25]: ## rename data.stats as statsDF statsDF = data.stats ## import statsDF into R namespace get_ipython().run_line_magic('R', '-i statsDF') # ### Now R knows about statsDF # We can access it just like a normal R DataFrame, and even create plots. Using the cell header %%R everything in the cell will execute as R code. # In[26]: get_ipython().run_cell_magic('R', '', 'print(statsDF)\n') # In[92]: get_ipython().run_cell_magic('R', '-w 350 -h 350', '## the dimensions above tell IPython how big to make the embedded figure\n## alternatively you can adjust the size when you save the figure\n\nplot(statsDF$reads_raw, \n statsDF$reads_filtered, \n pch=20, cex=3)\n') # ### Let's transfer more data from Python to R # In[13]: ### Other stats from our assembly are also available. ### First store names and then import into R s5 = data.stats_dfs.s5 s7L = data.stats_dfs.s7_loci s7S = data.stats_dfs.s7_snps s7N = data.stats_dfs.s7_samples ## no spaces allowed between comma-separated names when ## transferring multiple objects to R get_ipython().run_line_magic('R', '-i s5,s7L,s7S,s7N') # ### Plot coverage among samples # Kinda boring in this example... # In[24]: get_ipython().run_cell_magic('R', '-w 800 -h 320', '## \nbarplot(s7N$sample_coverage, \n col=\'grey30\', names=rownames(s7N),\n ylab="N loci",\n xlab="Sample")\n') # ### Plot the distribution of SNPs among loci # In[14]: get_ipython().run_cell_magic('R', '-w 450 -h 400', 'print(s7S)\n\nbarplot(s7S$var, \n col=rgb(0,0,1,1/4), \n names=rownames(s7S),\n ylab="N loci", ylim=c(0, 400),\n xlab="N variable sites")\n\nbarplot(s7S$pis, \n col=rgb(1,0,0,1/4), \n names=rownames(s7S),\n ylab="N loci", ylim=c(0, 400),\n xlab="N variable sites", \n add=TRUE)\n') # ### Future # You can of course make much more complex figures with the data for your own Assembly. And you can also create figures to compare stats among multiple Assemblies. See also the R package RADami which has some functions for analysis and plotting using the .loci output file from ipyrad. # # If you're an R afficionado please contribute to the ipyrad cookbooks with more advanced plots or analyses, and add your notebooks to the cookbook collection by submitting them to the ipyrad github repo.