Jupyter notebooks provide a convenient interface for sharing data and functions between Python and R through use of the Python rpy2
module. By combining all of your code from across multiple languages inside a single notebook your workflow from analyses to plots is easy to follow and reproduce. In this notebook, I show an example of exporting data from an ipyrad JSON object into R so that we can create high quality plots using plotting libraries in R.
A large motivation for creating the JSON storage object for ipyrad Assemblies is that this object stores all of the information for an entire assembly, and thus provides a useful portable format. This way you can execute very large assemblies on a remote HPC cluster and simply import the small JSON file onto your laptop to analyze and compare its size and other stats. It is of course easiest to analyze the stats in Python since ipyrad has a built-in parser for these JSON objects. However, many users may prefer to use R for plotting, and so here we show how to easily transfer results from ipyrad to R.
## import ipyrad and give it a shorter name
import ipyrad as ip
Here we use the API to run the example RAD data set, which only takes about 3 minutes on a 4-core laptop.
## create a test assembly
data = ip.Assembly("data")
data.set_params('project_dir', 'test')
data.set_params('raw_fastq_path', 'ipsimdata/rad_example_R1_.fastq.gz')
data.set_params('barcodes_path', 'ipsimdata/rad_example_barcodes.txt')
## Assemble data set; runs steps 1-7
data.run('1')
New Assembly: data Assembly: data [####################] 100% chunking large files | 0:00:00 | s1 | [####################] 100% sorting reads | 0:00:02 | s1 | [####################] 100% writing/compressing | 0:00:00 | s1 |
## load the JSON file for this assembly
data = ip.load_json("test/data.json")
loading Assembly: data from saved path: ~/Documents/ipyrad/tests/test/data.json
## Data can be accessed from the object's stats and stats_df attributes
print data.stats
state reads_raw 1A_0 1 19862 1B_0 1 20043 1C_0 1 20136 1D_0 1 19966 2E_0 1 20017 2F_0 1 19933 2G_0 1 20030 2H_0 1 20199 3I_0 1 19885 3J_0 1 19822 3K_0 1 19965 3L_0 1 20008
## This requires that you have the Python module `rpy2` installed.
## If you do not, it can be installed in anaconda with:
## conda install rpy2
There are a few odd tricks to using this module. One being that you shouldn't try to transfer objects with '.' in their names. R doesn't like that. So simply rename these objects before passing them. Below I rename the stats data frame and then use the '-i' flag in R to import it into R namespace. The "%R" at the beginning of the line tells IPyhton to execute just that line as R code.
%load_ext rpy2.ipython
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-7-691c6d73b073> in <module>() ----> 1 get_ipython().magic(u'load_ext rpy2.ipython') /home/deren/miniconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s) 2156 magic_name, _, magic_arg_s = arg_s.partition(' ') 2157 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC) -> 2158 return self.run_line_magic(magic_name, magic_arg_s) 2159 2160 #------------------------------------------------------------------------- /home/deren/miniconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line) 2077 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals 2078 with self.builtin_trap: -> 2079 result = fn(*args,**kwargs) 2080 return result 2081 <decorator-gen-62> in load_ext(self, module_str) /home/deren/miniconda2/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k) 186 # but it's overkill for just that one bit of state. 187 def magic_deco(arg): --> 188 call = lambda f, *a, **k: f(*a, **k) 189 190 if callable(arg): /home/deren/miniconda2/lib/python2.7/site-packages/IPython/core/magics/extension.pyc in load_ext(self, module_str) 35 if not module_str: 36 raise UsageError('Missing module name.') ---> 37 res = self.shell.extension_manager.load_extension(module_str) 38 39 if res == 'already loaded': /home/deren/miniconda2/lib/python2.7/site-packages/IPython/core/extensions.pyc in load_extension(self, module_str) 81 if module_str not in sys.modules: 82 with prepended_to_syspath(self.ipython_extension_dir): ---> 83 __import__(module_str) 84 mod = sys.modules[module_str] 85 if self._call_load_ipython_extension(mod): /home/deren/miniconda2/lib/python2.7/site-packages/rpy2-2.5.6-py2.7-linux-x86_64.egg/rpy2/ipython/__init__.py in <module>() ----> 1 from .rmagic import load_ipython_extension /home/deren/miniconda2/lib/python2.7/site-packages/rpy2-2.5.6-py2.7-linux-x86_64.egg/rpy2/ipython/rmagic.py in <module>() 73 argument, magic_arguments, parse_argstring, argument_group 74 ) ---> 75 from IPython.external.simplegeneric import generic 76 from IPython.utils.py3compat import str_to_unicode, unicode_to_str, PY3 77 ImportError: No module named simplegeneric
## rename data.stats as statsDF
statsDF = data.stats
## import statsDF into R namespace
%R -i statsDF
We can access it just like a normal R DataFrame, and even create plots. Using the cell header %%R everything in the cell will execute as R code.
%%R
print(statsDF)
state reads_raw reads_filtered clusters_total clusters_hidepth hetero_est 1A_0 6 20093 20093 1000 1000 0.001965305 1B_0 6 19938 19938 1000 1000 0.001865104 1C_0 6 20054 20054 1000 1000 0.001968599 1D_0 6 20066 20066 1000 1000 0.001934149 2E_0 6 19838 19838 1000 1000 0.002077262 2F_0 6 20033 20033 1000 1000 0.001886744 2G_0 6 20145 20145 1000 1000 0.001898473 2H_0 6 20158 20158 1000 1000 0.001931724 3I_0 6 20010 20010 1000 1000 0.001865512 3J_0 6 19977 19977 1000 1000 0.002291394 3K_0 6 20061 20061 1000 1000 0.001808438 3L_0 6 19975 19975 1000 1000 0.001990599 error_est reads_consens 1A_0 0.0007549364 1000 1B_0 0.0007512644 1000 1C_0 0.0007436178 1000 1D_0 0.0007661128 1000 2E_0 0.0007511313 1000 2F_0 0.0007386290 1000 2G_0 0.0007652827 1000 2H_0 0.0007361525 1000 3I_0 0.0007643904 1000 3J_0 0.0007778005 1000 3K_0 0.0007410198 1000 3L_0 0.0007273741 1000
%%R -w 350 -h 350
## the dimensions above tell IPython how big to make the embedded figure
## alternatively you can adjust the size when you save the figure
plot(statsDF$reads_raw,
statsDF$reads_filtered,
pch=20, cex=3)
### Other stats from our assembly are also available.
### First store names and then import into R
s5 = data.stats_dfs.s5
s7L = data.stats_dfs.s7_loci
s7S = data.stats_dfs.s7_snps
s7N = data.stats_dfs.s7_samples
## no spaces allowed between comma-separated names when
## transferring multiple objects to R
%R -i s5,s7L,s7S,s7N
Kinda boring in this example...
%%R -w 800 -h 320
##
barplot(s7N$sample_coverage,
col='grey30', names=rownames(s7N),
ylab="N loci",
xlab="Sample")
%%R -w 450 -h 400
print(s7S)
barplot(s7S$var,
col=rgb(0,0,1,1/4),
names=rownames(s7S),
ylab="N loci", ylim=c(0, 400),
xlab="N variable sites")
barplot(s7S$pis,
col=rgb(1,0,0,1/4),
names=rownames(s7S),
ylab="N loci", ylim=c(0, 400),
xlab="N variable sites",
add=TRUE)
var sum_var pis sum_pis 0 9 0 316 0 1 44 44 369 369 2 117 278 204 777 3 174 800 84 1029 4 182 1528 19 1105 5 165 2353 5 1130 6 143 3211 2 1142 7 76 3743 1 1149 8 45 4103 0 1149 9 25 4328 0 1149 10 14 4468 0 1149 11 5 4523 0 1149 12 0 4523 0 1149 13 1 4536 0 1149
You can of course make much more complex figures with the data for your own Assembly. And you can also create figures to compare stats among multiple Assemblies. See also the R package RADami which has some functions for analysis and plotting using the .loci output file from ipyrad.
If you're an R afficionado please contribute to the ipyrad cookbooks with more advanced plots or analyses, and add your notebooks to the cookbook collection by submitting them to the ipyrad github repo.