#!/usr/bin/env python # coding: utf-8 # > This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python. # # # 7.8. Analyzing data with R in the IPython notebook # **UPDATE (2014-09-29)**: in newer versions of rpy2, the IPython extension with the R magic is `rpy2.ipython` and not `rmagic` as stated in the book. # There are three steps to use R from IPython. First, install R and rpy2 (R to Python interface). Of course, you only need to do this step once. Then, to use R in an IPython session, you need to load the IPython R extension. # 1. Download and install R for your operating system. (http://cran.r-project.org/mirrors.html) # 2. Download and install [rpy2](http://rpy.sourceforge.net/rpy2.html). Windows users can try to download an *experimental* installer on Chris Gohlke's webpage. (http://www.lfd.uci.edu/~gohlke/pythonlibs/#rpy2) # 3. Then, to be able to execute R code in an IPython notebook, execute `%load_ext rpy2.ipython` first. # rpy2 does not appear to work well on Windows. We recommend using Linux or Mac OS X. # To install R and rpy2 on Ubuntu, run the following commands: # # sudo apt-get install r-base-dev # sudo apt-get install python-rpy2 # Here, we will use the following workflow. First, we load data from Python. Then, we use R to design and fit a model, and to make some plots in the IPython notebook. We could also load data from R, or design and fit a statistical model with Python's statsmodels package, etc. In particular, the analysis we do here could be done entirely in Python, without resorting to the R language. This recipe just shows the basics of R and illustrates how R and Python can play together within an IPython session. # 1. Let's load the *longley* dataset with the statsmodels package. This dataset contains a few economic indicators in the US from 1947 to 1962. We also load the IPython R extension. # In[ ]: import statsmodels.datasets as sd # In[ ]: data = sd.longley.load_pandas() # In[ ]: get_ipython().run_line_magic('load_ext', 'rpy2.ipython') # 2. We define `x` and `y` as the exogeneous (independent) and endogenous (dependent) variables, respectively. The endogenous variable quantifies the total employment in the country. # In[ ]: data.endog_name, data.exog_name # In[ ]: y, x = data.endog, data.exog # 3. For convenience, we add the endogenous variable to the `x` DataFrame. # In[ ]: x['TOTEMP'] = y # In[ ]: x # 4. We will make a simple plot in R. First, we need to pass Python variables to R. We can use the `%R -i var1,var2` magic. Then, we can call R's `plot` command. # In[ ]: gnp = x['GNP'] totemp = x['TOTEMP'] # In[ ]: get_ipython().run_line_magic('R', '') # In[ ]: get_ipython().run_line_magic('R', '-i totemp,gnp plot(gnp, totemp)') # 5. Now that the data has been passed to R, we can fit a linear model to the data. The `lm` function lets us perform a linear regression. Here, we want to express `totemp` (total employement) as a function of the country's GNP. # In[ ]: get_ipython().run_cell_magic('R', '', 'fit <- lm(totemp ~ gnp); # Least-squares regression\nprint(fit$coefficients) # Display the coefficients of the fit.\nplot(gnp, totemp) # Plot the data points.\nabline(fit) # And plot the linear regression.\n') # > You'll find all the explanations, figures, references, and much more in the book (to be released later this summer). # # > [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).