#!/usr/bin/env python
# coding: utf-8

# > This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.
# 

# # 7.8. Analyzing data with R in the IPython notebook

# **UPDATE (2014-09-29)**: in newer versions of rpy2, the IPython extension with the R magic is `rpy2.ipython` and not `rmagic` as stated in the book.

# There are three steps to use R from IPython. First, install R and rpy2 (R to Python interface). Of course, you only need to do this step once. Then, to use R in an IPython session, you need to load the IPython R extension.

# 1. Download and install R for your operating system. (http://cran.r-project.org/mirrors.html)
# 2. Download and install [rpy2](http://rpy.sourceforge.net/rpy2.html). Windows users can try to download an *experimental* installer on Chris Gohlke's webpage. (http://www.lfd.uci.edu/~gohlke/pythonlibs/#rpy2)
# 3. Then, to be able to execute R code in an IPython notebook, execute `%load_ext rpy2.ipython` first.

# rpy2 does not appear to work well on Windows. We recommend using Linux or Mac OS X.

# To install R and rpy2 on Ubuntu, run the following commands:
# 
#     sudo apt-get install r-base-dev
#     sudo apt-get install python-rpy2

# Here, we will use the following workflow. First, we load data from Python. Then, we use R to design and fit a model, and to make some plots in the IPython notebook. We could also load data from R, or design and fit a statistical model with Python's statsmodels package, etc. In particular, the analysis we do here could be done entirely in Python, without resorting to the R language. This recipe just shows the basics of R and illustrates how R and Python can play together within an IPython session.

# 1. Let's load the *longley* dataset with the statsmodels package. This dataset contains a few economic indicators in the US from 1947 to 1962. We also load the IPython R extension.

# In[ ]:


import statsmodels.datasets as sd


# In[ ]:


data = sd.longley.load_pandas()


# In[ ]:


get_ipython().run_line_magic('load_ext', 'rpy2.ipython')


# 2. We define `x` and `y` as the exogeneous (independent) and endogenous (dependent) variables, respectively. The endogenous variable quantifies the total employment in the country.

# In[ ]:


data.endog_name, data.exog_name


# In[ ]:


y, x = data.endog, data.exog


# 3. For convenience, we add the endogenous variable to the `x` DataFrame.

# In[ ]:


x['TOTEMP'] = y


# In[ ]:


x


# 4. We will make a simple plot in R. First, we need to pass Python variables to R. We can use the `%R -i var1,var2` magic. Then, we can call R's `plot` command.

# In[ ]:


gnp = x['GNP']
totemp = x['TOTEMP']


# In[ ]:


get_ipython().run_line_magic('R', '')


# In[ ]:


get_ipython().run_line_magic('R', '-i totemp,gnp plot(gnp, totemp)')


# 5. Now that the data has been passed to R, we can fit a linear model to the data. The `lm` function lets us perform a linear regression. Here, we want to express `totemp` (total employement) as a function of the country's GNP.

# In[ ]:


get_ipython().run_cell_magic('R', '', 'fit <- lm(totemp ~ gnp);  # Least-squares regression\nprint(fit$coefficients)  # Display the coefficients of the fit.\nplot(gnp, totemp)  # Plot the data points.\nabline(fit)  # And plot the linear regression.\n')


# > You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
# 
# > [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).