The jupyter notebook project is now designed to be a 'language agnostic' web-application front-end for any one of many possible software language kernels. We've been mostly using python but there are in fact several dozen other language kernels that can be made to work with it including Julia, R, Matlab, C, Go, Fortran and Stata.
The ecosystem of libraries and packages for scientific computing with python is huge and constantly growing but there are still many statistics and econometrics applications that are available as built-in or user-written modules in Stata that have not yet been ported to python or are just simply easier to use in Stata. On the other hand there are some libraries such as python pandas and different visualization libraries such as seaborn or matplotlib that give features that are not available in Stata.
Fortunately you don't have to choose between using Stata or python, you can use them both together, to get the best of both worlds.
R is a powerful open source software environment for statistical computing. R has R markdown which allows you to create R-markdown notebooks similar in concept to jupyter notebooks. But you can also run R inside a jupyter notebook (indeed the name 'Jupyter' is from Julia, iPython and R).
See my notebook with notes on Research Discontinuity Design for an example of a jupyter notebook running R. To install an R kernel see the IRkernel project.
Kyle Barron has created a stata_kernel that offers several useful features including code-autocompletion, inline graphics, and generally fast responses.
For this to work you must have a working licensed copy of Stata version 14 or greater on your machine.
Sometimes it may be useful to combine python and Stata in the same notebook. Ties de Kok has written a nice python library called ipystata that allows one to execute Stata code in codeblocks inside an ipython notebook when preceded by a %%stata
magic command.
This workflow allows you to pass data between python and Stata sessions and to display Stata plots inline. Compared to the stata_kernel option the response times are not quite as fast.
The remainder of this notebook illustrates the use of ipystata.
For more details see the example notebook and documentation on the ipystata repository.
%matplotlib inline
import seaborn as sns
import pandas as pd
import statsmodels.formula.api as smf
import ipystata
The following opens a Stata session where we load a dataset and summarize the data. The -o
flag following the `%%Stata``` magic instructs it to output or return the dataset in Stata memory as a pandas dataframe in python.
%%stata -o life_df
sysuse lifeexp.dta
summarize
(Life expectancy, 1998) Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- region | 68 1.5 .7431277 1 3 country | 0 popgrowth | 68 .9720588 .9311918 -.5 3 lexp | 68 72.27941 4.715315 54 79 gnppc | 63 8674.857 10634.68 370 39980 -------------+--------------------------------------------------------- safewater | 40 76.1 17.89112 28 100
Let's confirm the data was returned as a pandas dataframe:
life_df.head(3)
region | country | popgrowth | lexp | gnppc | safewater | |
---|---|---|---|---|---|---|
0 | Eur & C.Asia | Albania | 1.2 | 72 | 810.0 | 76.0 |
1 | Eur & C.Asia | Armenia | 1.1 | 74 | 460.0 | NaN |
2 | Eur & C.Asia | Austria | 0.4 | 79 | 26830.0 | NaN |
A simple generate variable command and ols regression in Stata:
%%stata -o life_df
gen lngnppc = ln(gnppc)
regress lexp lngnppc
(5 missing values generated) Source | SS df MS Number of obs = 63 -------------+---------------------------------- F(1, 61) = 97.09 Model | 873.264865 1 873.264865 Prob > F = 0.0000 Residual | 548.671643 61 8.99461709 R-squared = 0.6141 -------------+---------------------------------- Adj R-squared = 0.6078 Total | 1421.93651 62 22.9344598 Root MSE = 2.9991 ------------------------------------------------------------------------------ lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lngnppc | 2.768349 .2809566 9.85 0.000 2.206542 3.330157 _cons | 49.41502 2.348494 21.04 0.000 44.71892 54.11113 ------------------------------------------------------------------------------
And the same regression using statsmodels and pandas:
model = smf.ols(formula = 'lexp ~ lngnppc',
data = life_df)
results = model.fit()
print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: lexp R-squared: 0.614 Model: OLS Adj. R-squared: 0.608 Method: Least Squares F-statistic: 97.09 Date: Tue, 08 May 2018 Prob (F-statistic): 3.13e-14 Time: 14:25:48 Log-Likelihood: -157.57 No. Observations: 63 AIC: 319.1 Df Residuals: 61 BIC: 323.4 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 49.4150 2.348 21.041 0.000 44.719 54.111 lngnppc 2.7683 0.281 9.853 0.000 2.207 3.330 ============================================================================== Omnibus: 21.505 Durbin-Watson: 1.518 Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.140 Skew: -1.038 Prob(JB): 9.57e-11 Kurtosis: 6.643 Cond. No. 52.7 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Let's change one of the variables in the dataframe in python:
life_df.popgrowth = life_df.popgrowth * 100
life_df.popgrowth.mean()
97.20587921142578
And now let's push the modified dataframe into the Stata dataset with the -d
flag:
%%stata -d life_df
summarize
Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- index | 68 33.5 19.77372 0 67 region | 68 .5 .7431277 0 2 country | 0 popgrowth | 68 97.20588 93.11918 -50 300 lexp | 68 72.27941 4.715315 54 79 -------------+--------------------------------------------------------- gnppc | 63 8674.857 10634.68 370 39980 safewater | 40 76.1 17.89112 28 100 lngnppc | 63 8.250023 1.355677 5.913503 10.59613
A Stata plot:
%%stata -d life_df --graph
graph twoway (scatter lexp lngnppc) (lfit lexp lngnppc)
Now on the python side use lmplot from the seaborn library to graph a similar scatter and fitted line but by region.
sns.set_style("whitegrid")
g=sns.lmplot(y='lexp', x='lngnppc', col='region', data=life_df,col_wrap=2)