Author: Ties de Kok tdekok@uw.edu
Homepage: https://github.com/TiesdeKok/ipystata
PyPi: https://pypi.python.org/pypi/ipystata
Stata Automation
mode¶See Github for an example notebook that uses the Stata Batch Mode
(supported for Windows, Mac OS X, and Linux).
import pandas as pd
import ipystata
Note: You can ignore the Javascript error adding output!
warning if it pops up with the newest version of Jupyter Notebook
Make sure that you have registered your Stata instance. (See GitHub for instructions).
%%stata
display "Hello, I am printed by Stata."
Hello, I am printed by Stata.
The code cell below runs the Stata command sysuse auto.dta
to load the dataset and returns it back to Python via the -o car_df
argument.
%%stata -o car_df
sysuse auto.dta
(1978 Automobile Data)
car_df
is a regular Pandas dataframe on which Python / Pandas actions can be performed.
car_df.head()
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AMC Concord | 4099 | 22 | 3.0 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | Domestic |
1 | AMC Pacer | 4749 | 17 | 3.0 | 3.0 | 11 | 3350 | 173 | 40 | 258 | 2.53 | Domestic |
2 | AMC Spirit | 3799 | 22 | NaN | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic |
3 | Buick Century | 4816 | 20 | 3.0 | 4.5 | 16 | 3250 | 196 | 40 | 196 | 2.93 | Domestic |
4 | Buick Electra | 7827 | 15 | 4.0 | 4.0 | 20 | 4080 | 222 | 43 | 350 | 2.41 | Domestic |
The argument -d or --data
is used to define which dataframe should be set as dataset in Stata.
In the example below the Stata function tabulate
is used to generate some descriptive statistics for the dataframe car_df
.
car_df.to_stata('D:\Software\stata15\test.df', version=120)
%%stata -d car_df
tabulate foreign headroom
| headroom foreign | 1.5 2 2.5 3 3.5 4 4.5 5 | Total -----------+----------------------------------------------------------------------------------------+---------- Domestic | 3 10 4 7 13 10 4 1 | 52 Foreign | 1 3 10 6 2 0 0 0 | 22 -----------+----------------------------------------------------------------------------------------+---------- Total | 4 13 14 13 15 10 4 1 | 74
These descriptive statistics can be replicated in Pandas using the crosstab
fuction, see the code below.
pd.crosstab(car_df['foreign'], car_df['headroom'], margins=True)
headroom | 1.5 | 2.0 | 2.5 | 3.0 | 3.5 | 4.0 | 4.5 | 5.0 | All |
---|---|---|---|---|---|---|---|---|---|
foreign | |||||||||
Domestic | 3 | 10 | 4 | 7 | 13 | 10 | 4 | 1 | 52 |
Foreign | 1 | 3 | 10 | 6 | 2 | 0 | 0 | 0 | 22 |
All | 4 | 13 | 14 | 13 | 15 | 10 | 4 | 1 | 74 |
IPyStata will automatically check whether there are any new graph generated.
If you want to show multiple graphs, you have to make sure to use the , name(.., replace)
argument in your Stata code.
Note: the order is not guaranteed to be the same as the generation order. Recommended to use the title()
argument when showing multiple graphs.
It is possible to prevent graphs from showing using the -nogr
or --nograph
arguments.
%%stata -s graph_session
use https://stats.idre.ucla.edu/stat/data/hsb2.dta, clear
graph twoway scatter read math, name(a, replace) title("Graph a")
graph twoway scatter math science, name(b, replace) title("Graph b")
(highschool and beyond (200 cases))
In many situations it is convenient to define values or variable names in a Python list or equivalently in a Stata macro.
The -i or --input
argument makes a Python list available for use in Stata as a local macro.
For example, -i main_var
converts the Python list ['mpg', 'rep78']
into the following Stata macro: ``main_var'`.
main_var = ['mpg', 'rep78']
control_var = ['gear_ratio', 'trunk', 'weight', 'displacement']
%%stata -i main_var -i control_var -os
display "`main_var'"
display "`control_var'"
regress price `main_var' `control_var', vce(robust)
mpg rep78 gear_ratio trunk weight displacement Linear regression Number of obs = 69 F(6, 62) = 8.60 Prob > F = 0.0000 R-squared = 0.4124 Root MSE = 2338.1 ------------------------------------------------------------------------------ | Robust price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -76.95578 84.95038 -0.91 0.369 -246.7692 92.8576 rep78 | 899.0818 299.7541 3.00 0.004 299.882 1498.282 gear_ratio | 1479.744 917.5363 1.61 0.112 -354.3846 3313.873 trunk | -110.3163 80.16622 -1.38 0.174 -270.5663 49.93365 weight | 1.139509 1.187361 0.96 0.341 -1.233991 3.51301 displacement | 17.82274 8.523647 2.09 0.041 .7842094 34.86126 _cons | -5163.323 4965.389 -1.04 0.302 -15088.99 4762.348 ------------------------------------------------------------------------------
It is possible create new variables or modify the existing dataset in Stata and have it returned as a Pandas dataframe.
In the example below the output -o car_df
will overwrite the car_df
previously created.
Note, the argument -np or --noprint
can be used to supress any output below the code cell.
%%stata -o car_df -np
generate weight_squared = weight^2
generate log_weight = log(weight)
car_df.head(3)
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | weight_squared | log_weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AMC Concord | 4099 | 22 | 3.0 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | Domestic | 8584900.0 | 7.982758 |
1 | AMC Pacer | 4749 | 17 | 3.0 | 3.0 | 11 | 3350 | 173 | 40 | 258 | 2.53 | Domestic | 11222500.0 | 8.116715 |
2 | AMC Spirit | 3799 | 22 | NaN | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic | 6969600.0 | 7.878534 |
The -gm
or --getmacro
argument allows a macro to be extracted from a Stata session. The macro will be added to the macro_dict
dictionary.
%%stata -s macro_example -gm macro_1 -gm macro_2
local macro_1 one two
local macro_2 three four
Several (2x) macros have been added to the dictionary: macro_dict
macro_dict
{'macro_1': ['one', 'two'], 'macro_2': ['three', 'four']}
macro_dict['macro_1']
['one', 'two']
import os
os.chdir(r'C:/')
%%stata -cwd
display "`c(pwd)'"
Set the working directory of Stata to: C:\
%%stata -s mata_session
sysuse auto
(1978 Automobile Data)
%%stata --mata -s mata_session
y = st_data(., "price")
X = st_data(., "mpg trunk")
n = rows(X)
X = X,J(n,1,1)
XpX = quadcross(X, X)
XpXi = invsym(XpX)
b = XpXi*quadcross(X, y)
b'
Mata output: 1 2 3 +----------------------------------------------+ 1 | -220.1648801 43.55851009 10254.94983 | +----------------------------------------------+
IPyStata 0.2 introduces the possibility to use many different Stata sessions that by default run in the background.
These sessions are defined using the -s
or --session
arguments.
%%stata -s session_1 -np
local session Hello I am session 1 and I am persistent
%%stata -s session_2 -np
local session Hello I am session 2 and I am persistent
%%stata -s session_1
display "`session'"
Hello I am session 1 and I am persistent
%%stata -s session_2
display "`session'"
Hello I am session 2 and I am persistent
In this example a logistic regression is performed in one cell and a postestimation (predict) is performed on this regression in the next cell.
%%stata -s auto_session
sysuse auto
logit foreign weight mpg
(1978 Automobile Data) Iteration 0: log likelihood = -45.03321 Iteration 1: log likelihood = -29.238536 Iteration 2: log likelihood = -27.244139 Iteration 3: log likelihood = -27.175277 Iteration 4: log likelihood = -27.175156 Iteration 5: log likelihood = -27.175156 Logistic regression Number of obs = 74 LR chi2(2) = 35.72 Prob > chi2 = 0.0000 Log likelihood = -27.175156 Pseudo R2 = 0.3966 ------------------------------------------------------------------------------ foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- weight | -.0039067 .0010116 -3.86 0.000 -.0058894 -.001924 mpg | -.1685869 .0919175 -1.83 0.067 -.3487418 .011568 _cons | 13.70837 4.518709 3.03 0.002 4.851859 22.56487 ------------------------------------------------------------------------------
%%stata -s auto_session
predict probhat
summarize probhat
(option pr assumed; Pr(foreign)) Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- probhat | 74 .2972973 .3052979 .000729 .8980594
In order to avoid using unnecessary system resources several tools and automatic cleanup routines are included.
%%stata
sessions
The following sessions have been found: main [active] graph_session [active] macro_example [active] mata_session [active] session_1 [active] session_2 [active] auto_session [active]
%%stata
reveal all
Revealed 7 Stata sessions.
%%stata
hide all
7 Stata sessions have been hidden.
%%stata
close
The following sessions have been closed: main graph_session macro_example mata_session session_1 session_2 auto_session Terminated unattached Stata session.
Close all Stata sessions (Warning! This closes all Stata windows)
%%stata
close all
Terminated 0 running Stata processes
Create the variable large
in Python and use it as the dependent variable for a binary choice estimation by Stata.
car_df['large'] = [1 if x > 3 and y > 200 else 0 for x, y in zip(car_df['headroom'], car_df['length'])]
car_df[['headroom', 'length', 'large']].head(7)
headroom | length | large | |
---|---|---|---|
0 | 2.5 | 186 | 0 |
1 | 3.0 | 173 | 0 |
2 | 3.0 | 168 | 0 |
3 | 4.5 | 196 | 0 |
4 | 4.0 | 222 | 1 |
5 | 4.0 | 218 | 1 |
6 | 3.0 | 170 | 0 |
main_var = ['mpg', 'rep78']
control_var = ['gear_ratio', 'trunk', 'weight', 'displacement']
%%stata -d car_df -i main_var -i control_var
logit large `main_var' `control_var', vce(cluster make)
Iteration 0: log pseudolikelihood = -39.60355 Iteration 1: log pseudolikelihood = -19.307161 Iteration 2: log pseudolikelihood = -13.526857 Iteration 3: log pseudolikelihood = -10.999644 Iteration 4: log pseudolikelihood = -10.726345 Iteration 5: log pseudolikelihood = -10.723111 Iteration 6: log pseudolikelihood = -10.723109 Iteration 7: log pseudolikelihood = -10.723109 Logistic regression Number of obs = 69 Wald chi2(6) = 12.90 Prob > chi2 = 0.0446 Log pseudolikelihood = -10.723109 Pseudo R2 = 0.7292 (Std. Err. adjusted for 69 clusters in make) ------------------------------------------------------------------------------ | Robust large | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -.5846335 .2941083 -1.99 0.047 -1.161075 -.0081918 rep78 | -1.298127 1.264918 -1.03 0.305 -3.777322 1.181067 gear_ratio | -1.331913 3.389448 -0.39 0.694 -7.975109 5.311283 trunk | 1.210178 .4830082 2.51 0.012 .2634991 2.156856 weight | -.0007284 .0022358 -0.33 0.745 -.0051105 .0036536 displacement | .001631 .0160425 0.10 0.919 -.0298119 .0330738 _cons | -.2977676 16.7841 -0.02 0.986 -33.19401 32.59847 ------------------------------------------------------------------------------ Note: 8 failures and 0 successes completely determined.