# RESEARCH IN PYTHON: INSTRUMENTAL VARIABLES ESTIMATION
# by J. NATHAN MATIAS March 18, 2015
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
This section is taken from Chapter 10 of Methods Matter by Richard Murnane and John Willett. The descriptions are taken from Wikipedia, for copyright reasons.
In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment.
In linear models, there are two main requirements for using an IV:
Can we use college attainment (COLLEGE) to predict the probability of civic engagement (REGISTER)? College attainment is not randomized, and the arrow of causality may move in the opposite direction, so all we can do with standard regression is to establish a correlation.
In this example, we use an Instrumental Variable of distance between the student's school and a community college (DISTANCE), to estimate a causal relationship. This is possible only if this variable is related to college attainment and NOT related to the residuals of regressing COLLEGE on REGISTER.
The python code listed here is roughly parallel to the code listed in the textbook example for Methods Matter Chapter 10. If you're curious about how to do a similar example in R, check out "A Simple Instrumental Variables Problem" by Adam Hyland in R-Bloggers or Ani Katchova's "Instrumental Variables in R video on YouTube.
# THINGS TO IMPORT
# This is a baseline set of libraries I import by default if I'm rushed for time.
import codecs # load UTF-8 Content
import json # load JSON files
import pandas as pd # Pandas handles dataframes
import numpy as np # Numpy handles lots of basic maths operations
import matplotlib.pyplot as plt # Matplotlib for plotting
import seaborn as sns # Seaborn for beautiful plots
from dateutil import * # I prefer dateutil for parsing dates
import math # transformations
import statsmodels.formula.api as smf # for doing statistical regression
import statsmodels.api as sm # access to the wider statsmodels library, including R datasets
from collections import Counter # Counter is useful for grouping and counting
import scipy
import urllib2
import os.path
if(os.path.isfile("dee.dta")!=True):
response = urllib2.urlopen("http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter10/dee.dta")
if(response.getcode()==200):
f = open("dee.dta","w")
f.write(response.read())
f.close()
dee_df = pd.read_stata("dee.dta")
dee_df[['register','college', 'distance']].describe()
register | college | distance | |
---|---|---|---|
count | 9227.000000 | 9227.000000 | 9227.000000 |
mean | 0.670857 | 0.547090 | 9.735992 |
std | 0.469927 | 0.497805 | 8.702286 |
min | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 3.000000 |
50% | 1.000000 | 1.000000 | 7.000000 |
75% | 1.000000 | 1.000000 | 15.000001 |
max | 1.000000 | 1.000000 | 35.000000 |
print pd.crosstab(dee_df.register, dee_df.college)
chi2 = scipy.stats.chi2_contingency(pd.crosstab(dee_df.register, dee_df.college))
print "chi2: %(c)d" % {"c":chi2[0]}
print "p: %(p)0.03f" % {"p":chi2[1]}
print "df: %(df)0.03f" % {"df":chi2[2]}
print "expected:"
print chi2[3]
college 0 1 register 0 1780 1257 1 2399 3791 chi2: 323 p: 0.000 df: 1.000 expected: [[ 1375.48748239 1661.51251761] [ 2803.51251761 3386.48748239]]
sns.corrplot(dee_df[['register','college','distance']])
<matplotlib.axes._subplots.AxesSubplot at 0x10bd71e50>
result = smf.ols(formula = "register ~ college", data = dee_df).fit()
print result.summary()
OLS Regression Results ============================================================================== Dep. Variable: register R-squared: 0.035 Model: OLS Adj. R-squared: 0.035 Method: Least Squares F-statistic: 335.9 Date: Wed, 18 Mar 2015 Prob (F-statistic): 1.03e-73 Time: 22:15:12 Log-Likelihood: -5959.0 No. Observations: 9227 AIC: 1.192e+04 Df Residuals: 9225 BIC: 1.194e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 0.5741 0.007 80.391 0.000 0.560 0.588 college 0.1769 0.010 18.326 0.000 0.158 0.196 ============================================================================== Omnibus: 603.688 Durbin-Watson: 1.955 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1441.712 Skew: -0.689 Prob(JB): 0.00 Kurtosis: 1.640 Cond. No. 2.74 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In two-stage least squares regression, we regress COLLEGE on DISTANCE and use the predictions from that model as the predictors for REGISTER.
print "=============================================================================="
print " FIRST STAGE"
print "=============================================================================="
result = smf.ols(formula = "college ~ distance", data = dee_df).fit()
print result.summary()
dee_df['college_fitted'] = result.predict()
print
print
print "=============================================================================="
print " SECOND STAGE"
print "=============================================================================="
result = smf.ols(formula = "register ~ college_fitted", data=dee_df).fit()
print result.summary()
============================================================================== FIRST STAGE ============================================================================== OLS Regression Results ============================================================================== Dep. Variable: college R-squared: 0.012 Model: OLS Adj. R-squared: 0.012 Method: Least Squares F-statistic: 115.9 Date: Wed, 18 Mar 2015 Prob (F-statistic): 7.35e-27 Time: 22:15:12 Log-Likelihood: -6598.2 No. Observations: 9227 AIC: 1.320e+04 Df Residuals: 9225 BIC: 1.321e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 0.6091 0.008 78.812 0.000 0.594 0.624 distance -0.0064 0.001 -10.764 0.000 -0.008 -0.005 ============================================================================== Omnibus: 54.830 Durbin-Watson: 1.703 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1463.752 Skew: -0.190 Prob(JB): 0.00 Kurtosis: 1.086 Cond. No. 19.7 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. ============================================================================== SECOND STAGE ============================================================================== OLS Regression Results ============================================================================== Dep. Variable: register R-squared: 0.001 Model: OLS Adj. R-squared: 0.001 Method: Least Squares F-statistic: 10.35 Date: Wed, 18 Mar 2015 Prob (F-statistic): 0.00130 Time: 22:15:12 Log-Likelihood: -6118.9 No. Observations: 9227 AIC: 1.224e+04 Df Residuals: 9225 BIC: 1.226e+04 Df Model: 1 Covariance Type: nonrobust ================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- Intercept 0.5157 0.049 10.632 0.000 0.421 0.611 college_fitted 0.2837 0.088 3.216 0.001 0.111 0.457 ============================================================================== Omnibus: 658.930 Durbin-Watson: 1.938 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1638.898 Skew: -0.726 Prob(JB): 0.00 Kurtosis: 1.532 Cond. No. 23.4 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
^^^^^^ Not sure what's going on with the R2 statistic here (it's 0.001 here, versus 0.022 in the example), although everything else matches what we see from the Stata output in the published example
In the case of the covariate of race/ethicity, we expect that there might be a relationship between race/ethnicity and distance to a community college, as well as a relationship between race/ethnicity and voter registration.
While race/ethnicity fails the test for instrumental variables, it can still be included as a covariate in a multiple regression model. In such cases, it is essential to include covariates at both stages of a two-stage test.
sns.corrplot(dee_df[['register','college','distance', 'black','hispanic','otherrace']])
<matplotlib.axes._subplots.AxesSubplot at 0x10c5ca810>
print "=============================================================================="
print " FIRST STAGE"
print "=============================================================================="
result = smf.ols(formula = "college ~ distance + black + hispanic + otherrace", data = dee_df).fit()
print result.summary()
dee_df['college_fitted'] = result.predict()
print
print
print "=============================================================================="
print " SECOND STAGE"
print "=============================================================================="
result = smf.ols(formula = "register ~ college_fitted + black + hispanic + otherrace", data=dee_df).fit()
print result.summary()
============================================================================== FIRST STAGE ============================================================================== OLS Regression Results ============================================================================== Dep. Variable: college R-squared: 0.022 Model: OLS Adj. R-squared: 0.021 Method: Least Squares F-statistic: 51.24 Date: Wed, 18 Mar 2015 Prob (F-statistic): 9.75e-43 Time: 22:15:20 Log-Likelihood: -6554.4 No. Observations: 9227 AIC: 1.312e+04 Df Residuals: 9222 BIC: 1.315e+04 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 0.6431 0.009 71.039 0.000 0.625 0.661 distance -0.0069 0.001 -11.636 0.000 -0.008 -0.006 black -0.0577 0.016 -3.613 0.000 -0.089 -0.026 hispanic -0.1162 0.013 -8.766 0.000 -0.142 -0.090 otherrace 0.0337 0.024 1.404 0.160 -0.013 0.081 ============================================================================== Omnibus: 53.726 Durbin-Watson: 1.715 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1408.886 Skew: -0.188 Prob(JB): 1.16e-306 Kurtosis: 1.123 Cond. No. 62.3 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. ============================================================================== SECOND STAGE ============================================================================== OLS Regression Results ============================================================================== Dep. Variable: register R-squared: 0.005 Model: OLS Adj. R-squared: 0.004 Method: Least Squares F-statistic: 10.53 Date: Wed, 18 Mar 2015 Prob (F-statistic): 1.65e-08 Time: 22:15:20 Log-Likelihood: -6103.0 No. Observations: 9227 AIC: 1.222e+04 Df Residuals: 9222 BIC: 1.225e+04 Df Model: 4 Covariance Type: nonrobust ================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- Intercept 0.5266 0.047 11.200 0.000 0.434 0.619 college_fitted 0.2489 0.082 3.041 0.002 0.088 0.409 black 0.0617 0.015 4.007 0.000 0.032 0.092 hispanic 0.0283 0.015 1.879 0.060 -0.001 0.058 otherrace -0.1067 0.023 -4.604 0.000 -0.152 -0.061 ============================================================================== Omnibus: 654.082 Durbin-Watson: 1.939 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1619.291 Skew: -0.723 Prob(JB): 0.00 Kurtosis: 1.543 Cond. No. 22.6 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In this case, we explore whether interactions between college and race/ethnicity are significant predictors of voter registration. Here, it's important to meet the "rank condition": that "for every endogenous predictor included in the second stage, there must be at least one instrument included in the first stage."
To do this, we need to create a series of stage-one instruments, one for the main effect, and one for each interaction. In
print "=============================================================================="
print " FIRST STAGE"
print "=============================================================================="
# generate the stage one main effect instrument
result = smf.ols(formula = "college ~ distance + black + hispanic + otherrace +" +
"distance:black + distance:hispanic + distance:otherrace", data = dee_df).fit()
dee_df['college_fitted'] = result.predict()
print result.summary()
# generate the stage one interaction instrument for distance:black
# note that we have DROPPED the irrelevant terms.
# The full form for each interaction, which gives the exact same result, is:
# result = smf.ols(formula = "college:black ~ distance + black + hispanic + otherrace +" +
# "distance:black + distance:hispanic + distance:otherrace", data = dee_df).fit()
result = smf.ols(formula = "college:black ~ distance + black + distance:black", data = dee_df).fit()
dee_df['collegeXblack'] = result.predict()
# generate the stage one interaction instrument for distance:hispanic
result = smf.ols(formula = "college:hispanic ~ distance + hispanic + distance:hispanic", data = dee_df).fit()
dee_df['collegeXhispanic'] = result.predict()
# generate the stage one interaction instrument for distance:hispanic
result = smf.ols(formula = "college:otherrace ~ distance + otherrace + distance:otherrace", data = dee_df).fit()
dee_df['collegeXotherrace'] = result.predict()
# generate the final model, that includes these interactions as predictors
result = smf.ols(formula = "register ~ college_fitted + black + hispanic + otherrace +" +
"collegeXblack + collegeXhispanic + collegeXotherrace", data = dee_df).fit()
print result.summary()
============================================================================== FIRST STAGE ============================================================================== OLS Regression Results ============================================================================== Dep. Variable: college R-squared: 0.022 Model: OLS Adj. R-squared: 0.021 Method: Least Squares F-statistic: 29.57 Date: Wed, 18 Mar 2015 Prob (F-statistic): 1.12e-40 Time: 22:15:20 Log-Likelihood: -6553.3 No. Observations: 9227 AIC: 1.312e+04 Df Residuals: 9219 BIC: 1.318e+04 Df Model: 7 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------------- Intercept 0.6452 0.010 64.147 0.000 0.625 0.665 distance -0.0071 0.001 -9.830 0.000 -0.009 -0.006 black -0.0651 0.023 -2.858 0.004 -0.110 -0.020 hispanic -0.1276 0.019 -6.588 0.000 -0.166 -0.090 otherrace 0.0590 0.035 1.681 0.093 -0.010 0.128 distance:black 0.0009 0.002 0.442 0.658 -0.003 0.005 distance:hispanic 0.0013 0.002 0.818 0.413 -0.002 0.004 distance:otherrace -0.0030 0.003 -1.018 0.309 -0.009 0.003 ============================================================================== Omnibus: 53.396 Durbin-Watson: 1.715 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1407.217 Skew: -0.188 Prob(JB): 2.67e-306 Kurtosis: 1.124 Cond. No. 93.0 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: register R-squared: 0.005 Model: OLS Adj. R-squared: 0.004 Method: Least Squares F-statistic: 6.691 Date: Wed, 18 Mar 2015 Prob (F-statistic): 6.29e-08 Time: 22:15:20 Log-Likelihood: -6100.6 No. Observations: 9227 AIC: 1.222e+04 Df Residuals: 9219 BIC: 1.227e+04 Df Model: 7 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 0.4640 0.056 8.360 0.000 0.355 0.573 college_fitted 0.3587 0.097 3.703 0.000 0.169 0.549 black 0.2780 0.163 1.702 0.089 -0.042 0.598 hispanic 0.1765 0.121 1.456 0.145 -0.061 0.414 otherrace 0.1742 0.176 0.988 0.323 -0.171 0.520 collegeXblack -0.3986 0.303 -1.315 0.189 -0.993 0.196 collegeXhispanic -0.2929 0.249 -1.177 0.239 -0.781 0.195 collegeXotherrace -0.4634 0.285 -1.623 0.105 -1.023 0.096 ============================================================================== Omnibus: 653.574 Durbin-Watson: 1.940 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1616.409 Skew: -0.723 Prob(JB): 0.00 Kurtosis: 1.545 Cond. No. 88.9 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
^^^ in this particular case, we find no significant interactions and fall back on our previous model, which simply included race/ethnicity as a covariate
In this example, we use a logistic model with a two-stage least-squares regression. NOTE: This is not attempted in the textbook example, so I cannot be completely certain about this, unlike the above results.
print "=============================================================================="
print " FIRST STAGE"
print "=============================================================================="
result = smf.glm(formula = "college ~ distance + black + hispanic + otherrace",
data=dee_df,
family=sm.families.Binomial()).fit()
print result.summary()
dee_df['college_fitted'] = result.predict()
print
print
print "=============================================================================="
print " SECOND STAGE"
print "=============================================================================="#
result = smf.glm(formula = "register ~ college_fitted + black + hispanic + otherrace",
data=dee_df,
family=sm.families.Binomial()).fit()
print result.summary()
============================================================================== FIRST STAGE ============================================================================== Generalized Linear Model Regression Results ============================================================================== Dep. Variable: college No. Observations: 9227 Model: GLM Df Residuals: 9222 Model Family: Binomial Df Model: 4 Link Function: logit Scale: 1.0 Method: IRLS Log-Likelihood: -6253.9 Date: Wed, 18 Mar 2015 Deviance: 12508. Time: 22:15:20 Pearson chi2: 9.23e+03 No. Iterations: 6 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 0.5840 0.038 15.447 0.000 0.510 0.658 distance -0.0283 0.002 -11.456 0.000 -0.033 -0.023 black -0.2372 0.066 -3.621 0.000 -0.366 -0.109 hispanic -0.4744 0.055 -8.699 0.000 -0.581 -0.368 otherrace 0.1426 0.101 1.414 0.157 -0.055 0.340 ============================================================================== ============================================================================== SECOND STAGE ============================================================================== Generalized Linear Model Regression Results ============================================================================== Dep. Variable: register No. Observations: 9227 Model: GLM Df Residuals: 9222 Model Family: Binomial Df Model: 4 Link Function: logit Scale: 1.0 Method: IRLS Log-Likelihood: -5825.1 Date: Wed, 18 Mar 2015 Deviance: 11650. Time: 22:15:20 Pearson chi2: 9.23e+03 No. Iterations: 6 ================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- Intercept 0.0572 0.211 0.272 0.786 -0.356 0.470 college_fitted 1.1312 0.367 3.079 0.002 0.411 1.851 black 0.2899 0.073 3.995 0.000 0.148 0.432 hispanic 0.1284 0.068 1.886 0.059 -0.005 0.262 otherrace -0.4589 0.101 -4.566 0.000 -0.656 -0.262 ==================================================================================
In this example, we use a probit model with a two-stage least-squares regression. NOTE: This is not attempted in the textbook example, so I cannot be completely certain about this, unlike the above results.
import patsy
print "=============================================================================="
print " FIRST STAGE"
print "=============================================================================="
a,b = patsy.dmatrices("college ~ distance + black + hispanic + otherrace",
dee_df,return_type="dataframe")
result = sm.Probit(a,b).fit()
print result.summary()
dee_df['college_fitted'] = result.predict()
print
print
print "=============================================================================="
print " SECOND STAGE"
print "=============================================================================="#
a,b = patsy.dmatrices("register ~ college_fitted + black + hispanic + otherrace",
dee_df,return_type="dataframe")
result = sm.Probit(a,b).fit()
print result.summary()
============================================================================== FIRST STAGE ============================================================================== Optimization terminated successfully. Current function value: 0.677791 Iterations 4 Probit Regression Results ============================================================================== Dep. Variable: college No. Observations: 9227 Model: Probit Df Residuals: 9222 Method: MLE Df Model: 4 Date: Wed, 18 Mar 2015 Pseudo R-squ.: 0.01585 Time: 22:15:20 Log-Likelihood: -6254.0 converged: True LL-Null: -6354.7 LLR p-value: 1.862e-42 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 0.3640 0.023 15.566 0.000 0.318 0.410 distance -0.0176 0.002 -11.504 0.000 -0.021 -0.015 black -0.1474 0.041 -3.607 0.000 -0.227 -0.067 hispanic -0.2959 0.034 -8.709 0.000 -0.363 -0.229 otherrace 0.0888 0.062 1.423 0.155 -0.033 0.211 ============================================================================== ============================================================================== SECOND STAGE ============================================================================== Optimization terminated successfully. Current function value: 0.631312 Iterations 4 Probit Regression Results ============================================================================== Dep. Variable: register No. Observations: 9227 Model: Probit Df Residuals: 9222 Method: MLE Df Model: 4 Date: Wed, 18 Mar 2015 Pseudo R-squ.: 0.003563 Time: 22:15:20 Log-Likelihood: -5825.1 converged: True LL-Null: -5845.9 LLR p-value: 1.962e-08 ================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- Intercept 0.0427 0.129 0.330 0.742 -0.211 0.296 college_fitted 0.6903 0.226 3.060 0.002 0.248 1.133 black 0.1752 0.044 4.021 0.000 0.090 0.261 hispanic 0.0782 0.042 1.879 0.060 -0.003 0.160 otherrace -0.2834 0.063 -4.533 0.000 -0.406 -0.161 ==================================================================================