Estimating simple regression models in Julia¶

By Tyler Ransom, Duke University Social Science Research Institute

tyler.ransom@duke.edu

With our knowledge of the DataFrames package, we can use Julia to estimate common regression models which are useful in summarizing and describing data.

First, let's call the package we'll need for this demonstration. We'll be using Julia version 0.4.1 with GLM version 0.5.0 and DataFrames version 0.6.10.

In [8]:

using GLM
using DataFrames

OLS¶

Let's start by loading in the same CSV file from our DataFrames example: (available at https://github.com/jmxpearson/duke-julia-ssri-2016/auto.csv)

In [10]:

auto = readtable("auto.csv");
showcols(auto)

74x12 DataFrames.DataFrame
| Col # | Name         | Eltype     | Missing |
|-------|--------------|------------|---------|
| 1     | make         | UTF8String | 0       |
| 2     | price        | Int64      | 0       |
| 3     | mpg          | Int64      | 0       |
| 4     | rep78        | Int64      | 5       |
| 5     | headroom     | Float64    | 0       |
| 6     | trunk        | Int64      | 0       |
| 7     | weight       | Int64      | 0       |
| 8     | length       | Int64      | 0       |
| 9     | turn         | Int64      | 0       |
| 10    | displacement | Int64      | 0       |
| 11    | gear_ratio   | Float64    | 0       |
| 12    | foreign      | Int64      | 0       |

Now let's use the GLM package to estimate a simple ordinary least squares regression. Let's use :price as the dependent variable and :mpg and :weight as the indepenent variables.

The syntax for estimating an OLS model with the GLM package is:

object = glm(Y~X,data,Normal(),IdentityLink()) where

object is the Julia object that stores all of the results
Y is the dependent variable
X is a linear combination of independent variables
data is the name of the data frame that holds the data
And the last two options can be adjusted to estimate other models (which we'll get to later)

Users with experience in R will recognize the tight correspondence between Julia's GLM estimation syntax and R's.

Let's estimate our model now:

In [13]:

olsest = glm(price~mpg+weight,auto,Normal(),IdentityLink())

Out[13]:

DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:

Coefficients:
             Estimate Std.Error   z value Pr(>|z|)
(Intercept)   1946.07   3597.05  0.541018   0.5885
mpg          -49.5122    86.156 -0.574681   0.5655
weight        1.74656  0.641354   2.72324   0.0065

We see the reported coefficients, standard errors, z-values, and corresponding p-values for each covariate included in the model. Note that the intercept was automatically included.

Additionally, we can extract these objects, and others, by the following syntax:

In [19]:

coef(olsest)

Out[19]:

3-element Array{Float64,1}:
 1946.07   
  -49.5122 
    1.74656

In [20]:

stderr(olsest)

Out[20]:

3-element Array{Float64,1}:
 3597.05    
   86.156   
    0.641354

In [21]:

[coef(olsest) stderr(olsest)]

Out[21]:

3x2 Array{Float64,2}:
 1946.07     3597.05    
  -49.5122     86.156   
    1.74656     0.641354

Additional objects that we can extract from the stored estimates object include:

deviance(), which is the weighted sum of squares of the residuals
vcov(), which is the estimated variance-covariance matrix of the parameters
predict(), which obtains predicted values of the outcome for each observation

In [22]:

deviance(olsest)

Out[22]:

4.487441163821706e8

In [27]:

pricehat = predict(olsest)

Out[27]:

74-element Array{Float64,1}:
 5974.22
 6955.33
 5467.72
 6632.14
 8329.35
 7464.72
 4553.58
 6684.54
 7930.52
 6943.64
 8815.5 
 8064.48
 8399.05
    ⋮   
 3918.89
 7226.13
 3854.95
 3793.59
 5264.06
 4253.62
 5718.16
 4579.86
 3479.05
 4079.12
 4183.92
 6640.95

In [28]:

[convert(Array,auto[:,:price]) pricehat]

Out[28]:

74x2 Array{Float64,2}:
  4099.0  5974.22
  4749.0  6955.33
  3799.0  5467.72
  4816.0  6632.14
  7827.0  8329.35
  5788.0  7464.72
  4453.0  4553.58
  5189.0  6684.54
 10372.0  7930.52
  4082.0  6943.64
 11385.0  8815.5 
 14500.0  8064.48
 15906.0  8399.05
     ⋮           
  3995.0  3918.89
 12990.0  7226.13
  3895.0  3854.95
  3798.0  3793.59
  5899.0  5264.06
  3748.0  4253.62
  5719.0  5718.16
  7140.0  4579.86
  5397.0  3479.05
  4697.0  4079.12
  6850.0  4183.92
 11995.0  6640.95

In [30]:

vcov(olsest)

Out[30]:

3x3 Array{Float64,2}:
     1.29388e7    -2.9276e5  -2191.9     
    -2.9276e5   7422.86         44.6017  
 -2191.9          44.6017        0.411335

In [31]:

[sqrt(diag(vcov(olsest))) stderr(olsest)]

Out[31]:

3x2 Array{Float64,2}:
 3597.05      3597.05    
   86.156       86.156   
    0.641354     0.641354

Binary response models¶

Here we will estimate parameters associated with a binary dependent variable. The two canonical models that we will discuss here are the logistic regression model (logit) and the probit model. The two models usually give similar results, and discussing their differences is beyond the current scope.

Logit¶

The logit model is called using the Binomial() and LogitLink() options in the glm() function:

In [32]:

logitest = glm(foreign~mpg+weight,auto,Binomial(),LogitLink())

Out[32]:

DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial,GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:

Coefficients:
               Estimate  Std.Error  z value Pr(>|z|)
(Intercept)     13.7084    4.51742  3.03456   0.0024
mpg           -0.168587  0.0919032  -1.8344   0.0666
weight       -0.0039067 0.00101117 -3.86356   0.0001

Here, we can predict the probability that a given vehicle is from a foreign maker and compare that with what was observed.

In [33]:

[convert(Array,auto[:,:foreign]) predict(logitest)]

Out[33]:

74x2 Array{Float64,2}:
 0.0  0.190436  
 0.0  0.0957767 
 0.0  0.422082  
 0.0  0.0862626 
 0.0  0.00849477
 0.0  0.0249945 
 0.0  0.648662  
 0.0  0.0774615 
 0.0  0.0155653 
 0.0  0.0585486 
 0.0  0.00380411
 0.0  0.0200754 
 0.0  0.00136983
 ⋮              
 1.0  0.714123  
 1.0  0.117869  
 1.0  0.898059  
 1.0  0.449941  
 1.0  0.778794  
 1.0  0.471888  
 1.0  0.560431  
 1.0  0.800974  
 1.0  0.236247  
 1.0  0.875856  
 1.0  0.848046  
 1.0  0.176266

Probit¶

Similar to the logit case, the probit model is called using the Binomial() and ProbitLink() options in the glm() function:

In [34]:

probitest = glm(foreign~mpg+weight,auto,Binomial(),ProbitLink())

Out[34]:

DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial,GLM.ProbitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:

Coefficients:
                Estimate   Std.Error  z value Pr(>|z|)
(Intercept)      8.27532     2.57855  3.20929   0.0013
mpg            -0.103948   0.0542063 -1.91763   0.0552
weight       -0.00233551 0.000556961 -4.19331    <1e-4

In [35]:

[convert(Array,auto[:,:foreign]) predict(probitest)]

Out[35]:

74x2 Array{Float64,2}:
 0.0  0.196393   
 0.0  0.0941283  
 0.0  0.429645   
 0.0  0.0816519  
 0.0  0.00245574 
 0.0  0.0151149  
 0.0  0.642254   
 0.0  0.0715818  
 0.0  0.0071502  
 0.0  0.0504584  
 0.0  0.000496129
 0.0  0.0110559  
 0.0  4.30191e-5 
 ⋮               
 1.0  0.702837   
 1.0  0.121525   
 1.0  0.902976   
 1.0  0.440127   
 1.0  0.781031   
 1.0  0.466058   
 1.0  0.566884   
 1.0  0.799495   
 1.0  0.226333   
 1.0  0.878817   
 1.0  0.848251   
 1.0  0.185297

Factor variables¶

Suppose you have a categorical variable (e.g. ethnicity, or gender) that you want to include as an independent variable in your model.

Several existing software packages have this capability. In Stata, use i.gender; in R use gender <- as.factor(gender).

In Julia, factor levels are created using what is called a "pooled data array."

Let's see how they work by estimating an OLS model on the auto data frame, using the :rep78 variable, which takes on categorical values 1 through 5.

In [36]:

olsest2 = glm(price~mpg+weight+rep78,auto,Normal(),IdentityLink())

Out[36]:

DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:

Coefficients:
             Estimate Std.Error   z value Pr(>|z|)
(Intercept)  -1939.87   3622.35 -0.535528   0.5923
mpg          -52.2172   83.7396 -0.623567   0.5329
weight        2.11149  0.619008   3.41108   0.0006
rep78         820.812   320.899   2.55786   0.0105

Here, the glm() function thinks that :rep78 is a continuous variable. In order to "pool" this variable, we use the following syntax:

In [37]:

pool!(auto,:rep78);

Once the pooling is done, we can look at the levels of the pooled vector using the levels() function:

In [38]:

levels(auto[:rep78])

Out[38]:

5-element Array{Int64,1}:
 1
 2
 3
 4
 5

Now when we re-run this regression, we should expect to obtain four different coefficient estimates representing the four uniquely identifiable categories of :rep78.

In [39]:

olsest3 = glm(price~mpg+weight+rep78,auto,Normal(),IdentityLink())

Out[39]:

DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:

Coefficients:
             Estimate Std.Error   z value Pr(>|z|)
(Intercept)  -598.967    3960.9  -0.15122   0.8798
mpg          -63.0971   87.4528 -0.721499   0.4706
weight        2.09307  0.636901   3.28633   0.0010
rep78 - 2     753.702   1919.76  0.392602   0.6946
rep78 - 3     1349.36   1772.71  0.761187   0.4465
rep78 - 4     2030.47   1810.09   1.12175   0.2620
rep78 - 5     3376.91   1900.17   1.77716   0.0755

This is indeed the case.

Pooling imported data automatically¶

It is possible to automatically pool the data frame upon import using the makefactors option of readtable:

In [40]:

auto2 = readtable("auto.csv",makefactors=true);

However, this functionality only works with string variables, so any numerical categorical variables must be pooled manually.