Implementing and vectorizing a Maximum Likelihood model with scipy

By: Matt Ranger

This notebook walks through the process of coding, testing, estimating, and vectorizing a Maximum Likelihood (MLE) model "from scratch" using scipy's numerical optimization package. Not all MLE models are available in pre-cooked packages, so this skill is necessary for some research topics.

We will only be touching a simple model here (regular probit), so as not to get bogged down in extraneous complexities. This general method of estimation, testing, and vectorization will however work for a wide range of MLE models.

Probit

The basic probit model takes a binary dependent variable, $Y$, and assumes that $Pr(Y=1 | X) = \Phi(X^T \beta)$ where $X$ is the matrix of independent variables, $\beta$ is the vector of parameters to estimate, and $\Phi$ is the CDF of the standard normal distribution. We want to take the likelihood function $L=PR(\beta|X, Y)$, and maximize it over $\beta$ to get the most likely $\beta$ paramaters given the data $X,Y$. We usually use the log of the likelihood function in practice, because it is simpler in both math and computation, and the maximum point is the same. The probit log likelihood is as follows:

$$ln L(\beta|X,Y) = \sum_{i=1}^n[y_i ln \Phi(x_i'\beta)+(1-y_i)ln(1-\Phi(x_i'\beta)) ]$$

Which we can translate into a naive python function like this:

In [1]:
import numpy as np
from scipy.stats import norm

def LogLikeProbit(betas, y, x):
    """
    Probit Log Likelihood function
    Very slow naive Python version
    Input:
        betas is a np.array of parameters
        y is a one dimensional np.array of endogenous data
        x is a 2 dimensional np.array of exogenous data
            First vertical colmn of X is assumed to be constant term,
            corresponding to betas[0]
    returns:
        negative of log likehood value (scalar)
    """
    result = 0
    #Sum operation
    for i in range(0, len(y)):
        #Get X_i * Beta value
        xb = np.dot(x[i], betas)
        
        #compute both binary probabilities from xb     
        #Add to total log likelihood
        llf = y[i]*np.log(norm.cdf(xb)) + (1-y[i])*np.log(1 - norm.cdf(xb))
        result += llf
    return -result

Note that we return the negative value of the result because we want to maximize over this function, and numerical optimizers are traditionally minimizers. Minimizing over the negative values will be the same as maximizing the function.

Generating a testing environment for your model

When creating a model from scratch, we need to know it is correct on data where we know the real values and distributions. Here is artificial data to test our probit model on:

In [5]:
######################
#ARTIFICIAL DATA
######################

#sample size
n = 1000

#random generators
z1 = np.random.randn(n)
z2 = np.random.randn(n)

#create artificial exogenous variables 
x1 = 0.8*z1 + 0.2*z2
x2 = 0.2*z1 + 0.8*z2
#create error term
u = 2*np.random.randn(n)

#create endogenous variable from x1, x2 and u
ystar = 0.5 + 0.75*x1 - 0.75*x2 + u

#create latent binary variable from ystar
def create_dummy(data, cutoff):
    result = np.zeros(len(data))
    for i in range(0, len(data)):
        if data[i] >= cutoff:
            result[i] = 1
        else:
            result[i] = 0
    return result

#get latent LHS variable
y = create_dummy(ystar, 0.5)

#prepend vector of ones to RHS variables matrix
#for constant term
const = np.ones(n)
x = np.column_stack((const, np.column_stack((x1, x2))))

Testing the model

We can now maximize the probit log likelihood to get the most likely vector of parameters given the artificial data using scipy's powerful numerical optimization library:

In [6]:
from scipy.optimize import minimize

#create beta hat vector to maximize on
#will store the values of maximum likelihood beta parameters
#Arbitrarily initialized to all zeros
bhat = np.zeros(len(x[0]))

#unvectorized MLE estimation
probit_est = minimize(LogLikeProbit, bhat, args=(y,x), method='nelder-mead')

#print vector of maximized betahats
probit_est['x']
Out[6]:
array([ 0.05125986,  0.34315356, -0.4281118 ])

Note here that probit regression results can't be interpreted directly like OLS regression results. Parameter estimates are divided by $\sigma$ (which we set to 2 when generating the error term). You can multiply the last two values in the results displayed ($\hat{\beta_1}$ and $\hat{\beta_2}$ ) by $\sigma$ and compare that to our generated values of the constant, $x_1$ and $x_2$ (0.75 and -0.75 respectively).

The $\hat{\beta}_0$ constant depends on the cutoff point that determines the binary variable value and $\sigma$; we're not interested in its value for today.

The estimates seem to be slightly off with our small samle size. $\hat{\beta_1}$ and $\hat{\beta_2}$ should be around 0.375 and -0.375.

Summary Statistics

This is a good time to get some summary statistics on our empirical estimate. In maximum likelihood estimation, the standard error is usually computed from the Cramèr-Rao lower bound. We can then use the standard error to get t- and p-values. The C-R lower bound is computed by the square root of the diagonal elements of the inverse of the Hessian at our estimated parameters. Statsmodels' numerical differentiation toolbox makes this easy:

In [8]:
import statsmodels.tools.numdiff as smt
import scipy as sc

#Get inverse hessian for Cramer Rao lower bound
b_estimates = probit_est['x']
Hessian = smt.approx_hess3(b_estimates, LogLikeProbit, args=(y,x))
invHessian = np.linalg.inv(Hessian)

#Standard Errors from C-R LB
#from diagonal elements of invHessian
SE = np.zeros(len(invHessian))
for i in range(0, len(invHessian)):
    SE[i] =  np.sqrt(invHessian[i,i])
    
#t and p values
t_statistics = (b_estimates/SE)
pval = (sc.stats.t.sf(np.abs(t_statistics), 999)*2)

print("Beta Hats: ", b_estimates)
print("SE: ", SE)
print("t stat: ", t_statistics)
print("P value: ", pval)
Beta Hats:  [ 0.05125986  0.34315356 -0.4281118 ]
SE:  [ 0.04049858  0.05533922  0.05726539]
t stat:  [ 1.26571979  6.20091066 -7.47592581]
P value:  [  2.05908521e-01   8.20432023e-10   1.67354497e-13]

We can see our $\hat{\beta_1}$ and $\hat{\beta_2}$ estimates are within one standard error of the expected values.

Vectorizing

Now that we know this probit model works, it would be a good time to vectorize the naive python into a performant version.

The naive model is very inefficient. More complex models will require nontrivial computational power to estimate, and proper vectorization can easily reduce a computation from taking hours to taking minutes.

The main idea is to replace as much computation from "pure python" to optimized numpy/scipy functions. To do this we need to look closely at what our code is doing. Here is the main loop in the naive probit model:

In [ ]:
for i in range(0, len(y)):
        xb = np.dot(X[i], betas)
        llf = y[i]*np.log(norm.cdf(xb)) + (1-y[i])*np.log(1 - norm.cdf(xb))
        result += llf

The outer loop is actually just a $\sum_{i=0}^n$ operation, so we can replace the outer for loop by numpy's optimized sum function. To do this, we have to make a couple of changes to the code.

  • First, we replace the for loop by wrapping np.sum() around the code inside the loop.

  • Move the $X_i'\beta$ computation outside the loop to run it only once.

  • Relace the conditional $y_i$ and $(1-y_i)$ in the loop by pythonic conditionals that are allowed inside the sum() function. The (y==1) and (y==0) conditional expressions can do this inside the sum function.

In [ ]:
xb = np.dot(x, betas)
result = np.sum(    
        (y==1)*np.log(1 - norm.cdf(xb)) + 
        (y==0)*np.log(norm.cdf(xb))
        )

For further optimization, we can use the natural log's mathematical propreties and move the log calculation to the outer part of the loop, so it's calculated once for the entire sum, instead of at each observation:

In [ ]:
xb = np.dot(x, betas)
result = np.sum(np.log(  
        (y==1)*(1 - norm.cdf(xb)) + 
        (y==0)*(norm.cdf(xb))
        ))

So the new vectorized log likelihood function looks like this:

In [9]:
def VectorizedProbitLL(betas, y, x):
    xb = np.dot(x, betas)
    result = np.sum(np.log(  
        (y==0)*(1 - norm.cdf(xb)) + 
        (y==1)*(norm.cdf(xb))
        ))
    return -result

Again, we return the negative value of the sum because optimization libraries generally minimize, and we're trying to maximize. We'll see a drastic difference in runtime right away:

In [10]:
import timeit

%timeit minimize(VectorizedProbitLL, bhat, args=(y,x), method='nelder-mead')
10 loops, best of 3: 98.4 ms per loop
In [11]:
%timeit minimize(LogLikeProbit, bhat, args=(y,x), method='nelder-mead')
1 loop, best of 3: 50.2 s per loop

So the vectorized version is 400-500x faster (!!!) with the Nelder-Mead algorithm. This was done on an intel i5-4210u, a mid range laptop processor, for reference.

Note that while 50 seconds is not a problem on our trivial sample size (n=1000 in the artificial data above), unvectorized code can become a serious problem on large datasets.

There are many algorithms in scipy.optimize.minimize's module; this is a good reference to help choose the best one.If in doubt, the Nelder-Mead algorithm is included in scipy's minimizer and is usually a good choice. Nelder-Mead doesn't require estimating derivatives of the function, and as such fails less often, at a cost of being slow to converge on larger datasets.

BFGS is a popular algorithm due to its speed, but necessitates computing second derivatives at each iteration:

In [12]:
%timeit minimize(VectorizedProbitLL, bhat, args=(y,x), method='bfgs')
10 loops, best of 3: 41.4 ms per loop

BFGS is about twice as fast as Nelder-Mead in this case.

Note that the BFGS algorithm sometimes throws a warning about a division by 0 here. This most likely comes from the fact that all our data is binary; methods that have to estimate derivatives have a harder time with sparse or binary data (among others).