In this lab we'll go beyond standard data analysis and actually code a machine learning algorithm from scratch. As this is our first such exercise we'll start with a classic linear model, Logistic Regression. We'll accomplish this with the following steps:
Before we do any programming, we need to first understand the optimization algorithm we'll use. As a reference, we'll be following the presentation presented here: https://tminka.github.io/papers/logreg/minka-logreg-old.pdf.
The logistic loss is convex so we'll use a convex optimization process called gradient descent. In particular, we will use a variant called Newton's Method that uses the 2nd derivative of the loss function to set the learning rate. To start, here are some primitives:
$L = \sum y * ln(p) + (1-y)*ln(1-p)$
(Note, we'll drop subscripts over the data instances to keep notation simple. All sums are over examples in the data set.)
$g_j = \nabla L_j = \sum (y-p)*x_j$
(This is the 1st derivative, where $j$ indicates feature dimension $j$)
$H_{jk} = \sum p*(1-p)*x_j*x_k$
(This is the 2nd derivative w.r.t. features $j$ and $k$)
Now the general form of Gradient Descent follows this function:
$w_{new} = w_{old} - \nu * g$
Where $w$ is the weight and $\nu$ is an appropriately chosen step size. In our case, we're going to use the inverse of the 2nd derviative matrix (the Hessian) as our learning rate. If we define $H$ as the Hessian matrix with entries $H_{jk}$ from above, and $G$ a gradient vector who's $jth$ entry is $g_j$, then our optimization problem becomes:
$w_{new} = w_{old} - H^{-1} * G$
This is what we are going to program here. Let's define a class that has the following methods:
#write a class with the following API
import numpy as np
class MyFirstLogReg(object):
def __init__(self, tol = 10**-8, max_iterations = 20):
self.tolerance = tol
self.max_iterations = max_iterations
self.beta = None
self.alpha = 0
def predict(self, Xint):
'''
Compute probs using the inverse logit
- Inputs: The NxK X matrix
- Outputs: Vector of probs of length N
'''
#First compute X*beta+alpha
XB = Xint.dot(self.b)
return (1+np.exp(-1*XB))**-1
def compute_gradient(self, Xint, y, p):
'''
Computes the gradient vector
-Inputs:
- NxK X matrix
- Nx1 y (label) vector
- Nx1 ps vector of predictions
-Outputs: 1xK vector of gradients
'''
return (y-p).dot(Xint)
def compute_hessian(self, Xint, P):
'''
computes the Hessian matrix
-inputs:
- NxK X matrix
- Nx1 vector of predictions
-outputs:
- KxK Hessian matrix H=X^T * Diag(Q) * X
'''
#Note, in first
Q = np.diag(P*(1-P))
return (Xint.T).dot(Q).dot(Xint)
def update_weights(self, Xint, y, i):
'''
Updates existing weight vector
-Inputs:
-NxK X matrix
-Nx1 y vector
-updates weights by calling predict, compute_gradient and compute_hessian
'''
p = self.predict(Xint)
g = self.compute_gradient(Xint, y, p)
H = self.compute_hessian(Xint, p)
#Store the current weights before updating so we can check for convergence
self.prior_b = self.b
#update the weights
self.b = self.b + np.linalg.inv(H).dot(g)
def check_stop(self):
'''
check to see if euclidean distance between old and new weights (normalized)
is less than the tolerance
returns: True or False on whether stopping criteria is met
'''
b_old_norm = self.prior_b / (np.sqrt(self.prior_b.dot(self.prior_b)))
b_new_norm = self.b / (np.sqrt(self.b.dot(self.b)))
diff = b_new_norm - b_old_norm
return (np.sqrt(diff.dot(diff)) < self.tolerance)
def fit(self, X, y):
'''
X is the Nx(K-1) data matrix
Y is the labels, using {0,1} coding
'''
#set initial weights - add an extra dimension for the intercept
self.b = np.zeros(X.shape[1] + 1)
#Initialize the slope parameter to log(base rate/(1-base rate))
self.b[-1] = np.log(y.mean() / (1-y.mean()))
#create a new X matrix that includes a column of ones for the intercept
Xint = np.hstack((X, np.ones((X.shape[0],1))))
for i in range(self.max_iterations):
#print(i)
self.update_weights(Xint, y, i)
self.beta = self.b[0:-1]
self.alpha = self.b[-1]
if self.check_stop():
break
Note about testing
One way we can test this implementation is to generate some random data according to a logistic model, and see if our model returns the correct weights. To do this:
#Generate a dataset with N=10k and K=5
def gen_logistic(N, K, Beta, Alpha):
X = np.random.random((N,K))
XB = X.dot(Beta) + Alpha * np.ones(N)
P = (1 + np.exp(-1*XB))**-1
Y = (np.random.random(N) < P)
return X, Y
K = 2
Beta = 2*(np.random.random(K)-1)
Alpha = -1
X, Y = gen_logistic(1000, K, Beta, Alpha)
Now that we have our dataset, let's test out our algorithm
lr = MyFirstLogReg()
lr.fit(X, Y)
print('The real Betas and Alpha are:')
print(Beta, Alpha)
print('')
print('The fitted Betas and Alpha are:')
print(lr.beta, lr.alpha)
The real Betas and Alpha are: [-1.84180703 -0.20277473] -1 The fitted Betas and Alpha are: [-1.64438904 -0.41202617] -1.1480464234681091
Now let's compare our fitted results to SkLearn's Logistic Regression implementation
from sklearn.linear_model import LogisticRegression
LR_SK = LogisticRegression(C=10**10).fit(X,Y)
LR_SK.coef_, LR_SK.intercept_
(array([[-1.64438881, -0.41202617]]), array([-1.14804636]))
We can see that this is pretty close, though not exact at all digits of precision. We can also see that both systems produce estimates that are far from the truth. Does this mean that the code is correct? If it is indeed correct, why might we observe the above? We can write a more robust test by running multiple draws, and seeing if on average we get the right answer. We can also increase the sample size. Doing either will be more computationally expensive. Before going there let's profile our code to see if there is a way to optimize it.
import cProfile
cProfile.run('lr.fit(X, Y)')
226 function calls in 0.028 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 5 0.000 0.000 0.000 0.000 <ipython-input-1-5667aecc335f>:14(predict) 5 0.000 0.000 0.000 0.000 <ipython-input-1-5667aecc335f>:25(compute_gradient) 5 0.000 0.000 0.022 0.004 <ipython-input-1-5667aecc335f>:36(compute_hessian) 5 0.004 0.001 0.027 0.005 <ipython-input-1-5667aecc335f>:50(update_weights) 5 0.000 0.000 0.000 0.000 <ipython-input-1-5667aecc335f>:69(check_stop) 1 0.000 0.000 0.028 0.028 <ipython-input-1-5667aecc335f>:82(fit) 1 0.000 0.000 0.028 0.028 <string>:1(<module>) 2 0.000 0.000 0.000 0.000 _methods.py:48(_count_reduce_items) 2 0.000 0.000 0.000 0.000 _methods.py:58(_mean) 5 0.000 0.000 0.000 0.000 linalg.py:103(get_linalg_error_extobj) 5 0.000 0.000 0.000 0.000 linalg.py:108(_makearray) 10 0.000 0.000 0.000 0.000 linalg.py:113(isComplexType) 5 0.000 0.000 0.000 0.000 linalg.py:126(_realType) 5 0.000 0.000 0.000 0.000 linalg.py:141(_commonType) 5 0.000 0.000 0.000 0.000 linalg.py:200(_assertRankAtLeast2) 5 0.000 0.000 0.000 0.000 linalg.py:211(_assertNdSquareness) 5 0.000 0.000 0.001 0.000 linalg.py:468(inv) 1 0.000 0.000 0.000 0.000 numeric.py:156(ones) 5 0.000 0.000 0.000 0.000 numeric.py:433(asarray) 9 0.000 0.000 0.000 0.000 numeric.py:504(asanyarray) 2 0.000 0.000 0.000 0.000 shape_base.py:11(atleast_1d) 1 0.000 0.000 0.000 0.000 shape_base.py:236(hstack) 1 0.000 0.000 0.000 0.000 shape_base.py:283(<listcomp>) 5 0.000 0.000 0.009 0.002 twodim_base.py:197(diag) 5 0.000 0.000 0.000 0.000 {built-in method builtins.abs} 1 0.000 0.000 0.028 0.028 {built-in method builtins.exec} 5 0.000 0.000 0.000 0.000 {built-in method builtins.getattr} 2 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr} 4 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 17 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass} 7 0.000 0.000 0.000 0.000 {built-in method builtins.len} 14 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array} 1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.concatenate} 1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.copyto} 1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.empty} 6 0.008 0.001 0.008 0.001 {built-in method numpy.core.multiarray.zeros} 5 0.000 0.000 0.000 0.000 {method '__array_prepare__' of 'numpy.ndarray' objects} 2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 5 0.000 0.000 0.000 0.000 {method 'astype' of 'numpy.ndarray' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 40 0.014 0.000 0.014 0.000 {method 'dot' of 'numpy.ndarray' objects} 5 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects} 2 0.000 0.000 0.000 0.000 {method 'mean' of 'numpy.ndarray' objects} 2 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
What method is taking the most amount of time? Looking more closely at the exact sequence of operations, what might be taking up a lot of memory or time? How can we achieve the same mathematic results, but do it in a more memory friendly way?
Let's do a quick test, where we design an alternative, and hopefully faster way to compute the Hessian.
def hessian_slow(X, P):
'''
Copy the operations used in the class above
'''
Q = np.diag(P*(1-P))
return (X.T).dot(Q).dot(X)
def hessian_fast(X, P):
'''
Rewrite this without using the np.diag function
'''
Q = P*(1-P)
XQ = X.T * Q
return XQ.dot(X)
P = 0.5 * np.ones(X.shape[0])
%timeit hessian_slow(X, P)
100 loops, best of 3: 3.49 ms per loop
%timeit hessian_fast(X, P)
The slowest run took 14.29 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 21.4 µs per loop
Does this make a difference? Now create a new class, same as above, but overwrite the compute_hessian method with the above faster version. We're going to do this in a very light and efficient way. Essentially we'll inherit MyFirstLogReg class, which means all of the methods in the base class are callable for this one. We can overwrite the compute_hessian method with our faster approach (note, in normal software development we'd likely just revise the original class, but we're doing it this way to show the example).
class MyFasterFirstLogReg(MyFirstLogReg):
def compute_hessian(self, Xint, p):
'''
computes the Hessian matrix
-inputs:
- NxK X matrix
- Nx1 vector of predictions
-outputs:
- KxK Hessian matrix H=X^T * Diag(Q) * X
'''
Q = p*(1-p)
XintQ = Xint.T * Q
return XintQ.dot(Xint)
#Now rerun
lr = MyFasterFirstLogReg()
lr.fit(X, Y)
lr.beta, lr.alpha
(array([-1.64438904, -0.41202617]), -1.1480464234681091)
As one last speed test, let's compare our faster class to sklearn.
%timeit LogisticRegression().fit(X,Y)
1000 loops, best of 3: 856 µs per loop
%timeit MyFasterFirstLogReg().fit(X, Y)
1000 loops, best of 3: 782 µs per loop
Now that we've got an efficient class written, let's perform 3 different unit tests:
#First let's generate a dataset with 1000000 examples
K = 4
Beta = 2*(np.random.random(K)-1)
Alpha = -2
X, Y = gen_logistic(1000000, K, Beta, Alpha)
LR_SK = LogisticRegression(C=10**30).fit(X,Y)
LR_Mine = MyFasterFirstLogReg()
LR_Mine.fit(X, Y)
print('Truth: beta=' + str(LR_SK.coef_) + ', Alpha=' +str(Alpha))
print('SkLearn: beta=' + str(LR_SK.coef_) + ', Alpha=' +str(Alpha))
print('Ours: beta=' + str(LR_Mine.beta) + ', Alpha=' +str(LR_Mine.alpha))
Truth: beta=[[-0.44567075 -1.42752653 -0.22013092 -1.85578601]], Alpha=-2 SkLearn: beta=[[-0.44567075 -1.42752653 -0.22013092 -1.85578601]], Alpha=-2 Ours: beta=[-0.44566344 -1.42750999 -0.22013456 -1.85580747], Alpha=-2.0060087546764263
For this test we'll check whether on average our own LR class is correct. To do this, run a loop where in each iteration, generate a random data set of size N, using the same Beta and Alpha. Then fit the model and store the weights. After this runs we can look at the distributions of features.
betas = []
alphas = []
for i in range(10000):
X, Y = gen_logistic(10000, K, Beta, Alpha)
LR_Mine = MyFasterFirstLogReg()
LR_Mine.fit(X, Y)
betas.append(LR_Mine.beta)
alphas.append(LR_Mine.alpha)
Now plot the histograms of each parameter distribution
#First put it all in a dataframe
import pandas as pd
df = pd.DataFrame(betas, columns = ['b' + str(k) for k in range(K)])
df['alpha'] = alphas
#Let's check the means against the truth
df.mean(), Beta, Alpha
(b0 -0.452829 b1 -1.459697 b2 -0.212411 b3 -1.845400 alpha -2.003525 dtype: float64, array([-0.451672 , -1.45290954, -0.20970989, -1.84053181]), -2)
import matplotlib.pyplot as plt
%matplotlib inline
bmeans = df.mean()
fig = plt.figure(figsize=(12,8))
for i in range(K):
fig.add_subplot(2,2,i+1)
plt.hist(df['b'+str(i)], bins=100)
plt.plot()
plt.title('b {}: Fit={}, Truth={}'.format(i, np.round(bmeans[i],2), np.round(Beta[i],2)))