Last updated as at 19 July 2017

In [1]:
import pandas as pd
import numpy as np
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
%matplotlib inline

Logistic Regression

In this first part, we develop an unregularized logistic regression model that predicts whether a student gets admitted into a university.

In [2]:
admit_data = pd.read_csv("ex2data1.txt", header = None, 
                   names = ["Exam 1 Score", "Exam 2 Score", "Admission Result"])
admit_data.head()
Out[2]:
Exam 1 Score Exam 2 Score Admission Result
0 34.623660 78.024693 0
1 30.286711 43.894998 0
2 35.847409 72.902198 0
3 60.182599 86.308552 1
4 79.032736 75.344376 1

Filter the two groups of students, admitted and not admitted, and create two separate objects in the same figure.

In [3]:
admitted = admit_data[admit_data["Admission Result"] == 1]
notadmitted = admit_data[admit_data["Admission Result"] == 0]

Visualizing the data

In [4]:
admits = go.Scatter(x = admitted["Exam 1 Score"], 
                    y= admitted["Exam 2 Score"],
                    mode = "markers", name= "Admitted")
notadmits = go.Scatter(x = notadmitted["Exam 1 Score"],
                       y = notadmitted["Exam 2 Score"],
                       mode = "markers", name= "Not admitted")
plotly.offline.iplot({"data": [admits, notadmits],
                      "layout": go.Layout(title = "Scatter plot of training data",
                                          xaxis = dict(title="Exam 1 score"),
                                          yaxis = dict(title="Exam 2 score"))})

Sigmoid, Cost, and Gradient Functions

In [5]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
In [6]:
def compute_cost(theta, X, y):
    return 1/len(y) * (-y @ np.log(sigmoid(X@theta)) - (1-y) @ np.log(1-sigmoid(X@theta)) )

Implementation notes for compute_gradient()

Gradient function is vectorized according to the form given in the subsequent programming exercise 3 (I skipped ahead), making for an elegant implementation that does not use any loops.

In [7]:
def compute_gradient(theta, X, y):
    gradient = np.zeros(len(theta))
    XT = X.T
    beta = sigmoid(X@theta) - y
    gradient = 1/len(y) * XT@beta
    return gradient

Prepare the input matrices and vectors.

Matrix X is formed from a transpose in order to make it the design matrix. Parameter vector theta_init is initialized to 0 by convention.

In [8]:
X = np.array([np.ones(len(admit_data.index)), 
               admit_data["Exam 1 Score"], admit_data["Exam 2 Score"]]).T
theta_init = np.zeros(3)
# Do not need to reshape y to be a column vector, keep y as row vector 
# in order to complete the multiplication with y on the left hand side
# otherwise y needs to be on the right
y = np.asarray(admit_data["Admission Result"])
X[0:5]
Out[8]:
array([[  1.        ,  34.62365962,  78.02469282],
       [  1.        ,  30.28671077,  43.89499752],
       [  1.        ,  35.84740877,  72.90219803],
       [  1.        ,  60.18259939,  86.3085521 ],
       [  1.        ,  79.03273605,  75.34437644]])

Learning the parameters using conjugate gradient

Use of the minimize method in the scipy.optimize module requires some special care noted below.

Implementation notes

At first, also couldn't get the optimization to work, and it was because the arguments to my functions were in the order (X,y,theta). Based on some simple Googling, I came across this stackoverflow post and noticed that this person got his cost function to work correctly, and had his arguments in the order of (theta, X, y). It made me wonder if this order mattered. So I applied this change to my function and immediately the optimization worked!

In retrospect, I understand why theta must be the first argument that is passed into the cost and gradient functions. This is becasue the interface of the minimize function in scipy.optimize expects its x0 argument to be the initial guess, ie. the initialized parameter values. In our case here $\theta$ is the parameter for which we set our initial guess to be $\theta = [\theta_0,\theta_1,\theta_2] = [0,0,0]$. minimize then performs an optimization routine on my cost function as described in compute_cost(), iteratively updating $\theta$ until the cost value converges. It then returns the $\theta$ for which this lowest minimized cost value was obtained.

That is why the x0 argument of minimize is special, and must always be the array of parameters for the cost function! It was thus wrong when I ordered my arguments as (X,y,theta) as this made X be the argument passed to x0, when it should have been theta.

In [9]:
from scipy.optimize import minimize

results = minimize(compute_cost, theta_init, args = (X,y),
                   method = 'CG', jac = compute_gradient, 
                   options = {"maxiter": 400, "disp" : True})
# optimized parameters are accessible through the x attribute
theta_optimized = results.x
print("Vector of learnt parameters:",theta_optimized)
Optimization terminated successfully.
         Current function value: 0.203498
         Iterations: 62
         Function evaluations: 140
         Gradient evaluations: 140
Vector of learnt parameters: [-25.16120038   0.20623067   0.20147049]
/home/kohaugustine/miniconda2/envs/intro_machine_learning_python/lib/python3.5/site-packages/ipykernel_launcher.py:2: RuntimeWarning:

divide by zero encountered in log

Plot the Decision Boundary

The decision boundary is the line that separates the regions where y = 0 (student is not admitted) and where y = 1 (student is admitted).

Our optimized paramters as computed via conjugate gradient is determined as $\theta = [\theta_0,\theta_1,\theta_2] = [-25.16120038, 0.20623067, 0.20147049]$. Our input, $\theta^Tx$ to the sigmoid function $g(z)$ is $\theta_0x_0+\theta_1x_1+\theta_2x_2 \space where \space x_0 = 1$. The property of the sigmoid function $g(z)$ is such that $g(z)\geq0.5 \space when \space z\geq0$ and $g(z)<0.5 \space when \space z<0$, where $z=\theta^Tx$.

Since we have defined that our hypothesis, $h_\theta(x) = g(\theta^Tx)$, and fixed the thresholds that the predicted class value, $y$, be $ y=1 \space when \space h_\theta(x)\geq0.5$ and $y=0 \space when \space h_\theta(x)<0.5$, this means that $y=1\space when \space \theta^Tx\geq0 \space$ and $\space y=0 \space when \space \theta^Tx<0$. Geometrically, any data points above the decision boundary are predicted to be case 1, and points below predicted as case 0.

Therefore, the decision boundary is defined by the line $\theta^Tx=0 \Leftrightarrow -25.16120038+0.20623067x_1+0.20147049x_2=0$ which can be plotted in the same space as the training dataset.

In [10]:
x1 = np.linspace(start = admit_data["Exam 1 Score"].min(), 
                 stop = admit_data["Exam 1 Score"].max(), num = 100)
x0 = np.ones(100)
x2 = -theta_optimized[1]/theta_optimized[2] * x1 - (theta_optimized[0]/theta_optimized[2] * x0)
In [11]:
decisionboundary = go.Scatter(x = x1, y = x2, mode="lines", 
                              name = "Decision Boundary" )
plotly.offline.iplot({"data": [admits, notadmits, decisionboundary],
                      "layout": go.Layout(title="Training data with decision boundary",
                                          xaxis = dict(title="Exam 1 score"),
                                          yaxis = dict(title="Exam 2 score"))})

Evaluating logistic regression

We test the predictions that our logistic regression gives on:

  • a particular case of a student with Exam 1 score = 45 and Exam 2 score = 85

  • the training dataset

Probability of a particular student

In [12]:
p = sigmoid(z = (np.array([1,45,85]) @ theta_optimized))
print("The admission probability of a student with Exam 1 score = 45 and Exam 2 score = 85 is {}.".format(p))
The admission probability of a student with Exam 1 score = 45 and Exam 2 score = 85 is 0.7762893321212395.

Accuracy on the entire training dataset

In [13]:
hypothesis_out = sigmoid(X@theta_optimized)
# By definition, whenever the hypothesis is greater than or equal to 0.5, 
# the predicted class should be set to 1, else 0.
predicted_training = np.where(hypothesis_out >= 0.5, 1, 0) 
acc = np.mean(predicted_training == admit_data["Admission Result"])*100
print("The accuracy of the logistic regression classifier as tested on the training dataset is {} percent.".format(acc))
The accuracy of the logistic regression classifier as tested on the training dataset is 89.0 percent.

Regularized logistic regression

In this second part, we develop a regularized logistic regresion model that predicts whether microchips from a fabrication plant passes quality assurance.

In [14]:
chip_data = pd.read_csv("./ex2data2.txt", header = None,
                   names = ["Microchip Test 1", "Microchip Test 2",
                            "Result"])
chip_data.head()
Out[14]:
Microchip Test 1 Microchip Test 2 Result
0 0.051267 0.69956 1
1 -0.092742 0.68494 1
2 -0.213710 0.69225 1
3 -0.375000 0.50219 1
4 -0.513250 0.46564 1
In [15]:
accept = chip_data[chip_data["Result"] == 1]
reject = chip_data[chip_data["Result"] == 0]

Visualizing the data

We see that it is impossible to separate the two classes of accepted and rejected microchips without a non-linear decision boundary. We need to build our classifier with polynomial terms generated from the given features in order to obtain a non-linear boundary that can separate our data.

In [16]:
accepted = go.Scatter(x = accept["Microchip Test 1"], 
                    y= accept["Microchip Test 2"],
                    mode = "markers", name= "Accepted")
rejected = go.Scatter(x = reject["Microchip Test 1"],
                       y = reject["Microchip Test 2"],
                       mode = "markers", name= "Rejected")
plotly.offline.iplot({"data": [accepted, rejected],
                      "layout": go.Layout(title = "Scatter plot of training data",
                                          xaxis = dict(title="Microchip Test 1"),
                                          yaxis = dict(title="Microchip Test 2"))})

Feature Mapping

Polynomial terms are generated iteratively, up to the specified degree (6 in this case), from vectors of the two features, $x_1, x_2$, that we have in our training dataset.

TODO: Explore more efficient way...perhaps by using dynamic programming.

In [17]:
def map_features(x1, x2, deg):
    # Initialize array with first column of all ones, ie. the 0th degree term
    featuremap = np.ones((len(x1), 1))
    # Begin iteratively generating polynomial features from 
    # the 2nd column onwards, degree 1 terms and above
    for d in range(1,deg+1): # add 1 to the deg in order to match the upper bound to the degree value
        for j in range(0,d+1):
            newfeature = x1**(d-j) * x2**j
            featuremap = np.append(featuremap, newfeature.reshape(len(x1),1), axis = 1)
    return featuremap            

Regularized Cost and Gradient Functions

Unregularized cost function was of the form $$J(\theta) = \frac{1}{m} \sum_{i=0}^m \left[-y^{(i)} \log( h_\theta(x^{(i)}) ) - (1 - y^{(i)}) \log( 1 - h_\theta(x^{(i)}))\right]$$ and now we add a regularization term at the end to get $$J(\theta) = \frac{1}{m} \sum_{i=0}^m \left[-y^{(i)} \log( h_\theta(x^{(i)}) ) - (1 - y^{(i)}) \log( 1 - h_\theta(x^{(i)}) ) \right]+\frac{\lambda}{2m}\sum_{j=0}^n \theta^2$$ where $m$ represents the number of rows in the training data and $n$ the number of features.

Implementation Note: Do not include the bias term $\theta_0$ in computing the regularized cost. So we index all the parameters in the parameter vector theta starting from parameter 1 which is $\theta_1$, all the way till the last parameter.

In [18]:
def compute_cost_regularized(theta, X, y, lda):
    reg = lda/(2*len(y)) * np.sum(theta[1:]**2)
    return 1/len(y) * np.sum(-y @ np.log(sigmoid(X@theta)) - (1-y) @ np.log(1-sigmoid(X@theta))) + reg

Unregularized gradient function was $$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m}\sum_{i=0}^m(h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j $$ and with the regularization term added to the cost function above, the gradient now becomes $$\frac{\partial J(\theta)}{\partial \theta_j} = (\frac{1}{m}\sum_{i=0}^m (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j ) + \frac{\lambda}{m}\theta_j$$ for $j\geq1$.

Implementation notes for adding regularization to gradient function in a vectorized form

With the gradient function already vectorized from the earlier unregularized implementation, adding the regularization in a vectorized manner is also easy. We just form the regularization term as a vector, regterm, where all the hypothesis parameters except the first, $\theta_1 \dots \theta_n$ get multiplied by the regularization parameter. In the implementation below, the first hypothesis parameter, $\theta_0$, is left unregularized by setting the value at that position in the array to $0$ before regterm is added to the partial derivative term.

In [19]:
def compute_gradient_regularized(theta, X, y, lda):
    gradient = np.zeros(len(theta))
    XT = X.T
    beta = sigmoid(X@theta) - y
    regterm = lda/len(y) * theta
    # theta_0 does not get regularized, so a 0 is substituted in its place
    regterm[0] = 0 
    gradient = 1/len(y) * XT@beta + regterm
    return gradient

Initialize the input variables. The design matrix $X$ is 28 columns long due to the polynomial features.

In [20]:
X = map_features(np.array(chip_data["Microchip Test 1"]), np.array(chip_data["Microchip Test 2"]),6)
theta_init = np.zeros(X.shape[1])
y = np.asarray(chip_data["Result"])
lda = 1 # Regularization parameter, lambda, value 
X[0:2]
Out[20]:
array([[  1.00000000e+00,   5.12670000e-02,   6.99560000e-01,
          2.62830529e-03,   3.58643425e-02,   4.89384194e-01,
          1.34745327e-04,   1.83865725e-03,   2.50892595e-02,
          3.42353606e-01,   6.90798869e-06,   9.42624411e-05,
          1.28625106e-03,   1.75514423e-02,   2.39496889e-01,
          3.54151856e-07,   4.83255257e-06,   6.59422333e-05,
          8.99809795e-04,   1.22782870e-02,   1.67542444e-01,
          1.81563032e-08,   2.47750473e-07,   3.38066048e-06,
          4.61305487e-05,   6.29470940e-04,   8.58939846e-03,
          1.17205992e-01],
       [  1.00000000e+00,  -9.27420000e-02,   6.84940000e-01,
          8.60107856e-03,  -6.35227055e-02,   4.69142804e-01,
         -7.97681228e-04,   5.89122275e-03,  -4.35092419e-02,
          3.21334672e-01,   7.39785525e-05,  -5.46363780e-04,
          4.03513411e-03,  -2.98012201e-02,   2.20094970e-01,
         -6.86091891e-06,   5.06708697e-05,  -3.74226408e-04,
          2.76382476e-03,  -2.04120477e-02,   1.50751849e-01,
          6.36295342e-07,  -4.69931780e-06,   3.47065055e-05,
         -2.56322636e-04,   1.89305413e-03,  -1.39810280e-02,
          1.03255971e-01]])

Compute some test values...

In [21]:
print(compute_cost_regularized(np.ones(X.shape[1]), X,y, lda))
print(compute_gradient_regularized(np.ones(X.shape[1]), X,y,lda)[0:5])
2.13484831467
[ 0.34604507  0.08508073  0.11852457  0.1505916   0.01591449]

Learning the parameters with regularization using conjugate gradient

In [22]:
results = minimize(compute_cost_regularized, theta_init, args = (X,y,lda),
                   method = 'CG', jac = compute_gradient_regularized, 
                   options = {"maxiter": 400, "disp" : True})
# optimized parameters are accessible through the x attribute
theta_optimized = results.x
print("Vector of learnt parameters:", theta_optimized)
Optimization terminated successfully.
         Current function value: 0.529003
         Iterations: 19
         Function evaluations: 55
         Gradient evaluations: 55
Vector of learnt parameters: [ 1.2726322   0.62526851  1.18110054 -2.01977776 -0.91750969 -1.43140105
  0.12396937 -0.36542088 -0.35720348 -0.17516018 -1.45817405 -0.05109747
 -0.61556197 -0.2747374  -1.19275643 -0.24225813 -0.20594332 -0.04478629
 -0.27777193 -0.29534671 -0.45647333 -1.04328679  0.02771439 -0.29243876
  0.01551392 -0.3273817  -0.14391016 -0.92473298]

Checking the accuracy

The accuracy of the prediction results from the training is $83.1\%$, which closely matches the expected accuracy at $\lambda=1$!

In [23]:
print(np.mean(y == np.where(sigmoid(X@theta_optimized)>=0.5,1,0))*100,"%")
83.0508474576 %

Plotting the decision boundary

The decision boundary is visualized as a contour plot. The classifier's prediction was computed for each point in an evenly spaced grid of feature values $x_1, x_2$ and stored in a 2D array which was then used to generate the plot.

Implementation notes

When I try to vectorize the grid values computation by constructing the 2D grid for each feature using featuremap = map_features(x1,x2,6) and np.meshgrid(*[featuremap[i] for i in range(0, featuremap.shape[1])]), the memory space requirement explodes with the size of the two arrays x1 and x2. When these arrays were just the size of 2 element, the calculation proceeded but ate up a lot of memory. Any larger sizes and numpy refuses to execute the computation.

Generate evenly spaced feature values.

In [24]:
x1 = np.linspace(start = chip_data["Microchip Test 1"].min(), 
                 stop = chip_data["Microchip Test 1"].max(), num = 100)
x2 = np.linspace(start = chip_data["Microchip Test 2"].min(), 
                 stop = chip_data["Microchip Test 2"].max(), num = 100)

Define a simpler version of the feature mapping function that accepts two scalars of feature values and returns a vector of polynomial features.

In [25]:
# Returns a vector of new polynomial features given two scalars
def map_features_from_scalar(x1,x2,deg):
    featvec = np.ones(1)
    for d in range(1,deg+1): # add 1 to the deg in order to match the upper bound to the degree value
        for j in range(0,d+1):
            newfeature = x1**(d-j) * x2**j
            featvec = np.append(featvec, newfeature)
    return featvec   
    

Compute classifier predictions for each point on the evenly spaced grid

As noted above, initially I attempted to vectorize the grid generation for all 28 features using np.meshgrid() but found out that this was not possible because it would require a lot of memory.

Eventually I realized that there is no escape from using a double loop to generate the grid, with the need to compute all the features each time. To vectorize this will require more memory than my puny machine can handle. It is a trade-off between saving on computation time or saving on memory.

In [26]:
z = np.zeros((len(x1),len(x1)))
for i in range(0, len(x1)):
    for j in range(0,len(x2)):
        # Multiplying the vector of polynomial features for each pair of x1 and x2 feature values to the 
        # vector of optimized parameters in order to obtain the 
        z[i,j] = map_features_from_scalar(x1[i],x2[j],6) @ theta_optimized
# Convert z to prediction
z=np.where(sigmoid(z)>=0.5,1,0)
In [27]:
contmap = go.Contour(z = z, x=x1, y=x2, showlegend = True, 
                     showscale = True, ncontours=1, 
                     text =np.asarray(list(map(lambda x: "Accepted" if x == 1 else "Rejected",
                            z.flatten()))).reshape(len(x1),len(x1)),
                        hoverinfo = "text" ,
                     colorbar=dict(ticktext=["Rejected", "Accepted"],
                                  tickmode = "array", 
                                   title = "Prediction"),
                     contours = dict(start=0,end=1))

plotly.offline.iplot({
        "data": [contmap],
        "layout" : go.Layout(title ="Decision Boundary as Contour Plot",
                             xaxis = dict(title="Microchip Test 1"),
                             yaxis = dict(title="Microchip Test 2"))
    })

Optional exercises

In this part, we explore different regularization parameters for the dataset to understand how regularization prevents overfitting. Before we begin, it will be helpful to wrap up our above code into functions that can be called with different regularization parameters multiple times easily.

In [28]:
def compute_optimal_theta(X,y,lda):
    results = minimize(compute_cost_regularized, theta_init, args = (X,y,lda),
                   method = 'CG', jac = compute_gradient_regularized, 
                  options = {"maxiter": 400})
    theta_optimized = results.x
    return theta_optimized
In [29]:
def generate_classification_grid(optimal_theta):
    x1 = np.linspace(start = chip_data["Microchip Test 1"].min(), 
                     stop = chip_data["Microchip Test 1"].max(), num = 200)
    x2 = np.linspace(start = chip_data["Microchip Test 2"].min(), 
                     stop = chip_data["Microchip Test 2"].max(), num = 200)
    z = np.zeros((len(x1),len(x1)))
    for i in range(0, len(x1)):
        for j in range(0,len(x2)):
            z[i,j] = map_features_from_scalar(x1[i],x2[j],6) @ optimal_theta
    
    z=np.where(sigmoid(z)>=0.5,1,0)
    return x1,x2,z
In [30]:
def generate_decision_boundary_for_given_lambda(X,y,lda,_title):
    optimal_theta = compute_optimal_theta(X,y,lda)
    a,b,c = generate_classification_grid(optimal_theta)
    
    # We only show lines on the contour plot in order to  display the decision boundary clearly
    contmap = go.Contour(z = c, x=a, y=b, showlegend = True, 
                         name = "Prediction", 
                         text =np.asarray(list(map(lambda x: "Accepted" if x == 1 else "Rejected",
                                               c.flatten()))).reshape(len(a),len(a)),
                        hoverinfo = "name+x+y+text" ,
                         ncontours = 2,
                         line = dict(smoothing=1.3, width = 2),
                         contours = dict(coloring="lines"), 
                          showscale = False)

    plotly.offline.iplot({
        "data": [accepted,rejected,contmap],
        "layout" : go.Layout(title = _title,
                             xaxis = dict(title="Microchip Test 1"),
                             yaxis = dict(title="Microchip Test 2"))
    })

Examples of good regularization

Let us consider the decision boundaries when $\lambda = 0.2$ and $\lambda = 1$. Click on the "Show closest data on hover" button in the top-right hand corner of the plot for less clutter during hovering.

In [31]:
generate_decision_boundary_for_given_lambda(X,y,0.2,"Good Regularization (When lambda = 0.2)")
In [32]:
generate_decision_boundary_for_given_lambda(X,y,1,"Good Regularization (When lambda = 1)")