Last updated as at 19 July 2017

In [1]:

```
import pandas as pd
import numpy as np
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
%matplotlib inline
```

In this first part, we develop an unregularized logistic regression model that predicts whether a student gets admitted into a university.

In [2]:

```
admit_data = pd.read_csv("ex2data1.txt", header = None,
names = ["Exam 1 Score", "Exam 2 Score", "Admission Result"])
admit_data.head()
```

Out[2]:

Filter the two groups of students, admitted and not admitted, and create two separate objects in the same figure.

In [3]:

```
admitted = admit_data[admit_data["Admission Result"] == 1]
notadmitted = admit_data[admit_data["Admission Result"] == 0]
```

In [4]:

```
admits = go.Scatter(x = admitted["Exam 1 Score"],
y= admitted["Exam 2 Score"],
mode = "markers", name= "Admitted")
notadmits = go.Scatter(x = notadmitted["Exam 1 Score"],
y = notadmitted["Exam 2 Score"],
mode = "markers", name= "Not admitted")
plotly.offline.iplot({"data": [admits, notadmits],
"layout": go.Layout(title = "Scatter plot of training data",
xaxis = dict(title="Exam 1 score"),
yaxis = dict(title="Exam 2 score"))})
```

In [5]:

```
def sigmoid(z):
return 1 / (1 + np.exp(-z))
```

In [6]:

```
def compute_cost(theta, X, y):
return 1/len(y) * (-y @ np.log(sigmoid(X@theta)) - (1-y) @ np.log(1-sigmoid(X@theta)) )
```

`compute_gradient()`

¶Gradient function is vectorized according to the form given in the subsequent programming exercise 3 (I skipped ahead), making for an elegant implementation that does not use any loops.

In [7]:

```
def compute_gradient(theta, X, y):
gradient = np.zeros(len(theta))
XT = X.T
beta = sigmoid(X@theta) - y
gradient = 1/len(y) * XT@beta
return gradient
```

Matrix `X`

is formed from a transpose in order to make it the design matrix. Parameter vector `theta_init`

is initialized to 0 by convention.

In [8]:

```
X = np.array([np.ones(len(admit_data.index)),
admit_data["Exam 1 Score"], admit_data["Exam 2 Score"]]).T
theta_init = np.zeros(3)
# Do not need to reshape y to be a column vector, keep y as row vector
# in order to complete the multiplication with y on the left hand side
# otherwise y needs to be on the right
y = np.asarray(admit_data["Admission Result"])
X[0:5]
```

Out[8]:

Use of the `minimize`

method in the `scipy.optimize`

module requires some special care noted below.

At first, also couldn't get the optimization to work, and it was because the arguments to my functions were in the order `(X,y,theta)`

. Based on some simple Googling, I came across this stackoverflow post and noticed that this person got his cost function to work correctly, and had his arguments in the order of `(theta, X, y)`

. It made me wonder if this order mattered. So I applied this change to my function and immediately the optimization worked!

In retrospect, I understand why `theta`

must be the first argument that is passed into the cost and gradient functions. This is becasue the interface of the `minimize`

function in `scipy.optimize`

expects its `x0`

argument to be the **initial guess**, ie. the initialized parameter values. In our case here $\theta$ is the parameter for which we set our initial guess to be $\theta = [\theta_0,\theta_1,\theta_2] = [0,0,0]$. `minimize`

then performs an optimization routine on my cost function as described in `compute_cost()`

, iteratively updating $\theta$ until the cost value converges. It then returns the $\theta$ for which this lowest minimized cost value was obtained.

That is why the `x0`

argument of `minimize`

is special, and must always be the array of parameters for the cost function! It was thus wrong when I ordered my arguments as `(X,y,theta)`

as this made `X`

be the argument passed to `x0`

, when it should have been `theta`

.

In [9]:

```
from scipy.optimize import minimize
results = minimize(compute_cost, theta_init, args = (X,y),
method = 'CG', jac = compute_gradient,
options = {"maxiter": 400, "disp" : True})
# optimized parameters are accessible through the x attribute
theta_optimized = results.x
print("Vector of learnt parameters:",theta_optimized)
```

The **decision boundary** is the line that separates the regions where y = 0 (student is not admitted) and where y = 1 (student is admitted).

Our optimized paramters as computed via conjugate gradient is determined as $\theta = [\theta_0,\theta_1,\theta_2] = [-25.16120038, 0.20623067, 0.20147049]$. Our input, $\theta^Tx$ to the sigmoid function $g(z)$ is $\theta_0x_0+\theta_1x_1+\theta_2x_2 \space where \space x_0 = 1$. The property of the sigmoid function $g(z)$ is such that $g(z)\geq0.5 \space when \space z\geq0$ and $g(z)<0.5 \space when \space z<0$, where $z=\theta^Tx$.

Since we have defined that our hypothesis, $h_\theta(x) = g(\theta^Tx)$, and fixed the thresholds that the predicted class value, $y$, be $ y=1 \space when \space h_\theta(x)\geq0.5$ and $y=0 \space when \space h_\theta(x)<0.5$, this means that $y=1\space when \space \theta^Tx\geq0 \space$ and $\space y=0 \space when \space \theta^Tx<0$. Geometrically, any data points above the decision boundary are predicted to be case 1, and points below predicted as case 0.

Therefore, the **decision boundary** is defined by the line $\theta^Tx=0 \Leftrightarrow -25.16120038+0.20623067x_1+0.20147049x_2=0$ which can be plotted in the same space as the training dataset.

In [10]:

```
x1 = np.linspace(start = admit_data["Exam 1 Score"].min(),
stop = admit_data["Exam 1 Score"].max(), num = 100)
x0 = np.ones(100)
x2 = -theta_optimized[1]/theta_optimized[2] * x1 - (theta_optimized[0]/theta_optimized[2] * x0)
```

In [11]:

```
decisionboundary = go.Scatter(x = x1, y = x2, mode="lines",
name = "Decision Boundary" )
plotly.offline.iplot({"data": [admits, notadmits, decisionboundary],
"layout": go.Layout(title="Training data with decision boundary",
xaxis = dict(title="Exam 1 score"),
yaxis = dict(title="Exam 2 score"))})
```

We test the predictions that our logistic regression gives on:

a particular case of a student with Exam 1 score = 45 and Exam 2 score = 85

the training dataset

In [12]:

```
p = sigmoid(z = (np.array([1,45,85]) @ theta_optimized))
print("The admission probability of a student with Exam 1 score = 45 and Exam 2 score = 85 is {}.".format(p))
```

In [13]:

```
hypothesis_out = sigmoid(X@theta_optimized)
# By definition, whenever the hypothesis is greater than or equal to 0.5,
# the predicted class should be set to 1, else 0.
predicted_training = np.where(hypothesis_out >= 0.5, 1, 0)
acc = np.mean(predicted_training == admit_data["Admission Result"])*100
print("The accuracy of the logistic regression classifier as tested on the training dataset is {} percent.".format(acc))
```

In this second part, we develop a regularized logistic regresion model that predicts whether microchips from a fabrication plant passes quality assurance.

In [14]:

```
chip_data = pd.read_csv("./ex2data2.txt", header = None,
names = ["Microchip Test 1", "Microchip Test 2",
"Result"])
chip_data.head()
```

Out[14]:

In [15]:

```
accept = chip_data[chip_data["Result"] == 1]
reject = chip_data[chip_data["Result"] == 0]
```

We see that it is impossible to separate the two classes of accepted and rejected microchips without a non-linear decision boundary. We need to build our classifier with **polynomial terms** generated from the given features in order to obtain a non-linear boundary that can separate our data.

In [16]:

```
accepted = go.Scatter(x = accept["Microchip Test 1"],
y= accept["Microchip Test 2"],
mode = "markers", name= "Accepted")
rejected = go.Scatter(x = reject["Microchip Test 1"],
y = reject["Microchip Test 2"],
mode = "markers", name= "Rejected")
plotly.offline.iplot({"data": [accepted, rejected],
"layout": go.Layout(title = "Scatter plot of training data",
xaxis = dict(title="Microchip Test 1"),
yaxis = dict(title="Microchip Test 2"))})
```

Polynomial terms are generated iteratively, up to the specified degree (6 in this case), from vectors of the two features, $x_1, x_2$, that we have in our training dataset.

TODO: Explore more efficient way...perhaps by using dynamic programming.

In [17]:

```
def map_features(x1, x2, deg):
# Initialize array with first column of all ones, ie. the 0th degree term
featuremap = np.ones((len(x1), 1))
# Begin iteratively generating polynomial features from
# the 2nd column onwards, degree 1 terms and above
for d in range(1,deg+1): # add 1 to the deg in order to match the upper bound to the degree value
for j in range(0,d+1):
newfeature = x1**(d-j) * x2**j
featuremap = np.append(featuremap, newfeature.reshape(len(x1),1), axis = 1)
return featuremap
```

**Unregularized cost function** was of the form $$J(\theta) = \frac{1}{m} \sum_{i=0}^m \left[-y^{(i)} \log( h_\theta(x^{(i)}) ) - (1 - y^{(i)}) \log( 1 - h_\theta(x^{(i)}))\right]$$
and now we add a regularization term at the end to get
$$J(\theta) = \frac{1}{m} \sum_{i=0}^m \left[-y^{(i)} \log( h_\theta(x^{(i)}) ) - (1 - y^{(i)}) \log( 1 - h_\theta(x^{(i)}) ) \right]+\frac{\lambda}{2m}\sum_{j=0}^n \theta^2$$ where $m$ represents the number of rows in the training data and $n$ the number of features.

**Implementation Note:** Do not include the bias term $\theta_0$ in computing the regularized cost. So we index all the parameters in the parameter vector `theta`

starting from parameter `1`

which is $\theta_1$, all the way till the last parameter.

In [18]:

```
def compute_cost_regularized(theta, X, y, lda):
reg = lda/(2*len(y)) * np.sum(theta[1:]**2)
return 1/len(y) * np.sum(-y @ np.log(sigmoid(X@theta)) - (1-y) @ np.log(1-sigmoid(X@theta))) + reg
```

**Unregularized gradient function** was
$$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m}\sum_{i=0}^m(h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j $$
and with the regularization term added to the cost function above, the gradient now becomes
$$\frac{\partial J(\theta)}{\partial \theta_j} = (\frac{1}{m}\sum_{i=0}^m (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j ) + \frac{\lambda}{m}\theta_j$$ for $j\geq1$.

With the gradient function already vectorized from the earlier unregularized implementation, adding the regularization in a vectorized manner is also easy. We just form the regularization term as a vector, `regterm`

, where all the hypothesis parameters except the first, $\theta_1 \dots \theta_n$ get multiplied by the regularization parameter. In the implementation below, the first hypothesis parameter, $\theta_0$, is left unregularized by setting the value at that position in the array to $0$ before `regterm`

is added to the partial derivative term.

In [19]:

```
def compute_gradient_regularized(theta, X, y, lda):
gradient = np.zeros(len(theta))
XT = X.T
beta = sigmoid(X@theta) - y
regterm = lda/len(y) * theta
# theta_0 does not get regularized, so a 0 is substituted in its place
regterm[0] = 0
gradient = 1/len(y) * XT@beta + regterm
return gradient
```

Initialize the input variables. The design matrix $X$ is 28 columns long due to the polynomial features.

In [20]:

```
X = map_features(np.array(chip_data["Microchip Test 1"]), np.array(chip_data["Microchip Test 2"]),6)
theta_init = np.zeros(X.shape[1])
y = np.asarray(chip_data["Result"])
lda = 1 # Regularization parameter, lambda, value
X[0:2]
```

Out[20]:

Compute some test values...

In [21]:

```
print(compute_cost_regularized(np.ones(X.shape[1]), X,y, lda))
print(compute_gradient_regularized(np.ones(X.shape[1]), X,y,lda)[0:5])
```

In [22]:

```
results = minimize(compute_cost_regularized, theta_init, args = (X,y,lda),
method = 'CG', jac = compute_gradient_regularized,
options = {"maxiter": 400, "disp" : True})
# optimized parameters are accessible through the x attribute
theta_optimized = results.x
print("Vector of learnt parameters:", theta_optimized)
```

The accuracy of the prediction results from the training is $83.1\%$, which closely matches the expected accuracy at $\lambda=1$!

In [23]:

```
print(np.mean(y == np.where(sigmoid(X@theta_optimized)>=0.5,1,0))*100,"%")
```

The decision boundary is visualized as a contour plot. The classifier's prediction was computed for each point in an evenly spaced grid of feature values $x_1, x_2$ and stored in a 2D array which was then used to generate the plot.

When I try to vectorize the grid values computation by constructing the 2D grid for each feature using
`featuremap = map_features(x1,x2,6)`

and `np.meshgrid(*[featuremap[i] for i in range(0, featuremap.shape[1])])`

, the memory space requirement explodes with the size of the two arrays `x1`

and `x2`

. When these arrays were just the size of 2 element, the calculation proceeded but ate up a lot of memory. Any larger sizes and `numpy`

refuses to execute the computation.

Generate evenly spaced feature values.

In [24]:

```
x1 = np.linspace(start = chip_data["Microchip Test 1"].min(),
stop = chip_data["Microchip Test 1"].max(), num = 100)
x2 = np.linspace(start = chip_data["Microchip Test 2"].min(),
stop = chip_data["Microchip Test 2"].max(), num = 100)
```

Define a simpler version of the feature mapping function that accepts two scalars of feature values and returns a vector of polynomial features.

In [25]:

```
# Returns a vector of new polynomial features given two scalars
def map_features_from_scalar(x1,x2,deg):
featvec = np.ones(1)
for d in range(1,deg+1): # add 1 to the deg in order to match the upper bound to the degree value
for j in range(0,d+1):
newfeature = x1**(d-j) * x2**j
featvec = np.append(featvec, newfeature)
return featvec
```

As noted above, initially I attempted to vectorize the grid generation for all 28 features using `np.meshgrid()`

but found out that this was not possible because it would require a lot of memory.

Eventually I realized that there is no escape from using a double loop to generate the grid, with the need to compute all the features each time. To vectorize this will require more memory than my puny machine can handle. It is a trade-off between saving on computation time or saving on memory.

In [26]:

```
z = np.zeros((len(x1),len(x1)))
for i in range(0, len(x1)):
for j in range(0,len(x2)):
# Multiplying the vector of polynomial features for each pair of x1 and x2 feature values to the
# vector of optimized parameters in order to obtain the
z[i,j] = map_features_from_scalar(x1[i],x2[j],6) @ theta_optimized
# Convert z to prediction
z=np.where(sigmoid(z)>=0.5,1,0)
```

In [27]:

```
contmap = go.Contour(z = z, x=x1, y=x2, showlegend = True,
showscale = True, ncontours=1,
text =np.asarray(list(map(lambda x: "Accepted" if x == 1 else "Rejected",
z.flatten()))).reshape(len(x1),len(x1)),
hoverinfo = "text" ,
colorbar=dict(ticktext=["Rejected", "Accepted"],
tickmode = "array",
title = "Prediction"),
contours = dict(start=0,end=1))
plotly.offline.iplot({
"data": [contmap],
"layout" : go.Layout(title ="Decision Boundary as Contour Plot",
xaxis = dict(title="Microchip Test 1"),
yaxis = dict(title="Microchip Test 2"))
})
```

In this part, we explore different regularization parameters for the dataset **to understand how regularization prevents overfitting**. Before we begin, it will be helpful to wrap up our above code into functions that can be called with different regularization parameters multiple times easily.

In [28]:

```
def compute_optimal_theta(X,y,lda):
results = minimize(compute_cost_regularized, theta_init, args = (X,y,lda),
method = 'CG', jac = compute_gradient_regularized,
options = {"maxiter": 400})
theta_optimized = results.x
return theta_optimized
```

In [29]:

```
def generate_classification_grid(optimal_theta):
x1 = np.linspace(start = chip_data["Microchip Test 1"].min(),
stop = chip_data["Microchip Test 1"].max(), num = 200)
x2 = np.linspace(start = chip_data["Microchip Test 2"].min(),
stop = chip_data["Microchip Test 2"].max(), num = 200)
z = np.zeros((len(x1),len(x1)))
for i in range(0, len(x1)):
for j in range(0,len(x2)):
z[i,j] = map_features_from_scalar(x1[i],x2[j],6) @ optimal_theta
z=np.where(sigmoid(z)>=0.5,1,0)
return x1,x2,z
```

In [30]:

```
def generate_decision_boundary_for_given_lambda(X,y,lda,_title):
optimal_theta = compute_optimal_theta(X,y,lda)
a,b,c = generate_classification_grid(optimal_theta)
# We only show lines on the contour plot in order to display the decision boundary clearly
contmap = go.Contour(z = c, x=a, y=b, showlegend = True,
name = "Prediction",
text =np.asarray(list(map(lambda x: "Accepted" if x == 1 else "Rejected",
c.flatten()))).reshape(len(a),len(a)),
hoverinfo = "name+x+y+text" ,
ncontours = 2,
line = dict(smoothing=1.3, width = 2),
contours = dict(coloring="lines"),
showscale = False)
plotly.offline.iplot({
"data": [accepted,rejected,contmap],
"layout" : go.Layout(title = _title,
xaxis = dict(title="Microchip Test 1"),
yaxis = dict(title="Microchip Test 2"))
})
```

Let us consider the decision boundaries when $\lambda = 0.2$ and $\lambda = 1$. Click on the "Show closest data on hover" button in the top-right hand corner of the plot for less clutter during hovering.

In [31]:

```
generate_decision_boundary_for_given_lambda(X,y,0.2,"Good Regularization (When lambda = 0.2)")
```

In [32]:

```
generate_decision_boundary_for_given_lambda(X,y,1,"Good Regularization (When lambda = 1)")
```