## Week 2+3. Regularization, Linear Classification, towards Logistic regression¶

Augustin Cosse

Material covered: Ridge and LASSO regression, linear classification through Multiple Discriminant, OLS and Normal Equations, one-vs-rest and one-vs-one classifiers. Use of the meshgrid function to display the classification boundaries.

### Example 1. Binary discriminant through OLS and Normal Equations¶

In this first demo, we will solve the classification problem as follows.

• We first assign 0/1 labels to each of the points (recall that we are in the classification framework so the target now take a finite set of values). That is we define our targets as +1 and 0 depending on whether our points are from the yellow or purple class. This gives a vector $$\mathbf{t} = [t^{(1)}, t^{(2)}, \ldots, t^{(N)}]$$

if we have $n$ points

• We then store the sample points in a matrix $\mathbf{X}$ as we did for regression. In this case the points are 2D, we thus have
$$\mathbf{X} = \left[\begin{array} (\mathbf{x}^{(1)})^T\\ \mathbf{x}^{(2)})^T\\ \vdots\\ (\mathbf{x}^{(N)})^T \end{array}\right]$$

where each $\mathbf{x}^{(i)} = \left[\begin{array}{c} x^{(i)}_1\\ x^{(i)}_2 \end{array}\right]$ now encodes the two coordinates of the corresponding point in the dataset below. We want to learn a model of the form

$y(\mathbf{x}) = \beta_0 + \beta_1X_1 + \beta_2X_2$

that outputs a prediction $y(\mathbf{x}^{(i)})$ that is as close as possible to the target of the point $t^{(i)}$. We will encode this model by adding an additional column of $1$'s to the matrix $\mathbf{X}$ above to get

$$\mathbf{X} = \left[\begin{array}{cc} 1 & (\mathbf{x}^{(1)})^T\\ 1& \mathbf{x}^{(2)})^T\\ \vdots & \vdots\\ 1& (\mathbf{x}^{(N)})^T \end{array}\right]$$

we can then write the model as $\mathbf{y} = \mathbf{X}\mathbf{\beta}$ and we want $\mathbf{y}$ as close as possible to $\mathbf{t}$ (given that what we can achieve is limited by the linearity of the model)

A natural approach, given what we learned so far, is thus to minimize the OLS criterion,

$$\min_{\beta_0, \beta_1, \beta_2} \sum_{i=1}^N \frac{1}{N}|t^{(i)} - (\beta_0 + \beta_1X^{(i)}_1 + \beta_2X^{(i)}_2)|^2$$

As we saw in regression, this model can read in matrix form as

$$\mathbf{v} = \mathbf{X}\mathbf{\beta} - \mathbf{t}$$

and then

$$\min_{\mathbf{\beta}}\frac{1}{N}\mathbf{v}^T\mathbf{v} = \min_{\mathbf{\beta}} \left(\mathbf{X}\mathbf{\beta} - \mathbf{t}\right)^T(\mathbf{X}\mathbf{\beta} - \mathbf{t})$$

Instead of using a gradient descent approach, we could alternatively set the derivative of the loss with respect to the weights $\mathbf{\beta}$ to zero and solve the equations. In this case (you can verify it by computing the derivatives with respect to each of the $\beta_j$ and setting them to zero), this gives the set of equations

$$\mathbf{X}^T\left(\mathbf{X}\mathbf{\beta} - \mathbf{t}\right) = 0 \leftrightarrow \mathbf{X}^T\mathbf{X}\mathbf{\beta} = \mathbf{X}^T\mathbf{t}$$

which can solve by using the inverse of $\mathbf{X}^T\mathbf{X}$ as $\mathbf{\beta} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{t}$

In :
# We start by generating data using the make_blobs function from scikit learn.
# We first consider a two classes dataset. We then learn the discriminant by solving the normal equations

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, t = make_blobs(n_samples=100,n_features=2,centers=[[0,0], [1,1]],cluster_std = [.2, .2])

plt.scatter(X[:,0], X[:,1], c = t)
plt.show()

num_samples = X.shape

# 1) Learn the discriminant through the normal equations

sampleMat =  np.hstack((np.ones((num_samples,1)), X))

RHS = np.dot(sampleMat.T,t)
normalMat = np.dot(sampleMat.T, sampleMat)

# beta = (XT*X)^{-1}*(X^t t)
# the line below is just solving the system nomalMat*beta = RHS

beta = np.linalg.solve(normalMat, RHS)

print(beta) [0.05629461 0.48213025 0.43630057]

In :
# To represent our model, Now that we have learned the beta, we can divide the space into a grid of equispaced points.
# Concretely, we first compute the minimum and maximum X1 and X2. Once we have those, we then generate all the
# points between X1_{min} and X1_{max} on the one hand, as well as between X2_{min} and X2_{max} on the other. Once we
# have those two vectors, let us call them v1 and v2. Those two vectors contain the X1 coordinates and X2 coordinates
# of the grid points.

x1min = np.amin(X[:,0])
x1max = np.amax(X[:,0])

x2min = np.amin(X[:,1])
x2max = np.amax(X[:,1])

x1mesh = np.linspace(x1min, x1max, 100)
x2mesh = np.linspace(x2min, x2max, 100)

# To compute the prediction, we however need to generate all the pairs $(X1, X2)$ for all the points
# on the grid. This is achieved through the meshgrid function. This function takes as first and second arguments
# the vectors of X1 and X2 coordinates respectively and returns the X1 and x2 coordinates of all the points on the grid.
# As an example if we want a grid with X1 = (1,2,3) and X2 = (1,2,3), meshgrid(X1, X2) will return the matrices
# (1 2 3; 1 2 3; 1 2 3) and (1 1 1; 2 2 2; 3 3 3).

x1MeshMat, x2MeshMat = np.meshgrid(x1mesh.reshape(-1,1), x2mesh.reshape(-1,1))

# Meshgrid thus gives us the X1 and X2 coordinates
# of all the points on the grid (nopt just the border). Those coordinates are returned as matrices, so to
# build the X matrix to compute our prediction, we need to flatten the X1 and X2 meshgrid matrices and then
# add a column of 1 on top of those coordinates

x1MeshMat = x1MeshMat.flatten()
x2MeshMat = x2MeshMat.flatten()

X12 = np.vstack((x1MeshMat, x2MeshMat)).T

totalNumPointsGrid = len(x1MeshMat)

# generating the additional vector of ones
addon = np.ones((totalNumPointsGrid, 1))

# stacking it to the matrix of prototypes coordinates

pointsGrid = np.hstack((addon.reshape(-1,1), X12))

# Now that we have computed the X matrix, and since we have previously obtained the weights of our discriminant,
# we can now get the predicted labels for the points over the whole space.

prediction = np.dot(pointsGrid, beta)

# reorganize my targets (according to original values given). Since we chose the targets to be 0 or 1. We classify our
# points depending on whether the output of the model gives a value larger or smaller than 1/2

prediction[prediction<.5] = 0
prediction[prediction>.5] = 1

print(prediction)

# To display the class of each point, we will use the function contourf. This function takes as first
# argument a matrix of X1 coordinates (coordinates of all points on the grid ) and a matrix of X2 coordinates.
# The reverse of meshgrid if you will.

coordinatesX1 = np.reshape(x1MeshMat, (100,100))
coordinatesX2 = np.reshape(x2MeshMat, (100,100))
reshaped_prediction = np.reshape(prediction, (100,100))

fig, ax = plt.subplots(constrained_layout=True)
plt.scatter(X[:,0], X[:,1], c = t)
ax.contourf(coordinatesX1, coordinatesX2, reshaped_prediction, alpha=0.1)
plt.show()

[0. 0. 0. ... 1. 1. 1.] ### Example 2. From the Binary classifier to multiclass discriminant.¶

Material covered : one vs rest, one vs one, multiclass discriminant, one hot encoding

In :
print(np.shape(coordinates_grid))

print(np.shape(np.ones((numSamplePoints, 1))))

(10000, 2)
(100, 1)

In :
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, t = make_blobs(n_samples=100,n_features=2,centers=[[-1,0], [1,0], [.5, 1]],cluster_std = [.2, .2, .2])

plt.scatter(X[:,0], X[:,1], c = t)
plt.show()

# we first consider a 3 classes problem for which we code the one vs rest and one vs one classifiers

K = 3
dim = 2
numSamplePoints = X.shape

# 1) one vs rest

target = np.zeros((numSamplePoints,K))
Xtilde = np.hstack((np.ones((numSamplePoints, 1)), X))
beta = np.zeros((dim+1, K))

for i in np.arange(0, K):

target[np.where(t==i), i] = 1
beta[:,i] = np.dot(np.linalg.inv(np.dot(Xtilde.T, Xtilde)), np.dot(Xtilde.T, target[:,i]))

# plotting and generating the grid
x1min = np.amin(X[:,0])
x1max = np.amax(X[:,0])
x2min = np.amin(X[:,1])
x2max = np.amax(X[:,1])

x1mesh = np.linspace(x1min, x1max, 100)
x2mesh = np.linspace(x2min, x2max, 100)

X1meshMat, X2meshMat = np.meshgrid(x1mesh, x2mesh)

coordinates_grid  = np.vstack((X1meshMat.flatten(), X2meshMat.flatten())).T
coordinates_tilde = np.hstack((np.ones((np.shape(coordinates_grid), 1)),coordinates_grid ))

# getting the predictions

out_classifier = np.zeros((np.shape(coordinates_grid),K))
indClassification = np.zeros((np.shape(coordinates_grid), K))

target_out_total = np.zeros((np.shape(coordinates_grid), 1))

for i in np.arange(0, K):

out_classifier[:,i] = np.dot(coordinates_tilde, beta[:,i])
indClassification[np.where(out_classifier[:,i]>0.5),i] = 1
target_out_total[np.where(out_classifier[:,i]>0.5)] = i

# looking for overlaps

misclassification = np.sum(indClassification, axis=1)

ind_misclassified1 = np.where(misclassification>1)
ind_misclassified2 = np.where(misclassification==0)

target_out_total[ind_misclassified1] = K
target_out_total[ind_misclassified2] = K

print(ind_misclassified) (array(),)

In :

# plotting the resulting classifier

fig, ax = plt.subplots(constrained_layout=True)
plt.scatter(X[:,0], X[:,1], c = t)
ax.contourf(X1meshMat, X2meshMat, target_out_total.reshape((100,100)), alpha=0.2)
plt.show()

# The representation below also shows that the classifier cannot properly classify some regions in
# the (x_1, x_2) space (show in yellow). For those regions, points are either classified as belonging to
# multiple classes (such as in the center) or they are do not get any label at all (such as the triangles on the border) In :
# I now code the one vs one classifier. We still consider the three classes dataset from above

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, t = make_blobs(n_samples=100,n_features=2,centers=[[-1,0], [1,0], [.5, 1]],cluster_std = [.2, .2, .2])
plt.scatter(X[:,0], X[:,1], c = t)
plt.show()

# we first consider a 3 classes problem for which we code the one vs rest and one vs one classifiers

K = 3
dim = 2
numSamplePoints = X.shape

Xtilde = np.hstack((np.ones((numSamplePoints, 1)),X))

# we will sore our beta vectors in a 3D array

beta_tensor = np.zeros((K,K, 3))

# generating the grid (we will classify the points from the grid in real time)
x1min = np.amin(X[:,0])
x1max = np.amax(X[:,0])
x2min = np.amin(X[:,1])
x2max = np.amax(X[:,1])

x1mesh = np.linspace(x1min, x1max, 100)
x2mesh = np.linspace(x2min, x2max, 100)

X1meshMat, X2meshMat = np.meshgrid(x1mesh, x2mesh)

coordinates_grid  = np.vstack((X1meshMat.flatten(), X2meshMat.flatten())).T
coordinates_tilde = np.hstack((np.ones((np.shape(coordinates_grid), 1)),coordinates_grid ))

# define vector counting the votes for each point

count = np.zeros((np.shape(coordinates_grid), K))

for i in np.arange(0, K):

for j in np.arange(0,K):

if i > j:

XtildeLocal = Xtilde[np.squeeze(np.hstack((np.where(t==i), np.where(t==j)))), :]
tlocal = t[np.squeeze(np.hstack((np.where(t==i), np.where(t==j))))]
tlocal2 = np.zeros(np.shape(tlocal))
tlocal2[tlocal == i] = 1 # here we do i vs j
tlocal2[tlocal == j] = 0

beta_ij = np.dot(np.linalg.inv(np.dot(XtildeLocal.T, XtildeLocal)), np.dot(XtildeLocal.T, tlocal2))

beta_tensor[i,j, :] = beta_ij
beta_tensor[j,i, :] = beta_ij

prediction = np.dot(coordinates_tilde,beta_ij)
ind_i = np.where(prediction > 0.5)
ind_j = np.where(prediction < 0.5)
count[ind_i, i] += 1
count[ind_j, j] += 1

# Now that we have all the beta, we compute each of the regions. We will proceed through a majority vote.
# That is we start by getting all the labels for each of the binary classifiers. And we add one each time a point
# is classified in a given class

count_final = np.squeeze(np.zeros((np.shape(coordinates_grid), 1)))

for i in np.arange(np.shape(coordinates_grid)):
count_final[i] = np.argmax(count[i,:])

# add a few lines highlighting the regions where points are classified at random In :
# plotting the resulting classifier

fig, ax = plt.subplots(constrained_layout=True)
plt.scatter(X[:,0], X[:,1], c = t)
ax.contourf(X1meshMat, X2meshMat, count_final.reshape((100,100)), alpha=0.2)
plt.show()

# note that here although the plot does not show it, we can have regions in space that get the same
# number of votes and there is thus also some form of ambiguity. When taking the max of the votes, in case of an
# equality numpy selects an index at random which is why it is not visible on the plot. In :
# To conclude, we consider a K class discriminant function. We still consider the 3 classes example from above
# In this case, the first step is to encode the
# each of the targets through a 1-hot encoding. We have

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, t = make_blobs(n_samples=100,n_features=2,centers=[[-1,0], [1,0], [.5, 1]],cluster_std = [.2, .2, .2])
plt.scatter(X[:,0], X[:,1], c = t)
plt.show()

# we first consider a 3 classes problem for which we code the one vs rest and one vs one classifiers

K = 3
dim = 2
numSamplePoints = X.shape

oneHotEncodingT = np.zeros((len(t),K))
oneHotEncodingT[np.arange(0,len(t)), t] = 1

vec_ones = np.ones((numSamplePoints, 1))
# As before, we build the matrix Xtilde by adding a column of 1 to X.

print(np.shape((np.ones((np.shape(X),1)))))
print(np.shape(X))

Xtilde = np.hstack((np.ones((np.shape(X),1)), X))

# we then solve the matrix system Xtilde*BETA = T (here both X and BETA are matrices)

solutionBeta = np.dot(np.linalg.inv(np.dot(Xtilde.T, Xtilde)), np.dot(Xtilde.T, oneHotEncodingT))

# if you print the shape of the matrix you now see that this matrix has size numClasses*3 so that we
# now have 3 weight vectors. If we apply this weight matrix to the meshgrid data, as we did before,

x1min = np.amin(X[:,0])
x1max = np.amax(X[:,0])
x2min = np.amin(X[:,1])
x2max = np.amax(X[:,1])

x1mesh = np.linspace(x1min, x1max, 100)
x2mesh = np.linspace(x2min, x2max, 100)

X1meshMat, X2meshMat = np.meshgrid(x1mesh, x2mesh)

Xmeshgrid = np.vstack((X1meshMat.flatten(), X2meshMat.flatten()))

XmeshgridTilde = np.hstack((np.ones((np.shape(Xmeshgrid.T),1)),Xmeshgrid.T ))

prediction_grid  = np.dot(XmeshgridTilde, solutionBeta)

# From this, we can then define our class to be the index of the row that gives the highest value for t[i,j], for each column j
# the columns corresponding to the dataset points

final_classificationTargets = np.squeeze(np.zeros((np.shape(prediction_grid),1)))

for i in np.arange(0, len(final_classificationTargets)):

final_classificationTargets[i] = np.argmax(prediction_grid[i,:])

# Now we display the prediction using contourf (100, 1)
(100, 2)

In :
# plotting the resulting classifier

fig, ax = plt.subplots(constrained_layout=True)
plt.scatter(X[:,0], X[:,1], c = t)
ax.contourf(X1meshMat, X2meshMat, final_classificationTargets.reshape((100,100)), alpha=0.2)
plt.show() In [ ]:
# we are now guaranteed to predict a single class for each of the points from our dataset
# (except if our classifier outputs twice the same value which is quite unlikely).
# The OLS classifier which we used throughout the various steps above is however not robust to outliers.


### Example 2. The sigmoid function¶

In :
# to resolve the lack of robustness of the OLS classifier vis a vis outliers, we can consider an additional
# non linear function which should return approximately the same value regardless of the distance of a point to
# the separating plane. This means that we would like the function to return a value alpha_0 when the output to the
# OLS classifier is positive and a value alpha1 when the output to the classifier is negative. A further improvement
# can be achieved if we could 'force' those values to be 0 and 1. In this case taking the decision regarding the
# classes becomes a lot easier. As we will see, this is what the sigmoid will approximately give us.

# From the two figures below, you can see that the sigmoid always return a value between 0 and 1 and it
# quickly saturates as we move aways from 0.

x = np.linspace(-10,10, 100)

sigmoid  = np.true_divide(1, 1+np.exp(-x))

plt.plot(x, sigmoid)
plt.show() In :
x = np.linspace(-100,100, 100)

sigmoid  = np.true_divide(1, 1+np.exp(-x))

plt.plot(x, sigmoid)
plt.show() 