whoami

stu
Machine Learning Engineer @Opendoor
@mstewart141

# Goal: Puzzle through the gradient descent algorithm towards a working prototype for linear and logistic regression.¶

Per Wikipedia:

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point

## Ok, but what is a gradient?¶

The gradient stores all the partial derivative information of a multivariable function.

The gradient is a vector-valued function: a vector of partial derivates.

source: Wikipedia

In [2]:
%matplotlib inline

import numpy as np
import sympy

from numpy.linalg import inv
from scipy.special import expit
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.datasets import make_classification
from sympy import diff
from sympy.solvers import solve
from sympy.plotting import plot
from toolz import compose, pipe

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

np.random.seed = RS = 47


## Let's look at an example¶

In [3]:
e = 'x*x'
plot(e);

In [4]:
diff(e)

Out[4]:
2*x
In [5]:
gradient = np.array(['2x'])

In [6]:
solve(e)

Out[6]:
[0]

### You told us to "step proportional to the negative of the gradient"?¶

what if x > 0, x < 0?

# Now, we'll do gradient descent live¶

## Starting with linear regression¶

In [7]:
X, Y = make_classification(n_samples=1000, n_features=3, n_informative=3,
n_redundant=0, n_repeated=0, n_classes=2,
random_state=RS)


#### We need to introduce a bias term:¶

In [8]:
bias_column = np.ones(X.shape[0])

X = np.c_[bias_column, X]
Y = Y.ravel()


# $$$$\beta_{n \times 1} = (X^{T}X)^{-1}_{n \times n}\space X^{T}_{n \times m}\space Y_{m \times 1}$$$$¶

#### This gives us a way to compute our beta vector:¶

In [9]:
betas_normal_eq = inv(X.T @ X) @ X.T @ Y


## Sanity check: Scikit-learn¶

In [11]:
linreg = LinearRegression(fit_intercept=False)
linreg.fit(X,Y);

In [12]:
betas_sklearn = linreg.coef_

In [13]:
betas_normal_eq
betas_sklearn

Out[13]:
array([ 0.45801758,  0.27764882, -0.10227603, -0.04232889])
Out[13]:
array([ 0.45801758,  0.27764882, -0.10227603, -0.04232889])

# $$$$F(z) = \dfrac{1}{1+e^{-z}}$$$$¶

### These two functions, let's write 'em up¶

In [14]:
identity = lambda z: z

sigmoid = expit

def sigmoid(z):
return 1 / (1 + np.exp(-z))


### Why do these $F(z)$ equations matter? What is our hypothesis?¶

#### Recall the normal equation?¶

In [16]:
hypothesis = X @ betas_sklearn
hypothesis_shape = hypothesis.shape

hypothesis_shape

Out[16]:
(1000,)

Gradient descent is a first-order iterative optimization algorithm

We must define an update step that moves us closer to the solution each iteration.

In [40]:
def gradient_descent(X=X, Y=Y, kind='linear', learning_rate=0.01, iterations=int(1e5)):
betas = np.zeros(X.shape[1])
assert kind in {'linear', 'logistic'}, f"Whoops! Don't support kind: {kind}."
fn = identity if kind == 'linear' else sigmoid

def update_step(betas):
hypothesis = fn(X @ betas)
loss = hypothesis - Y
gradient = X.T @ loss * (1 / X.shape[0])
return betas - learning_rate * gradient

return pipe(betas, *([update_step] * iterations))

In [40]:
# (tweetable version)
def gradient_descent(X=X, Y=Y, kind='linear', lr=0.01, n=int(1e5)):
return pipe(np.zeros(X.shape[1]), *([
lambda betas: betas - lr *
X.T @ (((identity if kind == 'linear' else sigmoid)(X @ betas)) - Y) * (1 / X.shape[0])
] * n))


### Does it work for linear regression?¶

In [41]:
betas_gd = gradient_descent()

In [42]:
betas_gd
betas_normal_eq
betas_sklearn

Out[42]:
array([ 0.45801758,  0.27764882, -0.10227603, -0.04232889])
Out[42]:
array([ 0.45801758,  0.27764882, -0.10227603, -0.04232889])
Out[42]:
array([ 0.45801758,  0.27764882, -0.10227603, -0.04232889])

### What saith Scikit?¶

In [25]:
logr = LogisticRegression(fit_intercept=False, C=1e18)
logr.fit(X,Y);

In [26]:
betas_logr = logr.coef_[0]

In [30]:
np.round(betas_logr, 2)
np.round(betas_gdl, 2)

Out[30]:
array([-0.13,  3.08, -1.04, -0.41])
Out[30]:
array([-0.13,  3.08, -1.04, -0.41])

### Bold claim: we're 90% of the way to a fully formed sklearn estimator!¶

In [32]:
class GradientDescentClassifierRegressor:
def __init__(self, kind='linear', learning_rate=0.01, iterations=int(1e5)):
self.kind = kind
self.learning_rate = learning_rate
self.iterations = iterations

def fit(self, X, Y):
X,
Y,
self.kind,
self.learning_rate,
self.iterations,
)
return self
def predict(self, X):
try:
getattr(self, 'coef_')  # illustrative
fn = identity if self.kind == 'linear' else lambda arr: np.round(sigmoid(arr)).astype(int)
return fn(X @ self.coef_)
except AttributeError:
raise RuntimeError('Must fit first!!!')

In [35]:
gdcr = GradientDescentClassifierRegressor(kind='linear')
gdcr.fit(X,Y);

In [36]:
gdcr.predict(X[:10])
linreg.predict(X[:10])

Out[36]:
array([ 0.61911337,  0.3343834 ,  0.49442232,  0.5050518 , -0.20501244,
0.00771493, -0.06387987,  1.02075149,  0.51484657,  0.77049524])
Out[36]:
array([ 0.61911337,  0.3343834 ,  0.49442232,  0.5050518 , -0.20501244,
0.00771493, -0.06387987,  1.02075149,  0.51484657,  0.77049524])

Questions?