Gradient descent in action¶

Goal¶

The goal of this lab is to explore how chasing function gradients can find the function minimum. If the function is a loss function representing the quality of a model's fit to a training set, we can use function minimization to train models.

When there is no symbolic solution to minimizing the loss function, we need an iterative solution, such as gradient descent.

Set up¶

In [ ]:

import numpy as np
import pandas as pd

from mpl_toolkits.mplot3d import Axes3D # required even though not ref'd!
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, log_loss, mean_absolute_error

import matplotlib.pyplot as plt
import matplotlib as mpl
%config InlineBackend.figure_format = 'retina'

In [ ]:

def normalize(X):
    X = X.copy()
    for colname in X.columns:
        u = np.mean(X[colname])
        s = np.std(X[colname])
        if s>0.0:
            X[colname] = (X[colname] - u) / s
        else:
            X[colname] = (X[colname] - u)
    return X

In [ ]:

def plot3d(X, y, b0_range, b1_range):
    b0_mesh, b1_mesh = np.meshgrid(b0_range, b1_range, indexing='ij')
    L = np.zeros(b0_mesh.shape)

    for i in range(len(b0_range)):
        for j in range(len(b1_range)):
            L[i,j] = loss([b0_range[i],b1_range[j]], X=X, y=y)

    fig = plt.figure(figsize=(5,4))
    ax = fig.add_subplot(111, projection='3d')
    surface = ax.plot_surface(b0_mesh, b1_mesh, L, alpha=0.7, cmap='coolwarm')
    ax.set_xlabel('$\\beta_0$', fontsize=14)
    ax.set_ylabel('$\\beta_1$', fontsize=14)

Simple function gradient descent¶

Let's define a very simple quadratic in one variable, $y = f(x) = (x-2)^2$ and then use an iterative solution to find the minimum value.

In [ ]:

def f(x) : return (x-2)**2

We can hide all of the plotting details in a function, as we will use it multiple times.

In [ ]:

def fplot(f,xrange,fstr='',x0=None,xn=None):
    plt.figure(figsize=(3.5,2))
    lx = np.linspace(*xrange,200)    
    fx = [f(x) for x in lx]
    plt.plot(lx, fx, lw=.75)
    if x0 is not None:
        plt.scatter([x0], [f(x0)], c='orange')
        plt.scatter([xn], [f(xn)], c='green')
    plt.xlabel("$x$", fontsize=12)
    plt.ylabel(fstr, fontsize=12)

In [ ]:

fplot(f, xrange=(0,4), fstr="$(x-2)^2$")

To minimize a function of $x$, we need the derivative of $f(x)$, which is just a function that gives the slope of the curve at every $x$.

1. Define a function returning the derivative of $f(x)$

You can ask for symbolic derivatives at a variety of sites, but here's one solution.

In [ ]:

def df(x): ...

Solution

def df(x): return 2*(x-2)

2. Pick an initial $x$ location and take a single step according to the derivative

Use a learning rate of $\eta = 0.4$. The output should be 1.76. (Also keep in mind that the minimum value is clearly at $x=2$.)

In [ ]:

x = .8 # initial x location

In [ ]:

x = ...
print(x)

Solution

x = x - .4 * df(x); print(x)

Q. How can we symbolically optimize a quadratic function like this with a single minimum?

Solution

When the derivative goes to zero, it means the curve is flat, which in turn means we are at the function minimum. Set the derivative equal to zero and solve for $x$: $\frac{d}{dx} (x-2)^2 = 2(x-2) = 2x-4 = 0$. Solving for $x$ gives $x=2$.

3. Create a loop that takes five more steps (same learning rate)

The output should look like:

In [ ]:

for i in range(5):
    x = x - 0.4 * df(x);
    print(x)

Solution

for i in range(5):
    x = x - 0.4 * df(x); print(x)

Notice how fast the iteration moves $x$ to the location where $f(x)$ is minimum!

Minimizing a more complicated function¶

This iterative minimization approach works for any (smooth) function, assuming we choose a small enough learning rate. For example, let's find one of the minima for $f(x) = x \sin(0.6x)$ in the range [-1,10]. The plot should look something like:

Depending on where we start, minimization will find either minimum at $x=0$ or at $8.18$. The location of the lowest function value is called the global minimum and any others are called local minima.

1. Define a function for $x \sin(0.6x)$

In [ ]:

def f(x) : ...

Solution

def f(x) : return np.sin(0.6*x)*x

In [ ]:

fplot(f, xrange=(-1,10), fstr="$x \sin(0.6x)$")
#plt.tight_layout(); plt.savefig("xsinx.png",dpi=150,bbox_inches=0)

2. Define the derivative function: $\frac{df}{dx} = 0.6x \cos(0.6 x) + \sin(0.6 x)$

In [ ]:

def df(x): ...

Solution

def df(x): return 0.6*x * np.cos(0.6*x) + np.sin(0.6*x)

3. Pick a random initial value, $x_0$, between -1 and 10; display that value

In [ ]:

x0 = np.random.rand()*11 - 1 # pick value between -1 and 10
x0

4. Start $x$ at $x_0$ and iterate 12 times using the gradient descent method

Use a learning rate of 0.4.

In [ ]:

x = x0
for i in range(12):
    x = x - .4 * df(x); print(f"{x:.10f}")

5. Plot the starting and stopping locations on the curve

In [ ]:

fplot(f, xrange=(-1,10), fstr="$x \sin(0.6x)$", x0=x0, xn=x)

6. Rerun the notebook several times to see how the random start location affects where it terminates.

Q. Rather than iterating a fixed number of times, what's a better way to terminate the iteration?

Solution

A simple stopping condition is when the (norm of the) gradient goes to zero, meaning that it does not suggest we move in any direction to get a lower loss of function value. We could also check to see if the new $x$ location is substantially different from the previous.

The effect of learning rate on convergence¶

Let's move back to the simple function $f(x) = (x-2)^2$ and consider different learning rates to see the effect.

In [ ]:

def df(x): return 2*(x-2)

Let's codify the minimization process in a handy function:

In [ ]:

def minimize(df,x0,eta):
    x = x0
    for i in range(10):
        x = x - eta * df(x);
        print(f"{x:.2f}")

1. Update the gradient descent loop to use a learning rate of 1.0

Notice how the learning rate is so large that iteration oscillates between two (incorrect) solutions. The output should be:

3.20
0.80
3.20
0.80
3.20
0.80
3.20
0.80
3.20
0.80

In [ ]:

minimize(df, x0=0.8, eta=...)

2. Update the gradient descent loop to use a learning rate of 2.0

Notice how the solution diverges when the learning rate is too big. The output should be:

5.60
-8.80
34.40
-95.20
293.60
-872.80
2626.40
-7871.20
23621.60
-70856.80

In [ ]:

minimize(df, x0=0.8, eta=...)

2. Update the gradient descent loop to use a learning rate of 0.01

Notice how slowly the solution converges when the learning rate is two small. The output should be:

0.82
0.85
0.87
0.89
0.92
0.94
0.96
0.98
1.00
1.02

In [ ]:

minimize(df, x0=0.8, eta=...)

Q. How do you choose the learning rate $\eta$?

Solution

The learning rate is specific to each problem unfortunately. A general strategy is to start with a small $\eta$ and gradually increase it until it starts to oscillate around the solution, then back off a little bit. Having a single global learning rate for un-normalized data usually means very slow convergence. A learning rate small enough to be appropriate for a variable with small range is unlikely to be appropriate for variable with a large range. This is overcome with the more sophisticated gradient descent methods, such as the Adagrad strategy you will use in your project. In that case, we keep a history of gradients and use that to speed up descent in directions that are historically shallow in the gradient.

Examine loss surface for LSTAT var from Boston dataset¶

Turning to a common toy data set, the Boston housing data set, let's pick the most important single feature and look at the loss function for simple OLS regression.

1. Load the Boston data set into a data frame

In [ ]:

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X.head()

2. Train an OLS linear regression model

In [ ]:

lm = LinearRegression()
lm.fit(X, y)

3. Using rfpimp package, display the feature importances

In [ ]:

from rfpimp import *

I = importances(lm, X, y)
plot_importances(I)

4. LSTAT is most important variable so train a new model with just X['LSTAT']

Print out the true $\beta_0, \beta_1$ coefficients.

In [ ]:

X_ = X['LSTAT'].values.reshape(-1,1) # Extract just one x variable

lm = LinearRegression()
lm.fit(X_, y)
print(f"True OLS coefficients: {np.array([lm.intercept_]+list(lm.coef_))}")

5. Show marginal plot of LSTAT vs price

In [ ]:

fig, ax1 = plt.subplots(figsize=(5,2.0))
ax1.scatter(X_, y, s=15, alpha=.5)
lx = np.linspace(np.min(X_), np.max(X_), num=len(X))
ax1.plot(lx, lm.predict(lx.reshape(-1,1)), c='orange')
ax1.set_xlabel("LSTAT", fontsize=10)
ax1.set_ylabel("price", fontsize=10)
plt.show()

6. Define an MSE loss function for single variable regression

$$ \frac{1}{n} \sum_{i=1}^n (y - (\beta_0 + \beta_1 x^{(i)}))^2 $$

In [ ]:

def loss(B,X,y): # B=[beta0, beta1]
    y_pred = ...
    return np.mean(...)

Solution

def loss(B,X,y):
    y_pred = B[0] + X*B[1]
    return np.mean((y - y_pred)**2)

7. Check the loss function value at the true OLS coordinates

In [ ]:

loss(np.array([34.55384088, -0.95004935]), X_, y) # demo loss function at minimum

8. Plot the loss function in 3D in region around $\beta$s

When you enter the correct loss function above, the plot should look something like:

In [ ]:

b0_range = np.linspace(-50, 120, 70)
b1_range = np.linspace(-6, 4, 70)

plot3d(X_, y, b0_range, b1_range)
#plt.tight_layout(); plt.savefig("boston-loss.png",dpi=150,bbox_inches=0)

Repeat using normalized data¶

1. Normalize the $x$ variables

In [ ]:

X_norm = normalize(X)

2. Retrain the model

In [ ]:

X_ = X_norm['LSTAT'].values.reshape(-1,1)

lm = LinearRegression()
lm.fit(X_, y)
print(f"True OLS coefficients: {np.array([lm.intercept_]+list(lm.coef_))}")

3. Show the marginal plot again

Notice how only the $x$ scale has changed but not $y$, nor has the shape changed.

In [ ]:

fig, ax1 = plt.subplots(figsize=(5,2.0))
ax1.scatter(X_, y, s=15, alpha=.5)
lx = np.linspace(np.min(X_), np.max(X_), num=len(X))
ax1.plot(lx, lm.predict(lx.reshape(-1,1)), c='orange')
ax1.set_xlabel("LSTAT", fontsize=10)
ax1.set_ylabel("price", fontsize=10)
plt.show()

4. Plot the cost surface with a region around the new minimum location

In [ ]:

b0_range = np.linspace(15, 30, 70)
b1_range = np.linspace(-10, 5, 70)

plot3d(X_, y, b0_range, b1_range)

Q. Compare the loss function contour lines of the unnormalized and normalized variables.

Solution

The normalized variables clearly result in a bowl shaped loss function, which gives spherical contours. A gradient descent method with a single learning rate will convergent much faster given visible shape.

Q. Look at the loss function directly from above; in which direction do the gradients point?

Solution

The negative of the gradients will point directly at the minimum loss function value location. The gradients themselves, however, point in the exact opposite direction.