The goal of this lab is to explore how chasing function gradients can find the function minimum. If the function is a loss function representing the quality of a model's fit to a training set, we can use function minimization to train models.
When there is no symbolic solution to minimizing the loss function, we need an iterative solution, such as gradient descent.
import numpy as np
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D # required even though not ref'd!
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, log_loss, mean_absolute_error
import matplotlib.pyplot as plt
import matplotlib as mpl
%config InlineBackend.figure_format = 'retina'
def normalize(X):
X = X.copy()
for colname in X.columns:
u = np.mean(X[colname])
s = np.std(X[colname])
if s>0.0:
X[colname] = (X[colname] - u) / s
else:
X[colname] = (X[colname] - u)
return X
def plot3d(X, y, b0_range, b1_range):
b0_mesh, b1_mesh = np.meshgrid(b0_range, b1_range, indexing='ij')
L = np.zeros(b0_mesh.shape)
for i in range(len(b0_range)):
for j in range(len(b1_range)):
L[i,j] = loss([b0_range[i],b1_range[j]], X=X, y=y)
fig = plt.figure(figsize=(5,4))
ax = fig.add_subplot(111, projection='3d')
surface = ax.plot_surface(b0_mesh, b1_mesh, L, alpha=0.7, cmap='coolwarm')
ax.set_xlabel('$\\beta_0$', fontsize=14)
ax.set_ylabel('$\\beta_1$', fontsize=14)
Let's define a very simple quadratic in one variable, $y = f(x) = (x-2)^2$ and then use an iterative solution to find the minimum value.
def f(x) : return (x-2)**2
We can hide all of the plotting details in a function, as we will use it multiple times.
def fplot(f,xrange,fstr='',x0=None,xn=None):
plt.figure(figsize=(3.5,2))
lx = np.linspace(*xrange,200)
fx = [f(x) for x in lx]
plt.plot(lx, fx, lw=.75)
if x0 is not None:
plt.scatter([x0], [f(x0)], c='orange')
plt.scatter([xn], [f(xn)], c='green')
plt.xlabel("$x$", fontsize=12)
plt.ylabel(fstr, fontsize=12)
fplot(f, xrange=(0,4), fstr="$(x-2)^2$")
To minimize a function of $x$, we need the derivative of $f(x)$, which is just a function that gives the slope of the curve at every $x$.
1. Define a function returning the derivative of $f(x)$
You can ask for symbolic derivatives at a variety of sites, but here's one solution.
def df(x): ...
def df(x): return 2*(x-2)
2. Pick an initial $x$ location and take a single step according to the derivative
Use a learning rate of $\eta = 0.4$. The output should be 1.76
. (Also keep in mind that the minimum value is clearly at $x=2$.)
x = .8 # initial x location
x = ...
print(x)
x = x - .4 * df(x); print(x)
Q. How can we symbolically optimize a quadratic function like this with a single minimum?
3. Create a loop that takes five more steps (same learning rate)
The output should look like:
1.952
1.9904
1.99808
1.999616
1.9999232
for i in range(5):
x = x - 0.4 * df(x);
print(x)
for i in range(5): x = x - 0.4 * df(x); print(x)
Notice how fast the iteration moves $x$ to the location where $f(x)$ is minimum!
This iterative minimization approach works for any (smooth) function, assuming we choose a small enough learning rate. For example, let's find one of the minima for $f(x) = x \sin(0.6x)$ in the range [-1,10]. The plot should look something like:
Depending on where we start, minimization will find either minimum at $x=0$ or at $8.18$. The location of the lowest function value is called the global minimum and any others are called local minima.
1. Define a function for $x \sin(0.6x)$
def f(x) : ...
def f(x) : return np.sin(0.6*x)*x
fplot(f, xrange=(-1,10), fstr="$x \sin(0.6x)$")
#plt.tight_layout(); plt.savefig("xsinx.png",dpi=150,bbox_inches=0)
2. Define the derivative function: $\frac{df}{dx} = 0.6x \cos(0.6 x) + \sin(0.6 x)$
def df(x): ...
def df(x): return 0.6*x * np.cos(0.6*x) + np.sin(0.6*x)
3. Pick a random initial value, $x_0$, between -1 and 10; display that value
x0 = np.random.rand()*11 - 1 # pick value between -1 and 10
x0
4. Start $x$ at $x_0$ and iterate 12 times using the gradient descent method
Use a learning rate of 0.4.
x = x0
for i in range(12):
x = x - .4 * df(x); print(f"{x:.10f}")
5. Plot the starting and stopping locations on the curve
fplot(f, xrange=(-1,10), fstr="$x \sin(0.6x)$", x0=x0, xn=x)
6. Rerun the notebook several times to see how the random start location affects where it terminates.
Q. Rather than iterating a fixed number of times, what's a better way to terminate the iteration?
Let's move back to the simple function $f(x) = (x-2)^2$ and consider different learning rates to see the effect.
def df(x): return 2*(x-2)
Let's codify the minimization process in a handy function:
def minimize(df,x0,eta):
x = x0
for i in range(10):
x = x - eta * df(x);
print(f"{x:.2f}")
1. Update the gradient descent loop to use a learning rate of 1.0
Notice how the learning rate is so large that iteration oscillates between two (incorrect) solutions. The output should be:
3.20
0.80
3.20
0.80
3.20
0.80
3.20
0.80
3.20
0.80
minimize(df, x0=0.8, eta=...)
2. Update the gradient descent loop to use a learning rate of 2.0
Notice how the solution diverges when the learning rate is too big. The output should be:
5.60
-8.80
34.40
-95.20
293.60
-872.80
2626.40
-7871.20
23621.60
-70856.80
minimize(df, x0=0.8, eta=...)
2. Update the gradient descent loop to use a learning rate of 0.01
Notice how slowly the solution converges when the learning rate is two small. The output should be:
0.82
0.85
0.87
0.89
0.92
0.94
0.96
0.98
1.00
1.02
minimize(df, x0=0.8, eta=...)
Q. How do you choose the learning rate $\eta$?
Turning to a common toy data set, the Boston housing data set, let's pick the most important single feature and look at the loss function for simple OLS regression.
1. Load the Boston data set into a data frame
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X.head()
2. Train an OLS linear regression model
lm = LinearRegression()
lm.fit(X, y)
3. Using rfpimp
package, display the feature importances
from rfpimp import *
I = importances(lm, X, y)
plot_importances(I)
4. LSTAT is most important variable so train a new model with just X['LSTAT']
Print out the true $\beta_0, \beta_1$ coefficients.
X_ = X['LSTAT'].values.reshape(-1,1) # Extract just one x variable
lm = LinearRegression()
lm.fit(X_, y)
print(f"True OLS coefficients: {np.array([lm.intercept_]+list(lm.coef_))}")
5. Show marginal plot of LSTAT vs price
fig, ax1 = plt.subplots(figsize=(5,2.0))
ax1.scatter(X_, y, s=15, alpha=.5)
lx = np.linspace(np.min(X_), np.max(X_), num=len(X))
ax1.plot(lx, lm.predict(lx.reshape(-1,1)), c='orange')
ax1.set_xlabel("LSTAT", fontsize=10)
ax1.set_ylabel("price", fontsize=10)
plt.show()
6. Define an MSE loss function for single variable regression
$$ \frac{1}{n} \sum_{i=1}^n (y - (\beta_0 + \beta_1 x^{(i)}))^2 $$def loss(B,X,y): # B=[beta0, beta1]
y_pred = ...
return np.mean(...)
def loss(B,X,y): y_pred = B[0] + X*B[1] return np.mean((y - y_pred)**2)
7. Check the loss function value at the true OLS coordinates
loss(np.array([34.55384088, -0.95004935]), X_, y) # demo loss function at minimum
8. Plot the loss function in 3D in region around $\beta$s
When you enter the correct loss function above, the plot should look something like:
b0_range = np.linspace(-50, 120, 70)
b1_range = np.linspace(-6, 4, 70)
plot3d(X_, y, b0_range, b1_range)
#plt.tight_layout(); plt.savefig("boston-loss.png",dpi=150,bbox_inches=0)
1. Normalize the $x$ variables
X_norm = normalize(X)
2. Retrain the model
X_ = X_norm['LSTAT'].values.reshape(-1,1)
lm = LinearRegression()
lm.fit(X_, y)
print(f"True OLS coefficients: {np.array([lm.intercept_]+list(lm.coef_))}")
3. Show the marginal plot again
Notice how only the $x$ scale has changed but not $y$, nor has the shape changed.
fig, ax1 = plt.subplots(figsize=(5,2.0))
ax1.scatter(X_, y, s=15, alpha=.5)
lx = np.linspace(np.min(X_), np.max(X_), num=len(X))
ax1.plot(lx, lm.predict(lx.reshape(-1,1)), c='orange')
ax1.set_xlabel("LSTAT", fontsize=10)
ax1.set_ylabel("price", fontsize=10)
plt.show()
4. Plot the cost surface with a region around the new minimum location
b0_range = np.linspace(15, 30, 70)
b1_range = np.linspace(-10, 5, 70)
plot3d(X_, y, b0_range, b1_range)
Q. Compare the loss function contour lines of the unnormalized and normalized variables.
Q. Look at the loss function directly from above; in which direction do the gradients point?