# A2.5 Multilayer Neural Networks for Nonlinear Regression¶

A2.5 (feb 18, 11:30 am)

• Small change to A2grader.py (Download A2grader.zip again). Fixed grading problem with Adam errors.

A2.4 (Feb 17, 9:45am)

• In start of Adam implementation, replaced g and g2 with m and v and removed comment about alphat. Now comments in Adam are more aligned with code used in 06.1 lecture notes.
• Also A2grader.zip is updated. Download this one.

A2.3:

• Fixed errors in my grading of partition function.

A2.2:

• Fixed initial values of self.beta1t and self.beta2t.
• Fixed description of partition function

## Summary¶

In this assignment you will

• make some modifications to the supplied neural network implementation,
• define a function that partitions data into training, validation and test sets,
• apply it to a data set,
• define a function that runs experiments with a variety of parameter values,
• describe your observations of these results.

## Optimizers¶

First, we need a class that includes our optimization algorithms, sgd and adam. The following code cell implements sgd. You must complete the implementation of adam, following its implementation in the lecture notes.

Notice that all_weights is updated in place by these optimization algorithms. The new values of all_weights are not returned from these functions, because the code that calls these functions allocates the memory for all_weights and keeps the reference to it so has direct access to the new values.

In [5]:
import numpy as np
import matplotlib.pyplot as plt

In [ ]:
class Optimizers():

def __init__(self, all_weights):
'''all_weights is a vector of all of a neural networks weights concatenated into a one-dimensional vector'''

self.all_weights = all_weights

# The following initializations are only used by adam.
# Only initializing mt, vt, beta1t and beta2t here allows multiple calls to adam to handle training
# with multiple subsets (batches) of training data.
self.mt = np.zeros_like(all_weights)
self.vt = np.zeros_like(all_weights)
self.beta1 = 0.9
self.beta2 = 0.999
self.beta1t = 1  # was self.beta1
self.beta2t = 1  # was self.beta2

def sgd(self, error_f, gradient_f, fargs=[], n_epochs=100, learning_rate=0.001, error_convert_f=None):
'''
error_f: function that requires X and T as arguments (given in fargs) and returns mean squared error.
gradient_f: function that requires X and T as arguments (in fargs) and returns gradient of mean squared error
with respect to each weight.
error_convert_f: function that converts the standardized error from error_f to original T units.
'''

error_trace = []
epochs_per_print = n_epochs // 10

for epoch in range(n_epochs):

error = error_f(*fargs)

# Update all weights using -= to modify their values in-place.

if error_convert_f:
error = error_convert_f(error)
error_trace.append(error)

if (epoch + 1) % max(1, epochs_per_print) == 0:
print(f'sgd: Epoch {epoch+1:d} Error={error:.5f}')

return error_trace

'''
error_f: function that requires X and T as arguments (given in fargs) and returns mean squared error.
gradient_f: function that requires X and T as arguments (in fargs) and returns gradient of mean squared error
with respect to each weight.
error_convert_f: function that converts the standardized error from error_f to original T units.
'''

alpha = learning_rate  # learning rate called alpha in original paper on adam
epsilon = 1e-8
error_trace = []
epochs_per_print = n_epochs // 10

for epoch in range(n_epochs):

error = error_f(*fargs)

# Finish Adam implementation here by updating
#   self.mt
#   self.vt
#   self.beta1t
#   self.beta2t
# and updating values of self.all_weights

. . .

if error_convert_f:
error = error_convert_f(error)
error_trace.append(error)

if (epoch + 1) % max(1, epochs_per_print) == 0:

return error_trace


Test Optimizers using the function test_optimizers. You should get the same results shown below.

In [2]:
def test_optimizers():

def parabola(wmin):
return ((w - wmin) ** 2)[0]

return 2 * (w - wmin)

w = np.array([0.0])
optimizer = Optimizers(w)

wmin = 5
print(f'sgd: Minimum of parabola is at {wmin}. Value found is {w}')

w = np.array([0.0])
optimizer = Optimizers(w)
print(f'adam: Minimum of parabola is at {wmin}. Value found is {w}')

In [3]:
test_optimizers()

sgd: Epoch 10 Error=0.45036
sgd: Epoch 20 Error=0.00519
sgd: Epoch 30 Error=0.00006
sgd: Epoch 40 Error=0.00000
sgd: Epoch 50 Error=0.00000
sgd: Epoch 60 Error=0.00000
sgd: Epoch 70 Error=0.00000
sgd: Epoch 80 Error=0.00000
sgd: Epoch 90 Error=0.00000
sgd: Epoch 100 Error=0.00000
sgd: Minimum of parabola is at 5. Value found is [5.]
adam: Minimum of parabola is at 5. Value found is [5.03900403]


## NeuralNetwork class¶

Now we can implement the NeuralNetwork class that calls the above Optimizers functions to update the weights.

You must first complete the use function. You can make use of the forward_pass function.

In [4]:
class NeuralNetwork():

def __init__(self, n_inputs, n_hiddens_per_layer, n_outputs):
self.n_inputs = n_inputs
self.n_outputs = n_outputs

# Set self.n_hiddens_per_layer to [] if argument is 0, [], or [0]
if n_hiddens_per_layer == 0 or n_hiddens_per_layer == [] or n_hiddens_per_layer == [0]:
self.n_hiddens_per_layer = []
else:
self.n_hiddens_per_layer = n_hiddens_per_layer

# Initialize weights, by first building list of all weight matrix shapes.
n_in = n_inputs
shapes = []
for nh in self.n_hiddens_per_layer:
shapes.append((n_in + 1, nh))
n_in = nh
shapes.append((n_in + 1, n_outputs))

# self.all_weights:  vector of all weights
# self.Ws: list of weight matrices by layer
self.all_weights, self.Ws = self.make_weights_and_views(shapes)

# Define arrays to hold gradient values.
# One array for each W array with same shape.

self.trained = False
self.total_epochs = 0
self.error_trace = []
self.Xmeans = None
self.Xstds = None
self.Tmeans = None
self.Tstds = None

def make_weights_and_views(self, shapes):
# vector of all weights built by horizontally stacking flatenned matrices
# for each layer initialized with uniformly-distributed values.
all_weights = np.hstack([np.random.uniform(size=shape).flat / np.sqrt(shape[0])
for shape in shapes])
# Build list of views by reshaping corresponding elements from vector of all weights
# into correct shape for each layer.
views = []
start = 0
for shape in shapes:
size =shape[0] * shape[1]
views.append(all_weights[start:start + size].reshape(shape))
start += size
return all_weights, views

# Return string that shows how the constructor was called
def __repr__(self):
return f'NeuralNetwork({self.n_inputs}, {self.n_hiddens_per_layer}, {self.n_outputs})'

# Return string that is more informative to the user about the state of this neural network.
def __str__(self):
if self.trained:
return self.__repr__() + f' trained for {self.total_epochs} epochs, final training error {self.error_trace[-1]}'

def train(self, X, T, n_epochs, learning_rate, method='sgd'):
'''
train:
X: n_samples x n_inputs matrix of input samples, one per row
T: n_samples x n_outputs matrix of target output values, one sample per row
n_epochs: number of passes to take through all samples updating weights each pass
learning_rate: factor controlling the step size of each update
method: is either 'sgd' or 'adam'
'''

# Setup standardization parameters
if self.Xmeans is None:
self.Xmeans = X.mean(axis=0)
self.Xstds = X.std(axis=0)
self.Xstds[self.Xstds == 0] = 1  # So we don't divide by zero when standardizing
self.Tmeans = T.mean(axis=0)
self.Tstds = T.std(axis=0)

# Standardize X and T
X = (X - self.Xmeans) / self.Xstds
T = (T - self.Tmeans) / self.Tstds

# Instantiate Optimizers object by giving it vector of all weights
optimizer = Optimizers(self.all_weights)

# Define function to convert value from error_f into error in original T units.
error_convert_f = lambda err: (np.sqrt(err) * self.Tstds)[0] # to scalar

if method == 'sgd':

fargs=[X, T], n_epochs=n_epochs,
learning_rate=learning_rate,
error_convert_f=error_convert_f)

fargs=[X, T], n_epochs=n_epochs,
learning_rate=learning_rate,
error_convert_f=error_convert_f)

else:
raise Exception("method must be 'sgd' or 'adam'")

self.error_trace = error_trace

# Return neural network object to allow applying other methods after training.
#  Example:    Y = nnet.train(X, T, 100, 0.01).use(X)
return self

def forward_pass(self, X):
'''X assumed already standardized. Output returned as standardized.'''
self.Ys = [X]
for W in self.Ws[:-1]:
self.Ys.append(np.tanh(self.Ys[-1] @ W[1:, :] + W[0:1, :]))
last_W = self.Ws[-1]
self.Ys.append(self.Ys[-1] @ last_W[1:, :] + last_W[0:1, :])
return self.Ys

# Function to be minimized by optimizer method, mean squared error
def error_f(self, X, T):
Ys = self.forward_pass(X)
mean_sq_error = np.mean((T - Ys[-1]) ** 2)
return mean_sq_error

# Gradient of function to be minimized for use by optimizer method
'''Assumes forward_pass just called with layer outputs in self.Ys.'''
error = T - self.Ys[-1]
n_samples = X.shape[0]
n_outputs = T.shape[1]
delta = - error / (n_samples * n_outputs)
n_layers = len(self.n_hiddens_per_layer) + 1
# Step backwards through the layers to back-propagate the error (delta)
for layeri in range(n_layers - 1, -1, -1):
# gradient of all but bias weights
self.dE_dWs[layeri][1:, :] = self.Ys[layeri].T @ delta
# gradient of just the bias weights
self.dE_dWs[layeri][0:1, :] = np.sum(delta, 0)
# Back-propagate this layer's delta to previous layer
delta = delta @ self.Ws[layeri][1:, :].T * (1 - self.Ys[layeri] ** 2)

def use(self, X):
'''X assumed to not be standardized. Return the unstandardized prediction'''

. . .


Then test it with the test_neuralnetwork function. Your results should be the same as those shown, because the pseudo-random number generator used to initialize the weights is set to start with the same seed.

In [18]:
np.random.seed(42)
np.random.uniform(-0.1, 0.1, size=(2, 2))

Out[18]:
array([[-0.02509198,  0.09014286],
[ 0.04639879,  0.0197317 ]])
In [19]:
np.random.uniform(-0.1, 0.1, size=(2, 2))

Out[19]:
array([[-0.06879627, -0.0688011 ],
[-0.08838328,  0.07323523]])
In [20]:
np.random.seed(42)
np.random.uniform(-0.1, 0.1, size=(2, 2))

Out[20]:
array([[-0.02509198,  0.09014286],
[ 0.04639879,  0.0197317 ]])
In [2]:
def test_neuralnetwork():

np.random.seed(42)

X = np.arange(100).reshape((-1, 1))
T = np.sin(X * 0.04)

n_hiddens = [10, 10]
n_epochs = 2000
learning_rate = 0.01

nnetsgd = NeuralNetwork(1, n_hiddens, 1)
nnetsgd.train(X, T, n_epochs, learning_rate, method='sgd')

print()  # skip a line

Ysgd = nnetsgd.use(X)

plt.figure(figsize=(15,10))
plt.subplot(1, 3, 1)
plt.plot(nnetsgd.error_trace, label='SGD')
plt.xlabel('Epoch')
plt.ylabel('RMSE')
plt.legend()

plt.subplot(1, 3, 2)
plt.plot(T, Ysgd, 'o', label='SGD')
a = min(np.min(T), np.min(Ysgd))
b = max(np.max(T), np.max(Ysgd))
plt.plot([a, b], [a, b], 'k-', lw=3, alpha=0.5, label='45 degree')
plt.xlabel('Target')
plt.ylabel('Predicted')
plt.legend()

plt.subplot(1, 3, 3)
plt.plot(Ysgd, 'o-', label='SGD')
plt.plot(T, label='Target')
plt.xlabel('Sample')
plt.ylabel('Target or Predicted')
plt.legend()

plt.tight_layout()

In [3]:
test_neuralnetwork()

sgd: Epoch 200 Error=0.49330
sgd: Epoch 400 Error=0.46833
sgd: Epoch 600 Error=0.44525
sgd: Epoch 800 Error=0.42264
sgd: Epoch 1000 Error=0.39428
sgd: Epoch 1200 Error=0.35526
sgd: Epoch 1400 Error=0.30300
sgd: Epoch 1600 Error=0.24079
sgd: Epoch 1800 Error=0.18020
sgd: Epoch 2000 Error=0.13423



## ReLU Activation Function¶

Cut and paste your NeuralNetwork class cell here. Then modify it to allow the use of the ReLU activiation function, in addition to the tanh activation function that NeuralNetwork currently uses.

Do this by

• Add the argument activation_function to the NeuralNetwork constructor that can be given values of tanh or relu, with tanh being its default value.
• Define two new class functions, relu(s) that accepts a matrix of weighted sums and returns the ReLU values, and grad_relu(s) that returns the gradient of relu(s) with respect to each value in s.
• Add if statements to forward_pass and gradient_f to selectively use the tanh or relu activation function. This is easy if you assign a new class variable in the NeuralNetwork constructor that has the value of the argument activation_function.

## Now for the Experiments!¶

Now that your code is working, let's apply it to some interesting data.

Read in the auto-mpg.data that we have used in lectures. Let's apply neural networks to predict mpg using various neural network architectures, numbers of epochs, and our two activation functions.

This time we will partition the data into five parts after randomly rearranging the samples. We will assign the first partition as the validation set, the second one as the test set, and the remaining parts will be vertically stacked to form the training set, as discussed in lecture. We can use the RMSE on the validation set to pick the best values of the number of epochs and the network architecture. Then to report on the RMSE we expect on new data, we will report the test set RMSE.

Read in the auto-mpg.data using pandas and remove all samples that contain missing values. You should end up with 392 samples.

Now randomly reorder the samples. First run np.random.seed(42) to guarantee that we all use the same random ordering of samples.

Partition the data into five folds, as shown in lecture. To do this, complete the following function.

In [ ]:
def partition(X, T, n_folds, random_shuffle=True):
. . .
return Xtrain, Ttrain, Xvalidate, Tvalidate, Xtest, Ttest


Write a function named run_experiment that uses three nested for loops to try different values of the parameters n_epochs, n_hidden_units_per_layer and activation_function which will just be either tanh or relu. Don't forget to try [0] for one of the values of n_hidden_units_per_layer to include a linear model in your tests. For each set of parameter values, create and train a neural network using the 'adam' optimization method and use the neural network on the training, validation and test sets. Collect the parameter values and the RMSE for the training, validation, and test set in a list. When your loops are done, construct a pandas.DataFrame from the list of results, for easy printing. The first five lines might look like:

   epochs        nh    lr act func  RMSE Train  RMSE Val  RMSE Test
0    1000       [0]  0.01     tanh    3.356401  3.418705   3.116480
1    1000       [0]  0.01     relu    3.354528  3.428324   3.125064
2    1000      [20]  0.01     tanh    1.992509  2.355746   2.459506
3    1000      [20]  0.01     relu    2.448536  2.026954   2.581707
4    1000  [20, 20]  0.01     tanh    1.518916  2.468188   3.118376

Your function must return a pandas.DataFrame like this one.

Before starting the nested for loops, your run_experiment function must first call your partition function to form the training, validation and test sets.

An example call of your function would look like this:

In [ ]:
result_df = run_experiment(X, T, n_folds=5,
n_epochs_choices=[1000, 2000],
n_hidden_units_per_layer_choices=[[0], [10], [100, 10]],
activation_function_choices=['tanh', 'relu'])


Find the lowest value of RMSE Val in your table and report the RMSE Test and the parameter values that produced this. This is your expected error in predicted miles per gallon. Discuss how good this prediction is.

Plot the RMSE values for training, validation and test sets versus the combined parameter values of number of epochs and network architecture. Make one plot for tanh as the activation function and a second one for relu. Your plots should look like this, but with different RMSE values, and will of course be different if you choose different network architectures and numbers of epochs.

Describe at least three different observations you make about these plots. What do you find interesting?

1. ...
2. ...
3. ...

Your notebook will be partially run and graded automatically. Test this grading process by first downloading A2grader.zip and extract A2grader.py from it. Run the code in the following cell to demonstrate an example grading session. You should see a perfect execution score of 70/70 if your functions are defined correctly. The remaining 30 points will be based on other testing and the results you obtain and your discussions.

A different, but similar, grading script will be used to grade your checked-in notebook. It will include additional tests. You should design and perform additional tests on all of your functions to be sure they run correctly before checking in your notebook.

For the grading script to run correctly, you must first name this notebook as Lastname-A2.ipynb| with Lastname being your last name, and then save this notebook and check it in at the A2 assignment link in our Canvas web page.

In [4]:
%run -i A2grader.py

Testing
w = np.array([0.0])
def cubic(wmin):
return (w[0] - wmin) ** 3 + (w[0] - wmin) ** 2
return 3 * (w[0] - wmin) ** 2 + 2 * (w[0] - wmin)
wmin = 0.5
opt = Optimizers(w)
errors_sgd = opt.sgd(cubic, grad_cubic, [wmin], 100, 0.01)

sgd: Epoch 10 Error=0.11889
sgd: Epoch 20 Error=0.11092
sgd: Epoch 30 Error=0.10176
sgd: Epoch 40 Error=0.09162
sgd: Epoch 50 Error=0.08081
sgd: Epoch 60 Error=0.06972
sgd: Epoch 70 Error=0.05879
sgd: Epoch 80 Error=0.04844
sgd: Epoch 90 Error=0.03901
sgd: Epoch 100 Error=0.03072

--- 10/10 points. Returned correct value.

Testing
w = np.array([0.0])
def cubic(wmin):
return (w[0] - wmin) ** 3 + (w[0] - wmin) ** 2
return 3 * (w[0] - wmin) ** 2 + 2 * (w[0] - wmin)
wmin = 0.5
opt = Optimizers(w)

--- 10/10 points. Returned correct value.

Testing
np.random.seed(42)

nnet = NeuralNetwork(2, [10], 1)
X = np.arange(40).reshape(20, 2)
T = X[:, 0:1] * X[:, 1:]

--- 20/20 points. Returned correct value.

Testing
np.random.seed(42)

# Using X and T from previous test
a, b, c, d, e, f = partition(X, T, 3)

--- 10/10 points. Returned correct values.

Testing
np.random.seed(42)

result = run_experiment(X, T, 4,
n_epochs_choices=[100, 200],
n_hidden_units_per_layer_choices=[[0], [10]],
activation_function_choices=['tanh', 'relu'])

first_test_rmse = result.iloc[0]['RMSE Test']

--- 20/20 points. Returned correct values.

======================================================================
a2 Execution Grade is 70 / 70
======================================================================

__ / 30 Discussion of at least three observations about

======================================================================
a2 FINAL GRADE is  _  / 100
======================================================================

Extra Credit:
Add the Swish activation function as a third choice in your train function in your NeuralNetwork class. A little googling will find definitions of it and its gradient.

Use your run_experiment function to compare results for all three activation functions. Discuss the results.

a2 EXTRA CREDIT is 0 / 1


## Extra Credit: 1 point¶

Add the Swish activation function as a third choice in your train function in your NeuralNetwork class. A little googling will find definitions of it and its gradient. Start with this article.

Use your run_experiment function to compare results for all three activation functions. Discuss the results.