A2.5 Multilayer Neural Networks for Nonlinear Regression

A2.5 (feb 18, 11:30 am)

  • Small change to A2grader.py (Download A2grader.zip again). Fixed grading problem with Adam errors.

A2.4 (Feb 17, 9:45am)

  • In start of Adam implementation, replaced g and g2 with m and v and removed comment about alphat. Now comments in Adam are more aligned with code used in 06.1 lecture notes.
  • Also A2grader.zip is updated. Download this one.


  • Fixed errors in my grading of partition function.


  • Fixed initial values of self.beta1t and self.beta2t.
  • Fixed description of partition function
  • New A2grader.zip

Type your name here


In this assignment you will

  • make some modifications to the supplied neural network implementation,
  • define a function that partitions data into training, validation and test sets,
  • apply it to a data set,
  • define a function that runs experiments with a variety of parameter values,
  • describe your observations of these results.


First, we need a class that includes our optimization algorithms, sgd and adam. The following code cell implements sgd. You must complete the implementation of adam, following its implementation in the lecture notes.

Notice that all_weights is updated in place by these optimization algorithms. The new values of all_weights are not returned from these functions, because the code that calls these functions allocates the memory for all_weights and keeps the reference to it so has direct access to the new values.

In [5]:
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
class Optimizers():

    def __init__(self, all_weights):
        '''all_weights is a vector of all of a neural networks weights concatenated into a one-dimensional vector'''
        self.all_weights = all_weights

        # The following initializations are only used by adam.
        # Only initializing mt, vt, beta1t and beta2t here allows multiple calls to adam to handle training
        # with multiple subsets (batches) of training data.
        self.mt = np.zeros_like(all_weights)
        self.vt = np.zeros_like(all_weights)
        self.beta1 = 0.9
        self.beta2 = 0.999
        self.beta1t = 1  # was self.beta1
        self.beta2t = 1  # was self.beta2

    def sgd(self, error_f, gradient_f, fargs=[], n_epochs=100, learning_rate=0.001, error_convert_f=None):
error_f: function that requires X and T as arguments (given in fargs) and returns mean squared error.
gradient_f: function that requires X and T as arguments (in fargs) and returns gradient of mean squared error
            with respect to each weight.
error_convert_f: function that converts the standardized error from error_f to original T units.

        error_trace = []
        epochs_per_print = n_epochs // 10

        for epoch in range(n_epochs):

            error = error_f(*fargs)
            grad = gradient_f(*fargs)

            # Update all weights using -= to modify their values in-place.
            self.all_weights -= learning_rate * grad

            if error_convert_f:
                error = error_convert_f(error)

            if (epoch + 1) % max(1, epochs_per_print) == 0:
                print(f'sgd: Epoch {epoch+1:d} Error={error:.5f}')

        return error_trace

    def adam(self, error_f, gradient_f, fargs=[], n_epochs=100, learning_rate=0.001, error_convert_f=None):
error_f: function that requires X and T as arguments (given in fargs) and returns mean squared error.
gradient_f: function that requires X and T as arguments (in fargs) and returns gradient of mean squared error
            with respect to each weight.
error_convert_f: function that converts the standardized error from error_f to original T units.

        alpha = learning_rate  # learning rate called alpha in original paper on adam
        epsilon = 1e-8
        error_trace = []
        epochs_per_print = n_epochs // 10

        for epoch in range(n_epochs):

            error = error_f(*fargs)
            grad = gradient_f(*fargs)

            # Finish Adam implementation here by updating
            #   self.mt
            #   self.vt
            #   self.beta1t
            #   self.beta2t
            # and updating values of self.all_weights
            . . .

            if error_convert_f:
                error = error_convert_f(error)

            if (epoch + 1) % max(1, epochs_per_print) == 0:
                print(f'Adam: Epoch {epoch+1:d} Error={error:.5f}')

        return error_trace

Test Optimizers using the function test_optimizers. You should get the same results shown below.

In [2]:
def test_optimizers():

    def parabola(wmin):
        return ((w - wmin) ** 2)[0]

    def parabola_gradient(wmin):
        return 2 * (w - wmin)

    w = np.array([0.0])
    optimizer = Optimizers(w)

    wmin = 5
    optimizer.sgd(parabola, parabola_gradient, [wmin], n_epochs=100, learning_rate=0.1)
    print(f'sgd: Minimum of parabola is at {wmin}. Value found is {w}')

    w = np.array([0.0])
    optimizer = Optimizers(w)
    optimizer.adam(parabola, parabola_gradient, [wmin], n_epochs=100, learning_rate=0.1)
    print(f'adam: Minimum of parabola is at {wmin}. Value found is {w}')
In [3]:
sgd: Epoch 10 Error=0.45036
sgd: Epoch 20 Error=0.00519
sgd: Epoch 30 Error=0.00006
sgd: Epoch 40 Error=0.00000
sgd: Epoch 50 Error=0.00000
sgd: Epoch 60 Error=0.00000
sgd: Epoch 70 Error=0.00000
sgd: Epoch 80 Error=0.00000
sgd: Epoch 90 Error=0.00000
sgd: Epoch 100 Error=0.00000
sgd: Minimum of parabola is at 5. Value found is [5.]
Adam: Epoch 10 Error=16.85565
Adam: Epoch 20 Error=9.93336
Adam: Epoch 30 Error=5.21627
Adam: Epoch 40 Error=2.37740
Adam: Epoch 50 Error=0.90515
Adam: Epoch 60 Error=0.26972
Adam: Epoch 70 Error=0.05453
Adam: Epoch 80 Error=0.00453
Adam: Epoch 90 Error=0.00016
Adam: Epoch 100 Error=0.00147
adam: Minimum of parabola is at 5. Value found is [5.03900403]

NeuralNetwork class

Now we can implement the NeuralNetwork class that calls the above Optimizers functions to update the weights.

You must first complete the use function. You can make use of the forward_pass function.

In [4]:
class NeuralNetwork():

    def __init__(self, n_inputs, n_hiddens_per_layer, n_outputs):
        self.n_inputs = n_inputs
        self.n_outputs = n_outputs

        # Set self.n_hiddens_per_layer to [] if argument is 0, [], or [0]
        if n_hiddens_per_layer == 0 or n_hiddens_per_layer == [] or n_hiddens_per_layer == [0]:
            self.n_hiddens_per_layer = []
            self.n_hiddens_per_layer = n_hiddens_per_layer

        # Initialize weights, by first building list of all weight matrix shapes.
        n_in = n_inputs
        shapes = []
        for nh in self.n_hiddens_per_layer:
            shapes.append((n_in + 1, nh))
            n_in = nh
        shapes.append((n_in + 1, n_outputs))

        # self.all_weights:  vector of all weights
        # self.Ws: list of weight matrices by layer
        self.all_weights, self.Ws = self.make_weights_and_views(shapes)

        # Define arrays to hold gradient values.
        # One array for each W array with same shape.
        self.all_gradients, self.dE_dWs = self.make_weights_and_views(shapes)

        self.trained = False
        self.total_epochs = 0
        self.error_trace = []
        self.Xmeans = None
        self.Xstds = None
        self.Tmeans = None
        self.Tstds = None

    def make_weights_and_views(self, shapes):
        # vector of all weights built by horizontally stacking flatenned matrices
        # for each layer initialized with uniformly-distributed values.
        all_weights = np.hstack([np.random.uniform(size=shape).flat / np.sqrt(shape[0])
                                 for shape in shapes])
        # Build list of views by reshaping corresponding elements from vector of all weights
        # into correct shape for each layer.
        views = []
        start = 0
        for shape in shapes:
            size =shape[0] * shape[1]
            views.append(all_weights[start:start + size].reshape(shape))
            start += size
        return all_weights, views

    # Return string that shows how the constructor was called
    def __repr__(self):
        return f'NeuralNetwork({self.n_inputs}, {self.n_hiddens_per_layer}, {self.n_outputs})'

    # Return string that is more informative to the user about the state of this neural network.
    def __str__(self):
        if self.trained:
            return self.__repr__() + f' trained for {self.total_epochs} epochs, final training error {self.error_trace[-1]}'

    def train(self, X, T, n_epochs, learning_rate, method='sgd'):
  X: n_samples x n_inputs matrix of input samples, one per row
  T: n_samples x n_outputs matrix of target output values, one sample per row
  n_epochs: number of passes to take through all samples updating weights each pass
  learning_rate: factor controlling the step size of each update
  method: is either 'sgd' or 'adam'

        # Setup standardization parameters
        if self.Xmeans is None:
            self.Xmeans = X.mean(axis=0)
            self.Xstds = X.std(axis=0)
            self.Xstds[self.Xstds == 0] = 1  # So we don't divide by zero when standardizing
            self.Tmeans = T.mean(axis=0)
            self.Tstds = T.std(axis=0)
        # Standardize X and T
        X = (X - self.Xmeans) / self.Xstds
        T = (T - self.Tmeans) / self.Tstds

        # Instantiate Optimizers object by giving it vector of all weights
        optimizer = Optimizers(self.all_weights)

        # Define function to convert value from error_f into error in original T units.
        error_convert_f = lambda err: (np.sqrt(err) * self.Tstds)[0] # to scalar

        if method == 'sgd':

            error_trace = optimizer.sgd(self.error_f, self.gradient_f,
                                        fargs=[X, T], n_epochs=n_epochs,

        elif method == 'adam':

            error_trace = optimizer.adam(self.error_f, self.gradient_f,
                                         fargs=[X, T], n_epochs=n_epochs,

            raise Exception("method must be 'sgd' or 'adam'")
        self.error_trace = error_trace

        # Return neural network object to allow applying other methods after training.
        #  Example:    Y = nnet.train(X, T, 100, 0.01).use(X)
        return self

    def forward_pass(self, X):
        '''X assumed already standardized. Output returned as standardized.'''
        self.Ys = [X]
        for W in self.Ws[:-1]:
            self.Ys.append(np.tanh(self.Ys[-1] @ W[1:, :] + W[0:1, :]))
        last_W = self.Ws[-1]
        self.Ys.append(self.Ys[-1] @ last_W[1:, :] + last_W[0:1, :])
        return self.Ys

    # Function to be minimized by optimizer method, mean squared error
    def error_f(self, X, T):
        Ys = self.forward_pass(X)
        mean_sq_error = np.mean((T - Ys[-1]) ** 2)
        return mean_sq_error

    # Gradient of function to be minimized for use by optimizer method
    def gradient_f(self, X, T):
        '''Assumes forward_pass just called with layer outputs in self.Ys.'''
        error = T - self.Ys[-1]
        n_samples = X.shape[0]
        n_outputs = T.shape[1]
        delta = - error / (n_samples * n_outputs)
        n_layers = len(self.n_hiddens_per_layer) + 1
        # Step backwards through the layers to back-propagate the error (delta)
        for layeri in range(n_layers - 1, -1, -1):
            # gradient of all but bias weights
            self.dE_dWs[layeri][1:, :] = self.Ys[layeri].T @ delta
            # gradient of just the bias weights
            self.dE_dWs[layeri][0:1, :] = np.sum(delta, 0)
            # Back-propagate this layer's delta to previous layer
            delta = delta @ self.Ws[layeri][1:, :].T * (1 - self.Ys[layeri] ** 2)
        return self.all_gradients

    def use(self, X):
        '''X assumed to not be standardized. Return the unstandardized prediction'''

        . . .

Then test it with the test_neuralnetwork function. Your results should be the same as those shown, because the pseudo-random number generator used to initialize the weights is set to start with the same seed.

In [18]:
np.random.uniform(-0.1, 0.1, size=(2, 2))
array([[-0.02509198,  0.09014286],
       [ 0.04639879,  0.0197317 ]])
In [19]:
np.random.uniform(-0.1, 0.1, size=(2, 2))
array([[-0.06879627, -0.0688011 ],
       [-0.08838328,  0.07323523]])
In [20]:
np.random.uniform(-0.1, 0.1, size=(2, 2))
array([[-0.02509198,  0.09014286],
       [ 0.04639879,  0.0197317 ]])
In [2]:
def test_neuralnetwork():
    X = np.arange(100).reshape((-1, 1))
    T = np.sin(X * 0.04)

    n_hiddens = [10, 10]
    n_epochs = 2000
    learning_rate = 0.01
    nnetsgd = NeuralNetwork(1, n_hiddens, 1)
    nnetsgd.train(X, T, n_epochs, learning_rate, method='sgd')

    print()  # skip a line
    nnetadam = NeuralNetwork(1, n_hiddens, 1)
    nnetadam.train(X, T, n_epochs, learning_rate, method='adam')

    Ysgd = nnetsgd.use(X)
    Yadam = nnetadam.use(X)

    plt.subplot(1, 3, 1)
    plt.plot(nnetsgd.error_trace, label='SGD')
    plt.plot(nnetadam.error_trace, label='Adam')
    plt.subplot(1, 3, 2)
    plt.plot(T, Ysgd, 'o', label='SGD')
    plt.plot(T, Yadam, 'o', label='Adam')
    a = min(np.min(T), np.min(Ysgd))
    b = max(np.max(T), np.max(Ysgd))
    plt.plot([a, b], [a, b], 'k-', lw=3, alpha=0.5, label='45 degree')

    plt.subplot(1, 3, 3)
    plt.plot(Ysgd, 'o-', label='SGD')
    plt.plot(Yadam, 'o-', label='Adam')
    plt.plot(T, label='Target')
    plt.ylabel('Target or Predicted')

In [3]:
sgd: Epoch 200 Error=0.49330
sgd: Epoch 400 Error=0.46833
sgd: Epoch 600 Error=0.44525
sgd: Epoch 800 Error=0.42264
sgd: Epoch 1000 Error=0.39428
sgd: Epoch 1200 Error=0.35526
sgd: Epoch 1400 Error=0.30300
sgd: Epoch 1600 Error=0.24079
sgd: Epoch 1800 Error=0.18020
sgd: Epoch 2000 Error=0.13423

Adam: Epoch 200 Error=0.11620
Adam: Epoch 400 Error=0.00795
Adam: Epoch 600 Error=0.00362
Adam: Epoch 800 Error=0.00268
Adam: Epoch 1000 Error=0.00236
Adam: Epoch 1200 Error=0.00213
Adam: Epoch 1400 Error=0.00200
Adam: Epoch 1600 Error=0.00184
Adam: Epoch 1800 Error=0.00337
Adam: Epoch 2000 Error=0.00162

ReLU Activation Function

Cut and paste your NeuralNetwork class cell here. Then modify it to allow the use of the ReLU activiation function, in addition to the tanh activation function that NeuralNetwork currently uses.

Do this by

  • Add the argument activation_function to the NeuralNetwork constructor that can be given values of tanh or relu, with tanh being its default value.
  • Define two new class functions, relu(s) that accepts a matrix of weighted sums and returns the ReLU values, and grad_relu(s) that returns the gradient of relu(s) with respect to each value in s.
  • Add if statements to forward_pass and gradient_f to selectively use the tanh or relu activation function. This is easy if you assign a new class variable in the NeuralNetwork constructor that has the value of the argument activation_function.

Now for the Experiments!

Now that your code is working, let's apply it to some interesting data.

Read in the auto-mpg.data that we have used in lectures. Let's apply neural networks to predict mpg using various neural network architectures, numbers of epochs, and our two activation functions.

This time we will partition the data into five parts after randomly rearranging the samples. We will assign the first partition as the validation set, the second one as the test set, and the remaining parts will be vertically stacked to form the training set, as discussed in lecture. We can use the RMSE on the validation set to pick the best values of the number of epochs and the network architecture. Then to report on the RMSE we expect on new data, we will report the test set RMSE.

Read in the auto-mpg.data using pandas and remove all samples that contain missing values. You should end up with 392 samples.

Now randomly reorder the samples. First run np.random.seed(42) to guarantee that we all use the same random ordering of samples.

Partition the data into five folds, as shown in lecture. To do this, complete the following function.

In [ ]:
def partition(X, T, n_folds, random_shuffle=True):
    . . .
    return Xtrain, Ttrain, Xvalidate, Tvalidate, Xtest, Ttest

Write a function named run_experiment that uses three nested for loops to try different values of the parameters n_epochs, n_hidden_units_per_layer and activation_function which will just be either tanh or relu. Don't forget to try [0] for one of the values of n_hidden_units_per_layer to include a linear model in your tests. For each set of parameter values, create and train a neural network using the 'adam' optimization method and use the neural network on the training, validation and test sets. Collect the parameter values and the RMSE for the training, validation, and test set in a list. When your loops are done, construct a pandas.DataFrame from the list of results, for easy printing. The first five lines might look like:

   epochs        nh    lr act func  RMSE Train  RMSE Val  RMSE Test
0    1000       [0]  0.01     tanh    3.356401  3.418705   3.116480
1    1000       [0]  0.01     relu    3.354528  3.428324   3.125064
2    1000      [20]  0.01     tanh    1.992509  2.355746   2.459506
3    1000      [20]  0.01     relu    2.448536  2.026954   2.581707
4    1000  [20, 20]  0.01     tanh    1.518916  2.468188   3.118376

Your function must return a pandas.DataFrame like this one.

Before starting the nested for loops, your run_experiment function must first call your partition function to form the training, validation and test sets.

An example call of your function would look like this:

In [ ]:
result_df = run_experiment(X, T, n_folds=5, 
                           n_epochs_choices=[1000, 2000],
                           n_hidden_units_per_layer_choices=[[0], [10], [100, 10]],
                           activation_function_choices=['tanh', 'relu'])

Find the lowest value of RMSE Val in your table and report the RMSE Test and the parameter values that produced this. This is your expected error in predicted miles per gallon. Discuss how good this prediction is.

Plot the RMSE values for training, validation and test sets versus the combined parameter values of number of epochs and network architecture. Make one plot for tanh as the activation function and a second one for relu. Your plots should look like this, but with different RMSE values, and will of course be different if you choose different network architectures and numbers of epochs.

Describe at least three different observations you make about these plots. What do you find interesting?

  1. ...
  2. ...
  3. ...

Grading and Check-in

Your notebook will be partially run and graded automatically. Test this grading process by first downloading A2grader.zip and extract A2grader.py from it. Run the code in the following cell to demonstrate an example grading session. You should see a perfect execution score of 70/70 if your functions are defined correctly. The remaining 30 points will be based on other testing and the results you obtain and your discussions.

A different, but similar, grading script will be used to grade your checked-in notebook. It will include additional tests. You should design and perform additional tests on all of your functions to be sure they run correctly before checking in your notebook.

For the grading script to run correctly, you must first name this notebook as Lastname-A2.ipynb| with Lastname being your last name, and then save this notebook and check it in at the A2 assignment link in our Canvas web page.

In [4]:
%run -i A2grader.py
  w = np.array([0.0])
  def cubic(wmin):
      return (w[0] - wmin) ** 3 + (w[0] - wmin) ** 2
  def grad_cubic(wmin):
      return 3 * (w[0] - wmin) ** 2 + 2 * (w[0] - wmin)
  wmin = 0.5
  opt = Optimizers(w)
  errors_sgd = opt.sgd(cubic, grad_cubic, [wmin], 100, 0.01)

sgd: Epoch 10 Error=0.11889
sgd: Epoch 20 Error=0.11092
sgd: Epoch 30 Error=0.10176
sgd: Epoch 40 Error=0.09162
sgd: Epoch 50 Error=0.08081
sgd: Epoch 60 Error=0.06972
sgd: Epoch 70 Error=0.05879
sgd: Epoch 80 Error=0.04844
sgd: Epoch 90 Error=0.03901
sgd: Epoch 100 Error=0.03072

--- 10/10 points. Returned correct value.

  w = np.array([0.0])
  def cubic(wmin):
      return (w[0] - wmin) ** 3 + (w[0] - wmin) ** 2
  def grad_cubic(wmin):
      return 3 * (w[0] - wmin) ** 2 + 2 * (w[0] - wmin)
  wmin = 0.5
  opt = Optimizers(w)
  errors_adam = opt.adam(cubic, grad_cubic, [wmin], 100, 0.01)

Adam: Epoch 10 Error=0.09899
Adam: Epoch 20 Error=0.06515
Adam: Epoch 30 Error=0.03305
Adam: Epoch 40 Error=0.01054
Adam: Epoch 50 Error=0.00110
Adam: Epoch 60 Error=0.00009
Adam: Epoch 70 Error=0.00044
Adam: Epoch 80 Error=0.00017
Adam: Epoch 90 Error=0.00001
Adam: Epoch 100 Error=0.00001

--- 10/10 points. Returned correct value.

    nnet = NeuralNetwork(2, [10], 1)
    X = np.arange(40).reshape(20, 2)
    T = X[:, 0:1] * X[:, 1:]
    nnet.train(X, T, 1000, 0.01, method='adam')

Adam: Epoch 100 Error=62.64463
Adam: Epoch 200 Error=35.83151
Adam: Epoch 300 Error=24.70152
Adam: Epoch 400 Error=18.70035
Adam: Epoch 500 Error=14.78695
Adam: Epoch 600 Error=11.85496
Adam: Epoch 700 Error=9.50694
Adam: Epoch 800 Error=7.55042
Adam: Epoch 900 Error=5.89951
Adam: Epoch 1000 Error=4.54197

--- 20/20 points. Returned correct value.

    # Using X and T from previous test
    a, b, c, d, e, f = partition(X, T, 3)

--- 10/10 points. Returned correct values.


    result = run_experiment(X, T, 4,
                            n_epochs_choices=[100, 200],
                            n_hidden_units_per_layer_choices=[[0], [10]],
                            activation_function_choices=['tanh', 'relu'])

    first_test_rmse = result.iloc[0]['RMSE Test']

Adam: Epoch 1 Error=89.53352
Adam: Epoch 2 Error=87.94694
Adam: Epoch 3 Error=87.01938
Adam: Epoch 4 Error=86.94322
Adam: Epoch 5 Error=87.30763
Adam: Epoch 6 Error=87.61791
Adam: Epoch 7 Error=87.67736
Adam: Epoch 8 Error=87.53176
Adam: Epoch 9 Error=87.29965
Adam: Epoch 10 Error=87.07802
Adam: Epoch 1 Error=348.51614
Adam: Epoch 2 Error=339.20790
Adam: Epoch 3 Error=329.92686
Adam: Epoch 4 Error=320.68025
Adam: Epoch 5 Error=311.47569
Adam: Epoch 6 Error=302.32116
Adam: Epoch 7 Error=293.22497
Adam: Epoch 8 Error=284.19583
Adam: Epoch 9 Error=275.24278
Adam: Epoch 10 Error=266.37524
Adam: Epoch 1 Error=412.14380
Adam: Epoch 2 Error=386.30379
Adam: Epoch 3 Error=364.45416
Adam: Epoch 4 Error=346.36303
Adam: Epoch 5 Error=331.88276
Adam: Epoch 6 Error=320.53967
Adam: Epoch 7 Error=311.73030
Adam: Epoch 8 Error=305.56897
Adam: Epoch 9 Error=300.97797
Adam: Epoch 10 Error=297.02832
Adam: Epoch 1 Error=350.41235
Adam: Epoch 2 Error=338.88922
Adam: Epoch 3 Error=327.98057
Adam: Epoch 4 Error=317.90531
Adam: Epoch 5 Error=308.92094
Adam: Epoch 6 Error=300.45091
Adam: Epoch 7 Error=292.59161
Adam: Epoch 8 Error=285.32170
Adam: Epoch 9 Error=278.66623
Adam: Epoch 10 Error=272.49345
Adam: Epoch 2 Error=261.01604
Adam: Epoch 4 Error=251.25526
Adam: Epoch 6 Error=242.60641
Adam: Epoch 8 Error=234.81381
Adam: Epoch 10 Error=227.31028
Adam: Epoch 12 Error=219.64909
Adam: Epoch 14 Error=211.79640
Adam: Epoch 16 Error=203.96324
Adam: Epoch 18 Error=196.37226
Adam: Epoch 20 Error=189.12872
Adam: Epoch 2 Error=191.08839
Adam: Epoch 4 Error=183.35901
Adam: Epoch 6 Error=175.86820
Adam: Epoch 8 Error=168.46335
Adam: Epoch 10 Error=161.30874
Adam: Epoch 12 Error=154.28219
Adam: Epoch 14 Error=147.54895
Adam: Epoch 16 Error=141.06177
Adam: Epoch 18 Error=134.85946
Adam: Epoch 20 Error=129.01892
Adam: Epoch 2 Error=397.50671
Adam: Epoch 4 Error=355.62556
Adam: Epoch 6 Error=329.88713
Adam: Epoch 8 Error=316.00098
Adam: Epoch 10 Error=308.33035
Adam: Epoch 12 Error=301.86276
Adam: Epoch 14 Error=293.88322
Adam: Epoch 16 Error=282.87490
Adam: Epoch 18 Error=268.86735
Adam: Epoch 20 Error=252.90209
Adam: Epoch 2 Error=285.86603
Adam: Epoch 4 Error=268.58494
Adam: Epoch 6 Error=253.57117
Adam: Epoch 8 Error=240.31659
Adam: Epoch 10 Error=228.82310
Adam: Epoch 12 Error=218.71252
Adam: Epoch 14 Error=209.74831
Adam: Epoch 16 Error=201.76172
Adam: Epoch 18 Error=194.70368
Adam: Epoch 20 Error=188.43514

--- 20/20 points. Returned correct values.

a2 Execution Grade is 70 / 70

 __ / 30 Discussion of at least three observations about
your results.  Use at least 10 sentences in your discussion.

a2 FINAL GRADE is  _  / 100

Extra Credit:
Add the Swish activation function as a third choice in your train function in your NeuralNetwork class. A little googling will find definitions of it and its gradient.

Use your run_experiment function to compare results for all three activation functions. Discuss the results.

a2 EXTRA CREDIT is 0 / 1

Extra Credit: 1 point

Add the Swish activation function as a third choice in your train function in your NeuralNetwork class. A little googling will find definitions of it and its gradient. Start with this article.

Use your run_experiment function to compare results for all three activation functions. Discuss the results.