Partitional Data to Compare Neural Network Performance

To fairly compare the performance of machine learning models, we must partition data into training, validation and testing parts. Remember, "partition" means disjoint subsets, ones whose intersection is empty. We will train our model on the training partition, calculate its performance every epoch on the validation partition, return the model with parameter values that resulted in the best performance on the validation partition, then apply the trained model on the testing partition and report its performance. It is only this last calculation, performance on the testing data, that is useful for predicting how well our model will do on new data.

One way to partition data info training, validation, and testing partitions is to simply randomly select samples from the available data into these three disjoint subsets. We can specify the size of each subset as fractions of the number of data samples.

It would be nice to have such a function, which could be called like this:

Xtrain, Ttrain, Xval, Tval, Xtest, Ttest = partition(X, T, fractions=(0.6, 0.2, 0.2), 
                                                     shuffle=True, classification=True)

This would randomly take 60 % of the samples as training data, 20 % as validation data, and 20 % as testing data.

In fact, this would be so nice to have ... here it is!

In [2]:
import numpy as np
import matplotlib.pyplot as plt
In [3]:
def partition(X, T, fractions, shuffle=True, classification=False):
    """Usage: Xtrain,Train,Xvalidate,Tvalidate,Xtest,Ttest = partition(X,T,(0.6,0.2,0.2),classification=True)
      X is nSamples x nFeatures.
      fractions can have just two values, for partitioning into train and test only
      If classification=True, T is target class as integer. Data partitioned
        according to class proportions.
        """
    train_fraction = fractions[0]
    if len(fractions) == 2:
        # Skip the validation step
        validate_fraction = 0
        test_fraction = fractions[1]
    else:
        validate_fraction = fractions[1]
        test_fraction = fractions[2]
        
    row_indices = np.arange(X.shape[0])
    if shuffle:
        np.random.shuffle(row_indices)
    
    if not classification:
        # regression, so do not partition according to targets.
        n = X.shape[0]
        n_train = round(train_fraction * n)
        n_validate = round(validate_fraction * n)
        n_test = round(test_fraction * n)
        if n_train + n_validate + n_test > n:
            n_test = n - n_train - n_validate
        Xtrain = X[row_indices[:n_train], :]
        Ttrain = T[row_indices[:n_train], :]
        if n_validate > 0:
            Xvalidate = X[row_indices[n_train:n_train + n_validate], :]
            Tvalidate = T[row_indices[n_train:n_train + n_validate], :]
        Xtest = X[row_indices[n_train + n_validate:n_train + n_validate + n_test], :]
        Ttest = T[row_indices[n_train + n_validate:n_train + n_validate + n_test], :]
        
    else:
        # classifying, so partition data according to target class
        classes = np.unique(T)
        train_indices = []
        validate_indices = []
        test_indices = []
        for c in classes:
            # row indices for class c
            rows_this_class = np.where(T[row_indices,:] == c)[0]
            # collect row indices for class c for each partition
            n = len(rows_this_class)
            n_train = round(train_fraction * n)
            n_validate = round(validate_fraction * n)
            n_test = round(test_fraction * n)
            if n_train + n_validate + n_test > n:
                n_test = n - n_train - n_validate
            train_indices += row_indices[rows_this_class[:n_train]].tolist()
            if n_validate > 0:
                validate_indices += row_indices[rows_this_class[n_train:n_train + n_validate]].tolist()
            test_indices += row_indices[rows_this_class[n_train + n_validate:n_train + n_validate + n_test]].tolist()
        Xtrain = X[train_indices, :]
        Ttrain = T[train_indices, :]
        if n_validate > 0:
            Xvalidate = X[validate_indices, :]
            Tvalidate = T[validate_indices, :]
        Xtest = X[test_indices, :]
        Ttest = T[test_indices, :]
    if n_validate > 0:
        return Xtrain, Ttrain, Xvalidate, Tvalidate, Xtest, Ttest
    else:
        return Xtrain, Ttrain, Xtest, Ttest
In [4]:
X = np.arange(10).reshape(-1, 1)
T = X + 0.1
X, T
Out[4]:
(array([[0],
        [1],
        [2],
        [3],
        [4],
        [5],
        [6],
        [7],
        [8],
        [9]]),
 array([[0.1],
        [1.1],
        [2.1],
        [3.1],
        [4.1],
        [5.1],
        [6.1],
        [7.1],
        [8.1],
        [9.1]]))
In [5]:
def print_them(Xtrain, Ttrain, Xval, Tval, Xtest, Ttest):
    print('Train')
    print(np.hstack((Xtrain, Ttrain)))
    print('Val')
    print(np.hstack((Xval, Tval)))
    print('Test')
    print(np.hstack((Xtest, Ttest)))

Run the following cell several times to see different partitions.

In [6]:
Xtrain, Ttrain, Xval, Tval, Xtest, Ttest = partition(X, T, (0.6, 0.2, 0.2), shuffle=True)
print_them(Xtrain, Ttrain, Xval, Tval, Xtest, Ttest)
Train
[[4.  4.1]
 [8.  8.1]
 [9.  9.1]
 [0.  0.1]
 [6.  6.1]
 [2.  2.1]]
Val
[[5.  5.1]
 [1.  1.1]]
Test
[[3.  3.1]
 [7.  7.1]]

Now that we can partition as many times as we like, let's work on another function that returns the training performance, validation performance, and testing performance for multiple partitions.

But first, we need some data to play with.

In [7]:
import pandas as pd
import os

if os.path.isfile('automobile.csv'):
    print('Reading data from \'automobile.csv\'.')
    automobile = pd.read_csv('automobile.csv')
else:
    print('Downloading auto-mpg.data from UCI ML Repository.')
    automobile = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data',
                             header=None, delimiter='\s+', na_values='?', 
                             usecols=range(8))
    automobile.columns = ('mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
                          'acceleration', 'year', 'origin')

    print(f'Number rows in original data file {len(automobile)}.')
    automobile = automobile.dropna(axis=0)
    print(f'Number rows after dropping rows with missing values {len(automobile)}.')

    automobile.to_csv('automobile.csv', index=False)  # so row numbers are not written to file
Reading data from 'automobile.csv'.
In [8]:
automobile
Out[8]:
mpg cylinders displacement horsepower weight acceleration year origin
0 18.0 8 307.0 130.0 3504.0 12.0 70 1
1 15.0 8 350.0 165.0 3693.0 11.5 70 1
2 18.0 8 318.0 150.0 3436.0 11.0 70 1
3 16.0 8 304.0 150.0 3433.0 12.0 70 1
4 17.0 8 302.0 140.0 3449.0 10.5 70 1
... ... ... ... ... ... ... ... ...
387 27.0 4 140.0 86.0 2790.0 15.6 82 1
388 44.0 4 97.0 52.0 2130.0 24.6 82 2
389 32.0 4 135.0 84.0 2295.0 11.6 82 1
390 28.0 4 120.0 79.0 2625.0 18.6 82 1
391 31.0 4 119.0 82.0 2720.0 19.4 82 1

392 rows × 8 columns

Now we need a function that creates and trains a new neural network for each partition, and returns the results.

In [12]:
def multiple_runs_regression(n_partitions, X, T, fractions, n_hiddens_list, n_epochs, learning_rate):
    
    def rmse(Y, T):
        return np.sqrt(np.mean((T - Y) ** 2))

    print(f'Structure {n_hiddens_list}: Repetition', end=' ')
    results = []
    for rep in range(n_partitions):
        
        print(f'{rep + 1}', end=' ')
        
        Xtrain, Ttrain, Xval, Tval, Xtest, Ttest = partition(X, T, fractions,
                                                             shuffle=True, classification=False)
        
        nnet = NeuralNetworkTorch(X.shape[1], n_hiddens_list, T.shape[1])
        nnet.train(Xtrain, Ttrain, n_epochs, learning_rate, method='adam', Xval=Xval, Tval=Tval, verbose=False)
        
        Ytrain = nnet.use(Xtrain)
        Yval = nnet.use(Xval)
        Ytest = nnet.use(Xtest)
        
        structure = str(n_hiddens_list)
        results.extend([[structure, 'train', rmse(Ytrain, Ttrain)],
                        [structure, 'validation', rmse(Yval, Tval)],
                        [structure, 'test', rmse(Ytest, Ttest)]])
    print()
    return results

Let's use the NeuralNetworkTorch that we defined in Lecture Notes 16.1. You will use that code in your solution to A5.

In [13]:
from A5mysolution import *
In [14]:
fractions = (0.6, 0.2, 0.2)
n_hiddens_list = [100, 100]
n_epochs = 1000
learning_rate = 0.001

n_partitions = 10
results = multiple_runs_regression(n_partitions, X, T, fractions, n_hiddens_list, n_epochs, learning_rate)
Hiddens [100, 100]: Repetition 1 2 3 4 5 6 7 8 9 10 
In [15]:
results
Out[15]:
[['[100, 100]', 'train', 0.004062758725166755],
 ['[100, 100]', 'validation', 0.1353611102136738],
 ['[100, 100]', 'test', 0.030880251632869154],
 ['[100, 100]', 'train', 0.00823247983172531],
 ['[100, 100]', 'validation', 0.11043832114743887],
 ['[100, 100]', 'test', 0.028416360071150438],
 ['[100, 100]', 'train', 0.40122134940580084],
 ['[100, 100]', 'validation', 0.05924082369573957],
 ['[100, 100]', 'test', 0.23956245872430842],
 ['[100, 100]', 'train', 0.1901793548782251],
 ['[100, 100]', 'validation', 0.0063642058693349065],
 ['[100, 100]', 'test', 0.5241358454319545],
 ['[100, 100]', 'train', 0.008299690420836031],
 ['[100, 100]', 'validation', 0.08765151014254802],
 ['[100, 100]', 'test', 0.29181913369227463],
 ['[100, 100]', 'train', 0.014493253494874988],
 ['[100, 100]', 'validation', 0.00300954120162004],
 ['[100, 100]', 'test', 0.014333701341829257],
 ['[100, 100]', 'train', 0.0034862982859318057],
 ['[100, 100]', 'validation', 0.04208833444664035],
 ['[100, 100]', 'test', 0.11987970658415248],
 ['[100, 100]', 'train', 0.010900611299816763],
 ['[100, 100]', 'validation', 0.15167900315897329],
 ['[100, 100]', 'test', 0.025156286379082154],
 ['[100, 100]', 'train', 0.011939286853332511],
 ['[100, 100]', 'validation', 0.002145348144511191],
 ['[100, 100]', 'test', 0.0982790440174706],
 ['[100, 100]', 'train', 0.4793987304516573],
 ['[100, 100]', 'validation', 0.2912593675536652],
 ['[100, 100]', 'test', 1.4765085996927851]]

The above form of the results seems a little wordy, with the network structure and the words 'train', 'validation' and 'test' repeated so often. But you will soon see why I chose this form.

In [19]:
resultsdf = pd.DataFrame(results, columns=('Structure', 'Partition', 'RMSE'))
resultsdf
Out[19]:
Structure Partition RMSE
0 [] train 2.919725
1 [] validation 0.894220
2 [] test 2.252018
3 [] train 1.158353
4 [] validation 2.631544
... ... ... ...
175 [20, 20, 20] validation 0.184229
176 [20, 20, 20] test 0.045723
177 [20, 20, 20] train 0.008183
178 [20, 20, 20] validation 0.041767
179 [20, 20, 20] test 0.102511

180 rows × 3 columns

Violin plots can be a helpful way to visualize the distribution of values from multiple runs. An illustration of how a violin plot reveals more than a boxplot is shown at this site. Matplotlib includes a violinplot function documented here.

This example includes a link to a jupyter notebook. We can make a legend and x-axis labels as shown in the example at that site. We will use the seaborn plotting package version of violinplot. Another example is provided here.

Here is a simple example of how to make a violin plot showing train, validation, and test set error distributions for different hidden layer structures.

In [17]:
import seaborn as sns

sns.violinplot(x='Structure', y='RMSE', hue='Partition', data=resultsdf)
for x in range(len(n_hiddens_list) - 1):
    plt.axvline(x + 0.5, color='r', linestyle='--', alpha=0.5)

To compare the performance of different neural networks, just wrap the call to multiple_runs_regression in a for loop over different structures and collect the results.

In [18]:
fractions = (0.6, 0.2, 0.2)
n_hiddens_list = [[], [2], [5], [10], [20, 20], [20, 20, 20]]  # Notice the first one... []
n_epochs = 1000
learning_rate = 0.001

n_partitions = 10

results = []
for nh in n_hiddens_list:
    results.extend(multiple_runs_regression(n_partitions, X, T, fractions, nh, n_epochs, learning_rate))
    
resultsdf = pd.DataFrame(results, columns=('Structure', 'Partition', 'RMSE'))
Hiddens []: Repetition 1 2 3 4 5 6 7 8 9 10 
Hiddens [2]: Repetition 1 2 3 4 5 6 7 8 9 10 
Hiddens [5]: Repetition 1 2 3 4 5 6 7 8 9 10 
Hiddens [10]: Repetition 1 2 3 4 5 6 7 8 9 10 
Hiddens [20, 20]: Repetition 1 2 3 4 5 6 7 8 9 10 
Hiddens [20, 20, 20]: Repetition 1 2 3 4 5 6 7 8 9 10 
In [20]:
resultsdf
Out[20]:
Structure Partition RMSE
0 [] train 2.919725
1 [] validation 0.894220
2 [] test 2.252018
3 [] train 1.158353
4 [] validation 2.631544
... ... ... ...
175 [20, 20, 20] validation 0.184229
176 [20, 20, 20] test 0.045723
177 [20, 20, 20] train 0.008183
178 [20, 20, 20] validation 0.041767
179 [20, 20, 20] test 0.102511

180 rows × 3 columns

In [21]:
plt.figure(figsize=(12, 10))
sns.violinplot(x='Structure', y='RMSE', hue='Partition', data=resultsdf)
for x in range(len(n_hiddens_list) - 1):
    plt.axvline(x + 0.5, color='r', linestyle='--', alpha=0.5)