Low-Dimensional Representations of Data

Download the QSAR data set from this UCI ML Repository site. It consists of eight measurements of water quality that may affect a ninth measurement, of aquatic toxicity towards Daphnia Magna.

In [1]:
!head qsar_aquatic_toxicity.csv








In [2]:
!wc qsar_aquatic_toxicity.csv
  546   546 22903 qsar_aquatic_toxicity.csv
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [4]:
datadf = pd.read_csv('qsar_aquatic_toxicity.csv', delimiter=';', header=None)
datadf
Out[4]:
0 1 2 3 4 5 6 7 8
0 0.00 0.000 0 2.419 1.225 0.667 0 0 3.740
1 0.00 0.000 0 2.638 1.401 0.632 0 0 4.330
2 9.23 11.000 0 5.799 2.930 0.486 0 0 7.019
3 9.23 11.000 0 5.453 2.887 0.495 0 0 6.723
4 9.23 11.000 0 4.068 2.758 0.695 0 0 5.979
... ... ... ... ... ... ... ... ... ...
541 24.06 35.776 2 3.326 2.837 0.849 2 0 4.651
542 9.23 11.000 0 3.275 2.727 0.874 0 0 3.953
543 0.00 0.000 0 5.165 3.111 0.732 0 0 6.219
544 13.14 9.507 0 2.859 2.614 0.827 0 0 4.995
545 0.00 0.000 0 2.255 1.800 0.917 0 0 2.480

546 rows × 9 columns

What is in each column? Must go back to the UCI Repository.

Input Columns:

  1. TPSA(Tot) (Molecular properties)
  2. SAacc (Molecular properties)
  3. H-050 (Atom-centred fragments)
  4. MLOGP (Molecular properties)
  5. RDCHI (Connectivity indices)
  6. GATS1p (2D autocorrelations)
  7. nN (Constitutional indices)
  8. C-040 (Atom-centred fragments)

Target Column:

  1. LC50, the concentration that causes death in 50% of test D. magna over a test duration of 48 hours
In [5]:
names = ('TPSA', 'SAacc', 'H-050', 'MLOGP', 'RDCHI', 'GATS1p', 'nN', 'C-40', 'LC50')
datadf.columns = names
datadf
Out[5]:
TPSA SAacc H-050 MLOGP RDCHI GATS1p nN C-40 LC50
0 0.00 0.000 0 2.419 1.225 0.667 0 0 3.740
1 0.00 0.000 0 2.638 1.401 0.632 0 0 4.330
2 9.23 11.000 0 5.799 2.930 0.486 0 0 7.019
3 9.23 11.000 0 5.453 2.887 0.495 0 0 6.723
4 9.23 11.000 0 4.068 2.758 0.695 0 0 5.979
... ... ... ... ... ... ... ... ... ...
541 24.06 35.776 2 3.326 2.837 0.849 2 0 4.651
542 9.23 11.000 0 3.275 2.727 0.874 0 0 3.953
543 0.00 0.000 0 5.165 3.111 0.732 0 0 6.219
544 13.14 9.507 0 2.859 2.614 0.827 0 0 4.995
545 0.00 0.000 0 2.255 1.800 0.917 0 0 2.480

546 rows × 9 columns

In [6]:
data = datadf.values
data
Out[6]:
array([[ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  3.74 ],
       [ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  4.33 ],
       [ 9.23 , 11.   ,  0.   , ...,  0.   ,  0.   ,  7.019],
       ...,
       [ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  6.219],
       [13.14 ,  9.507,  0.   , ...,  0.   ,  0.   ,  4.995],
       [ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  2.48 ]])

For such a simple csv file, we can also use the numpy function loadtxt.

In [7]:
data = np.loadtxt('qsar_aquatic_toxicity.csv', delimiter=';')
data
Out[7]:
array([[ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  3.74 ],
       [ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  4.33 ],
       [ 9.23 , 11.   ,  0.   , ...,  0.   ,  0.   ,  7.019],
       ...,
       [ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  6.219],
       [13.14 ,  9.507,  0.   , ...,  0.   ,  0.   ,  4.995],
       [ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  2.48 ]])
In [8]:
data.shape
Out[8]:
(546, 9)
In [9]:
plt.figure(figsize=(10, 10))
for i in range(data.shape[1]):
    plt.subplot(3, 3, i+1)
    plt.plot(data[:, i])
    plt.ylabel(names[i])
plt.tight_layout()

pd.plotting.scatter_matrix(datadf, figsize=(15, 15), marker = 'o', hist_kwds = {'bins': 15}, s = 10, alpha = 0.8);

Let's say we want to use one tenth of the data as test data.

In [10]:
546 * 0.1
Out[10]:
54.6
In [11]:
np.random.shuffle(data)

Xtrain = data[:-55, :-1]
Ttrain = data[:-55, -1:]
Xtest = data[-55:, :-1]
Ttest = data[-55:, -1:]

Xtrain.shape, Ttrain.shape, Xtest.shape, Ttest.shape
Out[11]:
((491, 8), (491, 1), (55, 8), (55, 1))
In [12]:
plt.plot(Ttrain)
Out[12]:
[<matplotlib.lines.Line2D at 0x7f27867506a0>]
In [13]:
import A3mysolution as nn
In [14]:
nnet = nn.NeuralNetwork(Xtrain.shape[1], [10], 1)
nnet.train(Xtrain, Ttrain, 2000, method='scg')
plt.plot(nnet.get_error_trace());
SCG: Epoch 200 Error=0.85848
SCG: Epoch 400 Error=0.77550
SCG: Epoch 600 Error=0.75750
SCG: Epoch 800 Error=0.74818
SCG: Epoch 1000 Error=0.74118
SCG: Epoch 1200 Error=0.73875
SCG: Epoch 1400 Error=0.73747
SCG: Epoch 1600 Error=0.73664
SCG: Epoch 1800 Error=0.73590
SCG: Epoch 2000 Error=0.73524
In [15]:
def rmse(Y, T):
    return np.sqrt(np.mean((T-Y)**2))
In [16]:
rmse(nnet.use(Xtrain), Ttrain)
Out[16]:
0.7352397091869803

How can we also create a plot of test data error versus epoch? Our optimizers.py code doesn't allow this.

One way is to call train multiple times for subsets of epochs and call rmse on train and test data to create error versus epoch lists.

In [17]:
n_epochs = 2000
n_epochs_per_train = 20
n_reps = n_epochs // n_epochs_per_train
n_epochs_last_train = n_epochs - n_reps * n_epochs_per_train
if n_epochs_last_train > 0:
    n_reps += 1

rmses = []
nnet = nn.NeuralNetwork(Xtrain.shape[1], [10], 1)
for reps in range(n_reps):
    n_epoch = n_epochs_per_train if reps < n_reps-1 else n_epochs_last_train
    nnet.train(Xtrain, Ttrain, n_epochs_per_train, method='scg', verbose=False)
    rmses.append([rmse(nnet.use(Xtrain), Ttrain),
                  rmse(nnet.use(Xtest), Ttest)])
In [18]:
plt.plot(nnet.get_error_trace())
Out[18]:
[<matplotlib.lines.Line2D at 0x7f27240df310>]
In [19]:
len(rmses)
Out[19]:
100
In [20]:
rmses[:10]
Out[20]:
[[1.1092905281015732, 1.074803514874808],
 [1.0422523967062938, 1.1507525173468052],
 [0.971343490054443, 1.1007517133208455],
 [0.9136015653490543, 1.1445257087064515],
 [0.8802663491664203, 1.1235300184227706],
 [0.8582983903885313, 1.1324399516602002],
 [0.8443238285666306, 1.15526254630405],
 [0.8338929488393266, 1.1433157378635803],
 [0.8239529692400904, 1.137936582928785],
 [0.8165316795859134, 1.1035905831154897]]
In [21]:
plt.plot(rmses)
plt.legend(('Train', 'Test'))
plt.xlabel(f'Batches of {n_epochs_per_train} epochs')
plt.ylabel('RMSE');
In [22]:
def run(Xtrain, Ttrain, Xtest, Ttest,
        n_epochs, n_epochs_per_train, n_hiddens_list,
        method, learning_rate=None):
    
    n_reps = n_epochs // n_epochs_per_train
    n_epochs_last_train = n_epochs - n_reps * n_epochs_per_train
    if n_epochs_last_train > 0:
        n_reps += 1

    nnet = nn.NeuralNetwork(Xtrain.shape[1], n_hiddens_list, Ttrain.shape[1])

    rmses = []
    for reps in range(n_reps):
        n_epoch = n_epochs_per_train 
        if n_epochs_last_train > 0 and reps == n_reps-1:
            n_epoch = n_epochs_last_train
            
        nnet.train(Xtrain, Ttrain, n_epoch, method='scg', verbose=False)
        
        rmses.append([(reps + 1) * n_epoch,
                      rmse(nnet.use(Xtrain), Ttrain),
                      rmse(nnet.use(Xtest), Ttest)])
        
    rmses = np.array(rmses)
    best_test_index = np.argmin(rmses[:, -1])
    best_test_epoch, best_test_train_rmse, best_test_test_rmse = rmses[best_test_index, :]
    print(f'Best test error is at epoch {best_test_epoch}', end='')
    print(f' with RMSE Train of {best_test_train_rmse:.2f} and Test of {best_test_test_rmse:.2f}')
        
    plt.plot(rmses[:, 0], rmses[:, 1], label='Train')
    plt.plot(rmses[:, 0], rmses[:, 2], label='Test')
    plt.legend()
    ymin, ymax = plt.ylim()
    plt.plot([best_test_epoch, best_test_epoch], [ymin, ymax], 'r')
    plt.xlabel(f'Epochs')
    plt.ylabel('RMSE')
    
    return nnet, best_test_epoch, best_test_train_rmse, best_test_test_rmse
In [23]:
run(Xtrain, Ttrain, Xtest, Ttest, 2000, 23, [10], 'scg')
Best test error is at epoch 115.0 with RMSE Train of 0.87 and Test of 0.99
Out[23]:
(NeuralNetwork(8, [10], 1), 115.0, 0.8687695479194892, 0.9892779064589123)
In [24]:
run(Xtrain, Ttrain, Xtest, Ttest, 1000, 5, [10], 'scg');
Best test error is at epoch 5.0 with RMSE Train of 1.44 and Test of 1.23
In [25]:
run(Xtrain, Ttrain, Xtest, Ttest, 1000, 5, [10, 20], 'scg');
Best test error is at epoch 80.0 with RMSE Train of 0.98 and Test of 1.02
In [26]:
run(Xtrain, Ttrain, Xtest, Ttest, 500, 10, [10], 'sgd', 0.01);
Best test error is at epoch 60.0 with RMSE Train of 0.99 and Test of 1.00
In [27]:
run(Xtrain, Ttrain, Xtest, Ttest, 1000, 1, [5]*10, 'sgd', 0.01);
Best test error is at epoch 20.0 with RMSE Train of 1.66 and Test of 1.42

Now, finally we can start to play with reduced representations. What if we add a middle hidden layer of just two units? The neural net will learn to pass as much information as it can about each sample through that constricted, bottleneck layer. The two values that are output from that layer must capture as much information about the input sample as possible.

In [28]:
run(Xtrain, Ttrain, Xtest, Ttest, 1000, 10, [10, 10], 'scg');
Best test error is at epoch 40.0 with RMSE Train of 1.10 and Test of 1.01
In [29]:
run(Xtrain, Ttrain, Xtest, Ttest, 1000, 10, [10, 2, 10], 'scg');
Best test error is at epoch 40.0 with RMSE Train of 1.14 and Test of 1.04
In [30]:
nnet, _, _, _ = run(Xtrain, Ttrain, Xtest, Ttest, 60, 5, [10, 2, 10], 'scg')
Best test error is at epoch 35.0 with RMSE Train of 1.17 and Test of 1.06
In [31]:
Y, Z = nnet.use(Xtrain, return_hidden_layer_outputs=True)
In [32]:
len(Z)
Out[32]:
3
In [33]:
Z[1].shape
Out[33]:
(491, 2)
In [34]:
plt.plot(Z[1][:, 0], Z[1][:, 1], 'o');
In [35]:
plt.scatter(Z[1][:, 0], Z[1][:, 1]);
In [36]:
plt.scatter(Z[1][:, 0], Z[1][:, 1], c=Ttrain)
plt.colorbar();
In [37]:
Ytest, Ztest = nnet.use(Xtest, return_hidden_layer_outputs=True)

plt.scatter(Ztest[1][:, 0], Ztest[1][:, 1], c=Ttest, marker='^')
plt.colorbar();
In [38]:
plt.figure(figsize=(12, 10))
plt.scatter(Z[1][:, 0], Z[1][:, 1], c=Ttrain, s=80)
plt.scatter(Ztest[1][:, 0], Ztest[1][:, 1], c=Ttest, s=200, marker='^')
plt.colorbar();