In this notebook we will cover the following topics:
One of the most important parts of the data science workflow is evaluating the performance of a trained model and deciding:
Let's load up Keras and train an overly simple model on the CIFAR10 data.
import numpy as np
np.warnings.filterwarnings('ignore') # Hide np.floating warning
import keras
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
# Prevent TensorFlow from grabbing all the GPU memory
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
import holoviews as hv
hv.extension('bokeh')
Using TensorFlow backend.
Same data preparation as before.
(Pro tip: If this wasn't a tutorial, we'd move these repetitive code to a Python module and import it in the notebook to ensure we do it consistently in every experiment.)
from keras.datasets import cifar10
import keras.utils
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
# Save an unmodified copy of y_test for later, flattened to one column
y_test_true = y_test[:,0].copy()
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
# The data only has numeric categories so we also have the string labels below
cifar10_labels = np.array(['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck'])
This model resembles the one from the previous notebook, but we've removed one of the convolutional groups
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=x_train.shape[1:]))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=128,
epochs=8,
verbose=1,
validation_data=(x_test, y_test))
Train on 50000 samples, validate on 10000 samples Epoch 1/8 50000/50000 [==============================] - 5s 92us/step - loss: 1.6955 - acc: 0.3959 - val_loss: 1.3856 - val_acc: 0.5180 Epoch 2/8 50000/50000 [==============================] - 4s 73us/step - loss: 1.1942 - acc: 0.5804 - val_loss: 1.0840 - val_acc: 0.6108 Epoch 3/8 50000/50000 [==============================] - 4s 74us/step - loss: 0.9838 - acc: 0.6567 - val_loss: 1.0674 - val_acc: 0.6304 Epoch 4/8 50000/50000 [==============================] - 4s 73us/step - loss: 0.8353 - acc: 0.7098 - val_loss: 0.9725 - val_acc: 0.6634 Epoch 5/8 50000/50000 [==============================] - 4s 72us/step - loss: 0.7062 - acc: 0.7545 - val_loss: 1.0049 - val_acc: 0.6643 Epoch 6/8 50000/50000 [==============================] - 4s 73us/step - loss: 0.5871 - acc: 0.7972 - val_loss: 1.0326 - val_acc: 0.6665 Epoch 7/8 50000/50000 [==============================] - 4s 73us/step - loss: 0.4664 - acc: 0.8401 - val_loss: 0.9669 - val_acc: 0.6930 Epoch 8/8 50000/50000 [==============================] - 4s 73us/step - loss: 0.3547 - acc: 0.8800 - val_loss: 1.0325 - val_acc: 0.6888
%%opts Curve [width=400 height=300]
%%opts Curve (line_width=3)
%%opts Overlay [legend_position='top_left']
train_acc = hv.Curve((history.epoch, history.history['acc']), 'epoch', 'accuracy', label='training')
val_acc = hv.Curve((history.epoch, history.history['val_acc']), 'epoch', 'accuracy', label='validation')
(train_acc * val_acc).redim(accuracy=dict(range=(0.4, 1.1)))
This model shows a huge discrepancy in accuracy between the training and validation data, a sign of overfitting. After the epoch 2, additional training is not helping. The model is essentially memorizing the training data and not generalizing at all.
When dealing with models that predict categories, it is helpful to look at the confusion matrix as well. This will show which categories are being predicted poorly, and what kind of mispredictions are happening.
As the confusion matrix is a standard tool in all of machine learning, the sklearn
package includes a function that computes it from an array of true category IDs and an array of predicted category IDs:
from sklearn.metrics import confusion_matrix
y_pred = model.predict_classes(x_test)
confuse = confusion_matrix(y_test_true, y_pred)
# Holoviews hack to tilt labels by 45 degrees
from math import pi
def angle_label(plot, element):
plot.state.xaxis.major_label_orientation = pi / 4
%%opts HeatMap [width=500 height=400 tools=['hover'] finalize_hooks=[angle_label]]
hv.HeatMap((cifar10_labels, cifar10_labels, confuse)).redim.label(x='true', y='predict')
From this we can see that dogs, deer, cats, and birds are particularly problematic classes, with the confusion between cats and dogs being especially high. Note that because the test data are already balanced to have equal examples from each class, we do not need to do any special normalization of the above.
Overfitting is more or less inevitable if we train long enough. The goal is to control it with tools like regularization or dropout. Dropout is a surprisingly effective technique where layer inputs are passed to the output, with a random subset of outputs forced to zero during training. The subset of zeroed outputs changes after every batch. When the model is used for prediction after training, the dropout layers have no effect.
For more details about dropout, see this paper.
model2 = Sequential()
model2.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=x_train.shape[1:]))
model2.add(Conv2D(64, (3, 3), activation='relu'))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(Dropout(0.25))
model2.add(Flatten())
model2.add(Dense(128, activation='relu'))
model2.add(Dropout(0.5))
model2.add(Dense(num_classes, activation='softmax'))
model2.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
history2 = model2.fit(x_train, y_train,
batch_size=128,
epochs=11,
verbose=1,
validation_data=(x_test, y_test))
Train on 50000 samples, validate on 10000 samples Epoch 1/11 50000/50000 [==============================] - 4s 84us/step - loss: 1.8089 - acc: 0.3464 - val_loss: 1.4873 - val_acc: 0.4812 Epoch 2/11 50000/50000 [==============================] - 4s 81us/step - loss: 1.3893 - acc: 0.5060 - val_loss: 1.2267 - val_acc: 0.5629 Epoch 3/11 50000/50000 [==============================] - 4s 82us/step - loss: 1.2019 - acc: 0.5740 - val_loss: 1.0820 - val_acc: 0.6140 Epoch 4/11 50000/50000 [==============================] - 4s 81us/step - loss: 1.0799 - acc: 0.6213 - val_loss: 0.9693 - val_acc: 0.6666 Epoch 5/11 50000/50000 [==============================] - 4s 81us/step - loss: 0.9997 - acc: 0.6490 - val_loss: 0.9357 - val_acc: 0.6734 Epoch 6/11 50000/50000 [==============================] - 4s 82us/step - loss: 0.9324 - acc: 0.6735 - val_loss: 0.9630 - val_acc: 0.6613 Epoch 7/11 50000/50000 [==============================] - 4s 81us/step - loss: 0.8723 - acc: 0.6946 - val_loss: 0.9178 - val_acc: 0.6865 Epoch 8/11 50000/50000 [==============================] - 4s 81us/step - loss: 0.8131 - acc: 0.7137 - val_loss: 0.9005 - val_acc: 0.6895 Epoch 9/11 50000/50000 [==============================] - 4s 82us/step - loss: 0.7757 - acc: 0.7273 - val_loss: 0.8945 - val_acc: 0.6963 Epoch 10/11 50000/50000 [==============================] - 4s 82us/step - loss: 0.7340 - acc: 0.7453 - val_loss: 0.8881 - val_acc: 0.7030 Epoch 11/11 50000/50000 [==============================] - 4s 81us/step - loss: 0.6874 - acc: 0.7578 - val_loss: 0.8635 - val_acc: 0.7113
%%opts Curve [width=600 height=450]
%%opts Curve (line_width=3)
%%opts Overlay [legend_position='top_left']
train_acc = hv.Curve((history.epoch, history.history['acc']), 'epoch', 'accuracy', label='training without dropout')
val_acc = hv.Curve((history.epoch, history.history['val_acc']), 'epoch', 'accuracy', label='validation without dropout')
train_acc2 = hv.Curve((history2.epoch, history2.history['acc']), 'epoch', 'accuracy', label='training with dropout')
val_acc2 = hv.Curve((history2.epoch, history2.history['val_acc']), 'epoch', 'accuracy', label='validation with dropout')
(train_acc * val_acc * train_acc2 * val_acc2).redim(accuracy=dict(range=(0.4, 1.1)))
Here we can see some common features of a model with dropout:
Unfortunately, the amount of improvement in this case is still not enough to increase accuracy by more than a few percent. It looks like we need a more complex model.
To increase the sophistication of this model, we're going to employ a few strategies:
Unfortunately, this is the hardest thing to figure out in practice. Sometimes we need more layers, sometimes we need bigger layers, and sometimes we need a different model entirely. Looking at what others have done is your best guide here until you get some intuition.
model3 = Sequential()
model3.add(Conv2D(32, kernel_size=(3, 3), padding='same',
activation='relu',
input_shape=x_train.shape[1:]))
model3.add(Conv2D(32, (3, 3), activation='relu'))
model3.add(MaxPooling2D(pool_size=(2, 2)))
model3.add(Dropout(0.25))
# Second layer of convolutions
model3.add(Conv2D(64, kernel_size=(3, 3), padding='same',
activation='relu'))
model3.add(Conv2D(64, (3, 3), activation='relu'))
model3.add(MaxPooling2D(pool_size=(2, 2)))
model3.add(Dropout(0.25))
model3.add(Flatten())
model3.add(Dense(512, activation='relu'))
model3.add(Dropout(0.5))
model3.add(Dense(num_classes, activation='softmax'))
model3.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
history3 = model3.fit(x_train, y_train,
batch_size=128,
epochs=15,
verbose=1,
validation_data=(x_test, y_test))
Train on 50000 samples, validate on 10000 samples Epoch 1/15 50000/50000 [==============================] - 5s 96us/step - loss: 1.8905 - acc: 0.3121 - val_loss: 1.4884 - val_acc: 0.4695 Epoch 2/15 50000/50000 [==============================] - 5s 91us/step - loss: 1.4267 - acc: 0.4854 - val_loss: 1.2828 - val_acc: 0.5423 Epoch 3/15 50000/50000 [==============================] - 5s 91us/step - loss: 1.2250 - acc: 0.5624 - val_loss: 1.0711 - val_acc: 0.6167 Epoch 4/15 50000/50000 [==============================] - 5s 91us/step - loss: 1.0753 - acc: 0.6195 - val_loss: 0.9471 - val_acc: 0.6651 Epoch 5/15 50000/50000 [==============================] - 5s 91us/step - loss: 0.9663 - acc: 0.6611 - val_loss: 0.8573 - val_acc: 0.6984 Epoch 6/15 50000/50000 [==============================] - 5s 91us/step - loss: 0.8762 - acc: 0.6944 - val_loss: 0.8515 - val_acc: 0.7034 Epoch 7/15 50000/50000 [==============================] - 5s 91us/step - loss: 0.8153 - acc: 0.7128 - val_loss: 0.7912 - val_acc: 0.7220 Epoch 8/15 50000/50000 [==============================] - 5s 91us/step - loss: 0.7584 - acc: 0.7355 - val_loss: 0.8715 - val_acc: 0.6957 Epoch 9/15 50000/50000 [==============================] - 5s 91us/step - loss: 0.7159 - acc: 0.7510 - val_loss: 0.6997 - val_acc: 0.7578 Epoch 10/15 50000/50000 [==============================] - 5s 91us/step - loss: 0.6693 - acc: 0.7661 - val_loss: 0.7505 - val_acc: 0.7440 Epoch 11/15 50000/50000 [==============================] - 5s 92us/step - loss: 0.6347 - acc: 0.7789 - val_loss: 0.6630 - val_acc: 0.7730 Epoch 12/15 50000/50000 [==============================] - 5s 93us/step - loss: 0.6037 - acc: 0.7903 - val_loss: 0.6399 - val_acc: 0.7778 Epoch 13/15 50000/50000 [==============================] - 5s 91us/step - loss: 0.5751 - acc: 0.7966 - val_loss: 0.6530 - val_acc: 0.7760 Epoch 14/15 50000/50000 [==============================] - 5s 90us/step - loss: 0.5521 - acc: 0.8065 - val_loss: 0.6370 - val_acc: 0.7800 Epoch 15/15 50000/50000 [==============================] - 5s 91us/step - loss: 0.5239 - acc: 0.8146 - val_loss: 0.7490 - val_acc: 0.7551
%%opts Curve [width=600 height=500]
%%opts Curve (line_width=3)
%%opts Overlay [legend_position='top_left']
train_acc = hv.Curve((history2.epoch, history2.history['val_acc']), 'epoch', 'accuracy', label='validation (simple model)')
train_acc2 = hv.Curve((history3.epoch, history3.history['acc']), 'epoch', 'accuracy', label='training (complex model)')
val_acc = hv.Curve((history3.epoch, history3.history['val_acc']), 'epoch', 'accuracy', label='validation (complex model)')
(train_acc * val_acc * train_acc2).redim(accuracy=dict(range=(0.4, 1.1)))
If you screw everything up, you can use File / Revert to Checkpoint to go back to the first version of the notebook and restart the Jupyter kernel with Kernel / Restart.