This notebook is going to lean pretty heavily on Keras, so it's worth taking a minute to understand a little bit about what Keras is.
According to its own documentation, Keras is "a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano."
So while Tensorflow, for example, is an API that helps with machine learning in a broader sense, Keras was written specifically to help with neural networks.
Let's see this in action.
Remember how we called MNIST a famous dataset? MNIST is so famous, Keras contains its own copy!
from keras.datasets import mnist
(x_trn, y_trn), (x_val, y_val) = mnist.load_data()
Using Theano backend. Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103) /home/ubuntu/anaconda2/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5. warnings.warn(warn)
x_trn
contains 60,000 28x28 images:
x_trn.shape
(60000, 28, 28)
We have to change the shape of our inputs a bit - Keras expects an image with multiple color channels by default, but we can just add a single channel for our black and white images:
import numpy as np
x_trn = np.expand_dims(x_trn, 1)
x_val = np.expand_dims(x_val, 1)
x_trn.shape
(60000, 1, 28, 28)
And y_trn
contains the labels:
y_trn[:5]
array([5, 0, 4, 1, 9], dtype=uint8)
Just like last time, this data should be in one-hot encoded format, but we can just use the provided Keras function this time.
from keras.utils.np_utils import to_categorical
y_trn = to_categorical(y_trn)
y_val = to_categorical(y_val)
y_val[:3]
array([[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], [ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])
The simplest type of Keras model is a Sequential
model.
from keras.models import Sequential
model = Sequential()
Called "sequential" because it contains a stack or "sequence" of linear layers.
from keras.layers import Dense
model.add(Dense(10, input_dim=28))
We should know by now that what we're doing here is telling the first layer in our model to take an input of size (, 28) and produce an output of size (, 10).
We can stack as many model layers as we like:
model.add(Dense(5))
model.add(Dense(2))
And we no longer have to specify input dimensions because each layer just takes the output of the previous layer.
When we're done here we can compile the model:
model.compile(
loss="categorical_crossentropy",
optimizer="sgd",
metrics=["accuracy"]
)
Can we run this on MNIST?
model.fit(x_trn, y_trn, nb_epoch=3, batch_size=32)
--------------------------------------------------------------------------- Exception Traceback (most recent call last) <ipython-input-12-5d4f6032f99c> in <module>() ----> 1 model.fit(x_trn, y_trn, nb_epoch=3, batch_size=32) /home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, **kwargs) 618 shuffle=shuffle, 619 class_weight=class_weight, --> 620 sample_weight=sample_weight) 621 622 def evaluate(self, x, y, batch_size=32, verbose=1, /home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight) 1032 class_weight=class_weight, 1033 check_batch_dim=False, -> 1034 batch_size=batch_size) 1035 # prepare validation data 1036 if validation_data: /home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in _standardize_user_data(self, x, y, sample_weight, class_weight, check_batch_dim, batch_size) 959 self.internal_input_shapes, 960 check_batch_dim=False, --> 961 exception_prefix='model input') 962 y = standardize_input_data(y, self.output_names, 963 output_shapes, /home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in standardize_input_data(data, names, shapes, check_batch_dim, exception_prefix) 95 ' to have ' + str(len(shapes[i])) + 96 ' dimensions, but got array with shape ' + ---> 97 str(array.shape)) 98 for j, (dim, ref_dim) in enumerate(zip(array.shape, shapes[i])): 99 if not j and not check_batch_dim: Exception: Error when checking model input: expected dense_input_1 to have 2 dimensions, but got array with shape (60000, 1, 28, 28)
I guess not. Let's have a look at the error message:
expected dense_input_1 to have 2 dimensions, but got array with shape (60000, 28, 28)
Hey, that makes sense! We told our model to expect an input of size (, 28) but instead we gave it the entire (60000, 28, 28) MNIST dataset.
And we could try changing the first layer in our model to something like model.add(Dense(10, input_shape=(28, 28))
but it actually wouldn't be a very good model (and also Keras will spit an error back at us).
Remember on Day Twelve when we said our spreadsheet neural network wasn't a real neural network because it was just a bunch of linear matrix multiplications?
Right now, that's all our Keras model is. Each Dense
layer is the equivalent of a single matrix multiplication in the spreadsheet.
Okay, so if we don't want just a bunch of linear layers, what do we want? Since we don't have any strong opinions about what the architecture of a neural network for image recognition should look like, let's just see what the VGG model looks like and start from there:
from vgg16 import Vgg16
vgg = Vgg16()
vgg.model.summary()
____________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ==================================================================================================== lambda_1 (Lambda) (None, 3, 224, 224) 0 lambda_input_1[0][0] ____________________________________________________________________________________________________ zeropadding2d_1 (ZeroPadding2D) (None, 3, 226, 226) 0 lambda_1[0][0] ____________________________________________________________________________________________________ convolution2d_1 (Convolution2D) (None, 64, 224, 224) 1792 zeropadding2d_1[0][0] ____________________________________________________________________________________________________ zeropadding2d_2 (ZeroPadding2D) (None, 64, 226, 226) 0 convolution2d_1[0][0] ____________________________________________________________________________________________________ convolution2d_2 (Convolution2D) (None, 64, 224, 224) 36928 zeropadding2d_2[0][0] ____________________________________________________________________________________________________ maxpooling2d_1 (MaxPooling2D) (None, 64, 112, 112) 0 convolution2d_2[0][0] ____________________________________________________________________________________________________ zeropadding2d_3 (ZeroPadding2D) (None, 64, 114, 114) 0 maxpooling2d_1[0][0] ____________________________________________________________________________________________________ convolution2d_3 (Convolution2D) (None, 128, 112, 112) 73856 zeropadding2d_3[0][0] ____________________________________________________________________________________________________ zeropadding2d_4 (ZeroPadding2D) (None, 128, 114, 114) 0 convolution2d_3[0][0] ____________________________________________________________________________________________________ convolution2d_4 (Convolution2D) (None, 128, 112, 112) 147584 zeropadding2d_4[0][0] ____________________________________________________________________________________________________ maxpooling2d_2 (MaxPooling2D) (None, 128, 56, 56) 0 convolution2d_4[0][0] ____________________________________________________________________________________________________ zeropadding2d_5 (ZeroPadding2D) (None, 128, 58, 58) 0 maxpooling2d_2[0][0] ____________________________________________________________________________________________________ convolution2d_5 (Convolution2D) (None, 256, 56, 56) 295168 zeropadding2d_5[0][0] ____________________________________________________________________________________________________ zeropadding2d_6 (ZeroPadding2D) (None, 256, 58, 58) 0 convolution2d_5[0][0] ____________________________________________________________________________________________________ convolution2d_6 (Convolution2D) (None, 256, 56, 56) 590080 zeropadding2d_6[0][0] ____________________________________________________________________________________________________ zeropadding2d_7 (ZeroPadding2D) (None, 256, 58, 58) 0 convolution2d_6[0][0] ____________________________________________________________________________________________________ convolution2d_7 (Convolution2D) (None, 256, 56, 56) 590080 zeropadding2d_7[0][0] ____________________________________________________________________________________________________ maxpooling2d_3 (MaxPooling2D) (None, 256, 28, 28) 0 convolution2d_7[0][0] ____________________________________________________________________________________________________ zeropadding2d_8 (ZeroPadding2D) (None, 256, 30, 30) 0 maxpooling2d_3[0][0] ____________________________________________________________________________________________________ convolution2d_8 (Convolution2D) (None, 512, 28, 28) 1180160 zeropadding2d_8[0][0] ____________________________________________________________________________________________________ zeropadding2d_9 (ZeroPadding2D) (None, 512, 30, 30) 0 convolution2d_8[0][0] ____________________________________________________________________________________________________ convolution2d_9 (Convolution2D) (None, 512, 28, 28) 2359808 zeropadding2d_9[0][0] ____________________________________________________________________________________________________ zeropadding2d_10 (ZeroPadding2D) (None, 512, 30, 30) 0 convolution2d_9[0][0] ____________________________________________________________________________________________________ convolution2d_10 (Convolution2D) (None, 512, 28, 28) 2359808 zeropadding2d_10[0][0] ____________________________________________________________________________________________________ maxpooling2d_4 (MaxPooling2D) (None, 512, 14, 14) 0 convolution2d_10[0][0] ____________________________________________________________________________________________________ zeropadding2d_11 (ZeroPadding2D) (None, 512, 16, 16) 0 maxpooling2d_4[0][0] ____________________________________________________________________________________________________ convolution2d_11 (Convolution2D) (None, 512, 14, 14) 2359808 zeropadding2d_11[0][0] ____________________________________________________________________________________________________ zeropadding2d_12 (ZeroPadding2D) (None, 512, 16, 16) 0 convolution2d_11[0][0] ____________________________________________________________________________________________________ convolution2d_12 (Convolution2D) (None, 512, 14, 14) 2359808 zeropadding2d_12[0][0] ____________________________________________________________________________________________________ zeropadding2d_13 (ZeroPadding2D) (None, 512, 16, 16) 0 convolution2d_12[0][0] ____________________________________________________________________________________________________ convolution2d_13 (Convolution2D) (None, 512, 14, 14) 2359808 zeropadding2d_13[0][0] ____________________________________________________________________________________________________ maxpooling2d_5 (MaxPooling2D) (None, 512, 7, 7) 0 convolution2d_13[0][0] ____________________________________________________________________________________________________ flatten_1 (Flatten) (None, 25088) 0 maxpooling2d_5[0][0] ____________________________________________________________________________________________________ dense_4 (Dense) (None, 4096) 102764544 flatten_1[0][0] ____________________________________________________________________________________________________ dropout_1 (Dropout) (None, 4096) 0 dense_4[0][0] ____________________________________________________________________________________________________ dense_5 (Dense) (None, 4096) 16781312 dropout_1[0][0] ____________________________________________________________________________________________________ dropout_2 (Dropout) (None, 4096) 0 dense_5[0][0] ____________________________________________________________________________________________________ dense_6 (Dense) (None, 1000) 4097000 dropout_2[0][0] ==================================================================================================== Total params: 138357544 ____________________________________________________________________________________________________
It looks like VGG starts with a Lambda layer, contains a number of ZeroPadding and Convolution layers with some MaxPooling sprinkled in, then a Flatten layer, then Dense > Dropout > Dense a couple of times.
Piece of cake, right?
Let's start at the top.
The first layer in VGG is a Lambda layer, which the docs tells us can "[wrap an] arbitrary expression as a Layer object".
I believe what this means is "turn any function into a layer", and if we look at the VGG declaration, the line containing the Lambda layer is:
model.add(Lambda(vgg_preprocess, input_shape=(3,224,224), output_shape=(3,224,224)))
So input and output shapes are unchanged, but the input is transformed by a function called vgg_preprocess
:
vgg_mean = np.array([123.68, 116.779, 103.939], dtype=np.float32).reshape((3,1,1))
def vgg_preprocess(x):
"""
Subtracts the mean RGB value, and transposes RGB to BGR.
The mean RGB was computed on the image set used to train the VGG model.
Args:
x: Image array (height x width x channels)
Returns:
Image array (height x width x transposed_channels)
"""
x = x - vgg_mean
return x[:, ::-1] # reverse axis RGB > BGR
It looks like vgg_preprocess
does two things:
1. Subtracts the mean RGB value (of the ImageNet dataset) from our input
This step normalizes our inputs. If we had an input with values [100, 200, 300]
, subtracting the mean would give us [-100, 0, 100]
. This is ostensibly easier for a neural network to work with because it keeps things within a certain range.
2. Transposes our input from RGB to BGR
Necessary because VGG was trained using Caffe (a deep learning framework), Caffe uses OpenCV (a computer vision library), and OpenCV works with images in BGR. So we have to reverse the order of the color channels in our RGB images to use them with VGG.
So the second part is VGG-specific (since we're building a model from scratch, and we don't have to worry about multiple color channels), but let's set up a normalization step for MNIST:
def mnist_normalize(x):
mnist_mean = x_trn.mean().astype(np.float32)
mnist_stdev = x_trn.mean().astype(np.float32)
return (x - mnist_mean) / mnist_stdev
Notice that we go one step further than VGG when it comes to normalization - VGG only subtracts the mean, but we also divide our input by its standard deviation.
I don't know why VGG doesn't do this, but Wikipedia says dividing by standard deviation is a valid approach.
from keras.layers.core import Lambda
model = Sequential()
model.add(Lambda(mnist_normalize, input_shape=(1, 28, 28)))
Input normalized!
Next up we have a repeating pattern of:
We're going to do an entire post on convolutions later on, but one way to think about a convolutional layer is as a layer that takes an image and runs randomly generated image filters over it to highlight certain features. These features can range from simple (like lines or circles) to complex (like textures or faces).
This post by Victor Powell contains a great way to visualize filters and their effects.
Because filters are smaller than the images they filter (a filter is usually 3x3 pixels), we need to create some empty space around the image for when a filter hits the edge of the image. The ZeroPadding layer adds a single, zero-valued pixel (essentially a black border) to the four edges of an image.
Let's add a couple of these to our model.
from keras.layers.convolutional import Convolution2D, ZeroPadding2D
model.add(
ZeroPadding2D((1, 1)) # Number of pixels to add
)
model.add(
Convolution2D(
32, # Number of filters to use
3, # Number of rows in the convolution kernel AKA filter height
3, # Number of columns in the convolution kernel AKA filter width
activation="relu"
)
)
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(32, 3, 3, activation="relu"))
The current output of our model should be 32 filtered images, one for each of our 32 filters:
model.output_shape
(None, 32, 28, 28)
The final, unexplained parameter in our Convolution2D layer (activation="relu"
) adds a nonlinear activation function (in this case, a rectified linear unit or ReLU) to the end of the layer.
ReLU computes the function f(x)=max(0, x)
- in plain English, if x is larger than 0, pass it on, otherwise, pass on 0.
The interaction of linear and nonlinear layers in a neural network is actually explained by the intimidatingly-named universal approximation theorem but we're going to save that for another day (because I don't actually understand it).
It might sort of make sense how a 3x3 image filter can detect something small like a corner or a line, but how can such a tiny filter detect a face? A face usually contains elements that would be difficult to express in a 9-pixel space.
There are two ways we can think about solving the problem: we can increase the size of the filter, or we can reduce the resolution of the image. MaxPooling helps us do that latter.
from keras.layers.pooling import MaxPooling2D
model.add(MaxPooling2D((2, 2))) # Factors by which to downsize vertically and horizontally
We can see that the MaxPooling layer cuts the resolution of our images in half:
model.output_shape
(None, 32, 14, 14)
As the image gets smaller through successive MaxPooling layers, the ability of our filters to place importance on the position of an element within the photo is increased.
Because each filter is now looking at a larger part of the image, even a 3x3 filter can meaningfully evaluate the position of elements like eyes, noses, and mouths relative to each other.
Combining the use of filters with high and low-resolution images means we can build a model that:
To preserve the amount of information propagation through the network, we double the amount of filters after each MaxPooling layer:
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(64, 3, 3, activation="relu"))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(64, 3, 3, activation="relu"))
model.add(MaxPooling2D((2, 2)))
model.output_shape
(None, 64, 7, 7)
The preservation is not perfect, of course - in theory we should 4x the number of image filters because both image height and image width are being reduced by a factor of 2. It's not clear to me why this isn't done.
I'm also not sure why a final MaxPooling layer is added.
We can quickly check in with the state of our model at this time:
model.summary()
____________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ==================================================================================================== lambda_2 (Lambda) (None, 1, 28, 28) 0 lambda_input_2[0][0] ____________________________________________________________________________________________________ zeropadding2d_14 (ZeroPadding2D) (None, 1, 30, 30) 0 lambda_2[0][0] ____________________________________________________________________________________________________ convolution2d_14 (Convolution2D) (None, 32, 28, 28) 320 zeropadding2d_14[0][0] ____________________________________________________________________________________________________ zeropadding2d_15 (ZeroPadding2D) (None, 32, 30, 30) 0 convolution2d_14[0][0] ____________________________________________________________________________________________________ convolution2d_15 (Convolution2D) (None, 32, 28, 28) 9248 zeropadding2d_15[0][0] ____________________________________________________________________________________________________ maxpooling2d_6 (MaxPooling2D) (None, 32, 14, 14) 0 convolution2d_15[0][0] ____________________________________________________________________________________________________ zeropadding2d_16 (ZeroPadding2D) (None, 32, 16, 16) 0 maxpooling2d_6[0][0] ____________________________________________________________________________________________________ convolution2d_16 (Convolution2D) (None, 64, 14, 14) 18496 zeropadding2d_16[0][0] ____________________________________________________________________________________________________ zeropadding2d_17 (ZeroPadding2D) (None, 64, 16, 16) 0 convolution2d_16[0][0] ____________________________________________________________________________________________________ convolution2d_17 (Convolution2D) (None, 64, 14, 14) 36928 zeropadding2d_17[0][0] ____________________________________________________________________________________________________ maxpooling2d_7 (MaxPooling2D) (None, 64, 7, 7) 0 convolution2d_17[0][0] ==================================================================================================== Total params: 64992 ____________________________________________________________________________________________________
And it looks like kinda like VGG.
It's time to make the change from convolutional layers to dense ones, but to do that we need to flatten our input. Remember that Keras error from way back?
expected dense_input_1 to have 2 dimensions, but got array with shape (60000, 28, 28)
model.output_shape
(None, 64, 7, 7)
from keras.layers.core import Flatten
model.add(Flatten())
model.output_shape
(None, 3136)
64 * 7 * 7 = 3,136. Ta-dah!
Finally, we can add in our Dense layers and they will work as expected:
model.add(Dense(512, activation="relu"))
Recall that we don't have to specify input_dim=3136
because we're adding to an existing model instead of building a new one with Dense as the first layer.
Why 512? There doesn't actually seem to be a good rule about how to pick output dimensions (this forum response by Jeremy Howard calls it more art than science). Maybe one day!
VGG also adds a layer called Dropout at this point. Dropout is a technique for reducing overfitting, but I don't think we have a problem with overfitting yet, so we're going to ignore it for now.
One final layer with an output dimensions set to 10 (for each of the 10 possibilities in the MNIST data)...
model.add(Dense(10, activation="softmax"))
... And we're done!
Let's compile the model and see how it does.
model.compile(
loss="categorical_crossentropy",
optimizer="sgd",
metrics=["accuracy"]
)
It's a good idea to turn our data into batches for the model (allowing us to feed our data to the model in batches instead of all at once):
from keras.preprocessing import image
gen = image.ImageDataGenerator()
trn_batches = gen.flow(x_trn, y_trn, batch_size=32)
val_batches = gen.flow(x_val, y_val, batch_size=32)
model.fit_generator(
trn_batches,
trn_batches.N,
nb_epoch=3,
validation_data=val_batches,
nb_val_samples=val_batches.N
)
Epoch 1/3 60000/60000 [==============================] - 50s - loss: 0.1761 - acc: 0.9445 - val_loss: 0.0546 - val_acc: 0.9829 Epoch 2/3 60000/60000 [==============================] - 50s - loss: 0.0527 - acc: 0.9834 - val_loss: 0.0366 - val_acc: 0.9872 Epoch 3/3 60000/60000 [==============================] - 50s - loss: 0.0347 - acc: 0.9896 - val_loss: 0.0318 - val_acc: 0.9893
<keras.callbacks.History at 0x7fdd27f0eed0>
That's 98%, almost 99% accuracy! With a model that was built using Keras components but trained from scratch.
Our score doesn't quite get us on the leaderboard, but next time we're going to look at a couple of techniques that will.
And finally, here's our model from scratch, all in one place:
model = Sequential([
Lambda(mnist_normalize, input_shape=(1,28,28)),
ZeroPadding2D(),
Convolution2D(32, 3, 3, activation="relu"),
ZeroPadding2D(),
Convolution2D(32, 3, 3, activation="relu"),
MaxPooling2D(),
ZeroPadding2D(),
Convolution2D(64, 3, 3, activation="relu"),
ZeroPadding2D(),
Convolution2D(64, 3, 3, activation="relu"),
MaxPooling2D(),
Flatten(),
Dense(512, activation='relu'),
Dense(10, activation='softmax')
])
Notice how we were able to drop a lot of the parameters for ZeroPadding and MaxPooling? That's because the defaults for both layers already contain exactly what we need!