Neural Network Layers¶

This notebook is going to lean pretty heavily on Keras, so it's worth taking a minute to understand a little bit about what Keras is.

According to its own documentation, Keras is "a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano."

So while Tensorflow, for example, is an API that helps with machine learning in a broader sense, Keras was written specifically to help with neural networks.

Let's see this in action.

Get training & test data¶

Remember how we called MNIST a famous dataset? MNIST is so famous, Keras contains its own copy!

In [1]:
from keras.datasets import mnist

(x_trn, y_trn), (x_val, y_val) = mnist.load_data()

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
warnings.warn(warn)


x_trn contains 60,000 28x28 images:

In [2]:
x_trn.shape

Out[2]:
(60000, 28, 28)

We have to change the shape of our inputs a bit - Keras expects an image with multiple color channels by default, but we can just add a single channel for our black and white images:

In [3]:
import numpy as np

x_trn = np.expand_dims(x_trn, 1)
x_val = np.expand_dims(x_val, 1)

In [4]:
x_trn.shape

Out[4]:
(60000, 1, 28, 28)

And y_trn contains the labels:

In [5]:
y_trn[:5]

Out[5]:
array([5, 0, 4, 1, 9], dtype=uint8)

Just like last time, this data should be in one-hot encoded format, but we can just use the provided Keras function this time.

In [6]:
from keras.utils.np_utils import to_categorical

y_trn = to_categorical(y_trn)
y_val = to_categorical(y_val)

In [7]:
y_val[:3]

Out[7]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
[ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
[ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Linear¶

The simplest type of Keras model is a Sequential model.

In [8]:
from keras.models import Sequential

model = Sequential()


Called "sequential" because it contains a stack or "sequence" of linear layers.

In [9]:
from keras.layers import Dense



We should know by now that what we're doing here is telling the first layer in our model to take an input of size (, 28) and produce an output of size (, 10).

We can stack as many model layers as we like:

In [10]:
model.add(Dense(5))


And we no longer have to specify input dimensions because each layer just takes the output of the previous layer.

When we're done here we can compile the model:

In [11]:
model.compile(
loss="categorical_crossentropy",
optimizer="sgd",
metrics=["accuracy"]
)


Can we run this on MNIST?

In [12]:
model.fit(x_trn, y_trn, nb_epoch=3, batch_size=32)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-12-5d4f6032f99c> in <module>()
----> 1 model.fit(x_trn, y_trn, nb_epoch=3, batch_size=32)

/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, **kwargs)
618                               shuffle=shuffle,
619                               class_weight=class_weight,
--> 620                               sample_weight=sample_weight)
621
622     def evaluate(self, x, y, batch_size=32, verbose=1,

/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight)
1032                                                            class_weight=class_weight,
1033                                                            check_batch_dim=False,
-> 1034                                                            batch_size=batch_size)
1035         # prepare validation data
1036         if validation_data:

/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in _standardize_user_data(self, x, y, sample_weight, class_weight, check_batch_dim, batch_size)
959                                    self.internal_input_shapes,
960                                    check_batch_dim=False,
--> 961                                    exception_prefix='model input')
962         y = standardize_input_data(y, self.output_names,
963                                    output_shapes,

/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in standardize_input_data(data, names, shapes, check_batch_dim, exception_prefix)
95                                 ' to have ' + str(len(shapes[i])) +
96                                 ' dimensions, but got array with shape ' +
---> 97                                 str(array.shape))
98             for j, (dim, ref_dim) in enumerate(zip(array.shape, shapes[i])):
99                 if not j and not check_batch_dim:

Exception: Error when checking model input: expected dense_input_1 to have 2 dimensions, but got array with shape (60000, 1, 28, 28)

I guess not. Let's have a look at the error message:

expected dense_input_1 to have 2 dimensions, but got array with shape (60000, 28, 28)

Hey, that makes sense! We told our model to expect an input of size (, 28) but instead we gave it the entire (60000, 28, 28) MNIST dataset.

And we could try changing the first layer in our model to something like model.add(Dense(10, input_shape=(28, 28)) but it actually wouldn't be a very good model (and also Keras will spit an error back at us).

Remember on Day Twelve when we said our spreadsheet neural network wasn't a real neural network because it was just a bunch of linear matrix multiplications?

Right now, that's all our Keras model is. Each Dense layer is the equivalent of a single matrix multiplication in the spreadsheet.

Great artists steal¶

Okay, so if we don't want just a bunch of linear layers, what do we want? Since we don't have any strong opinions about what the architecture of a neural network for image recognition should look like, let's just see what the VGG model looks like and start from there:

In [13]:
from vgg16 import Vgg16

vgg = Vgg16()
vgg.model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
lambda_1 (Lambda)                (None, 3, 224, 224)   0           lambda_input_1[0][0]
____________________________________________________________________________________________________
zeropadding2d_1 (ZeroPadding2D)  (None, 3, 226, 226)   0           lambda_1[0][0]
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 64, 224, 224)  1792        zeropadding2d_1[0][0]
____________________________________________________________________________________________________
zeropadding2d_2 (ZeroPadding2D)  (None, 64, 226, 226)  0           convolution2d_1[0][0]
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 64, 224, 224)  36928       zeropadding2d_2[0][0]
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)    (None, 64, 112, 112)  0           convolution2d_2[0][0]
____________________________________________________________________________________________________
zeropadding2d_3 (ZeroPadding2D)  (None, 64, 114, 114)  0           maxpooling2d_1[0][0]
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 128, 112, 112) 73856       zeropadding2d_3[0][0]
____________________________________________________________________________________________________
zeropadding2d_4 (ZeroPadding2D)  (None, 128, 114, 114) 0           convolution2d_3[0][0]
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 128, 112, 112) 147584      zeropadding2d_4[0][0]
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)    (None, 128, 56, 56)   0           convolution2d_4[0][0]
____________________________________________________________________________________________________
zeropadding2d_5 (ZeroPadding2D)  (None, 128, 58, 58)   0           maxpooling2d_2[0][0]
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D)  (None, 256, 56, 56)   295168      zeropadding2d_5[0][0]
____________________________________________________________________________________________________
zeropadding2d_6 (ZeroPadding2D)  (None, 256, 58, 58)   0           convolution2d_5[0][0]
____________________________________________________________________________________________________
convolution2d_6 (Convolution2D)  (None, 256, 56, 56)   590080      zeropadding2d_6[0][0]
____________________________________________________________________________________________________
zeropadding2d_7 (ZeroPadding2D)  (None, 256, 58, 58)   0           convolution2d_6[0][0]
____________________________________________________________________________________________________
convolution2d_7 (Convolution2D)  (None, 256, 56, 56)   590080      zeropadding2d_7[0][0]
____________________________________________________________________________________________________
maxpooling2d_3 (MaxPooling2D)    (None, 256, 28, 28)   0           convolution2d_7[0][0]
____________________________________________________________________________________________________
zeropadding2d_8 (ZeroPadding2D)  (None, 256, 30, 30)   0           maxpooling2d_3[0][0]
____________________________________________________________________________________________________
convolution2d_8 (Convolution2D)  (None, 512, 28, 28)   1180160     zeropadding2d_8[0][0]
____________________________________________________________________________________________________
zeropadding2d_9 (ZeroPadding2D)  (None, 512, 30, 30)   0           convolution2d_8[0][0]
____________________________________________________________________________________________________
convolution2d_9 (Convolution2D)  (None, 512, 28, 28)   2359808     zeropadding2d_9[0][0]
____________________________________________________________________________________________________
zeropadding2d_10 (ZeroPadding2D) (None, 512, 30, 30)   0           convolution2d_9[0][0]
____________________________________________________________________________________________________
convolution2d_10 (Convolution2D) (None, 512, 28, 28)   2359808     zeropadding2d_10[0][0]
____________________________________________________________________________________________________
maxpooling2d_4 (MaxPooling2D)    (None, 512, 14, 14)   0           convolution2d_10[0][0]
____________________________________________________________________________________________________
zeropadding2d_11 (ZeroPadding2D) (None, 512, 16, 16)   0           maxpooling2d_4[0][0]
____________________________________________________________________________________________________
convolution2d_11 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_11[0][0]
____________________________________________________________________________________________________
zeropadding2d_12 (ZeroPadding2D) (None, 512, 16, 16)   0           convolution2d_11[0][0]
____________________________________________________________________________________________________
convolution2d_12 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_12[0][0]
____________________________________________________________________________________________________
zeropadding2d_13 (ZeroPadding2D) (None, 512, 16, 16)   0           convolution2d_12[0][0]
____________________________________________________________________________________________________
convolution2d_13 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_13[0][0]
____________________________________________________________________________________________________
maxpooling2d_5 (MaxPooling2D)    (None, 512, 7, 7)     0           convolution2d_13[0][0]
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 25088)         0           maxpooling2d_5[0][0]
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 4096)          102764544   flatten_1[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 4096)          0           dense_4[0][0]
____________________________________________________________________________________________________
dense_5 (Dense)                  (None, 4096)          16781312    dropout_1[0][0]
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 4096)          0           dense_5[0][0]
____________________________________________________________________________________________________
dense_6 (Dense)                  (None, 1000)          4097000     dropout_2[0][0]
====================================================================================================
Total params: 138357544
____________________________________________________________________________________________________


It looks like VGG starts with a Lambda layer, contains a number of ZeroPadding and Convolution layers with some MaxPooling sprinkled in, then a Flatten layer, then Dense > Dropout > Dense a couple of times.

Piece of cake, right?

Let's start at the top.

Lambda¶

The first layer in VGG is a Lambda layer, which the docs tells us can "[wrap an] arbitrary expression as a Layer object".

I believe what this means is "turn any function into a layer", and if we look at the VGG declaration, the line containing the Lambda layer is:

model.add(Lambda(vgg_preprocess, input_shape=(3,224,224), output_shape=(3,224,224)))

So input and output shapes are unchanged, but the input is transformed by a function called vgg_preprocess:

In [14]:
vgg_mean = np.array([123.68, 116.779, 103.939], dtype=np.float32).reshape((3,1,1))
def vgg_preprocess(x):
"""
Subtracts the mean RGB value, and transposes RGB to BGR.
The mean RGB was computed on the image set used to train the VGG model.

Args:
x: Image array (height x width x channels)
Returns:
Image array (height x width x transposed_channels)
"""
x = x - vgg_mean
return x[:, ::-1] # reverse axis RGB > BGR


It looks like vgg_preprocess does two things:

1. Subtracts the mean RGB value (of the ImageNet dataset) from our input

This step normalizes our inputs. If we had an input with values [100, 200, 300], subtracting the mean would give us [-100, 0, 100]. This is ostensibly easier for a neural network to work with because it keeps things within a certain range.

2. Transposes our input from RGB to BGR

Necessary because VGG was trained using Caffe (a deep learning framework), Caffe uses OpenCV (a computer vision library), and OpenCV works with images in BGR. So we have to reverse the order of the color channels in our RGB images to use them with VGG.

So the second part is VGG-specific (since we're building a model from scratch, and we don't have to worry about multiple color channels), but let's set up a normalization step for MNIST:

In [15]:
def mnist_normalize(x):
mnist_mean = x_trn.mean().astype(np.float32)
mnist_stdev = x_trn.mean().astype(np.float32)
return (x - mnist_mean) / mnist_stdev


Notice that we go one step further than VGG when it comes to normalization - VGG only subtracts the mean, but we also divide our input by its standard deviation.

I don't know why VGG doesn't do this, but Wikipedia says dividing by standard deviation is a valid approach.

In [16]:
from keras.layers.core import Lambda

model = Sequential()
model.add(Lambda(mnist_normalize, input_shape=(1, 28, 28)))


Input normalized!

Next up we have a repeating pattern of:

• ((ZeroPadding > Convolution) x2 > MaxPooling) x2
• ((ZeroPadding > Convolution) x3 > MaxPooling) x3

We're going to do an entire post on convolutions later on, but one way to think about a convolutional layer is as a layer that takes an image and runs randomly generated image filters over it to highlight certain features. These features can range from simple (like lines or circles) to complex (like textures or faces).

This post by Victor Powell contains a great way to visualize filters and their effects.

Because filters are smaller than the images they filter (a filter is usually 3x3 pixels), we need to create some empty space around the image for when a filter hits the edge of the image. The ZeroPadding layer adds a single, zero-valued pixel (essentially a black border) to the four edges of an image.

Let's add a couple of these to our model.

In [17]:
from keras.layers.convolutional import Convolution2D, ZeroPadding2D

ZeroPadding2D((1, 1)) # Number of pixels to add
)
Convolution2D(
32, # Number of filters to use
3, # Number of rows in the convolution kernel AKA filter height
3, # Number of columns in the convolution kernel AKA filter width
activation="relu"
)
)
model.add(Convolution2D(32, 3, 3, activation="relu"))


The current output of our model should be 32 filtered images, one for each of our 32 filters:

In [18]:
model.output_shape

Out[18]:
(None, 32, 28, 28)

Nonlinear Activation¶

The final, unexplained parameter in our Convolution2D layer (activation="relu") adds a nonlinear activation function (in this case, a rectified linear unit or ReLU) to the end of the layer.

ReLU computes the function f(x)=max(0, x) - in plain English, if x is larger than 0, pass it on, otherwise, pass on 0.

The interaction of linear and nonlinear layers in a neural network is actually explained by the intimidatingly-named universal approximation theorem but we're going to save that for another day (because I don't actually understand it).

MaxPooling¶

It might sort of make sense how a 3x3 image filter can detect something small like a corner or a line, but how can such a tiny filter detect a face? A face usually contains elements that would be difficult to express in a 9-pixel space.

There are two ways we can think about solving the problem: we can increase the size of the filter, or we can reduce the resolution of the image. MaxPooling helps us do that latter.

In [19]:
from keras.layers.pooling import MaxPooling2D

model.add(MaxPooling2D((2, 2))) # Factors by which to downsize vertically and horizontally


We can see that the MaxPooling layer cuts the resolution of our images in half:

In [20]:
model.output_shape

Out[20]:
(None, 32, 14, 14)

As the image gets smaller through successive MaxPooling layers, the ability of our filters to place importance on the position of an element within the photo is increased.

Because each filter is now looking at a larger part of the image, even a 3x3 filter can meaningfully evaluate the position of elements like eyes, noses, and mouths relative to each other.

Combining the use of filters with high and low-resolution images means we can build a model that:

1. Can find features everywhere (on a high-resolution image)
2. Cares about how features relate to each other positionally (on a low-resolution image)

To preserve the amount of information propagation through the network, we double the amount of filters after each MaxPooling layer:

In [21]:
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(64, 3, 3, activation="relu"))
model.add(Convolution2D(64, 3, 3, activation="relu"))

In [22]:
model.output_shape

Out[22]:
(None, 64, 7, 7)

The preservation is not perfect, of course - in theory we should 4x the number of image filters because both image height and image width are being reduced by a factor of 2. It's not clear to me why this isn't done.

I'm also not sure why a final MaxPooling layer is added.

We can quickly check in with the state of our model at this time:

In [23]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
lambda_2 (Lambda)                (None, 1, 28, 28)     0           lambda_input_2[0][0]
____________________________________________________________________________________________________
zeropadding2d_14 (ZeroPadding2D) (None, 1, 30, 30)     0           lambda_2[0][0]
____________________________________________________________________________________________________
convolution2d_14 (Convolution2D) (None, 32, 28, 28)    320         zeropadding2d_14[0][0]
____________________________________________________________________________________________________
zeropadding2d_15 (ZeroPadding2D) (None, 32, 30, 30)    0           convolution2d_14[0][0]
____________________________________________________________________________________________________
convolution2d_15 (Convolution2D) (None, 32, 28, 28)    9248        zeropadding2d_15[0][0]
____________________________________________________________________________________________________
maxpooling2d_6 (MaxPooling2D)    (None, 32, 14, 14)    0           convolution2d_15[0][0]
____________________________________________________________________________________________________
zeropadding2d_16 (ZeroPadding2D) (None, 32, 16, 16)    0           maxpooling2d_6[0][0]
____________________________________________________________________________________________________
convolution2d_16 (Convolution2D) (None, 64, 14, 14)    18496       zeropadding2d_16[0][0]
____________________________________________________________________________________________________
zeropadding2d_17 (ZeroPadding2D) (None, 64, 16, 16)    0           convolution2d_16[0][0]
____________________________________________________________________________________________________
convolution2d_17 (Convolution2D) (None, 64, 14, 14)    36928       zeropadding2d_17[0][0]
____________________________________________________________________________________________________
maxpooling2d_7 (MaxPooling2D)    (None, 64, 7, 7)      0           convolution2d_17[0][0]
====================================================================================================
Total params: 64992
____________________________________________________________________________________________________


And it looks like kinda like VGG.

Flatten¶

It's time to make the change from convolutional layers to dense ones, but to do that we need to flatten our input. Remember that Keras error from way back?

expected dense_input_1 to have 2 dimensions, but got array with shape (60000, 28, 28)

In [24]:
model.output_shape

Out[24]:
(None, 64, 7, 7)
In [25]:
from keras.layers.core import Flatten


In [26]:
model.output_shape

Out[26]:
(None, 3136)

64 * 7 * 7 = 3,136. Ta-dah!

Dense & Dropout¶

Finally, we can add in our Dense layers and they will work as expected:

In [27]:
model.add(Dense(512, activation="relu"))


Recall that we don't have to specify input_dim=3136 because we're adding to an existing model instead of building a new one with Dense as the first layer.

Why 512? There doesn't actually seem to be a good rule about how to pick output dimensions (this forum response by Jeremy Howard calls it more art than science). Maybe one day!

VGG also adds a layer called Dropout at this point. Dropout is a technique for reducing overfitting, but I don't think we have a problem with overfitting yet, so we're going to ignore it for now.

One final layer with an output dimensions set to 10 (for each of the 10 possibilities in the MNIST data)...

In [28]:
model.add(Dense(10, activation="softmax"))


... And we're done!

Performance¶

Let's compile the model and see how it does.

In [29]:
model.compile(
loss="categorical_crossentropy",
optimizer="sgd",
metrics=["accuracy"]
)


It's a good idea to turn our data into batches for the model (allowing us to feed our data to the model in batches instead of all at once):

In [30]:
from keras.preprocessing import image

gen = image.ImageDataGenerator()
trn_batches = gen.flow(x_trn, y_trn, batch_size=32)
val_batches = gen.flow(x_val, y_val, batch_size=32)

In [31]:
model.fit_generator(
trn_batches,
trn_batches.N,
nb_epoch=3,
validation_data=val_batches,
nb_val_samples=val_batches.N
)

Epoch 1/3
60000/60000 [==============================] - 50s - loss: 0.1761 - acc: 0.9445 - val_loss: 0.0546 - val_acc: 0.9829
Epoch 2/3
60000/60000 [==============================] - 50s - loss: 0.0527 - acc: 0.9834 - val_loss: 0.0366 - val_acc: 0.9872
Epoch 3/3
60000/60000 [==============================] - 50s - loss: 0.0347 - acc: 0.9896 - val_loss: 0.0318 - val_acc: 0.9893

Out[31]:
<keras.callbacks.History at 0x7fdd27f0eed0>

That's 98%, almost 99% accuracy! With a model that was built using Keras components but trained from scratch.

Our score doesn't quite get us on the leaderboard, but next time we're going to look at a couple of techniques that will.

And finally, here's our model from scratch, all in one place:

In [32]:
model = Sequential([
Lambda(mnist_normalize, input_shape=(1,28,28)),
Convolution2D(32, 3, 3, activation="relu"),
Convolution2D(32, 3, 3, activation="relu"),
MaxPooling2D(),