#!/usr/bin/env python
# coding: utf-8

# ## Neural Network Layers

# This notebook is going to lean pretty heavily on [Keras](https://keras.io/), so it's worth taking a minute to understand a little bit about what Keras is.
# 
# According to its own documentation, Keras is "a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano."
# 
# So while Tensorflow, for example, is an API that helps with machine learning in a broader sense, Keras was written specifically to help with neural networks.
# 
# Let's see this in action.

# ## Get training & test data

# Remember how we called MNIST a famous dataset? MNIST is so famous, Keras contains [its own copy](https://keras.io/datasets/#mnist-database-of-handwritten-digits)!

# In[1]:


from keras.datasets import mnist

(x_trn, y_trn), (x_val, y_val) = mnist.load_data()


# `x_trn` contains 60,000 28x28 images:

# In[2]:


x_trn.shape


# We have to change the shape of our inputs a bit - Keras expects an image with multiple color channels by default, but we can just add a single channel for our black and white images:

# In[3]:


import numpy as np

x_trn = np.expand_dims(x_trn, 1)
x_val = np.expand_dims(x_val, 1)


# In[4]:


x_trn.shape


# And `y_trn` contains the labels:

# In[5]:


y_trn[:5]


# Just like last time, this data should be in one-hot encoded format, but we can just use the provided Keras function this time.

# In[6]:


from keras.utils.np_utils import to_categorical

y_trn = to_categorical(y_trn)
y_val = to_categorical(y_val)


# In[7]:


y_val[:3]


# ## Linear

# The simplest type of Keras model is a `Sequential` model.

# In[8]:


from keras.models import Sequential

model = Sequential()


# Called "sequential" because it contains a stack or "sequence" of linear layers.

# In[9]:


from keras.layers import Dense

model.add(Dense(10, input_dim=28))


# We should know by now that what we're doing here is telling the first layer in our model to take an input of size (, 28) and produce an output of size (, 10).
# 
# We can stack as many model layers as we like:

# In[10]:


model.add(Dense(5))
model.add(Dense(2))


# And we no longer have to specify input dimensions because each layer just takes the output of the previous layer. 
# 
# When we're done here we can compile the model:

# In[11]:


model.compile(
    loss="categorical_crossentropy", 
    optimizer="sgd", 
    metrics=["accuracy"]
)


# Can we run this on MNIST?

# In[12]:


model.fit(x_trn, y_trn, nb_epoch=3, batch_size=32)


# I guess not. Let's have a look at the error message:
# 
# > `expected dense_input_1 to have 2 dimensions, but got array with shape (60000, 28, 28)`
# 
# Hey, that makes sense! We told our model to expect an input of size (, 28) but instead we gave it the entire (60000, 28, 28) MNIST dataset.
# 
# And we could try changing the first layer in our model to something like `model.add(Dense(10, input_shape=(28, 28))` but it actually wouldn't be a very good model (and also Keras will spit an error back at us). 
# 
# Remember on [Day Twelve](http://theianchanc.com/one-data-science-a-day/day-twelve/) when we said our [spreadsheet neural network](https://docs.google.com/spreadsheets/d/1fXL-hSkdDZaca4Wc7Q7x5wYTlbYxsjgrXvjNk3woKxU/edit?usp=sharing) wasn't a real neural network because it was just a bunch of linear matrix multiplications? 
# 
# Right now, that's all our Keras model is. Each `Dense` layer is the equivalent of a single matrix multiplication in the spreadsheet.

# ## Great artists steal

# Okay, so if we don't want just a bunch of linear layers, what do we want? Since we don't have any strong opinions about what the architecture of a neural network for image recognition should look like, let's just see what the VGG model looks like and start from there:

# In[13]:


from vgg16 import Vgg16

vgg = Vgg16()
vgg.model.summary()


# It looks like VGG starts with a Lambda layer, contains a number of ZeroPadding and Convolution layers with some MaxPooling sprinkled in, then a Flatten layer, then Dense > Dropout > Dense a couple of times.
# 
# Piece of cake, right?
# 
# Let's start at the top.

# ## Lambda

# The first layer in VGG is a Lambda layer, which the docs tells us can "[wrap an] arbitrary expression as a Layer object".
# 
# I *believe* what this means is "turn any function into a layer", and if we look at the VGG declaration, the line containing the Lambda layer is:
# 
# > `model.add(Lambda(vgg_preprocess, input_shape=(3,224,224), output_shape=(3,224,224)))`
# 
# So input and output shapes are unchanged, but the input is transformed by a function called `vgg_preprocess`:

# In[14]:


vgg_mean = np.array([123.68, 116.779, 103.939], dtype=np.float32).reshape((3,1,1))
def vgg_preprocess(x):
    """
        Subtracts the mean RGB value, and transposes RGB to BGR.
        The mean RGB was computed on the image set used to train the VGG model.

        Args: 
            x: Image array (height x width x channels)
        Returns:
            Image array (height x width x transposed_channels)
    """
    x = x - vgg_mean
    return x[:, ::-1] # reverse axis RGB > BGR


# It looks like `vgg_preprocess` does two things:
# 
# **1. Subtracts the mean RGB value (of the ImageNet dataset) from our input**
# 
# This step normalizes our inputs. If we had an input with values `[100, 200, 300]`, subtracting the mean would give us `[-100, 0, 100]`. This is ostensibly easier for a neural network to work with because it keeps things within a certain range. 
# 
# **2. Transposes our input from RGB to BGR**
# 
# Necessary because [VGG was trained using Caffe](https://github.com/jcjohnson/neural-style/issues/207#issuecomment-210287465) (a deep learning framework), Caffe uses OpenCV (a computer vision library), and OpenCV works with images in BGR. So we have to reverse the order of the color channels in our RGB images to use them with VGG.
# 
# So the second part is VGG-specific (since we're building a model from scratch, and we don't have to worry about multiple color channels), but let's set up a normalization step for MNIST:

# In[15]:


def mnist_normalize(x):
    mnist_mean = x_trn.mean().astype(np.float32)
    mnist_stdev = x_trn.mean().astype(np.float32)
    return (x - mnist_mean) / mnist_stdev


# Notice that we go one step further than VGG when it comes to normalization - VGG only subtracts the mean, but we also divide our input by its standard deviation.
# 
# I don't know why VGG doesn't do this, but [Wikipedia says][1] dividing by standard deviation is a valid approach.
# 
# [1]: https://en.wikipedia.org/wiki/Normalization_(statistics)#Examples

# In[16]:


from keras.layers.core import Lambda

model = Sequential()
model.add(Lambda(mnist_normalize, input_shape=(1, 28, 28)))


# Input normalized!

# ## ZeroPadding & Convolution

# Next up we have a repeating pattern of:
# 
# * ((ZeroPadding > Convolution) x2 > MaxPooling) x2
# * ((ZeroPadding > Convolution) x3 > MaxPooling) x3
# 
# We're going to do an entire post on convolutions later on, but one way to think about a convolutional layer is as **a layer that takes an image and runs randomly generated image filters over it to highlight certain features**. These features can range from simple (like lines or circles) to complex (like textures or faces).
# 
# [This post][1] by Victor Powell contains a great way to visualize filters and their effects.
# 
# Because filters are smaller than the images they filter (a filter is usually 3x3 pixels), we need to create some empty space around the image for when a filter hits the edge of the image. **The ZeroPadding layer adds a single, zero-valued pixel (essentially a black border) to the four edges of an image**.
# 
# Let's add a couple of these to our model.
# 
# [1]: http://setosa.io/ev/image-kernels/

# In[17]:


from keras.layers.convolutional import Convolution2D, ZeroPadding2D

model.add(
    ZeroPadding2D((1, 1)) # Number of pixels to add
)
model.add(
    Convolution2D(
        32, # Number of filters to use
        3, # Number of rows in the convolution kernel AKA filter height
        3, # Number of columns in the convolution kernel AKA filter width 
        activation="relu"
    )
)
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(32, 3, 3, activation="relu"))


# The current output of our model should be 32 filtered images, one for each of our 32 filters:

# In[18]:


model.output_shape


# ## Nonlinear Activation

# The final, unexplained parameter in our Convolution2D layer (`activation="relu"`) adds a nonlinear activation function (in this case, a rectified linear unit or ReLU) to the end of the layer.
# 
# ReLU computes the function `f(x)=max(0, x)` - in plain English, if x is larger than 0, pass it on, otherwise, pass on 0.
# 
# The interaction of linear and nonlinear layers in a neural network is actually explained by the intimidatingly-named [universal approximation theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) but we're going to save that for another day (because I don't actually understand it).

# ## MaxPooling

# It might sort of make sense how a 3x3 image filter can detect something small like a corner or a line, but how can such a tiny filter detect a face? A face usually contains elements that would be difficult to express in a 9-pixel space.
# 
# There are two ways we can think about solving the problem: we can increase the size of the filter, or we can reduce the resolution of the image. MaxPooling helps us do that latter.

# In[19]:


from keras.layers.pooling import MaxPooling2D

model.add(MaxPooling2D((2, 2))) # Factors by which to downsize vertically and horizontally


# We can see that the MaxPooling layer cuts the resolution of our images in half:

# In[20]:


model.output_shape


# As the image gets smaller through successive MaxPooling layers, the  ability of our filters to place importance on the **position** of an element within the photo is increased.
# 
# Because each filter is now looking at a larger part of the image, **even a 3x3 filter can meaningfully evaluate the position of elements like eyes, noses, and mouths relative to each other**.
# 
# Combining the use of filters with high and low-resolution images means we can build a model that:
# 
# 1. Can find features everywhere (on a high-resolution image)
# 2. Cares about how features relate to each other positionally (on a low-resolution image)
# 
# To preserve the amount of information propagation through the network, we double the amount of filters after each MaxPooling layer:

# In[21]:


model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(64, 3, 3, activation="relu"))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(64, 3, 3, activation="relu"))
model.add(MaxPooling2D((2, 2)))


# In[22]:


model.output_shape


# The preservation is not perfect, of course - in theory we should 4x the number of image filters because both image height and image width are being reduced by a factor of 2. It's not clear to me why this isn't done.
# 
# I'm also not sure why a final MaxPooling layer is added.

# We can quickly check in with the state of our model at this time:

# In[23]:


model.summary()


# And it looks like *kinda* like VGG.

# ## Flatten

# It's time to make the change from convolutional layers to dense ones, but to do that we need to flatten our input. Remember that Keras error from way back?
# 
# > `expected dense_input_1 to have 2 dimensions, but got array with shape (60000, 28, 28)`

# In[24]:


model.output_shape


# In[25]:


from keras.layers.core import Flatten

model.add(Flatten())


# In[26]:


model.output_shape


# 64 \* 7 \* 7 = 3,136. Ta-dah!

# ## Dense & Dropout

# Finally, we can add in our Dense layers and they will work as expected:

# In[27]:


model.add(Dense(512, activation="relu"))


# Recall that we don't have to specify `input_dim=3136` because we're adding to an existing model instead of building a new one with Dense as the first layer.
# 
# Why 512? There doesn't actually seem to be a good rule about how to pick output dimensions ([this forum response][1] by Jeremy Howard calls it more art than science). Maybe one day! 
# 
# VGG also adds a layer called Dropout at this point. Dropout is a technique for reducing overfitting, but I don't think we have a problem with overfitting yet, so we're going to ignore it for now.
# 
# One final layer with an output dimensions set to 10 (for each of the 10 possibilities in the MNIST data)...
# 
# [1]: http://forums.fast.ai/t/output-dimensions-for-dense-layers/847

# In[28]:


model.add(Dense(10, activation="softmax"))


# ... And we're done!

# ## Performance

# Let's compile the model and see how it does.

# In[29]:


model.compile(
    loss="categorical_crossentropy", 
    optimizer="sgd", 
    metrics=["accuracy"]
)


# It's a good idea to turn our data into batches for the model (allowing us to feed our data to the model in batches instead of all at once):

# In[30]:


from keras.preprocessing import image

gen = image.ImageDataGenerator()
trn_batches = gen.flow(x_trn, y_trn, batch_size=32)
val_batches = gen.flow(x_val, y_val, batch_size=32)


# In[31]:


model.fit_generator(
    trn_batches,
    trn_batches.N,
    nb_epoch=3,
    validation_data=val_batches,
    nb_val_samples=val_batches.N
)


# That's 98%, almost 99% accuracy! With a model that was built using Keras components but trained **from scratch**. 
# 
# Our score doesn't *quite* get us on [the leaderboard](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html), but next time we're going to look at a couple of techniques that will.
# 
# And finally, here's our model from scratch, all in one place:

# In[32]:


model = Sequential([
    Lambda(mnist_normalize, input_shape=(1,28,28)),
    ZeroPadding2D(),
    Convolution2D(32, 3, 3, activation="relu"),
    ZeroPadding2D(),
    Convolution2D(32, 3, 3, activation="relu"),
    MaxPooling2D(),
    ZeroPadding2D(),
    Convolution2D(64, 3, 3, activation="relu"),
    ZeroPadding2D(),
    Convolution2D(64, 3, 3, activation="relu"),
    MaxPooling2D(),
    Flatten(),
    Dense(512, activation='relu'),
    Dense(10, activation='softmax')
])


# Notice how we were able to drop a lot of the parameters for ZeroPadding and MaxPooling? That's because the defaults for both layers already contain exactly what we need!