Code samples adapted from Francois Chollet: https://github.com/fchollet/deep-learning-with-python-notebooks/
SMU Contributors: Ian Johnson, Eric C. Larson
Deep learning models are frequently treated with a "black-box" mentality: users simply care about output and don't care to take a look inside to see what's going on. Since convolutional neural networks are, abstractly speaking, "representations of visual concepts" (Chollet), they are viable for visualization.
There are numerous ways to visualize the underlying structure and state of a CNN. Chollet explores 3 particularly useful and intuitive techniques for visualizing convolutional networks:
The following visualization code is adapted from Chollet's Jupyter Notebooks for his book: https://github.com/fchollet/deep-learning-with-python-notebooks/
import keras
keras.__version__
from keras.models import load_model
model = load_model('models/cats_and_dogs_small_2.h5')
model.summary() # As a reminder.
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d_21 (Conv2D) (None, 148, 148, 32) 896 _________________________________________________________________ max_pooling2d_21 (MaxPooling (None, 74, 74, 32) 0 _________________________________________________________________ conv2d_22 (Conv2D) (None, 72, 72, 64) 18496 _________________________________________________________________ max_pooling2d_22 (MaxPooling (None, 36, 36, 64) 0 _________________________________________________________________ conv2d_23 (Conv2D) (None, 34, 34, 128) 73856 _________________________________________________________________ max_pooling2d_23 (MaxPooling (None, 17, 17, 128) 0 _________________________________________________________________ conv2d_24 (Conv2D) (None, 15, 15, 128) 147584 _________________________________________________________________ max_pooling2d_24 (MaxPooling (None, 7, 7, 128) 0 _________________________________________________________________ flatten_6 (Flatten) (None, 6272) 0 _________________________________________________________________ dropout_3 (Dropout) (None, 6272) 0 _________________________________________________________________ dense_11 (Dense) (None, 512) 3211776 _________________________________________________________________ dense_12 (Dense) (None, 1) 513 ================================================================= Total params: 3,453,121 Trainable params: 3,453,121 Non-trainable params: 0 _________________________________________________________________
Here we load an image to use for our visualization. Feel free to try an image of your own choosing.
img_url = 'https://raw.githubusercontent.com/8000net/LectureNotes/master/images/dog.jpg'
# We preprocess the image into a 4D tensor
from keras.preprocessing import image
import numpy as np
import requests
from io import BytesIO
from PIL import Image
def load_image_as_array(url, size=(150, 150)):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img = img.resize(size)
return np.array(img).astype(float)
img_tensor = load_image_as_array(img_url)
img_tensor = np.expand_dims(img_tensor, axis=0)
# Remember that the model was trained on inputs
# that were preprocessed in the following way:
img_tensor /= 255.
# Its shape is (1, 150, 150, 3)
print(img_tensor.shape)
import matplotlib.pyplot as plt
plt.imshow(img_tensor[0])
plt.show()
(1, 150, 150, 3)
Given an arbitrary input to a CNN, we can visualize the activation of each layer of the network by displaying the feature maps that are output by each layer of the network. For each layer, there is a 3-dimensional feature map which can be visualized as a set of 2D images, one for each channel. The resulting images are representations of the filter activations of individual channels in a convolutional or pooling layer. Chollet uses the following example of a cat image to visualize the contents of a CNN used to descriminate between cats and dogs.
Now that we've loaded an image to visualize the network with, we can create a Model that accepts inputs of image batches and returns outputs of all the layers of the original network.
"To do this, we will use the Keras class Model. A Model is instantiated using two arguments: an input tensor (or list of input tensors), and an output tensor (or list of output tensors). The resulting class is a Keras model, just like the Sequential models that you are familiar with, mapping the specified inputs to the specified outputs. What sets the Model class apart is that it allows for models with multiple outputs, unlike Sequential. For more information about the Model class, see Chapter 7, Section 1." (Chollet)
from keras import models
# Extracts the outputs of the top 8 layers:
layer_outputs = [layer.output for layer in model.layers[:8]]
# Creates a model that will return these outputs, given the model input:
activation_model = models.Model(inputs=model.input, outputs=layer_outputs)
Notice that this is not a traditional Keras model in that there are multiple outputs. Generally speaking, Keras supports an arbitrary number of inputs and outputs to a model, but we have used one-input, one-output models up to this point.
# This will return a list of 5 Numpy arrays:
# one array per layer activation
activations = activation_model.predict(img_tensor)
first_layer_activation = activations[0]
print(first_layer_activation.shape)
(1, 148, 148, 32)
Using this multi-output model, we can visualize the activation of any arbitrary channel of any layer of the network. For example, here is a channel of the first layer that appears to function as a diagonal edge detector.
import matplotlib.pyplot as plt
plt.matshow(first_layer_activation[0, :, :, 29], cmap='viridis')
plt.show()
This can be extended to visualize every channel of every layer in the network, which gives us eyes into the black box of the convolutional cats-and-dogs network. The following code (Chollet) plots every single channel side-by-side for each layer of the network.
import keras
# These are the names of the layers, so can have them as part of our plot
layer_names = []
for layer in model.layers[:8]:
layer_names.append(layer.name)
images_per_row = 16
# Now let's display our feature maps
for layer_name, layer_activation in zip(layer_names, activations):
# This is the number of features in the feature map
n_features = layer_activation.shape[-1]
# The feature map has shape (1, size, size, n_features)
size = layer_activation.shape[1]
# We will tile the activation channels in this matrix
n_cols = n_features // images_per_row
display_grid = np.zeros((size * n_cols, images_per_row * size))
# We'll tile each filter into this big horizontal grid
for col in range(n_cols):
for row in range(images_per_row):
channel_image = layer_activation[0,
:, :,
col * images_per_row + row]
# Post-process the feature to make it visually palatable
channel_image -= channel_image.mean()
channel_image /= channel_image.std()
channel_image *= 64
channel_image += 128
channel_image = np.clip(channel_image, 0, 255).astype('uint8')
display_grid[col * size : (col + 1) * size,
row * size : (row + 1) * size] = channel_image
# Display the grid
scale = 1. / size
plt.figure(figsize=(scale * display_grid.shape[1],
scale * display_grid.shape[0]))
plt.title(layer_name)
plt.grid(False)
plt.imshow(display_grid, aspect='auto', cmap='viridis')
plt.show()
/hpc/applications/anaconda/3/lib/python3.6/site-packages/ipykernel_launcher.py:30: RuntimeWarning: invalid value encountered in true_divide
Notice that as we move downward through this figure, into deeper layers of the network, the activations retain less and less of the original input content. The channels serve primarily as edge detectors and the like, while later layers retain very little of the original form of the image. Moreover, in some later layers, there is no activation. For example, in max_pooling2d_23, there are two channels with no activation. This indicates that the pattern represented by that channel is not present in the source image.
It is important to identify that, as you move deeper into the network, the features that activate a given channel become more and more abstract. Chollet describes this behaviour by comparing a convolutional neural network to an information distillation pipeline, which iteratively transforms raw data such that irrelevant information is "filtered out and useful information is magnified."
I encourage you to try your own image of a dog or cat and examine the activations with that image. Spend some time looking at what all the activations look like. Can you identify any high-level abstract concepts that are identified by later-level layers? Perhaps dog ears or cat eyes?
Now let's look at analyzing this through the WebCam for VGG. You can run the following scripts in python assuming you have OpenCV installed.
cd activation-demo
python Activations.py
We can also perform the inverse operation, in some sense, by synthesizing images to maximize the response from a specific filter. This allows us to visualize the pattern that a given channel responds to. This can be performed with gradient ascent, a process wherein you perform gradient descent on the input image to a network, starting with a blank image. The final result image, in an optimal gradient ascent, is maximally activating of the chosen filter.
Chollet performs this with the following code, which utliizes stochastic gradient descent to synthesize an image which maximally activates an arbitrary filter
from keras.applications import VGG16
from keras import backend as K
# Load the pre-trained VGG16 model
model = VGG16(weights='imagenet', include_top=False)
# Selecting a layer and channel to visualize
layer_name = 'block3_conv1'
filter_index = 0
# Isolate the output and loss for the given chanel
layer_output = model.get_layer(layer_name).output
loss = K.mean(layer_output[:, :, :, filter_index])
# We take the gradient of this loss using keras backend.gradients
grads = K.gradients(loss, model.input)[0]
# Before performing gradient descent, we divide the gradient tensor by its L2 norm (square root
# of the mean of the square of values in the tensor). We add a small epsilon term to the L2 norm
# to avoid division by zero.
grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)
# We use a keras backend function to accept a numpy tensor and return a loss and gradient for that tensor.
iterate = K.function([model.input], [loss, grads])
# To quickly test the interface:
import numpy as np
loss_value, grads_value = iterate([np.zeros((1, 150, 150, 3))])
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5 58892288/58889256 [==============================] - 4s 0us/step
When we perform gradient descent and iteratively update the blank image, the result is not guaranteed to be a valid, displayable image. We can fix that using a simple utility function below:
def deprocess_image(x):
# normalize tensor: center on 0., ensure std is 0.1
x -= x.mean()
x /= (x.std() + 1e-5)
x *= 0.1
# clip to [0, 1]
x += 0.5
x = np.clip(x, 0, 1)
# convert to RGB array
x *= 255
x = np.clip(x, 0, 255).astype('uint8')
return x
Using this deprocess_image utility function, and the code above, we can write a generalized function to generate an optimal image for activating any arbitrary filter in the network. Notice the additonal code at the bottom of the function which, for 40 iterations, performs gradient ascent on the input image to adjust it to maximize filter activation.
def generate_pattern(layer_name, filter_index, size=150):
# Build a loss function that maximizes the activation
# of the nth filter of the layer considered.
layer_output = model.get_layer(layer_name).output
loss = K.mean(layer_output[:, :, :, filter_index])
# Compute the gradient of the input picture wrt this loss
grads = K.gradients(loss, model.input)[0]
# Normalization trick: we normalize the gradient
grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)
# This function returns the loss and grads given the input picture
iterate = K.function([model.input], [loss, grads])
# We start from a gray image with some noise
input_img_data = np.random.random((1, size, size, 3)) * 20 + 128.
# Run gradient ascent for 40 steps
step = 1.
for i in range(40):
loss_value, grads_value = iterate([input_img_data])
input_img_data += grads_value * step
# todo: see about deprocessing the image (make sure it works)
img = input_img_data[0]
return deprocess_image(img)
plt.imshow(generate_pattern('block3_conv1', 7))
plt.show()
This can be repeated for any filter in the network. The following code from Chollet plots 64 of the filters from each convolutional layer, with black margins between them in 4 8x8 grids of filters.
for layer_name in ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1']:
size = 64
margin = 5
# This a empty (black) image where we will store our results.
results = np.zeros((8 * size + 7 * margin, 8 * size + 7 * margin, 3))
for i in range(8): # iterate over the rows of our results grid
for j in range(8): # iterate over the columns of our results grid
# Generate the pattern for filter `i + (j * 8)` in `layer_name`
filter_img = generate_pattern(layer_name, i + (j * 8), size=size)
# Put the result in the square `(i, j)` of the results grid
horizontal_start = i * size + i * margin
horizontal_end = horizontal_start + size
vertical_start = j * size + j * margin
vertical_end = vertical_start + size
results[horizontal_start: horizontal_end, vertical_start: vertical_end, :] = filter_img
# Display the results grid
plt.figure(figsize=(20, 20))
plt.imshow(results)
plt.show()
Notice that the trend observed in the first set of visualizations is very evident here: as we move into deeper layers of the network, the convolutional filters resemble increasingly abstract ideas. The first layer consists of primarily simple textures, and the filters get increasingly complex until we reach the final layer, where filters have abstract meanings recognizable to the human eye (feathers, eyes, bricks, etc.)
Now try to manipulate the above code to generate images like DeepDream. To achieve this:
For classification CNNs, it can be useful to identify which parts of an input image have the most influence in the final output classification. This general technique of visualization is referred to as Class Activation Maps (CAMs), and we will use Chollet's implementation of Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization to show some examples of CAMs for VGG16.
from keras.applications.vgg16 import VGG16
from keras import backend as K
K.clear_session()
# Note that we are including the densely-connected classifier on top;
# all previous times, we were discarding it.
model = VGG16(weights='imagenet')
model.summary()
/Users/eclarson/anaconda3/envs/mlenv/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters Using TensorFlow backend.
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 224, 224, 3) 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 _________________________________________________________________ flatten (Flatten) (None, 25088) 0 _________________________________________________________________ fc1 (Dense) (None, 4096) 102764544 _________________________________________________________________ fc2 (Dense) (None, 4096) 16781312 _________________________________________________________________ predictions (Dense) (None, 1000) 4097000 ================================================================= Total params: 138,357,544 Trainable params: 138,357,544 Non-trainable params: 0 _________________________________________________________________
Now we will use some pre-processing code to convert an arbitrary image into the correct format for VGG. The sample input image here is of Dallas Hall and lawn on a sunny day.
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import requests
from io import BytesIO
from PIL import Image
def load_image_as_array(url, size=(150, 150)):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img = img.resize(size)
return np.array(img).astype(float)
# The local path to our target image
img_url = 'https://raw.githubusercontent.com/8000net/LectureNotes/master/images/dallas_hall.jpg'
img = load_image_as_array(img_url, size=(224, 224))
# We add a dimension to transform our array into a "batch"
# of size (1, 224, 224, 3)
x = np.expand_dims(img, axis=0)
# Finally we preprocess the batch
# (this does channel-wise color normalization)
x = preprocess_input(x)
plt.imshow(np.squeeze(x)/256.0+0.5)
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
<matplotlib.image.AxesImage at 0xb2eb297f0>
preds = model.predict(x)
print('Predicted:', decode_predictions(preds, top=3)[0])
predicted_class = np.argmax(preds)
Predicted: [('n03877845', 'palace', 0.4713494), ('n03733281', 'maze', 0.41759244), ('n04355338', 'sundial', 0.043737043)]
The model interprets Dallas Hall (and lawn) to be a palace. (A reasonable prediction, if you ask me)
To visualize why the maximal class was chosen, we use the following Grad-CAM implementation from Chollet. Grad-CAM is a visualization technique originally invented by Selvaraju, et. al which uses the gradient of the predicted class at the final convolutional layer to generate a map of pixel locations on the input image which cause the relevant predicted class to be chosen.
predicted_class_output = model.output[:, predicted_class] # defines class of interest
# The is the output feature map of the `block5_conv3` layer,
# the last convolutional layer in VGG16
last_conv_layer = model.get_layer('block5_conv3')
# This is the gradient of the predicted class with regard to
# the output feature map of `block5_conv3`
grads = K.gradients(predicted_class_output, last_conv_layer.output)[0]
# This is a vector of shape (512,), where each entry
# is the mean intensity of the gradient over a specific feature map channel
pooled_grads = K.mean(grads, axis=(0, 1, 2))
# This function allows us to access the values of the quantities we just defined:
# `pooled_grads` and the output feature map of `block5_conv3`,
# given a sample image
iterate = K.function([model.input], [pooled_grads, last_conv_layer.output[0]])
# These are the values of these two quantities, as Numpy arrays,
# given our sample image
pooled_grads_value, conv_layer_output_value = iterate([x])
# We multiply each channel in the feature map array
# by "how important this channel is" with regard to the predicted class
for i in range(512):
conv_layer_output_value[:, :, i] *= pooled_grads_value[i]
# The channel-wise mean of the resulting feature map
# is our heatmap of class activation
heatmap = np.mean(conv_layer_output_value, axis=-1)
# We then normalize the heatmap 0-1 for visualization:
heatmap = np.maximum(heatmap, 0)
heatmap /= np.max(heatmap)
import cv2
# We use cv2 to load the original image
img = cv2.imread(img_path)
# We resize the heatmap to have the same size as the original image
heatmap = cv2.resize(heatmap, (img.shape[1], img.shape[0]))
# We convert the heatmap to RGB
heatmap = np.uint8(255 * heatmap)
# We apply the heatmap to the original image
heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)
# 0.4 here is a heatmap intensity factor
superimposed_img = heatmap * 0.4 + img
cv2.imwrite('images/dallas_hall_palace_heatmap.jpg', superimposed_img)
True
The following heatmap shows that the center of the building's facade is what most identifies it as a palace.
VGG also identifies that Dallas Hall looks like a maze. The heatmap of the activation for the maze class shows that the crossing sidewalks are the culprit. This is quite intuitive, as a series of crossing paths is likely to be indicative of a maze.
Finally, VGG identifies that the image may be a sundial. This is a slightly less intuitive reaction to the image (and, accordingly, the sundial class activates to a much lesser degree than the previous classes). However, the class is activated by the fountain and to a lesser extent by the top of the dome on top of Dallas hall. Upon further inspection, it is reasonable to see how those two parts of the image could be confused for a sundial.
I encourage you to try this with an image of your own choosing and explore the code in detail.
Now let's look at analyzing this through the WebCam for VGG. You can run the following scripts in python assuming you have OpenCV installed.
cd activation-demo
python Heatmap.py
The following software versions were used to run this Jupyter notebook: