Notebook

My Lectures Notes of CNN Course ¶

It is short overview of what I've learned in CNN course of Deep Learning specialization from Coursera by Andrew Ng.

Simple Convolution matrixes¶

edge detection Kernal $$ K = \left[ {\begin{array}{ccc} -1 & 0 & 1 \\ -1 & 0 & 1 \\ -1 & 0 & 1 \\ \end{array} } \right] $$

in python¶

In [1]:

# TODO: apply conv matrixes to an image and show result

CNN Technics¶

Padding¶

Adds zeros around image to avoid shrinking image and losing information on the edge. Size of zeros = (p - 1) / 2

Strided convolution¶

params:

$f^{[l]}$ - filter size
$p^{[l]}$ - padding
$s^{[l]}$ - stride
$n_{c}^{[l]}$ - number of channels

input: $n_{h}^{[l-1]} n_{w}^{[l-1]} n_{c}^{[l-1]}$.

output: $n_{h}^{[l]} n_{w}^{[l]} n_{c}^{[l]}$.

$n^{[l]} = floor(\frac{n^{[l-1]} + 2p^{[l-1]} - f^{[l-1]}}{s^{[l-1]} + 1})$

each filter has shape:: $f^{[l]} f^{[l]} n_{c}^{[l-1]}$

activations: $a^{[l]} -> n_{h}^{[l]} n_{w}^{[l]} n_{c}^{[l]}$.

Pooling layer¶

Could be max pooling, average pooling. It splites matrix in the regions and get max/avg value in each region and store it in the new matrix.

params:

f-size of region
s-stride.

properties

has no parameters to learn
common patter of usage:

conv -> pool -> conv -> pool -> … -> fc -> ... -> fc -> softmax.

activation size tends to drop gradually but number of channels are increase
but sure influent on back propagation

CNN Backpropagation¶

it is not covered in the course itself but has quick overview in a exercises.

CNN Features¶

very good at capturing area features
parameter sharing - a feature detector that’s useful in one part of the image is probably useful in another part of the image
sparsity of connections - in each layer, each output value depends only on a small number of inputs (much less than fully connected layer).

How to use Available (Third party) Nets¶

Classic Nets (LeNet-5, AlexNet, VGG)¶

good to start from AlexNet, because it's much easy to understand.
LeNet-5 1998
- articles: http://yann.lecun.com/exdb/lenet/
- http://deeplearning.net/tutorial/lenet.html
AlexNet 2012
- article: http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
- https://en.wikipedia.org/wiki/AlexNet
- contains only 8 layers, first 5 were convolutional layers followed by fully connected layers
VGG 2014
- https://arxiv.org/abs/1409.1556 Very Deep Convolutional Networks for Large-Scale Image Recognition Karen Simonyan, Andrew Zisserman
- use only very small convolution filter 3x3
- 16-19 layers

ResNet (152 - layers)'2015¶

article: https://arxiv.org/abs/1512.03385 Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
residual block ("short cut" or "skip connection")
- "plain" network (without residual blocks) increasing layer number after some max layer in theory should decrease training error but in practice will increase it - because of vanishing gradients (or, in rare cases, grow exponentially quickly and "explode" to take very large values).
- ResBlocks allows the gradient to be directly backpropagated to earlier layers and solve problem of “plain” network
- ResBlocks makes it very easy for one of the blocks to learn an identity function. Thus we can stack on additional ResNet blocks with little risk of harming training set performance
- If we have different volumes of the input/output we use CONV2D layer in “shortcut” path to resize dimension
we could use batch norm for the channel after each CONV2D layer — to speed up training.

Inception Network'2014¶

article: https://arxiv.org/abs/1409.4842 Going Deeper with Convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
features:
- for one layer apply multiple filters (1x1, 3x3, 5x5, max-pool) => 28x28x(64+128+32+32)
- bottleneck layer - use 1x1 layer to shrink layer before applying some big butch of filters (example layer: 28x28x192 apply 32 of 5x5).
- side branches have with few FCL and softmax which helps follow that we are going right way
- gooLeNet
1x1 convolution features:
- weight each channel by n filter matrixes
- shrink number of channels

Transfer Learning¶

get nn + weights from internet
replace its last layers with softmax
freeze all (or part for big data set) layers except softmax layer
precompute last frozen layer activations -- convert X through all fixed layers and save to disk (it save from passing through all layers multiple times)
train

Data Augmentation¶

should be implemented on-fly and in parallel
mirroring
random cropping
rotation
shearing
local warping
color shifting
PCA color argumentation http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Kaggle Galaxy Zoo challenge http://benanne.github.io/2014/04/05/galaxy-zoo.html
examples https://github.com/facebook/fb.resnet.torch/blob/e8fb31378fd8dc188836cf1a7c62b609eb4fd50a/datasets/transforms.lua

Ensembling¶

train several network independently and average their output ($\hat{y}$)
popular for competition but not so useful for production - because gets a lot of power for the small amount of improvement

Multi-crop on test time¶

run classifier (predict) on 10-cropped images (center, top-left, top-right, …., mirrored, etc) and then average output ($\hat{y}$)

Object Classification < Object Localization < Object Detection¶

Landmark detection

YOLO (you only look once) YOLO / objects detection
- Redmon et al., 2016 (https://arxiv.org/abs/1506.02640) and Redmon and Farhadi, 2016 (https://arxiv.org/abs/1612.08242).
- grid nxn
- in each cell find object (x,y, width, height, class)
Validate - (IoU) Intersection Over Union
- = intersection / union.
- 0.5 is good
Non-max Suppression
- get box with max p
- exclude other boxes with IoU > 0.5
- repeat
Anchor Boxes
- purpose - overlapping objects
- problem - we could have in one cell more than 1 object
- make prediction for different anchor boxes (and store more than few anchors in one y - “output”)
- choose right anchor box by matching IoU
- TODO: what about case when we have few pedestrians in single cell? They should have similar anchor boxes. And why don’t we store in y id of anchor? In this way we could have few objects with similar anchor.
Alternative
- Region Proposals (R-CNN)
  - article: https://arxiv.org/abs/1311.2524
  - segment to regions
  - try to classify these regions (label + bounding box)
  - classify one region at the time
- Fast R-CNN
  - article: https://arxiv.org/abs/1504.08083
  - use CNN for classify all regions at the time
- Faster R-CNN
  - article: https://arxiv.org/abs/1506.01497
  - use CNN to propose regions
  - but it's still slower than YOLO

Face recognition¶

Problem: One Shot Learning:
- we have extendable data base of persons with little amount of pictures (maybe one for single person)
- we can't use NN with n (number of persons) outputs because:
  - we can re-train your NN each time when new person comes or leaves
  - it is really small amount of data for training good NN
Solution:

train function which gives similarity: $$ d(face\_image^{(i)}, face\_image^{(j)}). $$

Siamese Network
- $f(image^{(i)})$ - encoding picture by NN to 128th numbers (the last layer).
- $d(image^{(i)}, image^{(j)}) = ||f(image^{(i)}) - f(image^{(j)})||_{2}^{2}$ - distance between $image^{(i)}$ and $image^{(j)}$, small for the same persons and large for different persons
- article: https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf DeepFace: Closing the Gap to Human-Level Performance in Face
Triplet Loss
- gets anchor (A), positive (P) and negative (N) images

$$ ||f(A) - f(P)||^{2} \leq ||f(A) - f(N)||^{2} $$

- to exclude trivial cases when d(A,B) will be 0 we add small margin ($\alpha$):

$$ ||f(A) - f(P)||^{2} + \alpha \leq ||f(A) - f(N)||^{2} $$

- _ME: why don’t just replace $\le$ with $<$?_

Lost function:

$$ L(A,P,N) = max(||f(A) - f(P)||^{2} - ||f(A) - f(N)||^{2} + \alpha, 0) $$

- Training: 
  - few images for single person -> make triple (A,P,N) for training, single image for person is not enough
  - just random triple (A,P,N) would not enough because it is very easy to satisfy L function. Choose triplet that’re hard to train - so d(A,P)~d(A,N) - quite close.
- Prediction: 
  - single image of person could be ok.
  - precompute f(A) for db known persons
- FaceNet: A Unified Embedding for Face Recognition and Clustering Florian Schroff, Dmitry Kalenichenko, James Philbin https://arxiv.org/abs/1503.03832

Binary classification for Face verification
- add one last extra layer which will be binary classifier: $\hat{y} = \Sigma(\sum{k=1}{128} w_{i} |f(x^{(i)}_{k}) - f(x^{(j)}_{k})| + b)$

Other applications¶

We can use this approach for checking similarity of other things - for example photos of artists, tweets and etc._

https://keras.io/getting-started/functional-api-guide/#shared-layers loss function based on:

p(tweet_1 and tweet_2 belong to one person) = log_regression(f(tweet_1) + f(tweet_2))
log_regression == Dense(1, activation="sigmoid")(merged_vector)

Neural Style Transfer (NST)¶

To Understand Should take a look on visualizatin of CNN layers¶

article: https://arxiv.org/abs/1311.2901 Visualizing and Understanding Convolutional Networks Matthew D Zeiler, Rob Fergus
pick a unit in layer 1 of (CNN Deep NN). Find the nine image patches that maximize the unit’s activation

ME: How should we search there images? brute force?

repeat for other units
repeat for other layers
this way we could finds that each channel of each layer focus on some subset of features - like circle at the middle, cat, people and etc.

NST algorithm¶

Content (C) + Style (S) => Generated image (G)
Start with random G image. For example: G shape 100x100x3
Use gradient descent to minimize J(G)

$$ G = G - \alpha / dG * J(G) $$

Cost function

$$ J(G) = \alpha*J_{content}(C, G) + \beta*J_{style}(S, G) $$

Content cost¶

use hidden layer l to compute content cost. l is somewhere in between (lower - "pixel-2-pixel", high - "feature somewhere")

$$ J_{content}(C, G) = \frac{1}{2}||a^{[l](C)} - a^{[l](G)}||^{2} $$

difference of activation matrixes

Style cost¶

use style matrix -- correlation (actually covariation) of different channels of one layer

$$ G^{[l]}_{kk'} = \sum_{i=1}^{n^{[l]}_{h}} \sum_{j=1}^{n^{[l]}_{w}} a^{[l]}_{ijk}a^{[l]}_{ijk'} $$

for one layer

$$ J^{[l]}_{style}(S,G) = ||G^{[l][S]} - G^{[l][G]}||^{2} = \frac{1}{2n^{[l]}_{H}n^{[l]}_{W}n^{[l]}_{C})^{2} \sum_{k} \sum_{k'} G^{[l](S)}_{kk'} G^{[l](G)}_{kk'}} $$

we would get better result when we will use sum of few layers.

$$ J_{style}(S,G) = \sum_{l} \lambda^{[l]} J^{[l]}_{style}(S,G) $$

diagonal elements $G_{ii}$ - shows how common feature i is.
https://arxiv.org/abs/1508.06576 A Neural Algorithm of Artistic Style Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

In [ ]:

My Lectures Notes of CNN Course¶