It is short overview of what I've learned in CNN course of Deep Learning specialization from Coursera by Andrew Ng.
# TODO: apply conv matrixes to an image and show result
Adds zeros around image to avoid shrinking image and losing information on the edge.
Size of zeros = (p - 1) / 2
params:
input: $n_{h}^{[l-1]} n_{w}^{[l-1]} n_{c}^{[l-1]}$.
output: $n_{h}^{[l]} n_{w}^{[l]} n_{c}^{[l]}$.
$n^{[l]} = floor(\frac{n^{[l-1]} + 2p^{[l-1]} - f^{[l-1]}}{s^{[l-1]} + 1})$
each filter has shape:: $f^{[l]} f^{[l]} n_{c}^{[l-1]}$
activations: $a^{[l]} -> n_{h}^{[l]} n_{w}^{[l]} n_{c}^{[l]}$.
Could be max pooling, average pooling. It splites matrix in the regions and get max/avg value in each region and store it in the new matrix.
params:
properties
conv -> pool -> conv -> pool -> … -> fc -> ... -> fc -> softmax.
it is not covered in the course itself but has quick overview in a exercises.
article: https://arxiv.org/abs/1409.4842 Going Deeper with Convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
features:
1x1 convolution features:
Landmark detection
YOLO (you only look once) YOLO / objects detection
Validate - (IoU) Intersection Over Union
Non-max Suppression
Anchor Boxes
Alternative
Problem: One Shot Learning:
Solution:
train function which gives similarity: $$ d(face\_image^{(i)}, face\_image^{(j)}). $$
Siamese Network
Triplet Loss
- to exclude trivial cases when d(A,B) will be 0 we add small margin ($\alpha$):
$$
||f(A) - f(P)||^{2} + \alpha \leq ||f(A) - f(N)||^{2}
$$- _ME: why don’t just replace $\le$ with $<$?_
- Training:
- few images for single person -> make triple (A,P,N) for training, single image for person is not enough
- just random triple (A,P,N) would not enough because it is very easy to satisfy L function. Choose triplet that’re hard to train - so d(A,P)~d(A,N) - quite close.
- Prediction:
- single image of person could be ok.
- precompute f(A) for db known persons
- FaceNet: A Unified Embedding for Face Recognition and Clustering Florian Schroff, Dmitry Kalenichenko, James Philbin https://arxiv.org/abs/1503.03832
We can use this approach for checking similarity of other things - for example photos of artists, tweets and etc._
https://keras.io/getting-started/functional-api-guide/#shared-layers loss function based on:
p(tweet_1 and tweet_2 belong to one person) = log_regression(f(tweet_1) + f(tweet_2))
log_regression == Dense(1, activation="sigmoid")(merged_vector)
ME: How should we search there images? brute force?
Content (C) + Style (S) => Generated image (G)
Start with random G image. For example: G shape 100x100x3
Use gradient descent to minimize J(G)
use hidden layer l
to compute content cost. l is somewhere in between (lower - "pixel-2-pixel", high - "feature somewhere")
difference of activation matrixes
use style matrix -- correlation (actually covariation) of different channels of one layer
$$ G^{[l]}_{kk'} = \sum_{i=1}^{n^{[l]}_{h}} \sum_{j=1}^{n^{[l]}_{w}} a^{[l]}_{ijk}a^{[l]}_{ijk'} $$for one layer
$$ J^{[l]}_{style}(S,G) = ||G^{[l][S]} - G^{[l][G]}||^{2} = \frac{1}{2n^{[l]}_{H}n^{[l]}_{W}n^{[l]}_{C})^{2} \sum_{k} \sum_{k'} G^{[l](S)}_{kk'} G^{[l](G)}_{kk'}} $$we would get better result when we will use sum of few layers.
$$ J_{style}(S,G) = \sum_{l} \lambda^{[l]} J^{[l]}_{style}(S,G) $$