In the early days of artificial intelligence, the field rapidly tackled and solved problems that are intellectually difficult for human beings but relatively straightforward for computers - problems that can be described by a list of formal, mathematical rules. The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images.
Goodfellow et al. 2016, Deep Learning
... a layer at a time
Several "origins" in different fields, see e.g.
Henry J. Kelley (1960). Gradient theory of optimal flight paths. Ars Journal, 30(10), 947-954.
Arthur E. Bryson (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.
Paul Werbos (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.
So why the hype success now?
It is true
that some skill is required to get good performance from a deep learning algorithm. Fortunately, the amount of skill required reduces as the amount of training data increases. The learning algorithms reaching human performance on complex tasks today are nearly identical to the learning algorithms that struggled to solve toy problems in the 1980s [...].
Goodfellow et al. 2016, Deep Learning
As of 2016, a rough rule of thumb
is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples.
Goodfellow et al. 2016, Deep Learning
thanks to faster/better
Since the introduction of hidden units, artificial neural networks have doubled in size roughly every 2.4 years.
Goodfellow et al. 2016, Deep Learning
and most importantly: Deep learning is highly profitable
Deep learning is now used by many top technology companies including Google, Microsoft, Facebook, IBM, Baidu, Apple, Adobe, Netflix, NVIDIA and NEC.
Goodfellow et al. 2016, Deep Learning
So “multi-layer” neural networks do not use the perceptron learning
procedure.
They should never have been called multi-layer perceptrons.
Geoffrey Hinton, Neural Networks for Machine Learning Lec. 3
We want to predict
$f(\mathbf{x}; \mathbf{w}, b) = \mathbf{x}^T\mathbf{w} + b$
$f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T (\mathbf{W}^T\mathbf{x} + \mathbf{c}) + b$
Design matrix: $\mathbf{X} = \begin{bmatrix}0 & 0 \\0 & 1\\1 & 0 \\1 & 1\end{bmatrix}$
Parameters: $\mathbf{W} = \begin{bmatrix}1 & 1 \\1 & 1\end{bmatrix}$, $\mathbf{c} = \begin{bmatrix}0 \\ -1 \end{bmatrix}$, $\mathbf{w} = \begin{bmatrix}1 \\ -2 \end{bmatrix}$
Input to hidden layer: $\mathbf{X}\mathbf{W} = \begin{bmatrix}0 & 0 \\1 & 1\\1 & 1 \\2 & 2\end{bmatrix}$, add $\mathbf{c}$ to every row ==> $\begin{bmatrix}0 & -1 \\1 & 0\\1 & 0 \\2 & 1\end{bmatrix}$
$f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T max(0, \mathbf{W}^T\mathbf{x} + \mathbf{c}) + b$
Output of rectified linear transformation: $\begin{bmatrix}0 & 0 \\1 & 0\\1 & 0 \\2 & 1\end{bmatrix}$
The remaining hidden-to-output transformation is linear, but the classes are already linearly separable.
Minimize squared error $f(\mathbf{x}) = ||\mathbf{X\hat{\beta} - y}||^2_2$
Closed form: solve normal equations $\mathbf{\hat{\beta}} = (\mathbf{X}^T\mathbf{{X}})^{-1} \mathbf{X}^T \mathbf{y}$
Alternatively, follow the gradient: $\nabla_x f(\mathbf{x})= \mathbf{X}^T\mathbf{X}\mathbf{\hat{\beta}} - \mathbf{X^Ty}$
How about a net with several layers?
the loss (or cost) function indicates the cost incurred from false prediction / misclassification
probably the best-known loss function in machine learning is mean squared error:
$\frac{1}{n} \sum_n{(\hat{y} - y)^2}$
most of the time, in deep learning we use cross entropy:
$- \sum_j{t_j log(y_j)}$
This is the negative log probability of the right answer.
purpose of activation function: introduce nonlinearity (see above)
for a long time, the sigmoid (logistic) activation function was used a lot:
$y = \frac{1}{1 + e^{-z}}$
now rectified linear units (ReLUs) are preferred:
$y = max(0, z)$
A = [1,2,3;4,5,6;7,8,9] # input "image"
# padded input matrix, for easier visualization
A_padded = [zeros(1,size(A,2)+2); [zeros(size(A,1),1), A, zeros(size(A,1),1)]; zeros(1,size(A,2)+2)]
B = [1,0;0,0] # kernel
# real convolution
C_full = conv2(A,B,'full') # default
C_same = conv2(A,B,'same')
C_valid = conv2(A,B,'valid')
# cross-correlation
XC = xcorr2(A,B)
Mikolov et al (2013a). Efficient estimation of word representations in vector space. arXiv:1301.3781.
Jane walked into the room. John walked in too. It was late in the
day, and everyone was walking home after a long day at work. Jane said hi to ___
def rnn_cell(rnn_input, state):
with tf.variable_scope('rnn_cell', reuse=True):
W = tf.get_variable('W', [num_classes + state_size, state_size])
b = tf.get_variable('b', [state_size], initializer=tf.constant_initializer(0.0))
return tf.tanh(tf.matmul(tf.concat(1, [rnn_input, state]), W) + b)
state = init_state
rnn_outputs = []
for rnn_input in rnn_inputs:
state = rnn_cell(rnn_input, state)
rnn_outputs.append(state)
final_state = rnn_outputs[-1]
from: http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html
class BasicRNNCell(RNNCell):
"""The most basic RNN cell."""
def __init__(self, num_units, input_size=None, activation=tanh):
self._num_units = num_units
self._activation = activation
@property
def state_size(self):
return self._num_units
def __call__(self, inputs, state, scope=None):
"""Most basic RNN: output = new_state = act(W * input + U * state + B)."""
with vs.variable_scope(scope or "basic_rnn_cell"):
output = self._activation(
_linear([inputs, state], self._num_units, True, scope=scope))
return output, output
class GRUCell(RNNCell):
"""Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078)."""
def __call__(self, inputs, state, scope=None):
"""Gated recurrent unit (GRU) with nunits cells."""
with vs.variable_scope(scope or "gru_cell"):
with vs.variable_scope("gates"): # Reset gate and update gate.
# We start with bias of 1.0 to not reset and not update.
r, u = array_ops.split(value=_linear([inputs, state],
2 * self._num_units,
True,
1.0,
scope=scope),
num_or_size_splits=2,
axis=1)
r, u = sigmoid(r), sigmoid(u)
with vs.variable_scope("candidate"):
c = self._activation(_linear([inputs, r * state],
self._num_units,
True, scope=scope))
new_h = u * state + (1 - u) * c
return new_h, new_h
class BasicLSTMCell(RNNCell):
def __call__(self, inputs, state, scope=None):
with vs.variable_scope(scope or "basic_lstm_cell"):
c, h = array_ops.split(1, 2, state)
concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) * self._activation(j))
new_h = self._activation(new_c) * sigmoid(o)
new_state = array_ops.concat_v2([new_c, new_h], 1)
return new_h, new_state
"If you can express your computation as a data flow graph, you can use TensorFlow."
char-rnn demo
(based on https://github.com/sherjilozair/char-rnn-tensorflow)