Important: Please read the installation page for details about how to install the toolboxes. $\newcommand{\dotp}[2]{\langle #1, #2 \rangle}$ $\newcommand{\enscond}[2]{\lbrace #1, #2 \rbrace}$ $\newcommand{\pd}[2]{ \frac{ \partial #1}{\partial #2} }$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\umax}[1]{\underset{#1}{\max}\;}$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\uargmin}[1]{\underset{#1}{argmin}\;}$ $\newcommand{\norm}[1]{\|#1\|}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\choice}[1]{ \left\{ \begin{array}{l} #1 \end{array} \right. }$ $\newcommand{\pa}[1]{\left(#1\right)}$ $\newcommand{\diag}[1]{{diag}\left( #1 \right)}$ $\newcommand{\qandq}{\quad\text{and}\quad}$ $\newcommand{\qwhereq}{\quad\text{where}\quad}$ $\newcommand{\qifq}{ \quad \text{if} \quad }$ $\newcommand{\qarrq}{ \quad \Longrightarrow \quad }$ $\newcommand{\ZZ}{\mathbb{Z}}$ $\newcommand{\CC}{\mathbb{C}}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\EE}{\mathbb{E}}$ $\newcommand{\Zz}{\mathcal{Z}}$ $\newcommand{\Ww}{\mathcal{W}}$ $\newcommand{\Vv}{\mathcal{V}}$ $\newcommand{\Nn}{\mathcal{N}}$ $\newcommand{\NN}{\mathcal{N}}$ $\newcommand{\Hh}{\mathcal{H}}$ $\newcommand{\Bb}{\mathcal{B}}$ $\newcommand{\Ee}{\mathcal{E}}$ $\newcommand{\Cc}{\mathcal{C}}$ $\newcommand{\Gg}{\mathcal{G}}$ $\newcommand{\Ss}{\mathcal{S}}$ $\newcommand{\Pp}{\mathcal{P}}$ $\newcommand{\Ff}{\mathcal{F}}$ $\newcommand{\Xx}{\mathcal{X}}$ $\newcommand{\Mm}{\mathcal{M}}$ $\newcommand{\Ii}{\mathcal{I}}$ $\newcommand{\Dd}{\mathcal{D}}$ $\newcommand{\Ll}{\mathcal{L}}$ $\newcommand{\Tt}{\mathcal{T}}$ $\newcommand{\si}{\sigma}$ $\newcommand{\al}{\alpha}$ $\newcommand{\la}{\lambda}$ $\newcommand{\ga}{\gamma}$ $\newcommand{\Ga}{\Gamma}$ $\newcommand{\La}{\Lambda}$ $\newcommand{\si}{\sigma}$ $\newcommand{\Si}{\Sigma}$ $\newcommand{\be}{\beta}$ $\newcommand{\de}{\delta}$ $\newcommand{\De}{\Delta}$ $\newcommand{\phi}{\varphi}$ $\newcommand{\th}{\theta}$ $\newcommand{\om}{\omega}$ $\newcommand{\Om}{\Omega}$ $\newcommand{\eqdef}{\equiv}$
This tour details fully connected multi-layers neural netorks.
We recommend that after doing this Numerical Tours, you apply it to your own data, for instance using a dataset from LibSVM.
Disclaimer: these machine learning tours are intended to be overly-simplistic implementations and applications of baseline machine learning methods. For more advanced uses and implementations, we recommend to use a state-of-the-art library, the most well known being Scikit-Learn
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
# convert to a column vector
def MakeCol(y): return y.reshape(-1,1)
# convert to a row vector
def MakeRow(y): return y.reshape(1,-1)
# find non zero/true elements
def find(x): return np.nonzero(x)[0]
n0 = 100 # number of points per class
p = 2 # dimensionality
k = 3 # number of classes
n = n0*k # Total number of points
x = np.zeros((p,n))
y = np.zeros((1,n))
for j in np.arange(0,k):
I = np.arange(n0*(j-1),n0*j)
r = np.linspace(0.0,1,n0) # radius
t = np.linspace(j*4,(j+1)*4,n0) + np.random.randn(1,n0)*0.2 # angle
x[0,I] = r*np.sin(t)
x[1,I] = r*np.cos(t)
y[0,I] = j
Display.
col = np.array( [ [1,0,0], [0,1,0], [0,0,1], [0,0,0], [0,1,1], [1,0,1], [1,1,0], [1,.5,.5], [.5,1,.5], [.5,.5,1] ] ).transpose()
plt.clf
for j in np.arange(0,k):
I = find(y.flatten()==j)
plt.plot(x[0,I], x[1,I], '.', color=col[:,j])
plt.axis('equal')
plt.axis('off');
Class probability matrix.
U = np.tile( MakeCol(np.arange(1,k+1)), (1,n) )
Y = np.double( U == np.tile( y+1, (k,1) ) )
We setup the network. It is parameterized by the dimensions of the layers.
The network is composed of $R$ layers, and operate by initialyzing $x_0=x$ and then iterating $$ \forall r=0,\ldots,R, \quad x_{r+1} \eqdef \rho(A_r x_r + b_r). $$ Here $\rho : \RR \mapsto \RR$ is a non-linear activation function which operate coordinate by coordinate. The intermediate variables are $x_r \in \RR^{d_r}$ with $(d_0,d_{L+1})=(p,k)$. The matrices have size $A_r \in \RR^{d_{r+1} \times d_r}$ while the biases have size $b_r \in \RR^{d_{r+1}}$.
The final value is obtained by comparing the predicted value $x_{R+1}$ to the data $y$ using some loss function $$ \ell \eqdef \Ll(x_{R+1},y). $$
Load the loss and its gradient. Here we use a multi-class logistic loss $$ \Ll(z,y) \eqdef \log \sum_{i=1}^k e^{z_i} - \dotp{z}{y}. $$
Note that in practice the computation is done in parallel over an array $x$ of size $(p,n)$ of $n$ points in $\RR^p$, where the class probability to predict is an array $y$ of size $k,n$ where $k$ is the number of classes.
Useful helpers.
# def dotp(x,y): return x.flatten().dot( y.flatten().transpose() )
def dotp(x,y): return sum( x.flatten() * y.flatten() )
def max2(S): return np.tile( S.max(axis=1,keepdims=1), (1,S.shape[1]) )
Stabilized log-sum-exp
def LSE0(S): return np.log( np.exp(S).sum(axis=1,keepdims=1))
def LSE(S): return LSE0( S-max2(S) ) + S.max(axis=1,keepdims=1)
stabilized soft-max
def SM0(S): return np.exp(S) / np.tile( np.exp(S).sum(axis=1,keepdims=1), (1, S.shape[1]) )
def SM(S): return SM0(S-max2(S))
Energy, $y$ being a target probability distribution
def lossF(z,y): return LSE(z.transpose()).sum() - dotp(z,y)
gradient
def lossG(z,y): return SM(z.transpose()).transpose() - y
Load the activation function. Here we use an atan sigmoid function.
def rhoF(u): return np.arctan(u)
def rhoG(u): return 1/(1+u**2)
Display the activation.
t = np.linspace(-5,5,201)
plt.clf
plt.plot(t, rhoF(t))
plt.axis('tight');
Dimensions $d_r$ of the layers.
D = np.array([p,15,k]) # here a single hidden layer
Initialize the layers randomly.
R = D.size-1
A = []
b = []
for r in np.arange(0,R):
A.append(np.random.randn(D[r+1],D[r]))
b.append(np.random.randn(D[r+1],1))
Evaluate the network. Bookkeep the intermediate results: this is crucial for the computation of the gradient.
def ForwardNN(A,b,Z,R):
X = []
X.append(Z)
for r in np.arange(0,R):
X.append( rhoF( np.dot(A[r],X[r]) + np.tile(b[r],[1,Z.shape[1]]) ) )
return X
X = ForwardNN(A,b,x,R)
L = lossF(X[-1],Y)
The network parameters are optimized by minimizing the non-convex empirical loss minimization through gradient descent.
Initialize the gradient as $$ \nabla_{x_{R+1}} \ell = \nabla \Ll(x_{R+1},y) $$
gx = lossG(X[R],Y)
The successive gradient with respect to the intermediate variables $x_r$ are solutions of a backward recursion, which corresponds to the celebrated backpropagation algorithm. $$ \forall r=R,\ldots,1, \quad \nabla_{x_{r}} \ell = A_r^\top M_r $$ where we denoted $$ M_r \eqdef \rho'(A_r x_r + b_r ) \odot \nabla_{x_{r+1}} \ell, $$ where $\odot$ is entry-wise multiplications.
From these gradients with respect to the intermediate layers variables, the gradient with respect to the network paramters are retrieved as $$ \nabla_{A_r} \ell = M_r x_r^\top, \qandq \nabla_{b_r} \ell = M_r 1_n. $$
Perform the back-propagation.
def BackwardNN(A,b,X,R):
gA = [] # gradient with respect to A
gb = [] # gradient with respect to b
for r in np.arange(0,R):
gA.append([])
gb.append([])
gx = lossG(X[R],Y) # initialize the gradient
for r in np.arange(R-1,-1,-1):
M = rhoG( A[r].dot(X[r]) + np.tile(b[r],[1,n]) ) * gx
# nabla_X[r]
gx = A[r].transpose().dot(M)
# nabla_A[r]
gA[r] = M.dot(X[r].transpose())
# nabla_b[r]
gb[r] = MakeCol(M.sum(axis=1))
return [gA,gb]
Exercise 1
Implement the gradient descent.
run -i nt_solutions/ml_6_nn/exo1
## Insert your code here.
Grid for evaluation.
q = 100;
t = np.linspace(-1,1,q)
[Yg,Xg] = np.meshgrid(t,t)
Z = np.vstack([Xg.flatten(), Yg.flatten()])
Classification maps
V = ForwardNN(A,b,Z,R)
U = np.reshape(SM(V[R].transpose()), [q,q,k] )
Turn it into color.
R = np.zeros((q,q,3))
for i in np.arange(0,k):
for a in np.arange(0,3):
R[:,:,a] = R[:,:,a] + U[:,:,i] * col[a,i]
Final display of points and class probability.
M=1
plt.clf
plt.imshow(R.transpose((1, 0, 2)), origin="lower", extent=[-M,M,-M,M])
for i in np.arange(0,k):
I = find(y==i+1)
plt.plot(x[0,I], x[1,I], '.', color=col[:,i]*.8)
plt.axis('off');
Exercise 2
Check the influence of the number of layers
run -i nt_solutions/ml_6_nn/exo2
## Insert your code here.
Instead of computing the gradient "by hand", we use an automatic differentation toolbox, here AutoGrad.
It uses reverse mode automatic differentation to compute the gradient at the same expense at computing the function itself. See this tutorial for more information.
from __future__ import absolute_import
import autograd.numpy as np
import autograd as ag
from autograd import elementwise_grad as egrad
Test on a scalar value function, computing and plotting higher-order derivatives.
def tanh(x):
return (1.0 - np.exp(-x)) / (1.0 + np.exp(-x))
t = np.linspace(-7, 7, 200)
plt.plot(t, tanh(t),
t, egrad(tanh)(t), # first derivative
t, egrad(egrad(tanh))(t), # second derivative
t, egrad(egrad(egrad(tanh)))(t), # third derivative
t, egrad(egrad(egrad(egrad(tanh))))(t), # fourth derivative
t, egrad(egrad(egrad(egrad(egrad(tanh)))))(t), # fifth derivative
t, egrad(egrad(egrad(egrad(egrad(egrad(tanh))))))(t)); # sixth derivative
Gradient of a quadratic functions.
M = np.random.randn(5,2);
def f(x):
u = np.dot( M, x )
return np.sum( u*u )
u = MakeCol( np.array( [1.0,2.0] ) )
print( 'f=' + str(f(u)) )
g = ag.grad(f)
print( 'nabla f=' + str( g( u ) ) )
f=8.283198013925736 nabla f=[[11.25997634] [ 2.65320984]]
Function to be minimized for neural networks.
def FuncNN(P): # A,b
X = ForwardNN(P[0],P[1],x,R)
return lossF(X[-1],Y)
D = [p, 4, 5, 8, 4, k]
R = np.size(D)-1
A = []
b = []
for r in np.arange(0,R):
A.append(np.random.randn(D[r+1],D[r]))
b.append(np.random.randn(D[r+1],1))
Compute the funcfion and the gradient
FuncNNG = ag.value_and_grad(FuncNN)
[u,g] = FuncNNG((A,b))
gA = g[0] # gradient with respect to A
gb = g[1] # gradient with respect to b
Compare with the gradient computed "by hand".
import numpy
X = ForwardNN(A,b,x,R)
[gA1,gb1] = BackwardNN(A,b,X,R)
R = np.size(A)
e = 0
for i in range(R):
e += np.linalg.norm(gA[r]-gA1[r])
e += np.linalg.norm(gb[r]-gb1[r])
print('Error=' + str(e))
Error=7.622089138719459e-14