# Neural Networks¶


This tour details fully connected multi-layers neural netorks.

We recommend that after doing this Numerical Tours, you apply it to your own data, for instance using a dataset from <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ LibSVM>.

Disclaimer: these machine learning tours are intended to be overly-simplistic implementations and applications of baseline machine learning methods. For more advanced uses and implementations, we recommend to use a state-of-the-art library, the most well known being <http://scikit-learn.org/ Scikit-Learn>

In [93]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

The autoreload extension is already loaded. To reload it, use:

In [94]:
#  convert to a column vector
def MakeCol(y): return y.reshape(-1,1)
#  convert to a row vector
def MakeRow(y): return y.reshape(1,-1)
# find non zero/true elements
def find(x): return np.nonzero(x)[0]


## Dataset Generation¶

Build a synthetic data set for classification

Generate Data

In [95]:
n0 = 100 # number of points per class
p = 2   # dimensionality
k = 3   # number of classes
n = n0*k # Total number of points
x = np.zeros((p,n))
y = np.zeros((1,n))
for j in np.arange(0,k):
I = np.arange(n0*(j-1),n0*j)
t = np.linspace(j*4,(j+1)*4,n0) + np.random.randn(1,n0)*0.2 # angle
x[0,I] = r*np.sin(t)
x[1,I] = r*np.cos(t)
y[0,I] = j


Display.

In [96]:
col = np.array( [ [1,0,0], [0,1,0], [0,0,1], [0,0,0], [0,1,1], [1,0,1], [1,1,0], [1,.5,.5], [.5,1,.5], [.5,.5,1]  ] ).transpose()
plt.clf
for j in np.arange(0,k):
I = find(y.flatten()==j)
plt.plot(x[0,I], x[1,I], '.', color=col[:,j])
plt.axis('equal')
plt.axis('off');


Class probability matrix.

In [97]:
U = np.tile( MakeCol(np.arange(1,k+1)), (1,n) )
Y = np.double( U == np.tile( y+1, (k,1) ) )


## Building the Network¶

We setup the network. It is parameterized by the dimensions of the layers.

The network is composed of $R$ layers, and operate by initialyzing $x_0=x$ and then iterating $$\forall r=0,\ldots,R, \quad x_{r+1} \eqdef \rho(A_r x_r + b_r).$$ Here $\rho : \RR \mapsto \RR$ is a non-linear activation function which operate coordinate by coordinate. The intermediate variables are $x_r \in \RR^{d_r}$ with $(d_0,d_{L+1})=(p,k)$. The matrices have size $A_r \in \RR^{d_{r+1} \times d_r}$ while the biases have size $b_r \in \RR^{d_{r+1}}$.

The final value is obtained by comparing the predicted value $x_{R+1}$ to the data $y$ using some loss function $$\ell \eqdef \Ll(x_{R+1},y).$$

Load the loss and its gradient. Here we use a multi-class logistic loss $$\Ll(z,y) \eqdef \log \sum_{i=1}^k e^{z_i} - \dotp{z}{y}.$$

Note that in practice the computation is done in parallel over an array $x$ of size $(p,n)$ of $n$ points in $\RR^p$, where the class probability to predict is an array $y$ of size $k,n$ where $k$ is the number of classes.

Useful helpers.

In [98]:
# def dotp(x,y): return x.flatten().dot( y.flatten().transpose() )
def dotp(x,y): return sum( x.flatten() * y.flatten() )
def max2(S): return np.tile( S.max(axis=1,keepdims=1), (1,S.shape[1]) )


Stabilized log-sum-exp

In [99]:
def LSE0(S): return np.log( np.exp(S).sum(axis=1,keepdims=1))
def LSE(S): return LSE0( S-max2(S) ) + S.max(axis=1,keepdims=1)


stabilized soft-max

In [100]:
def SM0(S): return np.exp(S) / np.tile( np.exp(S).sum(axis=1,keepdims=1), (1, S.shape[1]) )
def SM(S): return SM0(S-max2(S))


Energy, $y$ being a target probability distribution

In [101]:
def lossF(z,y): return LSE(z.transpose()).sum() - dotp(z,y)


In [102]:
def lossG(z,y): return SM(z.transpose()).transpose() -  y


Load the activation function. Here we use an atan sigmoid function.

In [103]:
def rhoF(u): return np.arctan(u)
def rhoG(u): return 1/(1+u**2)


Display the activation.

In [104]:
t = np.linspace(-5,5,201)
plt.clf
plt.plot(t, rhoF(t))
plt.axis('tight');


Dimensions $d_r$ of the layers.

In [105]:
D = np.array([p,15,k]) # here a single hidden layer


Initialize the layers randomly.

In [106]:
R = D.size-1
A = []
b = []
for r in np.arange(0,R):
A.append(np.random.randn(D[r+1],D[r]))
b.append(np.random.randn(D[r+1],1))


Evaluate the network. Bookkeep the intermediate results: this is crucial for the computation of the gradient.

In [107]:
def ForwardNN(A,b,Z,R):
X = []
X.append(Z)
for r in np.arange(0,R):
X.append( rhoF( np.dot(A[r],X[r]) + np.tile(b[r],[1,Z.shape[1]]) ) )
return X
X = ForwardNN(A,b,x,R)
L = lossF(X[-1],Y)


## Network Optimization¶

The network parameters are optimized by minimizing the non-convex empirical loss minimization through gradient descent.

Initialize the gradient as $$\nabla_{x_{R+1}} \ell = \nabla \Ll(x_{R+1},y)$$

In [108]:
gx = lossG(X[R],Y)


The successive gradient with respect to the intermediate variables $x_r$ are solutions of a backward recursion, which corresponds to the celebrated backpropagation algorithm. $$\forall r=R,\ldots,1, \quad \nabla_{x_{r}} \ell = A_r^\top M_r$$ where we denoted $$M_r \eqdef \rho'(A_r x_r + b_r ) \odot \nabla_{x_{r+1}} \ell,$$ where $\odot$ is entry-wise multiplications.

From these gradients with respect to the intermediate layers variables, the gradient with respect to the network paramters are retrieved as $$\nabla_{A_r} \ell = M_r x_r^\top, \qandq \nabla_{b_r} \ell = M_r 1_n.$$

Perform the back-propagation.

In [109]:
def BackwardNN(A,b,X,R):
gA = [] # gradient with respect to A
gb = [] # gradient with respect to b
for r in np.arange(0,R):
gA.append([])
gb.append([])
gx = lossG(X[R],Y) # initialize the gradient
for r in np.arange(R-1,-1,-1):
M = rhoG( A[r].dot(X[r]) + np.tile(b[r],[1,n]) ) * gx
# nabla_X[r]
gx = A[r].transpose().dot(M)
# nabla_A[r]
gA[r] =  M.dot(X[r].transpose())
# nabla_b[r]
gb[r] =  MakeCol(M.sum(axis=1))
return [gA,gb]


Exercise 1

In [110]:
run -i nt_solutions/ml_5_nn/exo1

In [111]:
## Insert your code here.


Grid for evaluation.

In [112]:
q = 100;
t = np.linspace(-1,1,q)
[Yg,Xg] = np.meshgrid(t,t)
Z = np.vstack([Xg.flatten(), Yg.flatten()])


Classification maps

In [113]:
V = ForwardNN(A,b,Z,R)
U = np.reshape(SM(V[R].transpose()), [q,q,k] )


Turn it into color.

In [114]:
R = np.zeros((q,q,3))
for i in np.arange(0,k):
for a in np.arange(0,3):
R[:,:,a] = R[:,:,a] + U[:,:,i] * col[a,i]


Final display of points and class probability.

In [115]:
M=1
plt.clf
plt.imshow(R.transpose((1, 0, 2)), origin="lower", extent=[-M,M,-M,M])
for i in np.arange(0,k):
I = find(y==i+1)
plt.plot(x[0,I], x[1,I], '.', color=col[:,i]*.8)
plt.axis('off');


Exercise 2

Check the influence of the number of layers

In [116]:
D = Layers[il]
A

Out[116]:
[array([[-3.29284689,  1.49260118],
[ 1.37717411,  1.58627402],
[-0.80847499, -4.04485085],
[-2.77291595, -3.2371961 ],
[-1.16796575,  8.24310395],
[-0.11288313,  0.15242016],
[ 6.62389652,  5.89589916],
[ 1.14380596,  1.12267236],
[-0.2469626 ,  0.22520564],
[-9.98003355,  3.00569398],
[-4.67762539,  1.43051334],
[-5.76283231,  1.86451588],
[ 0.97346821, -0.48986319],
[-1.08462158,  7.07794009],
[-4.96509514, -3.98473611]]),
array([[-1.46602295, -2.05840468,  3.67989712,  1.43298496,  5.05436557,
-1.0725481 ,  4.46247667, -1.0529891 ,  0.61408046, -1.12107628,
-1.89210352, -2.67492959,  0.27324336, -6.17942339,  5.19088946],
[-2.0667059 ,  0.58920165, -1.48053501, -4.90046138,  0.72283288,
-0.88651266, -4.6487042 ,  2.66585634,  1.68710386,  5.98923599,
-3.26935353, -4.72661609, -0.84925318,  1.50110362, -4.6023117 ],
[ 3.72321229, -1.03620466,  0.06729369,  1.91073085, -3.39529315,
-0.7677965 , -0.74708235, -0.43700477,  0.45635066, -5.28548548,
4.31764138,  4.70269184, -2.6960545 ,  6.32810019,  1.94687973]])]
In [117]:
run -i nt_solutions/ml_5_nn/exo2

In [118]:
## Insert your code here.


## Automatic Differentiation¶

It uses reverse mode automatic differentation to compute the gradient at the same expense at computing the function itself. See this tutorial for more information.

In [119]:
from __future__ import absolute_import


Test on a scalar value function, computing and plotting higher-order derivatives.

In [120]:
def tanh(x):
return (1.0 - np.exp(-x))  / (1.0 + np.exp(-x))
t = np.linspace(-7, 7, 200)
plt.plot(t, tanh(t),


In [121]:
M = np.random.randn(5,2);
def f(x):
u = np.dot( M, x )
return np.sum( u*u )
u = MakeCol( np.array( [1.0,2.0] )  )
print( 'f=' + str(f(u)) )
print( 'nabla f=' + str( g( u ) ) )

f=8.088903186499566
nabla f=[[5.18356321]
[5.49712158]]


Function to be minimized for neural networks.

In [122]:
def FuncNN(P): # A,b
X = ForwardNN(P[0],P[1],x,R)
return lossF(X[-1],Y)

In [123]:
D = [p, 4, 5, 8, 4, k]
R = np.size(D)-1
A = []
b = []
for r in np.arange(0,R):
A.append(np.random.randn(D[r+1],D[r]))
b.append(np.random.randn(D[r+1],1))


Compute the funcfion and the gradient

In [124]:
FuncNNG = ag.value_and_grad(FuncNN)
[u,g] = FuncNNG((A,b))
gA = g[0] # gradient with respect to A
gb = g[1] # gradient with respect to b


Compare with the gradient computed "by hand".

In [125]:
import numpy
X = ForwardNN(A,b,x,R)
[gA1,gb1] = BackwardNN(A,b,X,R)
R = np.size(A)
e = 0
for i in range(R):
e += np.linalg.norm(gA[r]-gA1[r])
e += np.linalg.norm(gb[r]-gb1[r])
print('Error=' + str(e))

Error=1.0848339337041004e-13