Choosing a good topology for the neural network is key in order to achieve a successful training as the number of layers and the numbers of unit in each layer directly determine the number of trainable weights. But how to chose the number of layers and units? This question is of particular importance in cases where neural networks are used for financial modelling as these models are subject to model governance as well as internal and external model validation. Thus, the network topology has to be chosen in a way that is compatible with model governance.
Typical approaches to chose a network topology (in financial and any other conext) are:
The first method is of course straight-forward and quite often successfully used in practice. However, it is not always successful. For financial models, this approach is particularly problematic as it is not compatible with any model governance framework.
The second method can also work well for many practical use cases, but the problem from a model validation perspective here is that one tries to shed light into a blackbox with another blackbox. AutoML can only be used to validate a bank's machine learning solution after it has sucessfully completed a model validation process itself. That might be possible, but requires a lot of additional ressources - potentially more than to validate the neural network in question directly. Also, it is doubtful if a regulator would sign such a blank cheque.
The third method, has the advantage that a lot of literature and research already exists in that area - most famously the universal approximation theorem for neural networks. This literature is very helpful in providing sound methodological justifications and theoretical background. However, in practice, asymptotic convergence results will often not be enough to uniquely determine the concrete parameters of the network for the problem at hand.
We will therefore propose an intermediary solution that improves the first method by some of the techniques used in the second, in particular grid search. We will formulate this proposal in a language that takes the more classical perspective of degrees of freedom that is very common in model validation. The aim is to have a framework in place that can be used in practice to get a production level machine learning application in quantitative finance through a model validation process.
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
%matplotlib notebook
Let's first consider the case of MLPs. For MLPs one has to chose the number $n_L$ of layers and for each hidden layer $i$ the number $n_{u_i}$ of units. For any choice of these parameters, the degrees of freedom in the resulting MLP are precisely the number $n_w$ trainable weights, which we compute in the following.
For a primer on MLPs and the notaion convention used in the following, see here.
In the easiest case, one only has $n_L = 1$ layers. In that case, the number $n_w$ of trainable weights is given by $$ n_w = n_o n_i + n_o,$$ where
This is because we need a matrix in $\mathbb{R}^{n_o x n_i}$ and a vector in $\mathbb{R}^{n_o}$ for the bias (we always assume that we train the bias).
In particular, for a network with only $n_L = 1$ layer, the topology is completely determined by the input and the output dimensions and there is no choice to be made.
# in keras, this setup corresponds to
n_o = 3
n_i = 2
model = Sequential([
Dense(units=n_o, input_shape=(n_i,)),
])
print(n_i*n_o + n_o)
model.summary()
9 _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 3) 9 ================================================================= Total params: 9 Trainable params: 9 Non-trainable params: 0 _________________________________________________________________
If we allow $n_L=2$ layers, then we have to choose the number $n_u$ of units in the first layer. The resulting number of weights will be $$ n_w = n_u n_i + n_u + n_o n_u + n_o = n_u(n_i + n_o + 1) + n_o, $$ because for the first layer, we need one matrix of shape $n_u x n_i$ and a vector of shape $n_u$, and for the second layer we need a matrix of shape $n_o x n_u$ and a vector of shape $n_o$. Thus, the only parameter we need to choose is $n_u$.
# in keras, this setup corresponds to
n_o = 3
n_i = 2
n_u = 10
model = Sequential([
Dense(units=n_u, input_shape=(n_i,)),
Dense(units=n_o),
])
print(n_u * (n_i + n_o + 1) + n_o)
model.summary()
63 _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_2 (Dense) (None, 10) 30 _________________________________________________________________ dense_3 (Dense) (None, 3) 33 ================================================================= Total params: 63 Trainable params: 63 Non-trainable params: 0 _________________________________________________________________
In case the number of layers is $n_L \geq 3$, we have to choose the number of units $n_u$ for all but the last layers separately. If we enumerate the layers from input to output by $1, \ldots, n_L$ and denote by $n_{u_i}$ the number of units in each layer, then the total number of trainable weights is given by $$ n_w = n_i n_{u_1} + n_{u_1} + n_{u_L} n_o + n_o + \sum_{i=2}^{n_L - 1}{n_{u_{i-1}} n_{u_{i}} +n_{u_i}},$$ which in theory leaves us with many choices.
If we assume that all but the output layer have the same number of units, i.e. $n_{u_i}=n_u$ for $1 \leq i < N_L$, then we only have to choose one number $n_u$ and the resulting number of trainable weights simplifies to: $$ n_w = (n_L - 2) n_u^2 + (n_i + n_o + n_L - 1)n_u + n_o ,$$ which requires $2$ choices, namely $n_L$, the number of layers, and $n_u$, the number of units per layer.
# in keras, an example is given by
n_o = 3
n_i = 2
n_u = 10
n_L = 5
model = Sequential([
Dense(units=n_u, input_shape=(n_i,)),
Dense(units=n_u),
Dense(units=n_u),
Dense(units=n_u),
Dense(units=n_o),
])
print( (n_L - 2) * n_u**2 + (n_i + n_o + n_L - 1) * n_u + n_o)
model.summary()
393 _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_4 (Dense) (None, 10) 30 _________________________________________________________________ dense_5 (Dense) (None, 10) 110 _________________________________________________________________ dense_6 (Dense) (None, 10) 110 _________________________________________________________________ dense_7 (Dense) (None, 10) 110 _________________________________________________________________ dense_8 (Dense) (None, 3) 33 ================================================================= Total params: 393 Trainable params: 393 Non-trainable params: 0 _________________________________________________________________
We have seen that under the assumption that every layer has the same number of units (except the output layer), the number $n_w$ of trainable weights is a function $n_w = n_w(n_L, w_u)$ of the number $n_L$ of layers and $n_u$ of units per layer.
In order to find a good network topology for a given problem, one could of course simply try out a grid of networks parametrised by $n_L$ and $n_u$. However, if we study this function using some simple examples, we see that this might not be the best choice.
# implement $n_w$ function
def num_weights(n_i, n_o, n_L, n_u):
"""
Computes the total number of parameters in an MLP assuming all layers have the same
number of units (except the output layer).
param n_i: number of inputs
param n_o: number of outputs
param n_L: number of layers
param n_u: number of units per layer
returns: total number of trainable weights
"""
if n_L == 2:
return n_u * (n_i + n_o + 1) + n_o
else:
return (n_L - 2) * n_u**2 + (n_i + n_o + n_L - 1) * n_u + n_o
# create example
n_o = 3
n_i = 2
units = np.arange(20, 120, 20)
layers = np.arange(2, 6)
xx, yy = np.meshgrid(units, layers)
x, y = xx.ravel(), yy.ravel()
z = np.array([[num_weights(n_i, n_o, l, u) for u in units] for l in layers]).ravel()
bottom = np.zeros_like(z)
# plot example
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
width = 5
depth = 0.25
ax.bar3d(x, y, bottom, width, depth, z, shade=True)
ax.set_title('Trainable weights')
ax.set_xlabel('number of units')
ax.set_ylabel('number of layers')
ax.set_zlabel('number of weights')
ax.zaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.set_yticks(layers)
plt.show()
As illustrated by the above plot, the degrees of freedom in the network, i.e. the numner $n_w$ of weights, depends linearly on the number $n_L$ of layers, but quadratically on the number $n_u$ of units. This means that doubling the number of layers will double the number of weights, but doubling the number of units will quadruple the number of weights.
Thus, if we want to find the optimal, or say a ''good'' network topology for a given problem, it is hard to say if it is better to increase the number of units or the number of layers by simply trying it out, because these increases have different effects on the number of weights. If we find that doubling the number of units (thus quadrupling the number of weights) results in a network that has a lower bias than doubling the number of layers (thus only doubling the number of weights), would it be fair to say that doubling the number of units is better?
Therefore, it is more insightful to start with a set of candidate numbers for the given units $n_u$ and then increase the number of layers, but keeping the total number of weights fixed. This can be achieved by reducing the number of units after increasing the number of layers. The mathematics of this is governed by rewriting the equation for $n_w$
$$ n_w = (n_L - 2) n_u^2 + (n_i + n_o + n_L - 1)n_u + n_o \quad \Longleftrightarrow \quad n_u^2 + \underbrace{\frac{n_i + n_o + n_L - 1}{n_L - 2}}_{=:p} n_u + \underbrace{\frac{n_o - n_w}{n_L - 2}}_{=:q} = 0 $$and solving this quadratic equation for $n_u$. The positive solution is thus given by $$ n_u = - \frac{p}{2} + \sqrt{\frac{p^2}{4} - q} $$
Using this we can keep the number of weights constant when increasing the number of layers. We illustrate this with an example.
# implement $n_u$ as function of $n_w$
def num_units(n_i, n_o, n_L, n_w):
"""
Computes the total number ofunits per layer to achieve a given number of parameters assuming all layers
have the same number of units (except the output layer).
param n_i: number of inputs
param n_o: number of outputs
param n_L: number of weights
param n_w: number of weights
returns: number of units per layer needed
"""
if n_L == 2:
return int((n_w - n_o) /(n_i + n_o + 1))
else:
p = n_i + n_o + n_L - 1
p /= n_L - 2
q = n_o - n_w
q /= n_L - 2
return int(-p/2 + np.sqrt(p**2/4 - q))
# create example
n_o = 3
n_i = 2
layers = np.arange(2, 6)
units = np.arange(20, 120, 20)
weights ={u : num_weights(n_i, n_o, layers[0], u) for u in units}
xx, yy = np.meshgrid(units, layers)
x, y = xx.ravel(), yy.ravel()
z = np.array([[num_weights(n_i, n_o, l, num_units(n_i, n_o, l, weights[u])) for u in units] for l in layers]).ravel()
# plot example
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
width = 5
depth = 0.25
ax.bar3d(x, y, bottom, width, depth, z, shade=True)
ax.set_title('Trainable weights')
ax.set_xlabel('original number of units')
ax.set_ylabel('number of layers')
ax.set_zlabel('number of weights')
ax.zaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.set_yticks(layers)
plt.show()
We see that the number of weights now increases linearly with the number of units, but stays roughly constant with the number of layers (because the number of units is decreased). It does not stay exactly constant due to rounding the solution of the quadratic equation.
This leaves us with the following method of determining a good network topology for any given problem:
Input:
Steps:
Output: A number $(n_u, n_L)$ of units and layers for the network $\operatorname{NN}$ such that the bias and the variance of the network on $(x,y)$ are within the threshold and the numbers $(n_u, n_L)$ are optimal within the given range.
If one wants to compare the MLP with an Long-Term-Short-Term-Memory network (LSTM), one has to determine the number of weights in an LSTM as well.
The number of weights in a single LSTM layer with $k$ features and $m$ units is given by \begin{align*} 4m^2 + 4(k+1)m. \end{align*}
For an LSTM with $n_L-1$ layers with $n_u$ units, $n_i$ inputs followed by a single dense output layer of dimension $n_o$, we obtain \begin{align*} n_w & = 4 (2 n_L -3) n_u^2 + (4 n_i + n_o + 4 n_L - 4 )n_u + n_o. \end{align*}