Notebook

4.1 Layers and Blocks¶

MXNet's Blocks
- Layers are blocks
- Many layers can be a block
- Many blocks can be a block
- Code can be a block
- Blocks take are of a lot of housekeeping, such as parameter initialization, backprop and related issues.
- Sequential concatenations of layers and blocks are handled by the eponymous Sequential block.
Blocks are combinations of one or more layers.
Network design is aided by code that generates such blocks on demand.

In [1]:

from mxnet import nd
from mxnet.gluon import nn

x = nd.random.uniform(shape=(2, 20))

net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)

Out[1]:

[[ 0.09543003  0.04614332 -0.00286653 -0.07790346 -0.05130243  0.02942039
   0.08696645 -0.0190793  -0.04122177  0.05088576]
 [ 0.0769287   0.03099705  0.00856576 -0.04467198 -0.0692684   0.09132432
   0.06786594 -0.06187843 -0.03436674  0.04234695]]
<NDArray 2x10 @cpu(0)>

We used nn.Sequential constructor to generate an empty network into which we then inserted two layers.
- This really just constructs a block.
- These blocks can be combined into larger artifacts, often recursively.
A block behaves very much like a fancy layer
- It needs to ingest data (the input).
- It needs to produce a meaningful output.
  - It allows us to invoke a block via net(X) to obtain the desired output.
  - It invokes forward to perform forward propagation.
- It needs to produce a gradient with regard to its input when invoking backward.
  - Typically this is automatic.
- It needs to store parameters that are inherent to the block.
- Obviously it also needs to initialize these parameters as needed.

4.1.1 A Custom Block¶

The nn.Block class
- It is a model constructor provided in the nn module, which we can inherit to define the model we want.
The following MLP class inherits the Block class to construct the multilayer perceptron
- It overrides the __init__ and forward functions of the Block class.
- They are used to create model parameters and define forward computations, respectively.

In [2]:

from mxnet import nd
from mxnet.gluon import nn

class MLP(nn.Block):
    # Declare a layer with model parameters. 
    # Here, we declare two fully connected layers.
    def __init__(self, **kwargs):
        # Call the constructor of the MLP parent class Block to perform the necessary initialization. 
        # In this way, other function parameters can also be specified when constructing an instance, 
        # such as the model parameter, params, described in the following sections.
        super(MLP, self).__init__(**kwargs)
        self.hidden = nn.Dense(256, activation='relu')  # Hidden layer.
        self.output = nn.Dense(10)  # Output layer.

    # Define the forward computation of the model
    # That is, how to return the required model output based on the input x.
    def forward(self, x):
        return self.output(self.hidden(x))

The forward method invokes a network simply by evaluating the hidden layer self.hidden(x) and subsequently by evaluating the output layer self.output( ... ).
- This is what we expect in the forward pass of this block.
The __init__ method
- Define the layers.
  - Initializes all of the Block-related parameters and then constructs the requisite layers.
There is no need to define a backpropagation method in the class.
- The system automatically generates the backward method
The same applies to the initialize method, which is generated automatically.

In [3]:

net = MLP()
net.initialize()
net(x)

Out[3]:

[[ 0.0036223   0.00633331  0.03201144 -0.01369375  0.10336448 -0.03508019
  -0.00032164 -0.01676024  0.06978628  0.01303308]
 [ 0.03871716  0.02608212  0.03544959 -0.02521311  0.11005434 -0.01430662
  -0.03052465 -0.03852826  0.06321152  0.0038594 ]]
<NDArray 2x10 @cpu(0)>

The block class's subclass...
- it can be a layer (such as the Dense class provided by Gluon),
- it can be a model (such as the MLP class we just derived),
- it can be a part of a model (this is what typically happens when designing very deep networks).

4.1.2 A Sequential Block¶

The purpose of the Sequential class is to provide some useful convenience functions.
- The add method allows us to add concatenated Block subclass instances one by one,
- The forward computation of the model is to compute these instances one by one in the order of addition

In [4]:

class MySequential(nn.Block):
    def __init__(self, **kwargs):
        super(MySequential, self).__init__(**kwargs)

    def add(self, block):
        # Here, block is an instance of a Block subclass, and we assume it has a unique name. 
        # We save it in the member variable _children of the Block class, and its type is OrderedDict. 
        self._children[block.name] = block

    def forward(self, x):
        # OrderedDict guarantees that members will be traversed in the order they were added.
        for block in self._children.values():
            x = block(x)
        return x

When MySequential instance calls the initialize function, the system automatically initializes all members of _children.

In [5]:

net = MySequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)

Out[5]:

[[ 0.07787763  0.00216402  0.016822    0.0305988  -0.00702019  0.01668715
   0.04822846  0.0039432  -0.09300035 -0.04494302]
 [ 0.08891079 -0.00625484 -0.01619132  0.03807179 -0.01451489  0.02006173
   0.0303478   0.02463485 -0.07605447 -0.04389168]]
<NDArray 2x10 @cpu(0)>

4.1.3 Blocks with Code¶

*Constant* model parameter
- These are parameters that are not used when invoking backprop.
$$f(\mathbf{x},\mathbf{w}) = 3 \cdot \mathbf{w}^\top \mathbf{x}.$$
- In this case 3 is a constant parameter.
- We could change 3 to something else, say $c$ via
$$f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}.$$

In [8]:

class FancyMLP(nn.Block):
    def __init__(self, **kwargs):
        super(FancyMLP, self).__init__(**kwargs)
        # Random weight parameters created with the get_constant are not iterated during training 
        # (i.e. constant parameters).
        self.rand_weight = self.params.get_constant(
            'rand_weight', 
            nd.random.uniform(shape=(20, 20))
        )
        self.dense = nn.Dense(20, activation='relu')

    def forward(self, x):
        x = self.dense(x)
        
        # Use the constant parameters created, as well as the relu and dot functions of NDArray.
        x = nd.relu(nd.dot(x, self.rand_weight.data()) + 1)

        # Reuse the fully connected layer. 
        # This is equivalent to sharing parameters with two fully connected layers.
        x = self.dense(x)
        
        # Here in Control flow, we need to call asscalar to return the scalar for comparison.
        while x.norm().asscalar() > 1:
            x /= 2
            
        if x.norm().asscalar() < 0.8:
            x *= 10
            
        return x.sum()

In this FancyMLP model, we used constant weight rand_weight (note that it is not a model parameter), performed a matrix multiplication operation (nd.dot), and reused the same Dense layer.
We used the same network twice.
- Two networks share the same parameters.

In [9]:

net = FancyMLP()
net.initialize()
net(x)

Out[9]:

[25.522684]
<NDArray 1 @cpu(0)>

The example below combines examples for building a block from individual blocks, which in turn, may be blocks themselves.
Furthermore, we can even combine multiple strategies inside the same forward function.

In [11]:

class NestMLP(nn.Block):
    def __init__(self, **kwargs):
        super(NestMLP, self).__init__(**kwargs)
        self.net = nn.Sequential()
        self.net.add(
            nn.Dense(64, activation='relu'),
            nn.Dense(32, activation='relu')
        )
        self.dense = nn.Dense(16, activation='relu')

    def forward(self, x):
        return self.dense(self.net(x))

chimera = nn.Sequential()
chimera.add(
    NestMLP(), 
    nn.Dense(20), 
    FancyMLP()
)

chimera.initialize()
chimera(x)

Out[11]:

[3.853818]
<NDArray 1 @cpu(0)>

4.1.4 Compilation¶

We have lots of dictionary lookups, code execution, and lots of other Pythonic things going on in what is supposed to be a high performance deep learning library.
The problems of Python’s Global Interpreter Lock are well known.
- In the context of deep learning it means that we have a super fast GPU (or multiple of them) which might have to wait until a puny single CPU core running Python gets a chance to tell it what to do next.
- This is clearly awful and there are many ways around it.
- The best way to speed up Python is by avoiding it altogether.
Gluon does this by allowing for Hybridization.
- In it, the Python interpreter executes the block the first time it’s invoked.
- The Gluon runtime records what is happening and the next time around it short circuits any calls to Python.
- This can accelerate things considerably in some cases but care needs to be taken with control flow.

4.2 Parameter Management¶

Accessing parameters for debugging, diagnostics,to visualize them or to save them is the first step to understanding how to work with custom models.
Secondly, we want to set them in specific ways, e.g. for initialization purposes.
- We discuss the structure of parameter initializers.
Lastly, we show how this knowledge can be put to good use by building networks that share some parameters.

In [12]:

from mxnet import init, nd
from mxnet.gluon import nn

net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))

net.initialize()  # Use the default initialization method.

x = nd.random.uniform(shape=(2, 20))
net(x)            # Forward computation.

Out[12]:

[[ 0.00407254  0.1019081   0.02062148  0.0552136   0.07915469 -0.05606864
  -0.1041737   0.00337543 -0.06740113 -0.06313396]
 [ 0.01474816  0.0497599   0.00468814  0.0468959   0.06075    -0.07501648
  -0.07173473  0.06645283 -0.08554209 -0.16031   ]]
<NDArray 2x10 @cpu(0)>

4.2.1 Parameter Access¶

In the case of a Sequential class we can access the parameters with ease, simply by indexing each of the layers in the network.
The names of the parameters sych as dense17_weight are very useful since they allow us to identify parameters uniquely even in a network of hundreds of layers and with nontrivial structure.

In [13]:

print(net[0].params)
print(net[1].params)

dense17_ (
  Parameter dense17_weight (shape=(256, 20), dtype=float32)
  Parameter dense17_bias (shape=(256,), dtype=float32)
)
dense18_ (
  Parameter dense18_weight (shape=(10, 256), dtype=float32)
  Parameter dense18_bias (shape=(10,), dtype=float32)
)

In [19]:

print(net[0].weight)
print(net[0].weight.data())
print(net[1].weight.data())

Parameter dense17_weight (shape=(256, 20), dtype=float32)

[[-0.05357582 -0.00228109 -0.03202471 ... -0.06692369 -0.00955358
  -0.01753462]
 [ 0.01603388  0.02262501 -0.06019409 ... -0.03063859 -0.02505398
   0.02994981]
 [-0.06580696  0.00862081  0.0332156  ...  0.05478401 -0.06591336
  -0.06983094]
 ...
 [ 0.02946895  0.05579274  0.01646009 ...  0.04695714  0.0208929
  -0.06849758]
 [ 0.01405259 -0.02814856  0.02697545 ... -0.03466139 -0.00090686
   0.02379511]
 [-0.05085108 -0.0290781   0.04582401 ...  0.00601977 -0.00817193
   0.06228926]]
<NDArray 256x20 @cpu(0)>

[[ 0.00338574  0.04148472 -0.01888602 ... -0.06870207 -0.06303862
  -0.04540806]
 [ 0.02585206  0.05058105  0.00044364 ... -0.00163042 -0.04103333
   0.06294077]
 [ 0.04751863  0.06542363 -0.03117647 ...  0.00775644  0.01028717
   0.02544965]
 ...
 [-0.02485485  0.01089642  0.0489713  ...  0.02502301  0.03442856
  -0.03999568]
 [ 0.02737013 -0.04429683  0.03048034 ...  0.00809494  0.00763652
   0.05087072]
 [ 0.01182987 -0.06716982  0.01266196 ...  0.01583868 -0.00265694
  -0.00011061]]
<NDArray 10x256 @cpu(0)>

In [16]:

print(net[0].bias)
print(net[0].bias.data())
print(net[1].bias.data())

Parameter dense17_bias (shape=(256,), dtype=float32)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 256 @cpu(0)>

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 10 @cpu(0)>

In [17]:

print(net[0].params['dense17_weight'])
print(net[0].params['dense17_weight'].data())

Parameter dense17_weight (shape=(256, 20), dtype=float32)

[[-0.05357582 -0.00228109 -0.03202471 ... -0.06692369 -0.00955358
  -0.01753462]
 [ 0.01603388  0.02262501 -0.06019409 ... -0.03063859 -0.02505398
   0.02994981]
 [-0.06580696  0.00862081  0.0332156  ...  0.05478401 -0.06591336
  -0.06983094]
 ...
 [ 0.02946895  0.05579274  0.01646009 ...  0.04695714  0.0208929
  -0.06849758]
 [ 0.01405259 -0.02814856  0.02697545 ... -0.03466139 -0.00090686
   0.02379511]
 [-0.05085108 -0.0290781   0.04582401 ...  0.00601977 -0.00817193
   0.06228926]]
<NDArray 256x20 @cpu(0)>

We can compute the gradient with respect to the parameters.
- It has the same shape as the weight.
However, since we did not invoke backpropagation yet, the values are all 0.

In [18]:

net[0].weight.grad()

Out[18]:

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
<NDArray 256x20 @cpu(0)>

All Parameters at Once
- A method collect_params grabs all parameters of a network in one dictionary such that we can traverse it with ease.
  - It does so by iterating over all constituents of a block and calls collect_params on subblocks as needed.

In [20]:

# parameters only for the first layer 
print(net[0].collect_params())

# parameters of the entire network
print(net.collect_params())

dense17_ (
  Parameter dense17_weight (shape=(256, 20), dtype=float32)
  Parameter dense17_bias (shape=(256,), dtype=float32)
)
sequential5_ (
  Parameter dense17_weight (shape=(256, 20), dtype=float32)
  Parameter dense17_bias (shape=(256,), dtype=float32)
  Parameter dense18_weight (shape=(10, 256), dtype=float32)
  Parameter dense18_bias (shape=(10,), dtype=float32)
)

In [21]:

net.collect_params()['dense18_bias'].data()

Out[21]:

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 10 @cpu(0)>

Regular expressions to filter out the required parameters.

In [24]:

print(net.collect_params('.*weight'))
print(net.collect_params('.*bias'))

sequential5_ (
  Parameter dense17_weight (shape=(256, 20), dtype=float32)
  Parameter dense18_weight (shape=(10, 256), dtype=float32)
)
sequential5_ (
  Parameter dense17_bias (shape=(256,), dtype=float32)
  Parameter dense18_bias (shape=(10,), dtype=float32)
)

Rube Goldberg strikes again
- Let’s see how the parameter naming conventions work if we nest multiple blocks inside each other.

In [25]:

def block1():
    net = nn.Sequential()
    net.add(nn.Dense(32, activation='relu'))
    net.add(nn.Dense(16, activation='relu'))
    return net

def block2():
    net = nn.Sequential()
    for i in range(4):
        net.add(block1())
    return net

rgnet = nn.Sequential()
rgnet.add(block2())
rgnet.add(nn.Dense(10))
rgnet.initialize()
rgnet(x)

Out[25]:

[[ 6.6884764e-09 -1.9991958e-08 -4.7974535e-09 -8.7700771e-09
  -1.6186359e-08  1.0396601e-08  1.0741704e-08  6.3689147e-09
  -1.9723858e-09  3.0433571e-09]
 [ 8.6247640e-09 -1.8395822e-08 -2.2687403e-09 -1.6464673e-08
  -2.4844146e-08  1.4356444e-08  1.6593912e-08  6.3606223e-09
  -9.6643706e-09  8.3527123e-09]]
<NDArray 2x10 @cpu(0)>

In [26]:

print(rgnet.collect_params)
print(rgnet.collect_params())

<bound method Block.collect_params of Sequential(
  (0): Sequential(
    (0): Sequential(
      (0): Dense(20 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (1): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (2): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (3): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
  )
  (1): Dense(16 -> 10, linear)
)>
sequential6_ (
  Parameter dense19_weight (shape=(32, 20), dtype=float32)
  Parameter dense19_bias (shape=(32,), dtype=float32)
  Parameter dense20_weight (shape=(16, 32), dtype=float32)
  Parameter dense20_bias (shape=(16,), dtype=float32)
  Parameter dense21_weight (shape=(32, 16), dtype=float32)
  Parameter dense21_bias (shape=(32,), dtype=float32)
  Parameter dense22_weight (shape=(16, 32), dtype=float32)
  Parameter dense22_bias (shape=(16,), dtype=float32)
  Parameter dense23_weight (shape=(32, 16), dtype=float32)
  Parameter dense23_bias (shape=(32,), dtype=float32)
  Parameter dense24_weight (shape=(16, 32), dtype=float32)
  Parameter dense24_bias (shape=(16,), dtype=float32)
  Parameter dense25_weight (shape=(32, 16), dtype=float32)
  Parameter dense25_bias (shape=(32,), dtype=float32)
  Parameter dense26_weight (shape=(16, 32), dtype=float32)
  Parameter dense26_bias (shape=(16,), dtype=float32)
  Parameter dense27_weight (shape=(10, 16), dtype=float32)
  Parameter dense27_bias (shape=(10,), dtype=float32)
)

In [30]:

print(rgnet[0][1][0].bias.name)
print(rgnet[0][1][0].bias.data())

dense21_bias

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 32 @cpu(0)>

4.2.2 Parameter Initialization¶

By default, MXNet initializes the weight matrices uniformly by drawing from $U[-0.07, 0.07]$ and the bias parameters are all set to $0$.
MXNet’s init module provides a variety of preset initialization methods, but if we want something out of the ordinary, we need a bit of extra work.

Built-in Initialization
- force_reinit ensures that the variables are initialized again, regardless of whether they were already initialized previously.

In [31]:

net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
net[0].weight.data()[0]

Out[31]:

[ 2.3467798e-02 -6.5989629e-03 -4.6144146e-04 -1.0800398e-03
 -2.5858415e-05 -6.9288602e-03  4.7301534e-03  1.6473899e-02
 -8.4304502e-03  3.8224545e-03  6.4377831e-03  9.0460032e-03
 -2.7124031e-04 -6.6581573e-03 -8.7738056e-03 -1.9149805e-03
  4.9869940e-03  1.7430604e-02 -9.3654627e-03 -1.5981171e-03]
<NDArray 20 @cpu(0)>

If we wanted to initialize all parameters to 1, we could do this simply by changing the initializer to Constant(1).

In [32]:

net.initialize(init=init.Constant(1), force_reinit=True)
net[0].weight.data()[0]

Out[32]:

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
<NDArray 20 @cpu(0)>

We initialize the second layer to a constant value of 42 and we use the Xavier initializer for the weights of the first layer.

In [35]:

net[1].initialize(init=init.Constant(42), force_reinit=True)
net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
print(net[1].weight.data()[0])
print(net[0].weight.data()[0])

[42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.
 42. 42. 42. 42.]
<NDArray 256 @cpu(0)>

[ 0.08490172  0.13223866  0.01630534 -0.00707628 -0.03077595  0.14420772
  0.13430956  0.07363294  0.02899179 -0.13734338 -0.11237526  0.08715159
 -0.02431636  0.12052891  0.0830339   0.06951596  0.05713288 -0.06902333
  0.12277207 -0.10455534]
<NDArray 20 @cpu(0)>

Custom Initialization
- Sometimes, the initialization methods we need are not provided in the init module.
- At this point, we can implement a subclass of the Initializer class so that we can use it like any other initialization method.
- Usually, we only need to implement the _init_weight function and modify the incoming NDArray according to the initial result.
- In the example below, we pick a decidedly bizarre and nontrivial distribution, just to prove the point.
- We draw the coefficients from the following distribution: $$ \begin{aligned} w \sim \begin{cases} U[5, 10] & \text{ with probability } \frac{1}{4} \\ 0 & \text{ with probability } \frac{1}{2} \\ U[-10, -5] & \text{ with probability } \frac{1}{4} \end{cases} \end{aligned} $$

In [36]:

class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        data[:] = nd.random.uniform(low=-10, high=10, shape=data.shape)
        data *= data.abs() >= 5

net.initialize(MyInit(), force_reinit=True)
net[0].weight.data()[0]

Init dense17_weight (256, 20)
Init dense18_weight (10, 256)

Out[36]:

[-9.572826   7.9411488 -7.953664   0.        -0.        -7.483777
  9.6598015 -5.8997717 -7.205085   8.736895  -0.        -0.
 -8.978939  -0.        -0.        -0.        -0.        -0.
  8.936142  -0.       ]
<NDArray 20 @cpu(0)>

Since data() returns an NDArray, we can access it just like any other matrix.
If you want to adjust parameters within an autograd scope, you need to use set_data to avoid confusing the automatic differentiation mechanics.

In [37]:

net[0].weight.data()[:] += 1
net[0].weight.data()[0,0] = 42
net[0].weight.data()[0]

Out[37]:

[42.         8.941149  -6.953664   1.         1.        -6.483777
 10.6598015 -4.8997717 -6.205085   9.736895   1.         1.
 -7.978939   1.         1.         1.         1.         1.
  9.936142   1.       ]
<NDArray 20 @cpu(0)>

4.2.3 Tied Parameters¶

In some cases, we want to share model parameters across multiple layers.
In the following we allocate a dense layer and then use its parameters specifically to set those of another layer.

In [43]:

net = nn.Sequential()
# we need to give the shared layer a name such that we can reference its parameters
shared = nn.Dense(8, activation='relu')
net.add(
    nn.Dense(8, activation='relu'),
    shared,
    nn.Dense(8, activation='relu', params=shared.params),
    nn.Dense(10)
)
net.initialize()

x = nd.random.uniform(shape=(2, 20))
net(x)

# Check whether the parameters are the same
print(net[1].weight.data()[0] == net[2].weight.data()[0])
print(net[1].weight.data()[0])
print(net[2].weight.data()[0])

# And make sure that they're actually the same object rather than just having the same value.
net[1].weight.data()[0,0] = 100
print(net[1].weight.data()[0] == net[2].weight.data()[0])

[1. 1. 1. 1. 1. 1. 1. 1.]
<NDArray 8 @cpu(0)>

[-0.03439966 -0.05555296  0.0232332  -0.02662065  0.04434159 -0.05426525
  0.01500529 -0.06945959]
<NDArray 8 @cpu(0)>

[-0.03439966 -0.05555296  0.0232332  -0.02662065  0.04434159 -0.05426525
  0.01500529 -0.06945959]
<NDArray 8 @cpu(0)>

[1. 1. 1. 1. 1. 1. 1. 1.]
<NDArray 8 @cpu(0)>

4.3 Deferred Initialization¶

In the previous examples...
- We defined the network architecture *with no regard to the input dimensionality*.
- We added layers *without regard to the output dimension of the previous layer*.
- We even ‘initialized’ these parameters *without knowing how many parameters were to initialize*.
The ability to set parameters without the need to know what the dimensionality is can greatly simplify statistical modeling.
In what follows, we will discuss how this works using initialization as an example.
After all, we cannot initialize variables that we don’t know exist.

4.3.1 Instantiating a Network¶

In [44]:

from mxnet import init, nd
from mxnet.gluon import nn

def getnet():
    net = nn.Sequential()
    net.add(nn.Dense(256, activation='relu'))
    net.add(nn.Dense(10))
    return net

net = getnet()

At this point, each layer needs weights and bias, albeit of unspecified dimensionality.

In [45]:

print(net.collect_params)
print(net.collect_params())

<bound method Block.collect_params of Sequential(
  (0): Dense(None -> 256, Activation(relu))
  (1): Dense(None -> 10, linear)
)>
sequential18_ (
  Parameter dense52_weight (shape=(256, 0), dtype=float32)
  Parameter dense52_bias (shape=(256,), dtype=float32)
  Parameter dense53_weight (shape=(10, 0), dtype=float32)
  Parameter dense53_bias (shape=(10,), dtype=float32)
)

Trying to access net[0].weight.data() at this point would trigger a runtime error stating that the network needs initializing before it can do anything.

In [46]:

net[0].weight.data()

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-46-59ea5453a5fc> in <module>
----> 1 net[0].weight.data()

~/anaconda3/envs/gluon/lib/python3.6/site-packages/mxnet/gluon/parameter.py in data(self, ctx)
    391         NDArray on ctx
    392         """
--> 393         return self._check_and_get(self._data, ctx)
    394 
    395     def list_data(self):

~/anaconda3/envs/gluon/lib/python3.6/site-packages/mxnet/gluon/parameter.py in _check_and_get(self, arr_list, ctx)
    187             "with Block.collect_params() instead of Block.params " \
    188             "because the later does not include Parameters of " \
--> 189             "nested child Blocks"%(self.name))
    190 
    191     def _load_init(self, data, ctx):

RuntimeError: Parameter 'dense52_weight' has not been initialized. Note that you should initialize parameters and create Trainer with Block.collect_params() instead of Block.params because the later does not include Parameters of nested child Blocks

In [47]:

net.initialize()
net.collect_params()

Out[47]:

sequential18_ (
  Parameter dense52_weight (shape=(256, 0), dtype=float32)
  Parameter dense52_bias (shape=(256,), dtype=float32)
  Parameter dense53_weight (shape=(10, 0), dtype=float32)
  Parameter dense53_bias (shape=(10,), dtype=float32)
)

Nothing really changed.

In [49]:

net[0].weight.data()

---------------------------------------------------------------------------
DeferredInitializationError               Traceback (most recent call last)
<ipython-input-49-59ea5453a5fc> in <module>
----> 1 net[0].weight.data()

~/anaconda3/envs/gluon/lib/python3.6/site-packages/mxnet/gluon/parameter.py in data(self, ctx)
    391         NDArray on ctx
    392         """
--> 393         return self._check_and_get(self._data, ctx)
    394 
    395     def list_data(self):

~/anaconda3/envs/gluon/lib/python3.6/site-packages/mxnet/gluon/parameter.py in _check_and_get(self, arr_list, ctx)
    181                 "Please pass one batch of data through the network before accessing Parameters. " \
    182                 "You can also avoid deferred initialization by specifying in_units, " \
--> 183                 "num_features, etc., for network layers."%(self.name))
    184         raise RuntimeError(
    185             "Parameter '%s' has not been initialized. Note that " \

DeferredInitializationError: Parameter 'dense52_weight' has not been initialized yet because initialization was deferred. Actual initialization happens during the first forward pass. Please pass one batch of data through the network before accessing Parameters. You can also avoid deferred initialization by specifying in_units, num_features, etc., for network layers.

Only once we provide the network with some data, we see a difference.

In [51]:

x = nd.random.uniform(shape=(2, 20))
net(x) # Forward computation.

net.collect_params()

Out[51]:

sequential18_ (
  Parameter dense52_weight (shape=(256, 20), dtype=float32)
  Parameter dense52_bias (shape=(256,), dtype=float32)
  Parameter dense53_weight (shape=(10, 256), dtype=float32)
  Parameter dense53_bias (shape=(10,), dtype=float32)
)

In [52]:

net[0].weight.data()

Out[52]:

[[-0.05247737 -0.01900016  0.06498937 ...  0.02672191 -0.02730501
   0.03611466]
 [ 0.0618015   0.03916474 -0.05941451 ...  0.04577643 -0.0453134
  -0.04038748]
 [ 0.06184389  0.04633274  0.03094608 ...  0.00510379  0.05605743
  -0.05085221]
 ...
 [-0.06550431  0.04614966  0.04391201 ... -0.01563684  0.04479967
   0.06039421]
 [-0.06207634  0.00493836 -0.0689486  ...  0.02575751 -0.05235828
   0.05903549]
 [-0.01011717  0.01382479  0.02665275 ... -0.05540304 -0.02307985
   0.00403536]]
<NDArray 256x20 @cpu(0)>

4.3.2 Deferred Initialization in Practice¶

In [53]:

class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        # The actual initialization logic is omitted here.

net = getnet()
net.initialize(init=MyInit())

In [54]:

x = nd.random.uniform(shape=(2, 20))
y = net(x)

Init dense54_weight (256, 20)
Init dense55_weight (10, 256)

When performing a forward calculation based on the input x, the system can automatically infer the shape of the weight parameters of all layers based on the shape of the input.
Once the system has created these parameters, it calls the MyInit instance to initialize them before proceeding to the forward calculation.
This initialization will only be called when completing the initial forward calculation.
After that, we will not re-initialize when we run the forward calculation net(x).

In [55]:

y = net(x)

4.3.3 Forced Initialization¶

Deferred initialization does not occur if the system knows the shape of all parameters when calling the initialize function. This can occur in two cases:
- We’ve already seen some data and we just want to reset the parameters.
- We specificed all input and output dimensions of the network when defining it.

In [56]:

net.initialize(init=MyInit(), force_reinit=True)

Init dense54_weight (256, 20)
Init dense55_weight (10, 256)

We specify the in_units so that initialization can occur immediately once initialize is called

In [57]:

net = nn.Sequential()
net.add(nn.Dense(256, in_units=20, activation='relu'))
net.add(nn.Dense(10, in_units=256))
net.initialize(init=MyInit())

Init dense56_weight (256, 20)
Init dense57_weight (10, 256)