Initialization

In the Neural Networks section we played fast and loose with setting up our networks. In particular we did the following things that shouldn’t work:

  • We defined the network architecture with no regard to the input dimensionality.

  • We added layers without regard to the output dimension of the previous layer.

  • We even ‘initialized’ these parameters without knowing how many parameters we were going to initialize.

All of those things sound impossible and indeed, they are. After all, there’s no way MXNet (or any other framework for that matter) could predict what the input dimensionality of a network would be. Later on, when working with convolutional networks and images this problem will become even more pertinent, since the input dimensionality (i.e. the resolution of an image) will affect the dimensionality of subsequent layers. The ability to determine parameter dimensionality during run-time rather than at coding time greatly simplifies the process of doing deep learning.

Instantiating a Network

Let’s see what happens when we instantiate a network. We start by defining a multi-layer perceptron.

[1]:
from mxnet import init, np
from mxnet.gluon import nn


def getnet():
    net = nn.Sequential()
    net.add(nn.Dense(256, activation='relu'))
    net.add(nn.Dense(10))
    return net

net = getnet()

At this point the network doesn’t really know yet what the dimensionalities of the various parameters should be. All one could tell at this point is that each layer needs weights and bias, albeit of unspecified dimensionality. If we try accessing the parameters, that’s exactly what happens.

[2]:
print(net.collect_params())
{'0.weight': Parameter (shape=(256, -1), dtype=float32), '0.bias': Parameter (shape=(256,), dtype=float32), '1.weight': Parameter (shape=(10, -1), dtype=float32), '1.bias': Parameter (shape=(10,), dtype=float32)}

You’ll notice None here in each Dense layer. This absence of value is how MXNet keeps track of unspecified dimensionality. In particular, trying to access net[0].weight.data() at this point would trigger a runtime error stating that the network needs initializing before it can do anything.

Note that if we did want to specify dimensionality, we could have done so by using the kwarg in_units, e.g. Dense(256, activiation='relu', in_units=20).

Let’s see whether anything changes after we initialize the parameters:

[3]:
net.initialize()
net.collect_params()
[04:45:57] /work/mxnet/src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for CPU
[3]:
{'0.weight': Parameter (shape=(256, -1), dtype=float32),
 '0.bias': Parameter (shape=(256,), dtype=float32),
 '1.weight': Parameter (shape=(10, -1), dtype=float32),
 '1.bias': Parameter (shape=(10,), dtype=float32)}

As we can see, nothing really changed. Only once we provide the network with some data do we see a difference. Let’s try it out.

[4]:
x = np.random.uniform(size=(2, 20))
net(x)  # Forward computation
print(net.collect_params())
{'0.weight': Parameter (shape=(256, 20), dtype=float32), '0.bias': Parameter (shape=(256,), dtype=float32), '1.weight': Parameter (shape=(10, 256), dtype=float32), '1.bias': Parameter (shape=(10,), dtype=float32)}

We see all the dimensions have been determined and the parameters initialized. This is because shape inference and parameter initialization have been performed in a lazy manner, so they are performed only when needed. In the above case, they are performed as a prerequisite to the forward computation.

Dimensional inference works like this: as soon as we knew the input dimensionality, \(\mathbf{x} \in \mathbb{R}^{20}\) it was possible to define the weight matrix for the first layer, i.e. \(\mathbf{W}_1 \in \mathbb{R}^{256 \times 20}\). With that out of the way, we can progress to the second layer, define its dimensionality to be \(10 \times 256\) and so on through the computational graph and resolve all the dimensions as they become available. Once this is known, we can proceed by initializing parameters. This is the solution to the three problems outlined above.

Deferred Initialization in Practice

Now that we know how it works in theory, let’s see when the initialization is actually triggered. In order to do so, we mock up an initializer which does nothing but report a debug message stating when it was invoked and with which parameters.

[5]:
class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        # The actual initialization logic is omitted here.

net = getnet()
net.initialize(init=MyInit())

Note that, although MyInit will print information about the model parameters when it is called, the above initialize function does not print any information after it has been executed. Therefore there is no actual initialization when calling the initialize function - this +initialization is deferred until forward is called for the first time. Next, we define the input and perform a forward calculation.

[6]:
x = np.random.uniform(size=(2, 20))
y = net(x)
Init weight (256, 20)
Init weight (10, 256)

At this time, information on the model parameters is printed. When performing a forward calculation based on the input x, the system can automatically infer the shape of the weight parameters of all layers based on the shape of the input. Once the system has created these parameters, it calls the MyInit instance to initialize them before proceeding to the forward calculation.

Of course, this initialization will only be called when completing the initial forward calculation. After that, we will not re-initialize when we run the forward calculation net(x), so the output of the MyInit instance will not be generated again.

[7]:
y = net(x)

As mentioned at the beginning of this section, deferred initialization can also cause confusion. Before the first forward calculation, we were unable to directly manipulate the model parameters, for example, we could not use the data and set_data functions to get and modify the parameters. Therefore, we often force initialization by sending a sample observation through the network.

Forced Initialization

Deferred initialization does not occur if the system knows the shape of all parameters when calling the initialize function. This can occur in two cases:

  • We’ve already seen some data and we just want to reset the parameters.

  • We specified all input and output dimensions of the network or layer when defining it.

The first case works just fine, as illustrated below.

[8]:
net.initialize(init=MyInit(), force_reinit=True)
Init weight (256, 20)
Init weight (10, 256)

The second case requires us to specify the remaining set of parameters when creating the layer. For instance, for dense layers we also need to specify the in_units so that initialization can occur immediately once initialize is called.

[9]:
net = nn.Sequential()
net.add(nn.Dense(256, in_units=20, activation='relu'))
net.add(nn.Dense(10, in_units=256))

net.initialize(init=MyInit())
Init weight (256, 20)
Init weight (10, 256)

Parameter Initialization

By default, MXNet initializes the weight matrices uniformly by drawing random values with uniform-distribution between \(-0.07\) and \(0.07\) (\(U[-0.07, 0.07]\)) and updates the bias parameters by setting them all to \(0\). However, we often need to use other methods to initialize the weights. MXNet’s init module provides a variety of preset initialization methods, but if we want something out of the ordinary, we need a bit of extra work.

Built-in Initialization

Let’s begin with the built-in initializers. The code below initializes all parameters with Gaussian random variables.

[10]:
# force_reinit ensures that the variables are initialized again, regardless of
# whether they were already initialized previously.
net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
print(net[0].weight.data()[0])
[ 0.00267808  0.00041055 -0.00435571 -0.00582388  0.00668746 -0.00819734
 -0.0049379   0.00828357  0.00697993 -0.01395126  0.00758456 -0.00089832
  0.00114823  0.01394226 -0.01350909  0.01093321  0.010512   -0.01405186
 -0.00102349  0.00497417]

If we wanted to initialize all parameters to \(1\), we could do this simply by changing the initializer to Constant(1).

[11]:
net.initialize(init=init.Constant(1), force_reinit=True)
net[0].weight.data()[0]
[11]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1.])

If we want to initialize only a specific parameter in a different manner, we can simply set the initializer only for the appropriate subblock (or parameter). For instance, below we initialize the second layer to a constant value of \(42\) and we use the Xavier initializer for the weights of the first layer.

[12]:
net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
net[1].initialize(init=init.Constant(42), force_reinit=True)

# First layer
print(net[0].weight.data()[0])
print(net[0].bias.data()[0])  # initialized to 0

# Second layer
print(net[1].weight.data()[0,0])
print(net[1].bias.data()[0])  # initialized to 0
[-0.10605425 -0.04578616 -0.06133834 -0.06036376 -0.11265133 -0.07754673
 -0.06069804  0.14513943  0.08760884 -0.03520384 -0.07174218  0.13920212
  0.0969938   0.08525378 -0.05766197 -0.12251727  0.06729299 -0.08634925
 -0.08256494 -0.04841156]
0.0
42.0
0.0

Custom Initialization

Sometimes, the initialization methods we need are not provided in the init module. At this point, we can implement a subclass of the Initializer class so that we can use it like any other initialization method. Usually, we only need to implement the _init_weight function to suit our needs. In the example below, we pick a decidedly bizarre and nontrivial distribution, just to prove the point. We draw the coefficients from the following distribution:

\[\begin{split}\begin{aligned} w \sim \begin{cases} U[5, 10] & \text{ with probability } \frac{1}{4} \\ 0 & \text{ with probability } \frac{1}{2} \\ U[-10, -5] & \text{ with probability } \frac{1}{4} \end{cases} \end{aligned}\end{split}\]
[13]:
class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        data[:] = np.random.uniform(low=-10, high=10, size=data.shape)
        data *= np.abs(data) >= 5

net.initialize(MyInit(), force_reinit=True)
net[0].weight.data()[0]
Init weight (256, 20)
Init weight (10, 256)
[13]:
array([ 0.       , -9.587537 ,  5.8298836, -0.       , -5.893178 ,
        5.8981   , -8.595652 ,  0.       , -0.       ,  0.       ,
       -0.       ,  0.       ,  0.       ,  5.394665 ,  9.243319 ,
       -0.       , -8.316738 ,  6.274517 , -0.       ,  5.1410913])

If this functionality is insufficient, we can even set parameters directly. Since data() returns an NDArray we can access it just like any other matrix. A note for advanced users - if you want to adjust parameters within an autograd scope you need to use set_data to avoid confusing the automatic differentiation mechanics.

[14]:
net[0].weight.data()[:] += 1
net[0].weight.data()[0,0] = 42
net[0].weight.data()[0]
[14]:
array([42.       , -8.587537 ,  6.8298836,  1.       , -4.893178 ,
        6.8981   , -7.5956516,  1.       ,  1.       ,  1.       ,
        1.       ,  1.       ,  1.       ,  6.394665 , 10.243319 ,
        1.       , -7.316738 ,  7.274517 ,  1.       ,  6.1410913])

Tied Parameters

In some cases, we want to share model parameters across multiple layers. For instance when we want to find good word embeddings we may decide to use the same parameters both for encoding and decoding of words. Let’s see how to do this a bit more elegantly. In the following we construct a dense layer and then use its parameters specifically to set those of another layer.

[15]:
net = nn.Sequential()
# We need to give the shared layer a name such that we can reference its
# parameters.
shared = nn.Dense(8, activation='relu')
net.add(nn.Dense(8, activation='relu'),
        shared,
        nn.Dense(8, activation='relu').share_parameters(shared.params),
        nn.Dense(10))
net.initialize()

x = np.random.uniform(size=(2, 20))
net(x)

# Check whether the parameters are the same.
print(net[1].weight.data()[0] == net[2].weight.data()[0])
net[1].weight.data()[0,0] = 100
# And make sure that they're actually the same object rather than just having
# the same value.
print(net[1].weight.data()[0] == net[2].weight.data()[0])
[ True  True  True  True  True  True  True  True]
[ True  True  True  True  True  True  True  True]
/work/mxnet/python/mxnet/util.py:755: UserWarning: Parameter 'bias' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  return func(*args, **kwargs)

The above example shows that the parameters of the second and third layer are tied. As Python objects, they are identical rather than just being equal. That is, by changing one of the parameters the other one changes too. What happens to the gradients is quite ingenious. Since the model parameters contain gradients, the gradients of the second hidden layer and the third hidden layer are accumulated in shared.params.grad during backpropagation.

Conclusion

In this tutorial you learnt how to initialize a neural network, and should now understand the difference between deferred and forced initialization. Some more advanced cases you should now be aware of include custom initialization and tied parameters.