Deferred Initialization¶

In the previous examples we played fast and loose with setting up our networks. In particular we did the following things that shouldn’t work:

We defined the network architecture with no regard to the input dimensionality.
We added layers without regard to the output dimension of the previous layer.
We even ‘initialized’ these parameters without knowing how many parameters were were to initialize.

All of those things sound impossible and indeed, they are. After all, there’s no way MXNet (or any other framework for that matter) could predict what the input dimensionality of a network would be. Later on, when working with convolutional networks and images this problem will become even more pertinent, since the input dimensionality (i.e. the resolution of an image) will affect the dimensionality of subsequent layers at a long range. Hence, the ability to set parameters without the need to know at the time of writing the code what the dimensionality is can greatly simplify statistical modeling. In what follows, we will discuss how this works using initialization as an example. After all, we cannot initialize variables that we don’t know exist.

Instantiating a Network¶

Let’s see what happens when we instantiate a network. We start with our trusty MLP as before.

In [1]:

from mxnet import init, nd
from mxnet.gluon import nn

def getnet():
    net = nn.Sequential()
    net.add(nn.Dense(256, activation='relu'))
    net.add(nn.Dense(10))
    return net

net = getnet()

At this point the network doesn’t really know yet what the dimensionalities of the various parameters should be. All one could tell at this point is that each layer needs weights and bias, albeit of unspecified dimensionality. If we try accessing the parameters, that’s exactly what happens.

In [2]:

print(net.collect_params)
print(net.collect_params())

<bound method Block.collect_params of Sequential(
  (0): Dense(None -> 256, Activation(relu))
  (1): Dense(None -> 10, linear)
)>
sequential0_ (
  Parameter dense0_weight (shape=(256, 0), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
  Parameter dense1_weight (shape=(10, 0), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)

In particular, trying to access net[0].weight.data() at this point would trigger a runtime error stating that the network needs initializing before it can do anything. Let’s see whether anything changes after we initialize the parameters:

In [3]:

net.initialize()
net.collect_params()

Out[3]:

sequential0_ (
  Parameter dense0_weight (shape=(256, 0), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
  Parameter dense1_weight (shape=(10, 0), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)

As we can see, nothing really changed. Only once we provide the network with some data do we see a difference. Let’s try it out.

In [4]:

x = nd.random.uniform(shape=(2, 20))
net(x)            # Forward computation.

net.collect_params()

Out[4]:

sequential0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
  Parameter dense1_weight (shape=(10, 256), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)

The main difference to before is that as soon as we knew the input dimensionality, \(\mathbf{x} \in \mathbb{R}^{20}\) it was possible to define the weight matrix for the first layer, i.e. \(\mathbf{W}_1 \in \mathbb{R}^{256 \times 20}\). With that out of the way, we can progress to the second layer, define its dimensionality to be \(10 \times 256\) and so on through the computational graph and bind all the dimensions as they become available. Once this is known, we can proceed by initializing parameters. This is the solution to the three problems outlined above.

Deferred Initialization in Practice¶

Now that we know how it works in theory, let’s see when the initialization is actually triggered. In order to do so, we mock up an initializer which does nothing but report a debug message stating when it was invoked and with which paramers.

In [5]:

class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        # The actual initialization logic is omitted here.

net = getnet()
net.initialize(init=MyInit())

Note that, although MyInit will print information about the model parameters when it is called, the above initialize function does not print any information after it has been executed. Therefore there is no real initialization parameter when calling the initialize function. Next, we define the input and perform a forward calculation.

In [6]:

x = nd.random.uniform(shape=(2, 20))
y = net(x)

Init dense2_weight (256, 20)
Init dense3_weight (10, 256)

At this time, information on the model parameters is printed. When performing a forward calculation based on the input x, the system can automatically infer the shape of the weight parameters of all layers based on the shape of the input. Once the system has created these parameters, it calls the MyInit instance to initialize them before proceeding to the forward calculation.

Of course, this initialization will only be called when completing the initial forward calculation. After that, we will not re-initialize when we run the forward calculation net(x), so the output of the MyInit instance will not be generated again.

In [7]:

y = net(x)

As mentioned at the beginning of this section, deferred initialization can also cause confusion. Before the first forward calculation, we were unable to directly manipulate the model parameters, for example, we could not use the data and set_data functions to get and modify the parameters. Therefore, we often force initialization by sending a sample observation through the network.

Forced Initialization¶

Deferred initialization does not occur if the system knows the shape of all parameters when calling the initialize function. This can occur in two cases:

We’ve already seen some data and we just want to reset the parameters.
We specificed all input and output dimensions of the network when defining it.

The first case works just fine, as illustrated below.

In [8]:

net.initialize(init=MyInit(), force_reinit=True)

Init dense2_weight (256, 20)
Init dense3_weight (10, 256)

The second case requires us to specify the remaining set of parameters when creating the layer. For instance, for dense layers we also need to specify the in_units so that initialization can occur immediately once initialize is called.

In [9]:

net = nn.Sequential()
net.add(nn.Dense(256, in_units=20, activation='relu'))
net.add(nn.Dense(10, in_units=256))

net.initialize(init=MyInit())

Init dense4_weight (256, 20)
Init dense5_weight (10, 256)

Summary¶

Deferred initialization is a good thing. It allows Gluon to set many things automagically and it removes a great source of errors from defining novel network architectures.
We can override this by specifying all implicitly defined variables.
Initialization can be repeated (or forced) by setting the force_reinit=True flag.

Problems¶

What happens if you specify only parts of the input dimensions. Do you still get immediate initialization?
What happens if you specify mismatching dimensions?
What would you need to do if you have input of varying dimensionality? Hint - look at parameter tying.