Layers and Blocks¶
One of the key components that helped propel deep learning is powerful software. In an analogous manner to semiconductor design where engineers went from specifying transistors to logical circuits to writing code we now witness similar progress in the design of deep networks. The previous chapters have seen us move from designing single neurons to entire layers of neurons. However, even network design by layers can be tedious when we have 152 layers, as is the case in ResNet-152, which was proposed by He et al. in 2016 for computer vision problems. Such networks have a fair degree of regularity and they consist of blocks of repeated (or at least similarly designed) layers. These blocks then form the basis of more complex network designs. In short, blocks are combinations of one or more layers. This design is aided by code that generates such blocks on demand, just like a Lego factory generates blocks which can be combined to produce terrific artifacts.
We start with very simple block, namely the block for a multilayer perceptron, such as the one we encountered previously. A common strategy would be to construct a two-layer network as follows:
In [1]:
from mxnet import nd
from mxnet.gluon import nn
x = nd.random.uniform(shape=(2, 20))
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)
Out[1]:
[[ 0.09543004 0.04614332 -0.00286654 -0.07790349 -0.05130243 0.02942037
0.08696642 -0.0190793 -0.04122177 0.05088576]
[ 0.0769287 0.03099705 0.00856576 -0.04467199 -0.06926839 0.09132434
0.06786595 -0.06187842 -0.03436673 0.04234694]]
<NDArray 2x10 @cpu(0)>
This generates a network with a hidden layer of 256 units, followed by a
ReLu activation and another 10 units governing the output. In
particular, we used the nn.Sequential
constructor to generate an
empty network into which we then inserted both layers. What exactly
happens inside nn.Sequential
has remained rather mysterious so far.
In the following we will see that this really just constructs a block.
These blocks can be combined into larger artifacts, often recursively.
The diagram below shows how:
Multiple.layers.are.combined.into.blocks
In the following we will explain the various steps needed to go from defining layers to defining blocks (of one or more layers). To get started we need a bit of reasoning about software. For most intents and purposes a block behaves very much like a fancy layer. That is, it provides the following functionality:
- It needs to ingest data (the input).
- It needs to produce a meaningful output. This is typically encoded in
what we will call the
forward
function. It allows us to invoke a block vianet(X)
to obtain the desired output. What happens behind the scenes is that it invokesforward
to perform forward propagation. - It needs to produce a gradient with regard to its input when invoking
backward
. Typically this is automatic. - It needs to store parameters that are inherent to the block. For instance, the block above contains two hidden layers, and we need a place to store parameters for it.
- Obviously it also needs to initialize these parameters as needed.
A Custom Block¶
The nn.Block
class provides the functionality required for much of
what we need. It is a model constructor provided in the nn
module,
which we can inherit to define the model we want. The following inherits
the Block class to construct the multilayer perceptron mentioned at the
beginning of this section. The MLP
class defined here overrides the
__init__
and forward
functions of the Block class. They are used
to create model parameters and define forward computations,
respectively. Forward computation is also forward propagation.
In [2]:
from mxnet import nd
from mxnet.gluon import nn
class MLP(nn.Block):
# Declare a layer with model parameters. Here, we declare two fully connected layers.
def __init__(self, **kwargs):
# Call the constructor of the MLP parent class Block to perform the necessary initialization. In this way,
# other function parameters can also be specified when constructing an instance, such as the model parameter, params, described in the following sections.
super(MLP, self).__init__(**kwargs)
self.hidden = nn.Dense(256, activation='relu') # Hidden layer.
self.output = nn.Dense(10) # Output layer.
# Define the forward computation of the model, that is, how to return the required model output based on the input x.
def forward(self, x):
return self.output(self.hidden(x))
Let’s look at it a bit more closely. The forward
method invokes a
network simply by evaluating the hidden layer self.hidden(x)
and
subsequently by evaluating the output layer self.output( ... )
. This
is what we expect in the forward pass of this block.
In order for the block to know what it needs to evaluate, we first need
to define the layers. This is what the __init__
method does. It
first initializes all of the Block-related parameters and then
constructs the requisite layers. This attached the coresponding layers
and the required parameters to the class. Note that there is no need to
define a backpropagation method in the class. The system automatically
generates the backward
method needed for back propagation by
automatically finding the gradient. The same applies to the
initialize
method, which is generated automatically. Let’s try this
out:
In [3]:
net = MLP()
net.initialize()
net(x)
Out[3]:
[[ 0.00362228 0.00633332 0.03201144 -0.01369375 0.10336449 -0.03508018
-0.00032164 -0.01676023 0.06978628 0.01303309]
[ 0.03871715 0.02608213 0.03544959 -0.02521311 0.11005433 -0.0143066
-0.03052466 -0.03852827 0.06321152 0.0038594 ]]
<NDArray 2x10 @cpu(0)>
As explained above, the block class can be quite versatile in terms of
what it does. For instance, its subclass can be a layer (such as the
Dense
class provided by Gluon), it can be a model (such as the
MLP
class we just derived), or it can be a part of a model (this is
what typically happens when designing very deep networks). Throughout
this chapter we will see how to use this with great flexibility.
A Sequential Block¶
The Block class is a generic component describing dataflow. In fact, the
Sequential class is derived from the Block class: when the forward
computation of the model is a simple concatenation of computations for
each layer, we can define the model in a much simpler way. The purpose
of the Sequential class is to provide some useful convenience functions.
In particular, the add
method allows us to add concatenated Block
subclass instances one by one, while the forward computation of the
model is to compute these instances one by one in the order of addition.
Below, we implement a MySequential
class that has the same
functionality as the Sequential class. This may help you understand more
clearly how the Sequential class works.
In [4]:
class MySequential(nn.Block):
def __init__(self, **kwargs):
super(MySequential, self).__init__(**kwargs)
def add(self, block):
# Here, block is an instance of a Block subclass, and we assume it has a unique name. We save it in the
# member variable _children of the Block class, and its type is OrderedDict. When the MySequential instance
# calls the initialize function, the system automatically initializes all members of _children.
self._children[block.name] = block
def forward(self, x):
# OrderedDict guarantees that members will be traversed in the order they were added.
for block in self._children.values():
x = block(x)
return x
At its core is the add
method. It adds any block to the ordered
dictionary of children. These are then executed in sequence when forward
propagation is invoked. Let’s see what the MLP looks like now.
In [5]:
net = MySequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)
Out[5]:
[[ 0.07787765 0.00216401 0.01682201 0.03059879 -0.00702019 0.01668714
0.04822845 0.00394321 -0.09300036 -0.044943 ]
[ 0.08891079 -0.00625484 -0.01619132 0.03807178 -0.01451489 0.02006172
0.0303478 0.02463485 -0.07605445 -0.04389167]]
<NDArray 2x10 @cpu(0)>
Indeed, it is no different than It can observed here that the use of the
MySequential
class is no different from the use of the Sequential
class described in the “Gluon implementation of multilayer
perceptron” section.
Blocks with Code¶
Although the Sequential class can make model construction easier, and
you do not need to define the forward
method, directly inheriting
the Block class can greatly expand the flexibility of model
construction. In particular, we will use Python’s control flow within
the forward method. While we’re at it, we need to introduce another
concept, that of the constant parameter. These are parameters that are
not used when invoking backprop. This sounds very abstract but here’s
what’s really going on. Assume that we have some function
In this case 3 is a constant parameter. We could change 3 to something else, say \(c\) via
Nothing has really changed, except that we can adjust the value of
\(c\). It is still a constant as far as \(\mathbf{w}\) and
\(\mathbf{x}\) are concerned. However, since Gluon doesn’t know
about this beforehand, it’s worth while to give it a hand (this makes
the code go faster, too, since we’re not sending the Gluon engine on a
wild goose chase after a parameter that doesn’t change).
get_constant
is the method that can be used to accomplish this.
Let’s see what this looks like in practice.
In [6]:
class FancyMLP(nn.Block):
def __init__(self, **kwargs):
super(FancyMLP, self).__init__(**kwargs)
# Random weight parameters created with the get_constant are not iterated during training (i.e. constant parameters).
self.rand_weight = self.params.get_constant(
'rand_weight', nd.random.uniform(shape=(20, 20)))
self.dense = nn.Dense(20, activation='relu')
def forward(self, x):
x = self.dense(x)
# Use the constant parameters created, as well as the relu and dot functions of NDArray.
x = nd.relu(nd.dot(x, self.rand_weight.data()) + 1)
# Reuse the fully connected layer. This is equivalent to sharing parameters with two fully connected layers.
x = self.dense(x)
# Here in Control flow, we need to call asscalar to return the scalar for comparison.
while x.norm().asscalar() > 1:
x /= 2
if x.norm().asscalar() < 0.8:
x *= 10
return x.sum()
In this FancyMLP
model, we used constant weight Rand_weight
(note that it is not a model parameter), performed a matrix
multiplication operation (nd.dot<
), and reused the same Dense
layer. Note that this is very different from using two dense layers with
different sets of parameters. Instead, we used the same network twice.
Quite often in deep networks one also says that the parameters are
tied when one wants to express that multiple parts of a network share
the same parameters. Let’s see what happens if we construct it and feed
data through it.
In [7]:
net = FancyMLP()
net.initialize()
net(x)
Out[7]:
[25.522684]
<NDArray 1 @cpu(0)>
There’s no reason why we couldn’t mix and match these ways of build a network. Obviously the example below resembles more a chimera, or less charitably, a Rube Goldberg Machine. That said, it combines examples for building a block from individual blocks, which in turn, may be blocks themselves. Furthermore, we can even combine multiple strategies inside the same forward function. To demonstrate this, here’s the network.
In [8]:
class NestMLP(nn.Block):
def __init__(self, **kwargs):
super(NestMLP, self).__init__(**kwargs)
self.net = nn.Sequential()
self.net.add(nn.Dense(64, activation='relu'),
nn.Dense(32, activation='relu'))
self.dense = nn.Dense(16, activation='relu')
def forward(self, x):
return self.dense(self.net(x))
chimera = nn.Sequential()
chimera.add(NestMLP(), nn.Dense(20), FancyMLP())
chimera.initialize()
chimera(x)
Out[8]:
[30.518448]
<NDArray 1 @cpu(0)>
Compilation¶
The avid reader is probably starting to worry about the efficiency of this. After all, we have lots of dictionary lookups, code execution, and lots of other Pythonic things going on in what is supposed to be a high performance deep learning library. The problems of Python’s Global Interpreter Lock are well known. In the context of deep learning it means that we have a super fast GPU (or multiple of them) which might have to wait until a puny single CPU core running Python gets a chance to tell it what to do next. This is clearly awful and there are many ways around it. The best way to speed up Python is by avoiding it altogether.
Gluon does this by allowing for Hybridization. In it, the Python interpreter executes the block the first time it’s invoked. The Gluon runtime records what is happening and the next time around it short circuits any calls to Python. This can accelerate things considerably in some cases but care needs to be taken with control flow. We suggest that the interested reader skip forward to the section covering hybridization and compilation after finishing the current chapter.
Summary¶
- Layers are blocks
- Many layers can be a block
- Many blocks can be a block
- Code can be a block
- Blocks take are of a lot of housekeeping, such as parameter initialization, backprop and related issues.
- Sequential concatenations of layers and blocks are handled by the
eponymous
Sequential
block.
Problems¶
- What kind of error message will you get when calling an
__init__
method whose parent class not in the__init__
function of the parent class? - What kinds of problems will occur if you remove the
asscalar
function in theFancyMLP
class? - What kinds of problems will occur if you change
self.net
defined by the Sequential instance in theNestMLP
class toself.net = [nn.Dense(64, activation='relu'), nn. Dense(32, activation='relu')]
? - Implement a block that takes two blocks as an argument, say
net1
andnet2
and returns the concatenated output of both networks in the forward pass (this is also called a parallel block). - Assume that you want to concatenate multiple instances of the same network. Implement a factory function that generates multiple instances of the same block and build a larger network from it.