Asynchronous Computing¶
MXNet utilizes asynchronous programming to improve computing performance. Understanding how asynchronous programming works helps us to develop more efficient programs, by proactively reducing computational requirements and thereby minimizing the memory overhead required in the case of limited memory resources. First, we will import the package or module needed for this section ’ s experiment.
In [1]:
from mxnet import autograd, gluon, nd
from mxnet.gluon import loss as gloss, nn
import os
import subprocess
import time
Asynchronous Programming in MXNet¶
Broadly speaking, MXNet includes the front-end directly used by users for interaction, as well as the back-end used by the system to perform the computation. For example, users can write MXNet programs in various front-end languages, such as Python, R, Scala and C++. Regardless of the front-end programming language used, the execution of MXNet programs occurs primarily in the back-end of C++ implementations. In other words, front-end MXNet programs written by users are passed on to the back-end to be computed. The back-end possesses its own threads that continuously collect and execute queued tasks.
Through the interaction between front-end and back-end threads, MXNet is able to implement asynchronous programming. Asynchronous programming means that the front-end threads continue to execute subsequent instructions without having to wait for the back-end threads to return the results from the current instruction. For simplicity ’ s sake, assume that the Python front-end thread calls the following four instructions.
In [2]:
a = nd.ones((1, 2))
b = nd.ones((1, 2))
c = a * b + 2
c
Out[2]:
[[3. 3.]]
<NDArray 1x2 @cpu(0)>
In Asynchronous Computing, whenever the Python front-end thread executes
one of the first three statements, it simply returns the task to the
back-end queue. When the last statement ’ s results need to be printed,
the Python front-end thread will wait for the C++ back-end thread to
finish computing result of the variable c
. One benefit of such as
design is that the Python front-end thread in this example does not need
to perform actual computations. Thus, there is little impact on the
program ’ s overall performance, regardless of Python ’ s performance.
MXNet will deliver consistently high performance, regardless of the
front-end language ’ s performance, provided the C++ back-end can meet
the efficiency requirements.
To further demonstrate the asynchronous computation ’ s performance, we will implement a simple timing class.
In [3]:
class Benchmark(): # This class is saved in the Gluonbook module for future reference.
def __init__(self, prefix=None):
self.prefix = prefix + ' ' if prefix else ''
def __enter__(self):
self.start = time.time()
def __exit__(self, *args):
print('%stime: %.4f sec' % (self.prefix, time.time() - self.start))
The following example uses timing to demonstrate the effect of
asynchronous programming. As we can see, when y = nd.dot(x, x).sum()
is returned, it does not actually wait for the variable y
to be
calculated. Only when the print
function needs to print the variable
y
must the function wait for it to be calculated.
In [4]:
with Benchmark('Workloads are queued.'):
x = nd.random.uniform(shape=(2000, 2000))
y = nd.dot(x, x).sum()
with Benchmark('Workloads are finished.'):
print('sum =', y)
Workloads are queued. time: 0.0004 sec
sum =
[2.0003661e+09]
<NDArray 1 @cpu(0)>
Workloads are finished. time: 0.1985 sec
In truth, whether or not the current result is already calculated in the memory is irrelevant, unless we need to print or save the computation results. So long as the data is stored in NDArray and the operators provided by MXNet are used, MXNet will utilize asynchronous programming by default to attain superior computing performance.
Use of the Synchronization Function to Allow the Front-End to Wait for the Computation Results¶
In addition to the print
function we just introduced, there are
other ways to make the front-end thread wait for the completion of the
back-end computations. The wait_to_read
function can be used to make
the front-end wait for the complete computation of the NDArray results,
and then execute following statement. Alternatively, we can use the
waitall
function to make the front-end wait for the completion of
all previous computations. The latter is a common method used in
performance testing.
Below, we use the wait_to_read
function as an example. The time
output includes the calculation time of y
.
In [5]:
with Benchmark():
y = nd.dot(x, x)
y.wait_to_read()
time: 0.1248 sec
Below, we use waitall
as an example. The time output includes the
calculation time of y
and z
respectively.
In [6]:
with Benchmark():
y = nd.dot(x, x)
z = nd.dot(x, x)
nd.waitall()
time: 0.2435 sec
Additionally, any operation that does not support asynchronous
programming but converts the NDArray into another data structure will
cause the front-end to have to wait for computation results. For
example, calling the asnumpy
and asscalar
functions:
In [7]:
with Benchmark():
y = nd.dot(x, x)
y.asnumpy()
time: 0.1310 sec
In [8]:
with Benchmark():
y = nd.dot(x, x)
y.norm().asscalar()
time: 0.1607 sec
The wait_to_read
, waitall
, asnumpy
, asscalar
and
theprint
functions described above will cause the front-end to
wait for the back-end computation results. Such functions are often
referred to as synchronization functions.
Using Asynchronous Programming to Improve Computing Performance¶
In the following example, we will use the “for” loop to continuously
assign values to the variable y
. Asynchronous programming is not
used in tasks when the synchronization function wait_to_read
is used
in the “for” loop. However, when the synchronization function
waitall
is used outside of the “for” loop, asynchronous programming
is used.
In [9]:
with Benchmark('synchronous.'):
for _ in range(1000):
y = x + 1
y.wait_to_read()
with Benchmark('asynchronous.'):
for _ in range(1000):
y = x + 1
nd.waitall()
synchronous. time: 0.9919 sec
asynchronous. time: 0.7109 sec
We have observed that certain aspects of computing performance can be improved by making use of asynchronous programming. To explain this, we will slightly simplify the interaction between the Python front-end thread and the C++ back-end thread. In each loop, the interaction between front and back-ends can be largely divided into three stages:
- The front-end orders the back-end to insert the calculation task
y = x + 1
into the queue. - The back-end then receives the computation tasks from the queue and performs the actual computations.
- The back-end then returns the computation results to the front-end.
Assume that the durations of these three stages are \(t_1, t_2, t_3\), respectively. If we do not use asynchronous programming, the total time taken to perform 1000 computations is approximately \(1000 (t_1+ t_2 + t_3)\). If asynchronous programming is used, the total time taken to perform 1000 computations can be reduced to \(t_1 + 1000 t_2 + t_3\) (assuming \(1000t_2 > 999t_1\)), since the front-end does not have to wait for the back-end to return computation results for each loop.
The Impact of Asynchronous Programming on Memory¶
In order to explain the impact of asynchronous programming on memory
usage, recall what we learned in the previous chapters. Throughout the
model training process implemented in the previous chapters, we usually
evaluated things like the loss or accuracy of the model in each
mini-batch. Detail-oriented readers may have discovered that such
evaluations often make use of synchronization functions, such as
asscalar
or asnumpy
. If these synchronization functions are
removed, the front-end will pass a large number of mini-batch computing
tasks to the back-end in a very short time, which might cause a spike in
memory usage. When the mini-batches makes use of synchronization
functions, on each iteration, the front-end will only pass one
mini-batch task to the back-end to be computed, which will typically
reduce memory use.
Because the deep learning model is usually large and memory resources
are usually limited, we recommend the use of synchronization functions
for each mini-batch throughout model training, for example by using the
asscalar
or asnumpy
functions to evaluate model performance.
Similarly, we also recommend utilizing synchronization functions for
each mini-batch prediction (such as directly printing out the current
batch ’ s prediction results), in order to reduce memory usage during
model prediction.
Next, we will demonstrate asynchronous programming ’ s impact on memory.
We will first define a data retrieval function data_iter
, which upon
being called, will start timing and regularly print out the time taken
to retrieve data batches.
In [10]:
def data_iter():
start = time.time()
num_batches, batch_size = 100, 1024
for i in range(num_batches):
X = nd.random.normal(shape=(batch_size, 512))
y = nd.ones((batch_size,))
yield X, y
if (i + 1) % 50 == 0:
print('batch %d, time %f sec' % (i + 1, time.time() - start))
The multilayer perceptron, optimization algorithm, and loss function are defined below.
In [11]:
net = nn.Sequential()
net.add(nn.Dense(2048, activation='relu'),
nn.Dense(512, activation='relu'),
nn.Dense(1))
net.initialize()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.005})
loss = gloss.L2Loss()
A helper function to monitor memory use is defined here. It should be noted that this function can only be run on Linux or MacOS operating systems.
In [12]:
def get_mem():
res = subprocess.check_output(['ps', 'u', '-p', str(os.getpid())])
return int(str(res).split()[15]) / 1e3
Now we can begin testing. To initialize the net
parameters we will
try running the system once. See the section “Deferred Initialization
of Model
Parameters”
for further discussions related to initialization.
In [13]:
for X, y in data_iter():
break
loss(y, net(X)).wait_to_read()
For the net
training model, the synchronization function
asscalar
can naturally be used to record the loss of each mini-batch
output by the NDArray format and to print out the model loss after each
iteration. At this point, the generation interval of each mini-batch
increases, but with a small memory overhead.
In [14]:
l_sum, mem = 0, get_mem()
for X, y in data_iter():
with autograd.record():
l = loss(y, net(X))
l_sum += l.mean().asscalar() # Use of the Asscalar synchronization function.
l.backward()
trainer.step(X.shape[0])
nd.waitall()
print('increased memory: %f MB' % (get_mem() - mem))
batch 50, time 6.061835 sec
batch 100, time 12.181416 sec
increased memory: 6.864000 MB
Even though each mini-batch ’ s generation interval is shorter, the
memory usage may still be high during training if the synchronization
function is removed. This is because, in default asynchronous
programming, the front-end will pass on all mini-batch computations to
the back-end in a short amount of time. As a result of this, a large
amount of intermediate results cannot be released and may end up piled
up in memory. In this experiment, we can see that all data (X
and
y
) is generated in under a second. However, because of an
insufficient training speed, this data can only be stored in the memory
and cannot be cleared in time, resulting in extra memory usage.
In [15]:
mem = get_mem()
for X, y in data_iter():
with autograd.record():
l = loss(y, net(X))
l.backward()
trainer.step(x.shape[0])
nd.waitall()
print('increased memory: %f MB' % (get_mem() - mem))
batch 50, time 0.073484 sec
batch 100, time 0.144795 sec
increased memory: 200.512000 MB
Summary¶
- MXNet includes the front-end used directly by users for interaction and the back-end used by the system to perform the computation.
- MXNet can improve computing performance through the use of asynchronous programming.
- We recommend using at least one synchronization function for each mini-batch training or prediction to avoid passing on too many computation tasks to the back-end in a short period of time.
Problems¶
- In the section “Use of Asynchronous Programming to Improve Computing Performance”, we mentioned that using asynchronous computation can reduce the total amount of time needed to perform 1000 computations to \(t_1 + 1000 t_2 + t_3\). Why do we have to assume \(1000t_2 > 999t_1\) here?