Automatic Parallelism¶
MXNet automatically constructs computational graphs at the back end.
Using a computational graph, the system is aware of all the
computational dependencies, and can selectively execute multiple
non-interdependent tasks in parallel to improve computing performance.
For instance, the first example in the “Asynchronous
Computing” section executes
a = nd.ones((1, 2))
and b = nd.ones((1, 2))
in turn. There is no
dependency between these two steps, so the system can choose to execute
them in parallel.
Typically, a single operator will use all the computational resources on
all CPUs or a single GPU. For example, the dot
operator will use all
threads on all CPUs (even if there are multiple CPU processors on a
single machine) or a single GPU. If computational load of each operator
is large enough and multiple operators are run in parallel on only on
the CPU or a single GPU, then the operations of each operator can only
receive a portion of computational resources of CPU or single GPU. Even
if these computations can be parallelized, the ultimate increase in
computing performance may not be significant. In this section, our
discussion of automatic parallel computation mainly focuses on parallel
computation using both CPUs and GPUs, as well as the parallelization of
computation and communication.
First, import the required packages or modules for experiment in this section. Note that we need at least one GPU to run the experiment in this section.
In [1]:
import gluonbook as gb
import mxnet as mx
from mxnet import nd
Parallel Computation using CPUs and GPUs¶
First, we will discuss parallel computation using CPUs and GPUs, for
example, when computation in a program occurs both on the CPU and a GPU.
First, define the run
function so that it performs 10 matrix
multiplications.
In [2]:
def run(x):
return [nd.dot(x, x) for _ in range(10)]
Next, create an NDArray on both the CPU and GPU.
In [3]:
x_cpu = nd.random.uniform(shape=(2000, 2000))
x_gpu = nd.random.uniform(shape=(6000, 6000), ctx=mx.gpu(0))
Then, use the two NDArrays to run the run
function on both the CPU
and GPU and print the time required.
In [4]:
run(x_cpu) # Warm-up begins.
run(x_gpu)
nd.waitall() # Warm-up ends.
with gb.Benchmark('Run on CPU.'):
run(x_cpu)
nd.waitall()
with gb.Benchmark('Then run on GPU.'):
run(x_gpu)
nd.waitall()
Run on CPU. time: 1.2215 sec
Then run on GPU. time: 1.2221 sec
We remove nd.waitall()
between the two computing tasks
run(x_cpu)
and run(x_gpu)
and hope the system can automatically
parallel these two tasks.
In [5]:
with gb.Benchmark('Run on both CPU and GPU in parallel.'):
run(x_cpu)
run(x_gpu)
nd.waitall()
Run on both CPU and GPU in parallel. time: 1.2225 sec
As we can see, when two computing tasks are executed together, the total execution time is less than the sum of their separate execution times. This means that MXNet can effectively automate parallel computation on CPUs and GPUs.
Parallel Computation of Computing and Communication¶
In computations that use both the CPU and GPU, we often need to copy data between the CPU and GPU, resulting in data communication. In the example below, we compute on the GPU and then copy the results back to the CPU. We print the GPU computation time and the communication time from the GPU to CPU.
In [6]:
def copy_to_cpu(x):
return [y.copyto(mx.cpu()) for y in x]
with gb.Benchmark('Run on GPU.'):
y = run(x_gpu)
nd.waitall()
with gb.Benchmark('Then copy to CPU.'):
copy_to_cpu(y)
nd.waitall()
Run on GPU. time: 1.2285 sec
Then copy to CPU. time: 0.5279 sec
We remove the waitall
function between computation and communication
and print the total time need to complete both tasks.
In [7]:
with gb.Benchmark('Run and copy in parallel.'):
y = run(x_gpu)
copy_to_cpu(y)
nd.waitall()
Run and copy in parallel. time: 1.2752 sec
As we can see, the total time required to perform computation and
communication is less than the sum of their separate execution times. It
should be noted that this computation and communication task is
different from the parallel computation task that simultaneously used
the CPU and GPU described earlier in this section. Here, there is a
dependency between execution and communication: y[i]
must be
computed before it can be copied to the CPU. Fortunately, the system can
copy y[i-1]
when computing y[i]
to reduce the total running time
of computation and communication.
Summary¶
- MXNet can improve computing performance through automatic parallel computation, such as parallel computation using the CPU and GPU and the parallelization of computation and communication.
Problems¶
- 10 operations were performed in the
run
function defined in this section. There are no dependencies between them. Design an experiment to see if MXNet will automatically execute them in parallel. - Designing computation tasks that include more complex data dependencies, and run experiments to see if MXNet can obtain the correct results and improve computing performance.
- When the computational load of an operator is small enough, parallel computation on only the CPU or a single GPU may also improve the computing performance. Design an experiment to verify this.