GPUs¶
In the introduction to this book we discussed the rapid growth of computation over the past two decades. In a nutshell, GPU performance has increased by a factor of 1000 every decade since 2000. This offers great opportunity but it also suggests a significant need to provide such performance.
Decade | Dataset | Memory | Floating Point Calculations per Second |
---|---|---|---|
1970 | 100 (Iris) | 1 KB | 100 KF (Intel 8080) |
1980 | 1 K (House prices in Boston) | 100 KB | 1 MF (Intel 80186) |
1990 | 10 K (optical character recognition) | 10 MB | 10 MF (Intel 80486) |
2000 | 10 M (web pages) | 100 MB | 1 GF (Intel Core) |
2010 | 10 G (advertising) | 1 GB | 1 TF (Nvidia C2050) |
2020 | 1 T (social network) | 100 GB | 1 PF (Nvidia DGX-2) |
In this section we begin to discuss how to harness this compute performance for your research. First by using single GPUs and at a later point, how to use multiple GPUs and multiple servers (with multiple GPUs). You might have noticed that MXNet NDArray looks almost identical to NumPy. But there are a few crucial differences. One of the key features that differentiates MXNet from NumPy is its support for diverse hardware devices.
In MXNet, every array has a context. In fact, whenever we displayed an
NDArray so far, it added a cryptic @cpu(0)
notice to the output
which remained unexplained so far. As we will discover, this just
indicates that the computation is being executed on the CPU. Other
contexts might be various GPUs. Things can get even hairier when we
deploy jobs across multiple servers. By assigning arrays to contexts
intelligently, we can minimize the time spent transferring data between
devices. For example, when training neural networks on a server with a
GPU, we typically prefer for the model ’ s parameters to live on the
GPU.
In short, for complex neural networks and large-scale data, using only
CPUs for computation may be inefficient. In this section, we will
discuss how to use a single Nvidia GPU for calculations. First, make
sure you have at least one Nvidia GPU installed. Then, download
CUDA and follow the
prompts to set the appropriate path. Once these preparations are
complete, the nvidia-smi
command can be used to view the graphics
card information.
In [1]:
!nvidia-smi
Sun Dec 16 05:03:23 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1D.0 Off | 0 |
| N/A 52C P0 48W / 150W | 0MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 56C P0 44W / 150W | 0MiB / 7618MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Next, we need to confirm that the GPU version of MXNet is installed. If
a CPU version of MXNet is already installed, we need to uninstall it
first. For example, use the pip uninstall mxnet
command, then
install the corresponding MXNet version according to the CUDA version.
Assuming you have CUDA 9.0 installed, you can install the MXNet version
that supports CUDA 9.0 by pip install mxnet-cu90
. To run the
programs in this section, you need at least two GPUs.
Note that this might be extravagant for most desktop computers but it is easily available in the cloud, e.g. by using the AWS EC2 multi-GPU instances. Almost all other sections do not require multiple GPUs. Instead, this is simply to illustrate how data flows between different devices.
Computing Devices¶
MXNet can specify devices, such as CPUs and GPUs, for storage and
calculation. By default, MXNet creates data in the main memory and then
uses the CPU to calculate it. In MXNet, the CPU and GPU can be indicated
by cpu()
and gpu()
. It should be noted that mx.cpu()
(or any
integer in the parentheses) means all physical CPUs and memory. This
means that MXNet’s calculations will try to use all CPU cores. However,
mx.gpu()
only represents one graphic card and the corresponding
graphic memory. If there are multiple GPUs, we use mx.gpu(i)
to
represent the \(i\)-th GPU (\(i\) starts from 0). Also,
mx.gpu(0)
and mx.gpu()
are equivalent.
In [2]:
import mxnet as mx
from mxnet import nd
from mxnet.gluon import nn
mx.cpu(), mx.gpu(), mx.gpu(1)
Out[2]:
(cpu(0), gpu(0), gpu(1))
NDArray and GPUs¶
By default, NDArray objects are created on the CPU. Therefore, we will
see the @cpu(0)
identifier each time we print an NDArray.
In [3]:
x = nd.array([1, 2, 3])
x
Out[3]:
[1. 2. 3.]
<NDArray 3 @cpu(0)>
We can use the context
property of NDArray to view the device where
the NDArray is located. It is important to note that whenever we want to
operate on multiple terms they need to be in the same context. For
instance, if we sum two variables, we need to make sure that both
arguments are on the same device - otherwise MXNet would not know where
to store the result or even how to decide where to perform the
computation.
In [4]:
x.context
Out[4]:
cpu(0)
Storage on the GPU¶
There are several ways to store an NDArray on the GPU. For example, we
can specify a storage device with the ctx
parameter when creating an
NDArray. Next, we create the NDArray variable a
on gpu(0)
.
Notice that when printing a
, the device information becomes
@gpu(0)
. The NDArray created on a GPU only consumes the memory of
this GPU. We can use the nvidia-smi
command to view GPU memory
usage. In general, we need to make sure we do not create data that
exceeds the GPU memory limit.
In [5]:
x = nd.ones((2, 3), ctx=mx.gpu())
x
Out[5]:
[[1. 1. 1.]
[1. 1. 1.]]
<NDArray 2x3 @gpu(0)>
Assuming you have at least two GPUs, the following code will create a
random array on gpu(1)
.
In [6]:
y = nd.random.uniform(shape=(2, 3), ctx=mx.gpu(1))
y
Out[6]:
[[0.59119 0.313164 0.76352036]
[0.9731786 0.35454726 0.11677533]]
<NDArray 2x3 @gpu(1)>
Copying¶
If we want to compute \(\mathbf{x} + \mathbf{y}\) we need to decide
where to perform this operation. For instance, we can transfer
\(\mathbf{x}\) to gpu(1)
and perform the operation there. Do
not simply add x + y
since this will result in an exception. The
runtime engine wouldn’t know what to do, it cannot find data on the same
device and it fails.
Copyto.copies.arrays.to.the.target.device
copyto
copies the data to another device such that we can add them.
Since \(\mathbf{y}\) lives on the second GPU we need to move
\(\mathbf{x}\) there before we can add the two.
In [7]:
z = x.copyto(mx.gpu(1))
print(x)
print(z)
[[1. 1. 1.]
[1. 1. 1.]]
<NDArray 2x3 @gpu(0)>
[[1. 1. 1.]
[1. 1. 1.]]
<NDArray 2x3 @gpu(1)>
Now that the data is on the same GPU (both \(\mathbf{z}\) and
\(\mathbf{y}\) are), we can add them up. In such cases MXNet places
the result on the same device as its constituents. In our case that is
@gpu(1)
.
In [8]:
y + z
Out[8]:
[[1.59119 1.313164 1.7635204]
[1.9731786 1.3545473 1.1167753]]
<NDArray 2x3 @gpu(1)>
Imagine that your variable z already lives on your second GPU (gpu(0)).
What happens if we call z.copyto(gpu(0))? It will make a copy and
allocate new memory, even though that variable already lives on the
desired device! There are times where depending on the environment our
code is running in, two variables may already live on the same device.
So we only want to make a copy if the variables currently lives on
different contexts. In these cases, we can call as_in_context()
. If
the variable is already the specified context then this is a no-op. In
fact, unless you specifically want to make a copy, as_in_context()
is the method of choice.
In [9]:
z = x.as_in_context(mx.gpu(1))
z
Out[9]:
[[1. 1. 1.]
[1. 1. 1.]]
<NDArray 2x3 @gpu(1)>
It is important to note that, if the context
of the source variable
and the target variable are consistent, then the as_in_context
function causes the target variable and the source variable to share the
memory of the source variable.
In [10]:
y.as_in_context(mx.gpu(1)) is y
Out[10]:
True
The copyto
function always creates new memory for the target
variable.
In [11]:
y.copyto(mx.gpu()) is y
Out[11]:
False
Watch Out¶
People use GPUs to do machine learning because they expect them to be fast. But transferring variables between contexts is slow. So we want you to be 100% certain that you want to do something slow before we let you do it. If MXNet just did the copy automatically without crashing then you might not realize that you had written some slow code.
Also, transferring data between devices (CPU, GPUs, other machines) is something that is much slower than computation. It also makes parallelization a lot more difficult, since we have to wait for data to be sent (or rather to be received) before we can proceed with more operations. This is why copy operations should be taken with great care. As a rule of thumb, many small operations are much worse than one big operation. Moreover, several operations at a time are much better than many single operations interspersed in the code (unless you know what you’re doing). This is the case since such operations can block if one device has to wait for the other before it can do something else. It’s a bit like ordering your coffee in a queue rather than pre-ordering it by phone and finding out that it’s ready when you are.
Lastly, when we print NDArray data or convert NDArrays to NumPy format, if the data is not in main memory, MXNet will copy it to the main memory first, resulting in additional transmission overhead. Even worse, it is now subject to the dreaded Global Interpreter Lock which makes everything wait for Python to complete.
Gluon and GPUs¶
Similar to NDArray, Gluon’s model can specify devices through the
ctx
parameter during initialization. The following code initializes
the model parameters on the GPU (we will see many more examples of how
to run models on GPUs in the following, simply since they will become
somewhat more compute intensive).
In [12]:
net = nn.Sequential()
net.add(nn.Dense(1))
net.initialize(ctx=mx.gpu())
When the input is an NDArray on the GPU, Gluon will calculate the result on the same GPU.
In [13]:
net(x)
Out[13]:
[[0.04995865]
[0.04995865]]
<NDArray 2x1 @gpu(0)>
Let us confirm that the model parameters are stored on the same GPU.
In [14]:
net[0].weight.data()
Out[14]:
[[0.0068339 0.01299825 0.0301265 ]]
<NDArray 1x3 @gpu(0)>
In short, as long as all data and parameters are on the same device, we can learn models efficiently. In the following we will see several such examples.
Summary¶
- MXNet can specify devices for storage and calculation, such as CPU or GPU. By default, MXNet creates data in the main memory and then uses the CPU to calculate it.
- MXNet requires all input data for calculation to be on the same device, be it CPU or the same GPU.
- You can lose significant performance by moving data without care. A typical mistake is as follows: computing the loss for every minibatch on the GPU and reporting it back to the user on the commandline (or logging it in a NumPy array) will trigger a global interpreter lock which stalls all GPUs. It is much better to allocate memory for logging inside the GPU and only move larger logs.
Problems¶
- Try a larger computation task, such as the multiplication of large matrices, and see the difference in speed between the CPU and GPU. What about a task with a small amount of calculations?
- How should we read and write model parameters on the GPU?
- Measure the time it takes to compute 1000 matrix-matrix multiplications of \(100 \times 100\) matrices and log the matrix norm \(\mathrm{tr} M M^\top\) one result at a time vs. keeping a log on the GPU and transferring only the final result.
- Measure how much time it takes to perform two matrix-matrix multiplications on two GPUs at the same time vs. in sequence on one GPU (hint - you should see almost linear scaling).
References¶
[1] CUDA download address. https://developer.nvidia.com/cuda-downloads