Gluon Implementation for Multi-GPU Computation¶
In Gluon, we can conveniently use data parallelism to perform multi-GPU computation. For example, we do not need to implement the helper function to synchronize data among multiple GPUs, as described in the“Multi-GPU Computation” section, ourselves.
First, import the required packages or modules for the experiment in this section. Running the programs in this section requires at least two GPUs.
In [1]:
import gluonbook as gb
import mxnet as mx
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn, utils as gutils
import time
Initialize Model Parameters on Multiple GPUs¶
In this section, we use ResNet-18 as a sample model. Since the input images in this section are original size (not enlarged), the model construction here is different from the ResNet-18 structure described in the “ResNet” section. This model uses a smaller convolution kernel, stride, and padding at the beginning and removes the maximum pooling layer.
In [2]:
def resnet18(num_classes): # This function is saved in the gluonbook package for future use.
def resnet_block(num_channels, num_residuals, first_block=False):
blk = nn.Sequential()
for i in range(num_residuals):
if i == 0 and not first_block:
blk.add(gb.Residual(
num_channels, use_1x1conv=True, strides=2))
else:
blk.add(gb.Residual(num_channels))
return blk
net = nn.Sequential()
# This model uses a smaller convolution kernel, stride, and padding and removes the maximum pooling layer.
net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1),
nn.BatchNorm(), nn.Activation('relu'))
net.add(resnet_block(64, 2, first_block=True),
resnet_block(128, 2),
resnet_block(256, 2),
resnet_block(512, 2))
net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))
return net
net = resnet18(10)
Previously, we discussed how to use the initialize
function’s
ctx
parameter to initialize model parameters on a CPU or a single
GPU. In fact, ctx
can accept a range of CPUs and GPUs so as to copy
initialized model parameters to all CPUs and GPUs in ctx
.
In [3]:
ctx = [mx.gpu(0), mx.gpu(1)]
net.initialize(init=init.Normal(sigma=0.01), ctx=ctx)
Gluon provides the split_and_load
function implemented in the
previous section. It can divide a mini-batch of data instances and copy
them to each CPU or GPU. Then, the model computation for the data input
to each CPU or GPU occurs on that same CPU or GPU.
In [4]:
x = nd.random.uniform(shape=(4, 1, 28, 28))
gpu_x = gutils.split_and_load(x, ctx)
net(gpu_x[0]), net(gpu_x[1])
Out[4]:
(
[[ 5.4814936e-06 -8.3371094e-07 -1.6316770e-06 -6.3674099e-07
-3.8216162e-06 -2.3514044e-06 -2.5469599e-06 -9.4784696e-08
-6.9033558e-07 2.5756231e-06]
[ 5.4710872e-06 -9.4246496e-07 -1.0494070e-06 9.8081841e-08
-3.3251815e-06 -2.4862918e-06 -3.3642798e-06 1.0455864e-07
-6.1001344e-07 2.0327841e-06]]
<NDArray 2x10 @gpu(0)>,
[[ 5.6176345e-06 -1.2837586e-06 -1.4605541e-06 1.8302967e-07
-3.5511653e-06 -2.4371013e-06 -3.5731798e-06 -3.0974860e-07
-1.1016571e-06 1.8909889e-06]
[ 5.1418697e-06 -1.3729932e-06 -1.1520088e-06 1.1507450e-07
-3.7372811e-06 -2.8289724e-06 -3.6477197e-06 1.5781629e-07
-6.0733043e-07 1.9712013e-06]]
<NDArray 2x10 @gpu(1)>)
Now we can access the initialized model parameter values through
data
. It should be noted that weight.data()
will return the
parameter values on the CPU by default. Since we specified 2 GPUs to
initialize the model parameters, we need to specify the GPU to access
parameter values. As we can see, the same parameters have the same
values on different GPUs.
In [5]:
weight = net[0].params.get('weight')
try:
weight.data()
except RuntimeError:
print('not initialized on', mx.cpu())
weight.data(ctx[0])[0], weight.data(ctx[1])[0]
not initialized on cpu(0)
Out[5]:
(
[[[-0.01473444 -0.01073093 -0.01042483]
[-0.01327885 -0.01474966 -0.00524142]
[ 0.01266256 0.00895064 -0.00601594]]]
<NDArray 1x3x3 @gpu(0)>,
[[[-0.01473444 -0.01073093 -0.01042483]
[-0.01327885 -0.01474966 -0.00524142]
[ 0.01266256 0.00895064 -0.00601594]]]
<NDArray 1x3x3 @gpu(1)>)
Multi-GPU Model Training¶
When we use multiple GPUs to train the model, the Trainer
instance
will automatically perform data parallelism, such as dividing
mini-batches of data instances and copying them to individual GPUs and
summing the gradients of each GPU and broadcasting the result to all
GPUs. In this way, we can easily implement the training function.
In [6]:
def train(num_gpus, batch_size, lr):
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)
ctx = [mx.gpu(i) for i in range(num_gpus)]
print('running on:', ctx)
net.initialize(init=init.Normal(sigma=0.01), ctx=ctx, force_reinit=True)
trainer = gluon.Trainer(
net.collect_params(), 'sgd', {'learning_rate': lr})
loss = gloss.SoftmaxCrossEntropyLoss()
for epoch in range(4):
start = time.time()
for X, y in train_iter:
gpu_Xs = gutils.split_and_load(X, ctx)
gpu_ys = gutils.split_and_load(y, ctx)
with autograd.record():
ls = [loss(net(gpu_X), gpu_y)
for gpu_X, gpu_y in zip(gpu_Xs, gpu_ys)]
for l in ls:
l.backward()
trainer.step(batch_size)
nd.waitall()
train_time = time.time() - start
test_acc = gb.evaluate_accuracy(test_iter, net, ctx[0])
print('epoch %d, training time: %.1f sec, test_acc %.2f' % (
epoch + 1, train_time, test_acc))
First, use a single GPU for training.
In [7]:
train(num_gpus=1, batch_size=256, lr=0.1)
running on: [gpu(0)]
epoch 1, training time: 63.5 sec, test_acc 0.86
epoch 2, training time: 60.8 sec, test_acc 0.91
epoch 3, training time: 61.0 sec, test_acc 0.91
epoch 4, training time: 61.1 sec, test_acc 0.93
Then we try to use 2 GPUs for training. Compared with the LeNet used in the previous section, ResNet-18 computing is more complicated and the communication time is shorter compared to the calculation time, so parallel computing in ResNet-18 better improves performance.
In [8]:
train(num_gpus=2, batch_size=512, lr=0.2)
running on: [gpu(0), gpu(1)]
epoch 1, training time: 32.3 sec, test_acc 0.77
epoch 2, training time: 31.5 sec, test_acc 0.86
epoch 3, training time: 31.2 sec, test_acc 0.88
epoch 4, training time: 31.3 sec, test_acc 0.89
Summary¶
- In Gluon, we can conveniently perform multi-GPU computations, such as initializing model parameters and training models on multiple GPUs.
Problems¶
- This section uses ResNet-18. Try different epochs, batch sizes, and learning rates. Use more GPUs for computation if conditions permit.
- Sometimes, different devices provide different computing power. Some can use CPUs and GPUs at the same time, or GPUs of different models. How should we divide mini-batches among different CPUs or GPUs?