Automatic Gradient¶

In deep learning, we often need to find the gradient of a function. This section describes how to use the autograd package provided by MXNet to automatically find the gradient. If you are unfamiliar with the mathematical concepts (such as gradients) in this section, you can refer to the “Mathematical Basics” section in the appendix.

In [1]:

from mxnet import autograd, nd

Simple Examples¶

Let’s look at a simple example: find the gradient of the function \(y = 2\boldsymbol{x}^{\top}\boldsymbol{x}\) with respect to the column vector \(\boldsymbol{x}\). Firstly, we create the variable x and assign an initial value.

In [2]:

x = nd.arange(4).reshape((4, 1))
x

Out[2]:

[[0.]
 [1.]
 [2.]
 [3.]]
<NDArray 4x1 @cpu(0)>

To find the gradient of the variable x, we need to call the attach_grad function to apply for the necessary memory to store the gradient.

In [3]:

x.attach_grad()

Next, we define the function with respect to the variable x. To reduce computational and memory usage, MXNet does not record calculations for gradients by default. We need to call the record function to ask the MXNet to record the calculations related to the gradient.

In [4]:

with autograd.record():
    y = 2 * nd.dot(x.T, x)

Since the shape of x is (4, 1), y a scalar. Next, we can automatically find the gradient by calling the backward function. It should be noted that if y is not a scalar, MXNet will first sum the elements in y to get the new variable by default, and then find the gradient of the variable with respect to x.

In [5]:

y.backward()

The gradient of the function \(y = 2\boldsymbol{x}^{\top}\boldsymbol{x}\) with respect to \(\boldsymbol{x}\) should be \(4\boldsymbol{x}\). Now let’s verify that the gradient produced is correct.

In [6]:

assert (x.grad - 4 * x).norm().asscalar() == 0
x.grad

Out[6]:

[[ 0.]
 [ 4.]
 [ 8.]
 [12.]]
<NDArray 4x1 @cpu(0)>

Training Mode and Prediction Mode¶

As you can see from the above, after calling the record function, MXNet will record and calculate the gradient. In addition, autograd will also change the running mode from the prediction mode to the training mode by default. This can be viewed by calling the is_training function.

In [7]:

print(autograd.is_training())
with autograd.record():
    print(autograd.is_training())

False
True

In some cases, the same model behaves differently in the training and prediction modes. We will cover these differences in detail in later chapters.

Find Gradient of Python Control Flow¶

One benefit of using MXNet is that even if the computational graph of the function contains Python’s control flow (such as conditional and loop control), we may still be able to find the gradient of a variable.

Consider the following program, containing Python’s conditional and loop control. It should be emphasized that the number of iterations of the loop (while loop) and the execution of the conditional judgment (if statement) depend on the value of the input b.

In [8]:

def f(a):
    b = a * 2
    while b.norm().asscalar() < 1000:
        b = b * 2
    if b.sum().asscalar() > 0:
        c = b
    else:
        c = 100 * b
    return c

As previously stated, we still use the record function to record the calculation, and call the backward function to find the gradient.

In [9]:

a = nd.random.normal(shape=1)
a.attach_grad()
with autograd.record():
    c = f(a)
c.backward()

Let’s analyze the f function defined above. Given an arbitrary input a, its output must be in the form of f(a) = x * a, where the value of the scalar coefficient x depends on the input a. Since c = f(a) has a gradient of x with respect to a and the value is c / a, we can verify the correctness of the gradient of the control flow result in the following example.

In [10]:

a.grad == c / a

Out[10]:

[1.]
<NDArray 1 @cpu(0)>

Summary¶

MXNet provides an autograd package to automate the derivation process.
MXNet’s autograd package can be used to derive general imperative programs.
The running modes of MXNet include the training mode and the prediction mode. We can determine the running mode by autograd.is_training().

exercise¶

In the example, finding the gradient of the control flow shown in this section, the variable a is changed to a random vector or matrix. At this point, the result of the calculation c is no longer a scalar. So, what will happen to the running result? How do we analyze the result?
Redesign an example of finding the gradient of the control flow. Run and analyze the result.