Linear regression implementation from scratch¶
After getting some background on linear regression, we are now ready for
a hands-on implementation. While a powerful deep learning framework
minimizes repetitive work, relying on it too much to make things easy
can make it hard to properly understand how deep learning works. This
matters in particular if we want to change things later, e.g. define our
own layers, loss functions, etc. Because of this, we start by describing
how to implement linear regression training using only NDArray and
autograd
.
Before we begin, let’s import the package or module required for this
section’s experiment; matplotlib
will be used for plotting and will
be set to embed in the GUI.
In [1]:
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
from mxnet import autograd, nd
import random
Generating Data Sets¶
By constructing a simple artificial training data set, we can visually compare the differences between the parameters we have learned and the actual model parameters. Set the number of examples in the training data set as 1000 and the number of inputs (feature number) as 2. Using the randomly generated batch example feature \(\mathbf{X}\in \mathbb{R}^{1000 \times 2}\), we use the actual weight \(\mathbf{w} = [2, -3.4]^\top\) and bias \(b = 4.2\) of the linear regression model, as well as a random noise item \(\epsilon\) to generate the tag
The noise term \(\epsilon\) (or rather each coordinate of it) obeys a normal distribution with a mean of 0 and a standard deviation of 0.01. To get a better idea, let us generate the dataset.
In [2]:
num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2
features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
Note that each row in features
consists of a 2-dimensional data
point and that each row in labels
consists of a 1-dimensional target
value (a scalar).
In [3]:
features[0], labels[0]
Out[3]:
(
[2.2122064 0.7740038]
<NDArray 2 @cpu(0)>,
[6.000587]
<NDArray 1 @cpu(0)>)
By generating a scatter plot using the second features[:, 1]
and
labels
, we can clearly observe the linear correlation between the
two.
In [4]:
def use_svg_display():
# Displayed in vector graphics.
display.set_matplotlib_formats('svg')
def set_figsize(figsize=(3.5, 2.5)):
use_svg_display()
# Set the size of the graph to be plotted.
plt.rcParams['figure.figsize'] = figsize
set_figsize()
plt.figure(figsize=(10, 6))
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1);
The plotting function plt
as well as the use_svg_display
and
set_figsize
functions are defined in the gluonbook
package. We
will call gluonbook.plt
directly for future plotting. To print the
vector diagram and set its size, we only need to call
gluonbook.set_figsize()
before plotting, because plt
is a global
variable in the gluonbook
package.
Reading Data¶
We need to iterate over the entire data set and continuously examine
mini-batches of data examples when training the model. Here we define a
function. Its purpose is to return the features and tags of random
batch_size
(batch size) examples every time it’s called. One might
wonder why we are not reading one observation at a time but rather
construct an iterator which returns a few observations at a time. This
has mostly to do with efficiency when optimizing. Recall that when we
processed one dimension at a time the algorithm was quite slow. The same
thing happens when processing single observations vs. an entire ‘batch’
of them, which can be represented as a matrix rather than just a vector.
In particular, GPUs are much faster when it comes to dealing with
matrices, up to an order of magnitude. This is one of the reasons why
deep learning usually operates on mini-batches rather than singletons.
In [5]:
# This function has been saved in the gluonbook package for future use.
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices) # The examples are read at random, in no particular order.
for i in range(0, num_examples, batch_size):
j = nd.array(indices[i: min(i + batch_size, num_examples)])
yield features.take(j), labels.take(j)
# The “take” function will then return the corresponding element based on the indices.
Let’s read and print the first small batch of data examples. The shape of the features in each batch corresponds to the batch size and the number of input dimensions. Likewise, we obtain as many labels as requested by the batch size.
In [6]:
batch_size = 10
for X, y in data_iter(batch_size, features, labels):
print(X, y)
break
[[ 0.39343107 -0.46811363]
[-0.32538787 0.01583581]
[-0.27784348 0.3698966 ]
[-0.02846896 -1.1795315 ]
[ 2.3101764 0.24071898]
[ 0.33749524 -0.4193833 ]
[-1.2895633 -1.0160087 ]
[-0.6434393 0.39862037]
[-0.33623475 1.8557605 ]
[ 1.2006497 -0.6992593 ]]
<NDArray 10x2 @cpu(0)>
[ 6.579802 3.5013514 2.382981 8.146603 8.0141945 6.2993846
5.070937 1.5484271 -2.7828512 8.984203 ]
<NDArray 10 @cpu(0)>
Clearly, if we run the iterator again, we obtain a different minibatch until all the data has been exhausted (try this). Note that the iterator described above is a bit inefficient (it requires that we load all data in memory and that we perform a lot of random memory access). The built-in iterators are more efficient and they can deal with data stored on file (or being fed via a data stream).
Initialize Model Parameters¶
Weights are initialized to normal random numbers using a mean of 0 and a standard deviation of 0.01, with the bias \(b\) set to zero.
In [7]:
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))
In the succeeding cells, we’re going to update these parameters to
better fit our data. This will involve taking the gradient (a
multi-dimensional derivative) of some loss function with respect to the
parameters. We’ll update each parameter in the direction that reduces
the loss. In order for autograd
to know that it needs to set up the
appropriate data structures, track changes, etc., we need to attach
gradients explicitly.
In [8]:
w.attach_grad()
b.attach_grad()
Define the Model¶
Next we’ll want to define our model. In this case, we’ll be working with linear models, the simplest possible useful neural network. To calculate the output of the linear model, we simply multiply a given input with the model’s weights \(w\), and add the offset \(b\).
In [9]:
def linreg(X, w, b): # This function has been saved in the gluonbook package for future use.
return nd.dot(X, w) + b
Define the Loss Function¶
We will use the squared loss function described in the previous section
to define the linear regression loss. In the implementation, we need to
transform the true value y
into the predicted value’s shape
y_hat
. The result returned by the following function will also be
the same as the y_hat
shape.
In [10]:
def squared_loss(y_hat, y): # This function has been saved in the gluonbook package for future use.
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
Define the Optimization Algorithm¶
Linear regression actually has a closed-form solution. However, most
interesting models that we’ll care about cannot be solved analytically.
So we’ll solve this problem by stochastic gradient descent sgd
. At
each step, we’ll estimate the gradient of the loss with respect to our
weights, using one batch randomly drawn from our dataset. Then, we’ll
update our parameters a small amount in the direction that reduces the
loss. Here, the gradient calculated by the automatic differentiation
module is the gradient sum of a batch of examples. We divide it by the
batch size to obtain the average. The size of the step is determined by
the learning rate lr
.
In [11]:
def sgd(params, lr, batch_size): # This function has been saved in the gluonbook package for future use.
for param in params:
param[:] = param - lr * param.grad / batch_size
Training¶
In training, we will iterate over the data to improve the model
parameters. In each iteration, the mini-batch stochastic gradient is
calculated by first calling the inverse function backward
depending
on the currently read mini-batch data examples (feature X
and label
y
), and then calling the optimization algorithm sgd
to iterate
the model parameters. Since we previously set the batch size
batch_size
to 10, the loss shape l
for each small batch is (10,
1).
- Initialize parameters \((\mathbf{w}, b)\)
- Repeat until done
- Compute gradient \(\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{\mathcal{B}} \sum_{i \in \mathcal{B}} l(\mathbf{x}^i, y^i, \mathbf{w}, b)\)
- Update parameters \((\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}\)
Since nobody wants to compute gradients explicitly (this is tedious and
error prone) we use automatic differentiation to compute \(g\). See
section “Automatic Gradient”
for more details. Since the loss l
is not a scalar variable, running
l.backward()
will add together the elements in l
to obtain the
new variable, and then calculate the variable model parameters’
gradient.
In an epoch (a pass through the data), we will iterate through the
data_iter
function once and use it for all the examples in the
training data set (assuming the number of examples is divisible by the
batch size). The number of epochs num_epochs
and the learning rate
lr
are both hyper-parameters and are set to 3 and 0.03,
respectively. Unfortunately in practice, the majority of the
hyper-parameters will require some adjustment by trial and error. For
instance, the model might actually become more accurate by training
longer (but this increases computational cost). Likewise, we might want
to change the learning rate on the fly. We will discuss this later in
the chapter on “Optimization
Algorithms”.
In [12]:
lr = 0.03 # learning rate
num_epochs = 3 # number of iterations
net = linreg # our fancy linear model
loss = squared_loss # 0.5 (y-y')^2
for epoch in range(num_epochs):
# Assuming the number of examples can be divided by the batch size, all the examples in
# the training data set are used once in one epoch iteration.
# The features and tags of mini-batch examples are given by X and y respectively.
for X, y in data_iter(batch_size, features, labels):
with autograd.record():
l = loss(net(X, w, b), y) # minibatch loss in X and y
l.backward() # compute gradient on l with respect to [w,b]
sgd([w, b], lr, batch_size) # update parameters [w,b] using their gradient
train_l = loss(net(features, w, b), labels)
print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))
epoch 1, loss 0.040224
epoch 2, loss 0.000154
epoch 3, loss 0.000050
To evaluate the trained model, we can compare the actual parameters used with the parameters we have learned after the training has been completed. They are very close to each other.
In [13]:
print('Error in estimating w', true_w - w.reshape(true_w.shape))
print('Error in estimating b', true_b - b)
Error in estimating w
[ 0.00061035 -0.00046229]
<NDArray 2 @cpu(0)>
Error in estimating b
[0.00023651]
<NDArray 1 @cpu(0)>
Note that we should not take it for granted that we are able to reover the parameters accurately. This only happens for a special category problems: strongly convex optimization problems with ‘enough’ data to ensure that the noisy samples allow us to recover the underlying dependency correctly. In most cases this is not the case. In fact, the parameters of a deep network are rarely the same (or even close) between two different runs, unless everything is kept identically, including the order in which the data is traversed. Nonetheless this can lead to very good solutions, mostly due to the fact that quite often there are many sets of parameters that work well.
Summary¶
We saw how a deep network can be implemented and optimized from scratch,
using just NDArray and autograd
without any need for defining
layers, fancy optimizers, etc. This only scratches the surface of what
is possible. In the following sections, we will describe additional deep
learning models based on what we have just learned and you will learn
how to implement them using more concisely.
Problems¶
- What would happen if we were to initialize the weights \(\mathbf{w} = 0\). Would the algorithm still work?
- Assume that you’re Georg Simon
Ohm trying to come up
with a model between voltage and current. Can you use
autograd
to learn the parameters of your model. - Can you use Planck’s Law to determine the temperature of an object using spectral energy density.
- What are the problems you might encounter if you wanted to extend
autograd
to second derivatives? How would you fix them? - Why is the
reshape
function needed in thesquared_loss
function? - Experiment using different learning rates to find out how fast the loss function value drops.
- If the number of examples cannot be divided by the batch size, what
happens to the
data_iter
function’s behavior?