Code
import numpy as np
from torch import tensor
from torch import nn
import torch, math
import random
%config Completer.use_jedi = False
= np.random.default_rng() rng
python
and a bit of pytorch
. This is mostly my personal re-writing of the fantastic lesson 1 of the fast.ai
course Part 2
Lorenzo Posti
March 22, 2022
When approaching the study of a new subject I find it extremely useful to get my hands dirty and play around with the stuff I’m learning, in order to cement the knowledge that I’m passively acquiring reading or listening to a lecture. In the case of deep learning, before starting to use massively the superb python
libraries available, e.g. pytorch
or fast.ai
, I think it’s critical to build a simple NN from scratch.
The bits required are just linear operations, e.g. matrix multiplications, functional composition and the chain rule to get the derivatives during back-propagation. All of this sounds not terrible at all, so we just need a bit of organization to glue all the pieces together.
We take inspiration from the pytorch
library and we start by building an abstract Module
class.
class Module():
""" abstract class: on call it saves the input and output, and it returns the output """
def __call__(self, *args):
self.args = args
self.out = self.forward(*args)
return self.out
def forward(self): raise Exception('not implemented')
def backward(self): self.bwd(self.out, *self.args)
When called, Module
stores the input and the output items and just returns the output which is defined by the method forward
, which needs to be overridden by the derived class. Another method, backward
, will have to return the derivative of the function, thus implementing the necessary step for back-propagation.
Let’s now use the class Module
to implement a sigmoid activation function:
Here the class Sigmoid
inherits from Module
and we just need to specify the forward
method, which is just the value of the sigmoid function, and the bwd
method, which is what is called by backward
. We use bwd
to implement the derivative of the sigmoid \[
\sigma'(x) = \sigma(x)\left[1-\sigma(x)\right],
\] which we store in the .g
attribute, that stands for gradient, of the input. This storing the gradient of the class in the .g
attribute of the input combined with the last multiplication by out.g
that we do in the bwd
method is basically the chain rule. The gradient in each layer of an NN is, according to the chain rule, the derivative of the layer times the derivative of the input. Once computed, we store this in the gradient of inp
, which is exactly the same variable as out
of the previous layer, thus we can reference its gradient with out.g
when climbing back the hierarchy of layers.
Similarly, a linear layer \(W{\bf x} + b\), where \(w\) is a matrix, \({\bf x}\) is a vector and \(b\) is a scalar, can be written as:
As before, forward
implements the linear layer (@
is the matrix multiplication operator in pytorch
) and bwd
implements the gradient. The derivative of a matrix multiplication \(W{\bf x}\) is just a matrix multiplication by the transpose of the matrix, \(W^T\). Since the linear layer has the weights \(w\) and bias \(b\) parameters that we want to learn, then we need to calculate the gradient of the output of the layer with respect to the weights and the bias. This is what is implemented in self.w.g
and self.b.g
.
Finally we can define the loss as a class derived from Module
as:
This is a mean squared error loss function, \(L({\bf y},{\bf y}_{\rm target}) = \sum_i (y_i-y_{i,\rm target})^2\), where the forward
and bwd
methods have the same meaning as above. Notice that here the bwd
method just stores the inp.g
attribute and does not have a multiplication by out.g
, because this is the final layer of our NN.
Finally we can bundle everything together in a Model
class which takes as input a list of layers and implements a forward
method, where maps the input into each layer sequentially, and a backward
method, where it goes through the gradient of each layer in reversed order.
class Model():
def __init__(self, layers):
self.layers = layers
self.loss = Mse()
def __call__(self, x, target): return self.forward(x, target)
def forward(self, x, target):
for l in self.layers: x = l(x)
return self.loss(x, target)
def backward(self):
self.loss.backward()
for l in reversed(self.layers): l.backward()
Let’s now take some fake data and let’s randomly initialize the weights and biases (unsing standard Xavier initialization so that the output of the layers are still a null mean and unit variance)
We can now define a model as a sequence of linear and activation layers and we can make a forward pass to calculate the loss…
…and also a backward pass to calculate the gradients
The architecture above is basically equivalent to an nn.Sequential
model