Linear layers and activation functions from scratch

When approaching the study of a new subject I find it extremely useful to get my hands dirty and play around with the stuff I'm learning, in order to cement the knowledge that I'm passively acquiring reading or listening to a lecture. In the case of deep learning, before starting to use massively the superb python libraries available, e.g. pytorch or fast.ai, I think it's critical to build a simple NN from scratch.

The bits required are just linear operations, e.g. matrix multiplications, functional composition and the chain rule to get the derivatives during back-propagation. All of this sounds not terrible at all, so we just need a bit of organization to glue all the pieces together.

We take inspiration from the pytorch library and we start by building an abstract Module class.

import numpy as np
from torch import tensor
from torch import nn
import torch, math
import random

%config Completer.use_jedi = False

rng = np.random.default_rng()
class Module():
    """ abstract class: on call it saves the input and output, and it returns the output """
    def __call__(self, *args):
        self.args = args
        self.out = self.forward(*args)
        return self.out
    
    def forward(self): raise Exception('not implemented')
    def backward(self): self.bwd(self.out, *self.args)

When called, Module stores the input and the output items and just returns the output which is defined by the method forward, which needs to be overridden by the derived class. Another method, backward, will have to return the derivative of the function, thus implementing the necessary step for back-propagation.

Let's now use the class Module to implement a sigmoid activation function:

sig = lambda x: 1.0/(1.0+np.exp(-x))
    
class Sigmoid(Module):
    def forward(self, inp): return sig(inp)
    def bwd(self, out, inp): inp.g = sig(inp) * (1-sig(inp)) * out.g

Here the class Sigmoid inherits from Module and we just need to specify the forward method, which is just the value of the sigmoid function, and the bwd method, which is what is called by backward. We use bwd to implement the derivative of the sigmoid $$ \sigma'(x) = \sigma(x)\left[1-\sigma(x)\right], $$ which we store in the .g attribute, that stands for gradient, of the input. This storing the gradient of the class in the .g attribute of the input combined with the last multiplication by out.g that we do in the bwd method is basically the chain rule. The gradient in each layer of an NN is, according to the chain rule, the derivative of the layer times the derivative of the input. Once computed, we store this in the gradient of inp, which is exactly the same variable as out of the previous layer, thus we can reference its gradient with out.g when climbing back the hierarchy of layers.

Similarly, a linear layer $W{\bf x} + b$, where $w$ is a matrix, ${\bf x}$ is a vector and $b$ is a scalar, can be written as:

class Lin(Module):
    def __init__(self, w, b): self.w,self.b = w,b
        
    def forward(self, inp): return inp@self.w + self.b
    
    def bwd(self, out, inp):
        inp.g = out.g @ self.w.t()
        self.w.g = inp.t() @ out.g
        self.b.g = out.g.sum(0)

As before, forward implements the linear layer (@ is the matrix multiplication operator in pytorch) and bwd implements the gradient. The derivative of a matrix multiplication $W{\bf x}$ is just a matrix multiplication by the transpose of the matrix, $W^T$. Since the linear layer has the weights $w$ and bias $b$ parameters that we want to learn, then we need to calculate the gradient of the output of the layer with respect to the weights and the bias. This is what is implemented in self.w.g and self.b.g.

Finally we can define the loss as a class derived from Module as:

class Mse(Module):
    def forward (self, inp, target): return (inp.squeeze(-1) - target).pow(2).mean()
    def bwd(self, out, inp, target): inp.g = 2*(inp.squeeze(-1)-target).unsqueeze(-1) / target.shape[0]

This is a mean squared error loss function, $L({\bf y},{\bf y}_{\rm target}) = \sum_i (y_i-y_{i,\rm target})^2$, where the forward and bwd methods have the same meaning as above. Notice that here the bwd method just stores the inp.g attribute and does not have a multiplication by out.g, because this is the final layer of our NN.

Finally we can bundle everything together in a Model class which takes as input a list of layers and implements a forward method, where maps the input into each layer sequentially, and a backward method, where it goes through the gradient of each layer in reversed order.

class Model():
    def __init__(self, layers):
        self.layers = layers
        self.loss = Mse()
        
    def __call__(self, x, target): return self.forward(x, target)
        
    def forward(self, x, target):
        for l in self.layers: x = l(x)
        return self.loss(x, target)
    
    def backward(self):
        self.loss.backward()
        for l in reversed(self.layers): l.backward()

Let's now take some fake data and let's randomly initialize the weights and biases (unsing standard Xavier initialization so that the output of the layers are still a null mean and unit variance)

n,m = 200,1

x = torch.randn(n,m)
y = x.pow(2)

nh = 100
# standard xavier init
w1 = torch.randn(m,nh)/math.sqrt(m)
b1 = torch.zeros(nh)
w2 = torch.randn(nh,1)/math.sqrt(nh)
b2 = torch.zeros(1)

We can now define a model as a sequence of linear and activation layers and we can make a forward pass to calculate the loss...

model = Model([Lin(w1,b1), Sigmoid(), Lin(w2,b2)])
loss = model(x, y)

...and also a backward pass to calculate the gradients

model.backward()

The architecture above is basically equivalent to an nn.Sequential model

nn.Sequential(nn.Linear(m,nh), nn.Sigmoid(), nn.Linear(nh,1))
Sequential(
  (0): Linear(in_features=1, out_features=100, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=100, out_features=1, bias=True)
)