OwnFramework

3-NeuralNetworksartificial-intelligencernnganmicrosoft-for-beginnerslessonsAImicrosoft-AI-For-Beginnersmachine-learningdeep-learningcomputer-vision04-OwnFrameworkcnnNLP

Multi-Layered Perceptrons

Building our own Neural Framework

This notebook is a part of AI for Beginners Curricula. Visit the repository for complete set of learning materials.

In this notebook, we will gradually build our own neural framework capable of solving multi-class classification tasks as well as regression with multi-layered preceptrons.

First, let's import some required libraries.

[14]

Sample Dataset

As before, we will start with a simple sample dataset with two parameters.

[15]
[16]
[17]
<IPython.core.display.Javascript object>
[18]
[[ 1.3382818  -0.98613256]
 [ 0.5128146   0.43299454]
 [-0.4473693  -0.2680512 ]
 [-0.9865851  -0.28692   ]
 [-1.0693829   0.41718036]]
[1 1 0 0 0]

Machine Learning Problem

Suppose we have input dataset X,Y\langle X,Y\rangle, where XX is a set of features, and YY - corresponding labels. For regression problem, yiRy_i\in\mathbb{R}, and for classification it is represented by a class number yi{0,,n}y_i\in\{0,\dots,n\}.

Any machine learning model can be represented by function fθ(x)f_\theta(x), where θ\theta is a set of parameters. Our goal is to find such parameters θ\theta that our model fits the dataset in the best way. The criteria is defined by loss function L\mathcal{L}, and we need to find optimal value

θ=argminθL(fθ(X),Y)\theta = \mathrm{argmin}_\theta \mathcal{L}(f_\theta(X),Y)

Loss function depends on the problem being solved.

Loss functions for regression

For regression, we often use abosolute error Labs(θ)=i=1nyifθ(xi)\mathcal{L}_{abs}(\theta) = \sum_{i=1}^n |y_i - f_{\theta}(x_i)|, or mean squared error: Lsq(θ)=i=1n(yifθ(xi))2\mathcal{L}_{sq}(\theta) = \sum_{i=1}^n (y_i - f_{\theta}(x_i))^2

[19]
[20]
<IPython.core.display.Javascript object>

Loss functions for classification

Let's consider binary classification for a moment. In this case we have two classes, numbered 0 and 1. The output of the network fθ(xi)[0,1]f_\theta(x_i)\in [0,1] essentially defines the probability of choosing the class 1.

0-1 loss

0-1 loss is the same as calculating accuracy of the model - we compute the number of correct classifications:

0 & (f(x_i)<0.5 \land y_i=0) \lor (f(x_i)<0.5 \land y_i=1) \\ 1 & \mathrm{ otherwise} \end{cases} \\

However, accuracy itself does not show how far are we from the right classification. It could be that we missed the correct class just by a little bit, and that is in a way "better" (in a sense that we need to correct weights much less) than missing significantly. Thus, more often logistic loss is used, which takes this into account.

Logistic Loss

Llog=i=1nylog(fθ(xi))(1y)log(1fθ(xi))\mathcal{L}_{log} = \sum_{i=1}^n -y\log(f_{\theta}(x_i)) - (1-y)\log(1-f_\theta(x_i))

[21]
[22]
C:\Users\dmitryso\AppData\Local\Temp/ipykernel_55820/331859503.py:10: RuntimeWarning: divide by zero encountered in log
  return -np.log(fx)
<IPython.core.display.Javascript object>

To understand logistic loss, consider two cases of the expected output:

  • If we expect output to be 1 (y=1y=1), then the loss is logfθ(xi)-log f_\theta(x_i). The loss is 0 is the network predicts 1 with probability 1, and grows larger when probability of 1 gets smaller.
  • If we expect output to be 0 (y=0y=0), the loss is log(1fθ(xi))-log(1-f_\theta(x_i)). Here, 1fθ(xi)1-f_\theta(x_i) is the probability of 0 which is predicted by the network, and the meaning of log-loss is the same as described in the previous case

Neural Network Architecture

We have generated a dataset for binary classification problem. However, let's consider it as multi-class classification right from the start, so that we can then easily switch our code to multi-class classification. In this case, our one-layer perceptron will have the following architecture:

Two outputs of the network correspond to two classes, and the class with highest value among two outputs corresponds to the right solution.

The model is defined as

fθ(x)=W×x+bf_\theta(x) = W\times x + b

where θ=W,b\theta = \langle W,b\rangle are parameters.

We will define this linear layer as a Python class with a forward function that performs the calculation. It receives input value xx, and produces the output of the layer. Parameters W and b are stored within the layer class, and are initialized upon creation with random values and zeroes respectively.

[23]
array([[ 1.77202116, -0.25384488],
,       [ 0.28370828, -0.39610552],
,       [-0.30097433,  0.30513182],
,       [-0.8120485 ,  0.56079421],
,       [-1.23519653,  0.3394973 ]])

In many cases, it is more efficient to operate not on the one input value, but on the vector of input values. Because we use Numpy operations, we can pass a vector of input values to our network, and it will give us the vector of output values.

Softmax: Turning Outputs into Probabilities

As you can see, our outputs are not probabilities - they can take any values. In order to convert them into probabilities, we need to normalize the values across all classes. This is done using softmax function: σ(zc)=ezcjezj,forc1..C\sigma(\mathbf{z}_c) = \frac{e^{z_c}}{\sum_{j} e^{z_j}}, \quad\mathrm{for}\quad c\in 1 .. |C|

Output of the network σ(z)\sigma(\mathbf{z}) can be interpreted as probability distribution on the set of classes CC: q=σ(zc)=p^(cx)q = \sigma(\mathbf{z}_c) = \hat{p}(c | x)

We will define the Softmax layer in the same manner, as a class with forward function:

[24]
array([[0.88348621, 0.11651379],
,       [0.66369714, 0.33630286],
,       [0.35294795, 0.64705205],
,       [0.20216095, 0.79783905],
,       [0.17154828, 0.82845172],
,       [0.24279153, 0.75720847],
,       [0.18915732, 0.81084268],
,       [0.17282951, 0.82717049],
,       [0.13897531, 0.86102469],
,       [0.72746882, 0.27253118]])

You can see that we are now getting probabilities as outputs, i.e. the sum of each output vector is exactly 1.

In case we have more than 2 classes, softmax will normalize probabilities across all of them. Here is a diagram of network architecture that does MNIST digit classification:

MNIST Classifier

Cross-Entropy Loss

A loss function in classification is typically a logistic function, which can be generalized as cross-entropy loss. Cross-entropy loss is a function that can calculate similarity between two arbitrary probability distributions. You can find more detailed discussion about it on Wikipedia.

In our case, first distribution is the probabilistic output of our network, and the second one is so-called one-hot distribution, which specifies that a given class cc has corresponding probability 1 (all the rest being 0). In such a case cross-entropy loss can be calculated as logpc-\log p_c, where cc is the expected class, and pcp_c is the corresponding probability of this class given by our neural network.

If the network return probability 1 for the expected class, cross-entropy loss would be 0. The closer the probability of the actual class is to 0, the higher is cross-entropy loss (and it can go up to infinity!).

[25]
[26]
<IPython.core.display.Javascript object>

Cross-entropy loss will be defined again as a separate layer, but forward function will have two input values: output of the previous layers of the network p, and the expected class y:

[27]
1.429664938969559

IMPORTANT: Loss function returns a number that shows how good (or bad) our network performs. It should return us one number for the whole dataset, or for the part of the dataset (minibatch). Thus after calculating cross-entropy loss for each individual component of the input vector, we need to average (or add) all components together - which is done by the call to .mean().

Computational Graph

Up to this moment, we have defined different classes for different layers of the network. Composition of those layers can be represented as computational graph. Now we can compute the loss for a given training dataset (or part of it) in the following manner:

[28]
1.429664938969559

Loss Minimization Problem and Network Training

Once we have defined out network as fθf_\theta, and given the loss function L(Y,fθ(X))\mathcal{L}(Y,f_\theta(X)), we can consider L\mathcal{L} as a function of θ\theta under our fixed training dataset: L(θ)=L(Y,fθ(X))\mathcal{L}(\theta) = \mathcal{L}(Y,f_\theta(X))

In this case, the network training would be a minimization problem of L\mathcal{L} under argument θ\theta:

θ=argminθL(Y,fθ(X))\theta = \mathrm{argmin}_{\theta} \mathcal{L}(Y,f_\theta(X))

There is a well-known method of function optimization called gradient descent. The idea is that we can compute a derivative (in multi-dimensional case call gradient) of loss function with respect to parameters, and vary parameters in such a way that the error would decrease.

Gradient descent works as follows:

  • Initialize parameters by some random values w(0)w^{(0)}, b(0)b^{(0)}
  • Repeat the following step many times:
W^{(i+1)}&=W^{(i)}-\eta\frac{\partial\mathcal{L}}{\partial W}\\ b^{(i+1)}&=b^{(i)}-\eta\frac{\partial\mathcal{L}}{\partial b} \end{align}

During training, the optimization steps are supposed to be calculated considering the whole dataset (remember that loss is calculated as a sum/average through all training samples). However, in real life we take small portions of the dataset called minibatches, and calculate gradients based on a subset of data. Because subset is taken randomly each time, such method is called stochastic gradient descent (SGD).

Backward Propagation

\zz\LW=\zz\Lp\zzpz\zzzW\zz\Lb=\zz\Lp\zzpz\zzzb\begin{align} \zz{\L}{W} =& \zz{\L}{p}\zz{p}{z}\zz{z}{W}\cr \zz{\L}{b} =& \zz{\L}{p}\zz{p}{z}\zz{z}{b} \end{align}

To compute L/W\partial\mathcal{L}/\partial W we can use the chaining rule for computing derivatives of a composite function, as you can see in the formulae above. It corresponds to the following idea:

  • Suppose under given input we have obtanes loss ΔL\Delta\mathcal{L}
  • To minimize it, we would have to adjust softmax output pp by value Δp=(L/p)ΔL\Delta p = (\partial\mathcal{L}/\partial p)\Delta\mathcal{L}
  • This corresponds to the changes to node zz by Δz=(p/z)Δp\Delta z = (\partial\mathcal{p}/\partial z)\Delta p
  • To minimize this error, we need to adjust parameters accordingly: ΔW=(z/W)Δz\Delta W = (\partial\mathcal{z}/\partial W)\Delta z (and the same for bb)

This process starts distributing the loss error from the output of the network back to its parameters. Thus the process is called back propagation.

One pass of the network training consists of two parts:

  • Forward pass, when we calculate the value of loss function for a given input minibatch
  • Backward pass, when we try to minimize this error by distributing it back to the model parameters through the computational graph.

Implementation of Back Propagation

  • Let's add backward function to each of our nodes that will compute the derivative and propagate the error during the backward pass.
  • We also need to implement parameter updates according to the procedure described above

We need to compute derivatives for each layer manually, for example for linear layer z=x×W+bz = x\times W+b:

\frac{\partial z}{\partial W} &= x \\ \frac{\partial z}{\partial b} &= 1 \\ \end{align}$$ If we need to compensate for the error $\Delta z$ at the output of the layer, we need to update the weights accordingly: $$\begin{align} \Delta x &= \Delta z \times W \\ \Delta W &= \frac{\partial z}{\partial W} \Delta z = \Delta z \times x \\ \Delta b &= \frac{\partial z}{\partial b} \Delta z = \Delta z \\ \end{align}$$ **IMPORTANT:** Calculations are done not for each training sample independently, but rather for a whole **minibatch**. Required parameter updates $\Delta W$ and $\Delta b$ are computed across the whole minibatch, and the respective vectors have dimensions: $x\in\mathbb{R}^{\mathrm{minibatch}\, \times\, \mathrm{nclass}}$
[29]

In the same manner we can define backward function for the rest of our layers:

[30]

Training the Model

Now we are ready to write the training loop, which will go through our dataset, and perform the optimization minibatch by minibatch.One complete pass through the dataset is often called an epoch:

[31]
Initial accuracy:  0.725
Final accuracy:  0.825

Nice to see how we can increase accuracy of the model from about 50% to around 80% in one epoch.

Network Class

Since in many cases neural network is just a composition of layers, we can build a class that will allow us to stack layers together and make forward and backward passes through them without explicitly programming that logic. We will store the list of layers inside the Net class, and use add() function to add new layers:

[32]

With this Net class our model definition and training becomes more neat:

[33]
Initial loss=0.6212072429381601, accuracy=0.6875: 
Final loss=0.44369925927417986, accuracy=0.8: 
Test loss=0.4767711377257787, accuracy=0.85: 

Plotting the Training Process

It would be nice to see visually how the network is being trained! We will define a train_and_plot function for that. To visualize the state of the network we will use level map, i.e. we will represent different values of the network output using different colors.

Do not worry if you do not understand some of the plotting code below - it is more important to understand the underlying neural network concepts.

[34]
[35]
[36]
[37]
<IPython.core.display.Javascript object>

After running the cell above you should be able to see interactively how the boundary between classes change during training. Note that we have chosen very small learning rate so that we can see how the process happens.

Multi-Layered Models

The network above has been constructed from several layers, but we still had only one Linear layer, which does the actual classification. What happens if we decide to add several such layers?

Surprisingly, our code will work! Very important thing to note, however, is that in between linear layers we need to have a non-linear activation function, such as tanh. Without such non-linearity, several linear layers would have the same expressive power as just one layers - because composition of linear functions is also linear!

[38]

Adding several layers make sense, because unlike one-layer network, multi-layered model will be able to accuratley classify sets that are not linearly separable. I.e., a model with several layers will be reacher.

It can be demonstrated that with sufficient number of neurons a two-layered model is capable to classifying any convex set of data points, and three-layered network can classify virtually any set.

Mathematically, multi-layered perceptron would be represented by a more complex function fθf_\theta that can be computed in several steps:

  • z1=W1×x+b1z_1 = W_1\times x+b_1
  • z2=W2×α(z1)+b2z_2 = W_2\times\alpha(z_1)+b_2
  • f=σ(z2)f = \sigma(z_2)

Here, α\alpha is a non-linear activation function, σ\sigma is a softmax function, and θ=W1,b1,W2,b2\theta=\langle W_1,b_1,W_2,b_2\rangle are parameters.

The gradient descent algorithm would remain the same, but it would be more difficult to calculate gradients. Given the chain differentiation rule, we can calculate derivatives as:

\frac{\partial\mathcal{L}}{\partial W_2} &= \color{red}{\frac{\partial\mathcal{L}}{\partial\sigma}\frac{\partial\sigma}{\partial z_2}}\color{black}{\frac{\partial z_2}{\partial W_2}} \\ \frac{\partial\mathcal{L}}{\partial W_1} &= \color{red}{\frac{\partial\mathcal{L}}{\partial\sigma}\frac{\partial\sigma}{\partial z_2}}\color{black}{\frac{\partial z_2}{\partial\alpha}\frac{\partial\alpha}{\partial z_1}\frac{\partial z_1}{\partial W_1}} \end{align}

Note that the beginning of all those expressions is still the same, and thus we can continue back propagation beyond one linear layers to adjust further weights up the computational graph.

Let's now experiment with two-layered network:

[39]
[40]
<IPython.core.display.Javascript object>

Why Not Always Use Multi-Layered Model?

We have seen that multi-layered model is more powerful and expressive, than one-layered one. You may be wondering why don't we always use many-layered model. The answer to this question is overfitting.

We will deal with this term more in a later sections, but the idea is the following: the more powerful the model is, the better it can approximate training data, and the more data it needs to properly generalize for the new data it has not seen before.

A linear model:

  • We are likely to get high training loss - so-called underfitting, when the model does not have enough power to correctly separate all data.
  • Valiadation loss and training loss are more or less the same. The model is likely to generalize well to test data.

Complex multi-layered model

  • Low training loss - the model can approximate training data well, because it has enough expressive power.
  • Validation loss can be much higher than training loss and can start to increase during training - this is because the model "memorizes" training points, and loses the "overall picture"

Overfitting

On this picture, x stands for training data, o - validation data. Left - linear model (one-layer), it approximates the nature of the data pretty well. Right - overfitted model, the model perfectly well approximates training data, but stops making sense with any other data (validation error is very high)

Takeaways

  • Simple models (fewer layers, fewer neurons) with low number of parameters ("low capacity") are less likely to overfit
  • More complex models (more layers, more neurons on each layer, high capacity) are likely to overfit. We need to monitor validation error to make sure it does not start to rise with further training
  • More complex models need more data to train on.
  • You can solve overfitting problem by either:
    • simplifying your model
    • increasing the amount of training data
  • Bias-variance trade-off is a term that shows that you need to get the compromise
    • between power of the model and amount of data,
    • between overfittig and underfitting
  • There is not single recipe on how many layers of parameters you need - the best way is to experiment

Credits

This notebook is a part of AI for Beginners Curricula, and has been prepared by Dmitry Soshnikov. It is inspired by Neural Network Workshop at Microsoft Research Cambridge. Some code and illustrative materials are taken from presentations by Katja Hoffmann, Matthew Johnson and Ryoto Tomioka, and from NeuroWorkshop repository.