When datasets are large and high-dimensional, it is computationally very expensive (sometimes impossible!) to find an analytical solution for the optimal parameters of your network. Instead, we use optimization methods. A vanilla optimization approach would be to sample different combinations of parameters and choose the one with the lowest loss value.
- Is this a good idea?
- Would it be possible to extract another piece of information to direct our search towards the optimal parameters?
This is exactly what gradient descent does! Apart from the loss value, gradient descent computes the local gradient of the loss when evaluating potential parameters. This information is used to decide which direction the search should go to find better parameter values. This extra piece of information (the local gradient) can be computed relatively easily using backpropagation. This recursive algorithm breaks up complex derivatives into smaller parts through the chain rule.
To help understand gradient descent, let’s visualize the setup.
Let’s consider a linear regression. You have a data set with examples. In other words, and are row vectors of scalar examples . The goal is to find the scalar parameters and such that the line optimally fits the data. This can be achieved using gradient descent.
The first step of gradient descent is to compute the loss. To do this, define your model’s output and loss function. In this regression setting, we use the mean squared error loss.
The next step is to compute the local gradient of the loss with respect to the parameters (i.e. and ). This means you need to calculate derivatives. Note that values stored during the forward propagation are used in the gradient equations.
Now, consider the case where is a matrix of shape and is still a row vector of shape . Instead of a single scalar value, the weights will be a vector (one element per feature) of shape . The bias parameter is still a scalar.
Two Layer Linear Network
Consider stacking two linear layers together. You can introduce a hidden variable of shape , which is the output of the first linear layer. The first layer is parameterized by a weight matrix of shape and bias of shape broadcasted to . The second layer will be the same as in the multivariate regression case, but its input will be instead of .
Two Layer Nonlinear Network
In this example, before sending as the input to the second layer, you will pass it through the sigmoid function. The output is denoted and is the input of the second layer.
We have compiled an illustrative set of handwritten notes that cover:
- Scalar Operations and their derivatives,
- Detailed visualizations of the backprop calculation for logistic regression and a 2-layer perceptron,
- An overview of common optimization techniques: mini-batch GD, momentum, RMSprop, and Adam.
We also put together a sheet of exercises (and their solutions) to help you test your understanding of gradient descent and backpropogation, as well as provide useful practice for the exam.