Stochastic Gradient Descent Convergence
This video will show
- Making sure it’s converging
- Get the right learning rate, alpha
- As talked earlier, batch gradient descent wait for particular huge amount of time before updating.
- In stochastic, to make sure its converging, compute the cost before updating thetas, and plot the cost function average last 1000 examples.
- These are 4 different plots that we might encounter when using stochastic GD and plot with 1000 examples
- In this case the up and down(stochastic) shown and see a point where the actual minimum of cost function
- In red line, when we choose smaller alpha, we see that it actually show better the minimum cost function
- The second shown in blue line with 1000 examples, still stochastic
- When using 5000 examples we actually see smoother line as shown in red, but the drawback is we get delayed feedback of learning rate
- Here’s we actually shown that the algorithm may not learning at all
- The red shown as we see that the trend may actually converging down to minimum, we just didn’t see it because it’s too stochastic with blue(1000 examples)
- Of course it may also be that we have a flat line, this case try to use more features
- So the other way, it’s also important think about the learning rate alpha
- Above diagram is when we using learning rate constant, whereas below when we keep setting the learning rate smaller and smaller
- The way we set learning rate smaller is by taking 3 value, iterNumber, cons1 and const2
- iterNumber keep increasing as we move iteration forward, that way alpha is getting smaller
- Where cons1 and cons2 is a constant number that we play a bit to find the correct ones.
- So there are two ways to making sure the stochastic is converging
- Plot the cost function to make sure it’s converging
- Slowly make smaller alpha