Mini Batch Gradient Descent

2018-04-05 00:00 | Source

The idea is simple. taking b examples instead just one example in stochastic. m > b > 1.
Next perform above iteration

Here is the more complete way of mini-batch gradient descent
We’re taking b = 10, as a result i each iteration will be +10, jump every 10 i
And the constant are what squared in magenta. 1/b , and summation over i+9 examples.
This is still significantly faster than batch GD.
So why we mini-batch instead of stochastic?
The answer lies on vectorization. With b examples, with pretty good linear algebra library, mini-batch can be parallelized. b can take b core to parallelized, where in contrast 1 example can’t be (less able to) paralleled.
So mini-batch is significantly faster than batch and somewhat faster than stochastic
(Final Note from me). I think that mini-batch can’t be that stochastic when converging to global minimum.