-
The idea is simple. taking b examples instead just one example in stochastic. m > b > 1.
-
Next perform above iteration
-
Here is the more complete way of mini-batch gradient descent
-
We’re taking b = 10, as a result i each iteration will be +10, jump every 10 i
-
And the constant are what squared in magenta. 1/b , and summation over i+9 examples.
-
This is still significantly faster than batch GD.
-
So why we mini-batch instead of stochastic?
-
The answer lies on vectorization. With b examples, with pretty good linear algebra library, mini-batch can be parallelized. b can take b core to parallelized, where in contrast 1 example can’t be (less able to) paralleled.
-
So mini-batch is significantly faster than batch and somewhat faster than stochastic
-
(Final Note from me). I think that mini-batch can’t be that stochastic when converging to global minimum.