
The idea is simple. taking b examples instead just one example in stochastic. m > b > 1.

Next perform above iteration

Here is the more complete way of minibatch gradient descent

We’re taking b = 10, as a result i each iteration will be +10, jump every 10 i

And the constant are what squared in magenta. 1/b , and summation over i+9 examples.

This is still significantly faster than batch GD.

So why we minibatch instead of stochastic?

The answer lies on vectorization. With b examples, with pretty good linear algebra library, minibatch can be parallelized. b can take b core to parallelized, where in contrast 1 example can’t be (less able to) paralleled.

So minibatch is significantly faster than batch and somewhat faster than stochastic

(Final Note from me). I think that minibatch can’t be that stochastic when converging to global minimum.