|   Source

• The idea is simple. taking b examples instead just one example in stochastic. m > b > 1.
• Next perform above iteration

• Here is the more complete way of mini-batch gradient descent
• We’re taking b = 10, as a result  i each iteration will be +10, jump every 10 i
• And the constant are what squared in magenta. 1/b , and summation over i+9 examples.
• This is still significantly faster than batch GD.
• So why we mini-batch instead of stochastic?
• The answer lies on vectorization. With b examples, with pretty good linear algebra library, mini-batch can be parallelized. b can take b core to parallelized, where in contrast 1 example can’t be (less able to) paralleled.
• So mini-batch is significantly faster than batch and somewhat faster than stochastic
• (Final Note from me). I think that mini-batch can’t be that stochastic when converging to global minimum.