Mini Batch Gradient Descent

  |   Source
Mini Batch Gradient Descent

  • The idea is simple. taking b examples instead just one example in stochastic. m > b > 1.
  • Next perform above iteration

  • Here is the more complete way of mini-batch gradient descent
  • We’re taking b = 10, as a result  i each iteration will be +10, jump every 10 i
  • And the constant are what squared in magenta. 1/b , and summation over i+9 examples.
  • This is still significantly faster than batch GD.
  • So why we mini-batch instead of stochastic?
  • The answer lies on vectorization. With b examples, with pretty good linear algebra library, mini-batch can be parallelized. b can take b core to parallelized, where in contrast 1 example can’t be (less able to) paralleled.
  • So mini-batch is significantly faster than batch and somewhat faster than stochastic
  • (Final Note from me). I think that mini-batch can’t be that stochastic when converging to global minimum.