• These are the steps to implement a neural network
  • First we need the matrix weight, because it is affected by the backprop and backprop is not allowed for the first example,
  • then we random initialized the matrix weight because each must not be the same. Use epsilon as a boundary random.
  • Then we using forward propagation until we get the final cost function
  • use forward and backprop for every example; forward  backprop first example, go to second example, use forward and backprop and so on until we reached the final example
  • Not recommend using without for loops if we first try to backprop
  • So for every iteration examples, go through layer, forward and backprop, compute the delta, doing so until reach the final iteration examples
  • Then write some code, then finally compute partial derrivative of cost function, using regularization that we defined earlier
  • gradient checking to make sure that the code we implemented is correct
  • use advanced optimization or other techniques to try to minimize the cost function of j(theta)
  • Actually the big j(theta) is actually non-convex, so sometimes it find itself in local optima, but often it doesn't find much problem, as it tends to find a global optima
  • If it does still feel magical for gradient descent in nerural networks, the graph below will help to visualize better

  • here are just examples with three axis Cost Function, first parameters, and second parameters respectively
  • Of course in reality there's tend to be a lot of parameters
  • The cost function will be pretty low(valley) , resulting with hyphotesis is approach(similar to) y, where hill result hyphotesis far from y
  • The backprop at init will determining the direction to go to the global optima, and the gradient descent will continue the rest of the step to find itself in global optima
  • So what the backprop or gradient descent or advance techniques are doing, is to find itself a hyphothesis output value that matche with output at the training set