Kernels II

  |   Source
Kernels II
  • Kernels video to define new features
  • This video discussed about what missing in kernels and bias/variance trade off
  • for every training example that we have, put landmark on same location.
  • That way we know similarity between landmarks and training example. So we're going to have landmarks as much as training examples.
  • This is based on training example on features(let's say x1/x2). The color of graph shown on the right is not matter much.

  • This section is where we putting together SVM with kernels.
  • So first of all we're going to make landmarks with m-unit, and assign each landmark with training example x.
  • Then for every training example x (can be from training/cross-validation/test set) compute f vector
  • f vector is where we compute between all similarity (from f1 to fm) and x, gather all the similarity value to x and wrap it into a vector.
  • In other words, by example x(i) would have fi m-dimensional vector, with elements compute from f1 to fm.
  • Among all f value, there's should be fi which exactly similar to xi.
  • And then, instead of including xi, we're going to include fi, where fi is f m-dimensional vector that each elements calculate similarity to xi

  • Instead of usual hypothesis that include xi, we're going to replace it with fi.
  • So the prediction then change that compute theta transpose multiply with f.
  • As we look into the formula each theta will be multiplied by f.
  • As f unit is match to m training examples, then the number of theta should be in equal number of training examples
  • n = number of features, eventually equals to m (same number of units as training examples).
  • How to get parameters (theta) value? We're doing this with SVM training cost function that discussed earlier, in goal is to minimize theta.
  • As we can see in the SVM formula, we already replace xi with fi.
  • Now, the regularization term is slightly different. We already discussed about n being the same number of m, keep in mind that theta0 is still unregularized.
  • For SVM problem, computing f-m is computationally expensive, we're basically have parameters that have m unit.

  • So we're going to slightly modify regularization term to be coherent with SVM and make it efficient calculation.
  • Instead of having usual regularization term, we're going to multiply thetaT * theta, exclude theta0
  • "M" is rescaled version, from kernel SVM, that try to minimize the value of theta
  • M is mathematical detail, reason for mathematical efficiency
  • Kernels, can be applied to logistic regression using kernels, but can be computationally expensive and can be really slow
  • SVM with its character can be going well with kernel, and with other advanced optimization techniques specifically for SVM can be really efficient
  • Use external packages, don't write own code, because has tested to be compute really fast. SVM has built-in in common library

  • This is a topic of worth mentioning in SVM kernel.
  • Choose wisely of C value, as we learn the higher the C value, more overfit, where lower C value, more underfit.
  • Then we also know sigma squared of the Gaussian kernel, where fi tends to vary smoothly, long range because bigger sigma, and rapid fall, narrower with smaller sigma.
  • In summary these are all SVM in kernels algorithm, and how the behaviour