Kernels II

for every training example that we have, put landmark on same location.
That way we know similarity between landmarks and training example. So we're going to have landmarks as much as training examples.
This is based on training example on features(let's say x1/x2). The color of graph shown on the right is not matter much.

This section is where we putting together SVM with kernels.
So first of all we're going to make landmarks with m-unit, and assign each landmark with training example x.
Then for every training example x (can be from training/cross-validation/test set) compute f vector
f vector is where we compute between all similarity (from f1 to fm) and x, gather all the similarity value to x and wrap it into a vector.
In other words, by example x(i) would have fi m-dimensional vector, with elements compute from f1 to fm.
Among all f value, there's should be fi which exactly similar to xi.
And then, instead of including xi, we're going to include fi, where fi is f m-dimensional vector that each elements calculate similarity to xi

Instead of usual hypothesis that include xi, we're going to replace it with fi.
So the prediction then change that compute theta transpose multiply with f.
As we look into the formula each theta will be multiplied by f.
As f unit is match to m training examples, then the number of theta should be in equal number of training examples
n = number of features, eventually equals to m (same number of units as training examples).
How to get parameters (theta) value? We're doing this with SVM training cost function that discussed earlier, in goal is to minimize theta.
As we can see in the SVM formula, we already replace xi with fi.
Now, the regularization term is slightly different. We already discussed about n being the same number of m, keep in mind that theta0 is still unregularized.
For SVM problem, computing f-m is computationally expensive, we're basically have parameters that have m unit.

So we're going to slightly modify regularization term to be coherent with SVM and make it efficient calculation.
Instead of having usual regularization term, we're going to multiply thetaT * theta, exclude theta0
"M" is rescaled version, from kernel SVM, that try to minimize the value of theta
M is mathematical detail, reason for mathematical efficiency
Kernels, can be applied to logistic regression using kernels, but can be computationally expensive and can be really slow
SVM with its character can be going well with kernel, and with other advanced optimization techniques specifically for SVM can be really efficient
Use external packages, don't write own code, because has tested to be compute really fast. SVM has built-in in common library

This is a topic of worth mentioning in SVM kernel.
Choose wisely of C value, as we learn the higher the C value, more overfit, where lower C value, more underfit.
Then we also know sigma squared of the Gaussian kernel, where fi tends to vary smoothly, long range because bigger sigma, and rapid fall, narrower with smaller sigma.