Choosing what features to use

2018-04-05 00:00 | Source

In previous video, we have implemented a system that can make Anomaly Detection and evaluate it
As it turns out, choosing which features that can be a vital point for distinguish the anomaly from the data is really a bit of problem, and can have a huge impact in the data
This lesson here is how the suggestion and the guidelines for selecting features in Anomaly Detection

The graph shown above is how plot histogram for our data, and for humanity check, let's see if our data can match the Gaussian Distribution (something like bowl-shape graph)
The graph on the bottom left is not a Gaussian like, so we may want to take transformation variation to make the data look like Gaussian graph
Each parameters (x) can be tinkered, as shown in the formula, x1,x2,x3,x4 is a way to transform the x into something that looks like Gaussian bowl-shape in the bottom right

The first image is show how we have the data, and this is exactly not look like the Gaussian graph.
So we want to change it. As we can see in the terminal, Ng write couple of code trying to figure out which transformation that he could use so the data can be transformed into Gaussian bowl-shape

In this slide, we see common problem that takes the data plotted into Gaussian
Suppose there's anomaly marked by the green 'x' , and it's barely touch the Gaussian barrier for deciding whether it's anomaly or not.
Supposed the x is anomaly, but Gaussian fail to pick that. We can create new feature takes other parameters, play around with it so the anomaly itself can be a little further away than the Gaussian circle.
As example above x1 feature fail to pick anomaly example, so we create new feature x2, and by creating new 2D graph, we can put the anomaly example further away(by set x2 in anomaly really high, so its far outside the Gaussian purple circle)

So if we can catch the anomaly examples, and seems to be blur whether or not the anomaly distinguish from the rest of the data, we can take features that either too large or too small as a vital features to distinguish the anomaly

New features that produce as x5 and x6 above. We can come up with a new features that incorporated both parameters, to put out the anomaly from the normal examples

SUMMARY

introduced transformed to parameters to be tinkered to make it look like Gaussian bowl-shape
Also how to create new features (can be incorporated between two predecessor features) to distinguish the anomaly from the rest of the data