-
n dimensional features in large data scale usually have arround thousand number that most of them are highly correlated
-
This lesson will show how to reduce for example thousand features(n) to hundred features(k) effectively while still retain most of the data
-
For choosing k, we at least have to take some measurement, in value, how many % we retain the original data
-
Usually we set about 99%, 95%, 90% . The sentence that preferred to use is "% of variance is retained".
-
Lot more simpler than saying that "we choose k unit, because..... and .... and the result of error is ........"
-
typical common use is >= 95% data
-
so % retained is
Average squared projection error/Total variation in the data
-
Check to see if k = 1 is satisfied the requirement. If it don't then increase k by one every iteration
-
Really inefficient if we use the computation for every iteration i
-
matrix S in USV is  diagonal matrix (other zero) that we can compute every iteration(much more simple)
-
Remember that k in PCA is a quantity value of the data that retained, and based on k, how much % variance that was retained, in other words, how much % of the data that retained based on the original
-
Alternatively, we can init k = 1;
-
Then for k, Compute like the graph on the right, Â choose the smallest value so that "99% variance retained" satisfied
-
If we try to manually set k, then try to provide some back-up by formula that given above.
-
That way others see your recommendation and approve it.
-
PCA try to minimize the error for the projection line between original data (x) and data projection (x.approx)
-
SUMMARY
-
thousand dimensional data and highly correlated, using PCA will reduce/compressed data by very large vector while retain most of the data