Non-linear hypothesis

high bias(underfit) : misinterpreted line fit for the data
high variance(overfit) : lot of features (many high order of polynomials) but lack more data to give a good hypothesis

example for logistic regression
there is a tool for analyzing whether the algorithm has overfitting or underfitting...

a lot of features may risk a lot of high order polynomials....
making it even harder to visualize (in case of over 100 features)

first option, either manually decrease the features or automatically reduce number of features that will be discussed later in greater depth..
The disadvantage is we don't know whether the features particularly useful for latter, or even all features matter...

second, use Regularization, where keep we all the features, but reduce the magnitude of features based on their usefulness..

multiclass classification

2014-04-24 15:35

multiclass classification

-the one that called one-vs-all classification algorithm
classification with more than two groups

either starting from index 0 or 1 doesn't matter..

the defined algorithms earlier is to compute the binary by using logistic regression...
using one vs all algorithm for solving multiclass classification.

the algorithm is essentially make two groups, with each class again the rest...and take each hypothesis(that denotes by superscript) for each class

Advanced Optimization

2014-04-24 14:33

Advanced Optimization

-optimize for faster gradient descent
-scalling efficiently for lots of features

the new algorithm will optimize the cost function and it's derrivative terms

the 3 other optimization algorithms are very complex ND shouldn't be used unless you are a professional in numerical computing....

for every language, choose the best library by testing it for particular problem that we have... write our own code is not really recommended

after we write code at right... we now be able to write optimization code
given options that stared,

octave function fminunc will take pointer, name of function, opt theta, that automatically chooses learning rate...
optimset = set of options for optimization

here's how to implement in octave, after we defined cost function

exit flag is there to makes sure the cost function is converged...
initialTheta has to be 2 element vector minimal

this is optimization for simple quadratic function...

remember that octave indexes starting from 1
we need the code for further optimize the cost function...

these are advanced optimize that can do better than gradient descent..

Simplified cost function and gradient descent

2014-04-24 02:10

Simplified cost function and gradient descent

-simpler cost function
-apply gradient descent to logistic regression
-fully ready logistic regression

this is more simpler and compact way for calculating cost function

taking advantage of y equals either 0 or 1,
making one line of cost function and disabling y each other..

one line of function is essentially used for linear regression
what's left is how to minimize J(theta)

the gradient descent function as shown above has include the derrivative function for cost of theta

almost the same as linear regression, except for calculating the hypothesis which stated above...

same technique also applied for monitoring gradient descent converge to global optima as linear regression...
(plot the gradient descent function for every iteration and see if gradient descent decreasing every iteration)

this is vectorized implementation for logistic regression...

there is feature scalling that makes faster gradient descent

this is the most widely used algorithm for classification...

Decision Boundary

2014-04-23 02:01

Logistic Regression
machine learning

Decision Boundary

-intuition for hypothesis for logistic regression.
approach(asynthosing) at 1

so as a classification between 0 and 1,
we get hypothesis >= 0.5 for y = 1
and hypothesis < 0.5 for y = 0

so that's for prediction, next hypothesis calculation to make prediction

the region (purple and blue area) is set not by training set, but from the value set by theta (parameters) . in regards, the theta are -3, 1,1 respectively, that are predefined earlier....

when theta are still unknown, then the training set would set the value of theta...

next, we talk about non-linear decision boundaries....

the first graph determine a non linear regression with decision boundaries y =1 inside a purple circle... and everything else ( outside the circle) y = 0...

again the decision boundaries is determined by value set by theta..
the training set may produce the value of parameters, but the the parameters is the vital part for creating the decision boundaries...

the second graph is more complex parameters with higher order polynomial...
as we can see the boundaries is more complicated than usual...

these visualization graphs would give us better understanding about various range of representation...

next, we talk about how automatically choose value of parameters based on training set...s