Non-linear hypothesis

Non-linear hypothesis
  • Outdated but most powerful learning algorithm in most ML problems
  • Read more…

Regularized Logistic Regression

Regularized Logistic Regression
  • regularized both gradient descent of cost function and the more advanced optimization that includes cost function and derrivative
  • Read more…

Regularized Linear Regression

Regularized Linear Regression
  • regularized both gradient descent and normal equation algorithms for linear regression
  • Read more…

The problem of overfitting

The problem of overfitting
The problem of overfitting algorithm
Regularization : a way to decrease overfitting problem

high bias(underfit) : misinterpreted line fit for the data
high variance(overfit) : lot of features (many high order of polynomials)  but lack more data to give a good hypothesis

example for logistic regression
there is a tool for analyzing whether the algorithm has overfitting or underfitting...

a lot of features may risk a lot of high order polynomials....
making it even harder to visualize (in case of over 100 features)


first option,  either manually decrease the features or automatically reduce number of features that will be discussed later in greater depth..
The disadvantage is we don't know whether the features particularly useful for latter,  or even all features matter...

second,  use Regularization,  where keep we all the features,  but reduce the magnitude of features based on their usefulness..

multiclass classification

multiclass classification
-the one that called one-vs-all classification algorithm
classification with more than two groups

either starting from index 0 or 1 doesn't matter..

the defined algorithms earlier is to compute the binary by using logistic regression...
using one vs all algorithm for solving multiclass classification.
  • the algorithm is essentially make two groups,  with each class again the rest...and take each hypothesis(that denotes by superscript) for each class
  • Read more…

Advanced Optimization

Advanced Optimization
-optimize for faster gradient descent
-scalling efficiently for lots of features


the new algorithm will optimize the cost function and it's derrivative terms


the 3 other optimization algorithms are very complex ND shouldn't be used unless you are a professional in numerical computing....

for every language,  choose the best library by testing it for particular problem that we have...  write our own code  is not really recommended


after we write code at right...  we now be able to write optimization code
given options that stared,

octave function fminunc will take pointer,  name of function,  opt theta,  that automatically chooses learning rate...
optimset = set of options for optimization

here's how to implement in octave,  after we defined cost function

exit flag is there to makes sure the cost function is converged...
initialTheta has to be 2 element vector minimal

this is optimization for simple quadratic function...


remember that octave indexes starting from 1
we need the code for further optimize the cost function...


these are advanced optimize that can do better than gradient descent..

Simplified cost function and gradient descent

Simplified cost function and gradient descent
-simpler cost function
-apply gradient descent to logistic regression
-fully ready logistic regression

this is more simpler and compact way for calculating cost function


taking advantage of y equals either 0 or 1,
making one line of cost function and disabling y each other..



one line of function is essentially used for linear regression
what's left is how to minimize J(theta)


the gradient descent function as shown above has include the derrivative function for cost of theta

almost the same as linear regression,  except for calculating the hypothesis which stated above...

same technique also applied for monitoring gradient descent converge to global optima as linear regression...
(plot the gradient descent function for every iteration and see   if  gradient descent decreasing every  iteration)


this is vectorized implementation for logistic regression...

there is feature scalling that makes faster gradient descent

this is the most widely used algorithm for classification...


Decision Boundary

Decision Boundary
-intuition for hypothesis for logistic regression.
approach(asynthosing)  at 1

so as a classification between 0 and 1,
we get hypothesis >= 0.5 for y = 1
and hypothesis < 0.5 for y = 0

so that's for prediction,  next hypothesis calculation to make prediction

the region (purple and blue area)  is set not by training set,  but from the value set by theta (parameters) .  in regards,  the theta are -3, 1,1 respectively,  that are predefined earlier....

when theta are still unknown,  then the training set would set the value of theta...



next,  we talk about non-linear decision boundaries....


the first graph determine a non linear regression with  decision boundaries y =1 inside a purple circle...  and everything else ( outside the circle)  y = 0...

again the decision boundaries is determined by value set by theta..
the training set may produce the value of parameters,  but the the parameters is the vital part for creating  the decision boundaries...


the second graph is more complex parameters with higher order polynomial...
as we can see the boundaries is more complicated than usual...

these visualization graphs would give us better understanding about various range of representation...


next,  we talk about how automatically choose value of parameters based on training set...s

Classification

Classification

linier regression may disperse value set for classification

hypothesis representation

hypothesis representation
the target is to build :

the hypothesis must be between zero and one.

hypothesis for logistic regression modified as stated above.

Sigmoid and logistic function basically are the same thing.

we're going to replace theta transpose with z.

more z approach towards infinity ,  g(z)  approach 1
the same as  z  negative towards infinity,  g(z)  approach 0

default y = 1
so if h0(x) = 0.7 then y = 1;
70% of being malignant
the default for x0 is as usual,  1
and x1 is a feature of tumorsize

the probability are whether y is 0 or 1 (both probability equal 1)

in summary,  the lesson here is about to introduce hypothesis function,  mathematical in logistic regression...