Today i want to talk about bayesian learning. Bayes' theorem has been around for a while. The name itself is after the statiscian Thomas Bayes. Interesting enough, He died before even published his major scientific achievement to the world as we know today.

from IPython.display import Image

Bayes found that in the event of probability event A, if depend on event B, then the probability of A will occurs in areas which B have to present.

I'm not going to bogged into the basic here, but maybe some intuition with Bayes will be used later discussion.

NOTE: This is what i learn so far from Bayesian Learning at Udacity, Machine Learning, Supervised Learning

Machine Learning is relatively new field in artificial inteligence.The discipline in this field taught us how teach machine to learn based on the data. In other words, from all the data we can infer what are the function to produce the output. We simply teach as a teacher. What are the problem, what are the solution. Then the learner, in this case the machine, learn how to induct from experience, the solution given future input .

Bayesian Learning derrive from machine learning algorithm that implement Bayes to learning algorithm. This is our usual Bayes:

Image(url='https://www.evernote.com/shard/s376/res/d587c056-ec5d-46f7-8d5f-870bf1adf80d/Screen%20Shot%202014-10-22%20at%202.29.31%20PM.jpg?resizeSmall&width=821')

Bayesian Learning is calculate which are our most probable hypothesis in our hypothesis set, to act as a mapping function from our date to our labeled output.

Now how do we implement Bayes in machine Learning. Turns out the data directly correlated with our hypothesis. This is how we calculate the probability of our hypothesis given data. It depend on probability of data given hypothesis with our normalization term. This is only for particular hypothesis from our possible hypothesis.

Image(url='https://www.evernote.com/shard/s376/res/4569bdad-ad84-4355-a8e2-efe9020b96c9')

In machine learning we present a tupple-pair, with input(x) and the label(d). To over simplify this learning, consider our samples are noise-free. There's no sensor error, human error, nature error, or any other kind of error.

We also think that the target concept(The definite hypothesis for our problems) is in fact in our dataset.

We also have uniform prior, where the the hypothesis in our hypothesis set is equally distributed. This is important as some people overlook this. We get to know that every hypothesis have same fighting chance.

The probability of h in our H(hypothesis set), as it uniform distribution, then it will just be 1/|H|.

Next our hypothesis is calculated 1 if our hypothesis get the same result as training output otherwise 0. This will make the probability of h gets higher if it more correct at predicting the output.

Now since we have calculate all the probability of D given some particular h, we can calculate the probabability of D by accumulating the probability across the whole population of h. Well what it equals to? The probabability of hypothesis that only consistent with the data. If it doesn't, then it doesn't take into equation(as we filter that at P(D|h). that the probability in more spesific set, the version space(VS).

Finally applying P(h|D) from Bayes Rule, we get the result. The result is no matter what the hypothesis are, we are choosing the hypothesis that fit the requirements, the hypothesis in the version space. So every hypothesis in the version space are fit to be THE hypothesis

Image(url='https://www.evernote.com/shard/s376/res/148db25e-26a9-4492-bbba-893de4cc8e9e')

Now, eventhough that we learn how to calculate the most probable hypothesis, it's not the case that we choose alone that particular hypothesis. It may lead to that, but actually we calculating the predicting output by voting. Which are the hypothesis that can allow to vote? the hypothesis that only in the version space. From the example we know that the hypothesis with most probable, 0.4 have said positive. But, since the other two said -, then the predicting label for x is -