Naive Bayes

  |   Source

Naive Bayes is one of the machine learning algorithm. Here we first want to take the definition of machine learning.

Sebastian Thrun Experience

In [1]:
from IPython.display import Image
In [2]:
Image('nb-ud/Screen Shot 2014-11-19 at 11.22.14 AM.jpg')
Out[2]:

Sebastian Thrun is the head of the project of google automatic driving car. He uses supervised classification to train the car. Supervised mean that we're giving lot of correct examples, and then give it as material lesson to the system as a student. That's what the learner do, observing a teacher, for e.g. we saw parents learn to drive to correctly, then it's our turn to drive it. That's what machine learning does. To win DARPA Grand Challenge 2006, Stanley, the car project he did at Standford, observe thousand miles how man drive in the dessert.

Supervised Classification Example

In [3]:
Image('nb-ud/Screen Shot 2014-11-19 at 11.45.46 AM.jpg')
Out[3]:
  1. This one is definitely supervised learning. Given an example(photo and album tagged), the system (e.g. Facebook) try to recognize someone face.
  2. Analyze bank does not supervised learning, It doesn't learn from an example.
  3. Companies like Netflix learn from music or movie choice, as a feature and try to recommend similar movies or song to the user.
  4. This one also based on supervised learning.

Feature and Labels

In [7]:
Image('nb-ud/Screen Shot 2014-11-19 at 11.52.49 AM.jpg')
Out[7]:

Katie loves to listen to Let It Go. What sort of features in the song that makes Katie likes the song? Then based on the song,Katie makes the decision. Is she labels the song,like or dislike?

Features Visualization

In [8]:
Image('nb-ud/Screen Shot 2014-11-19 at 12.02.38 PM.jpg')
Out[8]:
In [10]:
Image('nb-ud/Screen Shot 2014-11-19 at 12.03.19 PM.jpg')
Out[10]:

If we plot two features, and it seems like we can make line separable, it may easier to draw conclusion based on the next point. But if it the case that the we plot like in the second, it could be unclear to us to make a conclusion. It's better to add new samples, or manipulate the features (use polynomial,log10..etc)

Stanley Speed

In [13]:
Image('nb-ud/Screen Shot 2014-11-19 at 12.11.55 PM.jpg')
Out[13]:

Suppose Stanley's speed decided by two factor. It's bumpiness and slope. We can take these into a classification.

In [15]:
Image('nb-ud/Screen Shot 2014-11-19 at 12.25.11 PM.jpg')
Out[15]:

The machine learning system always draw some decision surface(boundary) that make it easier to the system if the next sample drop plot in which side of the boundary

Introduction to Naive Bayes

Okay, we have enough description of the machine learning. Let's dig deeper about Naive Bayes. Bayes is actually a religious man trying to prove the existing of God, the algorithm that he makes that makes it naive.

In [16]:
Image('nb-ud/Screen Shot 2014-11-19 at 12.35.35 PM.jpg')
Out[16]:

Naive Bayes itself later will make decision boundary as the one in the picture. So the the incoming sample will be known its label by plotting in this graph.

SKlearn

Now we want to create a picture as above using the scikit learn library

In [17]:
import numpy as np
In [18]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [19]:
from sklearn.naive_bayes import GaussianNB
In [20]:
clf = GaussianNB()
In [21]:
X = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]] )
Y = np.array([1,1,1,2,2,2])
In [22]:
clf.fit(X,Y)
Out[22]:
GaussianNB()
In [23]:
print clf.predict([[-0.8,-1]])
[1]
In [24]:
%%writefile prep_terrain_data.py
#!/usr/bin/python
import random


def makeTerrainData(n_points=1000):
###############################################################################
### make the toy dataset
    random.seed(42)
    grade = [random.random() for ii in range(0,n_points)]
    bumpy = [random.random() for ii in range(0,n_points)]
    error = [random.random() for ii in range(0,n_points)]
    y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]
    for ii in range(0, len(y)):
        if grade[ii]>0.8 or bumpy[ii]>0.8:
            y[ii] = 1.0

### split into train/test sets
    X = [[gg, ss] for gg, ss in zip(grade, bumpy)]
    split = int(0.75*n_points)
    X_train = X[0:split]
    X_test  = X[split:]
    y_train = y[0:split]
    y_test  = y[split:]

    grade_sig = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==0]
    bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==0]
    grade_bkg = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==1]
    bumpy_bkg = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==1]

#    training_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}
#            , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}


    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    test_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}
            , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}

    return X_train, y_train, X_test, y_test
#    return training_data, test_data
Writing prep_terrain_data.py
In [25]:
%%writefile class_vis.py
#!/usr/bin/python

#from udacityplots import *
import matplotlib 
matplotlib.use('agg')

import matplotlib.pyplot as plt
import pylab as pl
import numpy as np

#import numpy as np
#import matplotlib.pyplot as plt
#plt.ioff()

def prettyPicture(clf, X_test, y_test):
    x_min = 0.0; x_max = 1.0
    y_min = 0.0; y_max = 1.0

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    h = .01  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic)

    # Plot also the test points
    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    plt.scatter(grade_sig, bumpy_sig, color = "b", label="fast")
    plt.scatter(grade_bkg, bumpy_bkg, color = "r", label="slow")
    plt.legend()
    plt.xlabel("bumpiness")
    plt.ylabel("grade")

    plt.savefig("test.png")
    
import base64
import json
import subprocess

def output_image(name, format, bytes):
    image_start = "BEGIN_IMAGE_f9825uweof8jw9fj4r8"
    image_end = "END_IMAGE_0238jfw08fjsiufhw8frs"
    data = {}
    data['name'] = name
    data['format'] = format
    data['bytes'] = base64.encodestring(bytes)
    print image_start+json.dumps(data)+image_end
Writing class_vis.py
In [29]:
%%writefile ClassifyNB.py

from sklearn.naive_bayes  import GaussianNB

def classify(features_train, labels_train):   
    ### import the sklearn module for GaussianNB
    ### create classifier
    ### fit the classifier on the training features and labels
    ### return the fit classifier
    clf = GaussianNB()
    clf.fit(features_train,labels_train)
    
    return clf
        

    
Overwriting ClassifyNB.py
In [76]:
# %%writefile GaussianNB_Deployment_on_Terrain_Data.py
#!/usr/bin/python

""" Complete the code below with the sklearn Naaive Bayes
    classifier to classify the terrain data
    
    The objective of this exercise is to recreate the decision 
    boundary found in the lesson video, and make a plot that
    visually shows the decision boundary """


from prep_terrain_data import makeTerrainData
from class_vis import prettyPicture, output_image
from ClassifyNB import classify

import numpy as np
import pylab as pl

from ggplot import *

features_train, labels_train, features_test, labels_test = makeTerrainData()

### the training data (features_train, labels_train) have both "fast" and "slow" points mixed
### in together--separate them so we can give them different colors in the scatterplot,
### and visually identify them
grade_fast = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0]
bumpy_fast = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0]
grade_slow = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1]
bumpy_slow = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1]




    ### draw the decision boundary with the text points overlaid
prettyPicture(clf, features_test, labels_test)
#output_image("test.png", "png", open("test.png", "rb").read())
In [62]:
%%writefile classify.py

from sklearn.naive_bayes import GaussianNB

def NBAccuracy(features_train, labels_train, features_test, labels_test):
    """ compute the accuracy of your Naive Bayes classifier """
    ### import the sklearn module for GaussianNB
    from sklearn.naive_bayes import GaussianNB

    ### create classifier
    clf = GaussianNB()

    ### fit the classifier on the training features and labels
    clf.fit(features_train,labels_train)

    ### use the trained classifier to predict labels for the test features
    pred = clf.predict(features_test)


    ### calculate and return the accuracy on the test data
    ### this is slightly different than the example, 
    ### where we just print the accuracy
    ### you might need to import an sklearn module
    accuracy = clf.score(features_test,labels_test)
    return accuracy
Overwriting classify.py
In [36]:
# %%writefile submitAccuracy.py
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
from classify import NBAccuracy

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl


features_train, labels_train, features_test, labels_test = makeTerrainData()

def submitAccuracy():
    accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test)
    return accuracy
In [37]:
print submitAccuracy()
0.884

That does in fact we have close to 90% accuracy in predicting our data. It always important to split your dataset, (this course recommend 90:10, others 80:20) for the test set so that we know is whether our learning is overfitting.

Bayes Rule

In [39]:
Image('nb-ud/Screen Shot 2014-11-19 at 2.12.38 PM.jpg')
Out[39]:

Accoding to Sebastian Thrun, here i quote "Bayes Rule is perhaps the Holy Grail of probabilistic inference". It found by Rev Thomas Bayes, who's trying to infer the existing of a God. What he didn't know back then, is he open endless possiblity for Artificial Inteligence background that we know today.

In [40]:
Image('nb-ud/Screen Shot 2014-11-19 at 2.19.16 PM.jpg')
Out[40]:

Here Bayes should infer the possibility given the condition. Now let's take it into a quiz.

This question is tricky, especially since both specificity and sensitivity are 90%. Intuitively, given the test result is positive, we know we are in the shaded region (blue and red). The true positive is depicted by red. As an estimate, which answer best describes the ratio of the red shaded region to the total (red + blue) shaded region?

In [42]:
Image('nb-ud/Screen Shot 2014-11-19 at 2.25.56 PM.jpg')
Out[42]:

If we can see at the graph, we actually observe the probability of the cancer inside positive test probability. We Ignore the whole population for now. And of that 90% test positive, we're calculating person that actually have disease. For this to happen, we have to know the other side of test positive, which is the person who doesn't have cancer but tested posititive. We do this to have the probability of the test positive, independent whether the person have the disease or not.

Here's the total probability for the problem

In [44]:
Image('nb-ud/Screen Shot 2014-11-19 at 2.40.25 PM.jpg')
Out[44]:

And here's the simpler intuition

In [46]:
Image('nb-ud/Screen Shot 2014-11-19 at 2.43.11 PM.jpg')
Out[46]:

By adding P(C|Pos) and P(notC|Pos) you have the total probability of 1

Bayes Rule for Classification

In [47]:
Image('nb-ud/Screen Shot 2014-11-19 at 3.13.58 PM.jpg')
Out[47]: