Decision Trees with scikit-learn
Decision Trees is one of the oldest machine learning algorithm. It's extremely robutst, and it can traceback for decades. Decision Trees is the algorithm that without expensive kernel in SVM, able to solve non-linear problem with linear surface.
For other information, please check this link. This also the final of the third learning algorithm of supervised classfication at Introduction to Machine Learning at Udacity. Check the reference at the bottom of the page for more information. D
In binary-classification(only two outcomes), Decision Trees count all possible output availables.
For this particular problem, we asked one feature first(e.g. windy?) and keep doing it until we reach all leaves.
This is the more complex problem for decision tree.
It still use binary classification. For this x,y feature and range from 1-5, we're still asking two options and use less than or greated than. For this example, we choose first options |x<3|. Now for the example that we have, if we choose YES |x<3| then we will create a new boundary (decision boundary) that split the region in two. After that we specify the other feature (y). and ultimately, the outcome of Y (either yes or no) would point the output that are given. For |x>3| which means No, we also ask second question with Y and will produce the outcome of either blue circle or red cross
%pylab inline import sys sys.path.append('../tools')
Populating the interactive namespace from numpy and matplotlib
%%writefile classifyDT.py def classify(features_train, labels_train): from sklearn import tree clf = tree.DecisionTreeClassifier() clf.fit(features_train, labels_train) return clf
#!/usr/bin/python """ lecture and example code for decision tree unit """ import sys from class_vis import prettyPicture, output_image from prep_terrain_data import makeTerrainData import matplotlib.pyplot as plt import numpy as np import pylab as pl from classifyDT import classify features_train, labels_train, features_test, labels_test = makeTerrainData() ### the classify() function in classifyDT is where the magic ### happens--it's your job to fill this in! clf = classify(features_train, labels_train) # clf = tree.DecisionTreeClassifier() # clf.fit(features_train, labels_train) #### grader code, do not modify below this line prettyPicture(clf, features_test, labels_test) # output_image("test.png", "png", open("test.png", "rb").read())
/Users/jon/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:1155: UserWarning: This call to matplotlib.use() has no effect because the backend has already been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot, or matplotlib.backends is imported for the first time. warnings.warn(_use_error_msg)
In this plot, we know that decision tree is trying to separate the region by linear separable line. Notice that decision tree is overfitting in this example
Let's see the accuracy of the algorithm
The Decision Tree have almost 91% accuracy, but we can see there's overfitting in our solution. Let's see the sklearn documentation to tackle this problem.
from IPython.display import Image
min_samples_split in documentation of sklearn, specifies, that if we have 100 samples, and min split=2(default) it will keep splitting the data, if the samples, have at least two point in the region. if the leaf reach 1, the decision tree will ignore the it, so the leaf with sample count 1 will not get split.
It's important to note that, the DT will always split if there's region that have min_samples_split count in the area. So if we at least, specify that min_samples=50, it will get at least 50 samples(point) to perform splitting. Let's test this
from sklearn import tree
clf_min_samples_50 = tree.DecisionTreeClassifier(min_samples_split=50)
DecisionTreeClassifier(compute_importances=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=50, random_state=None, splitter='best')
We can see that the decision tree is less overfitting. And because it's less overfitting, means that we at least should have higher accuracy.
Let's test this!
And there you go, we have 91.2% accuracy compared to min_samples_2(default) which has 90.8%
Entropy is the one of Decision Tree parameter that can tell the algorithm,how it should expand the split area. The Entropy that is perfectly split the the region into two class will have high entropy. In the other hand, if we split the region and we still get same class for both region then it's zero. The entropy is like how much information gain that we have if we split the area.
Here's some other notes from Udacity:
"Some sources use other bases for the logarithm (for example, they might use log base 10 or the natural log, with a base of approx. 2.72)--those details can change the maximal value of entropy that you can get. In our case, where there are 2 classes, the log base 2 formula that we use will have a maximal value of 1.
In practice, when you use a decision tree, you will rarely have to deal with the details of the log base--the important takeaway here is that lower entropy points toward more organized data, and that a decision tree uses that as a way how to classify events."
Now in the top right we have some formula for calculating the entropy. The formula actually lack of minus sign in the beginnning.
First we observe one node, the speed, and the samples in the node are [ssff].If we get Pslow, of all the examples we have 0.5, and that also applies to Pfast 0.5.
Now if we are to calculate the entropy of this node, based in the intuition earlier, we should get 1.0, because we perfectly split the node into two class, slow and fast, both equal 0.5, 0.5.
Let's test this by code out of curiousity!
Because we have same probability for both fast and slow, we assign by 1 variable
entropi = -(0.5)*math.log(0.5,2)
after we assign a variable, we want to calculate both of class with same probability, so we just multiply by two.
For entropy to work with Decision Tree, we take into account an Information Gain
Check out other other information about entropy here
The decision tree algorithm will maximize information gain. What does it mean by this, is when we select a feature, with various different value, we want some limit of that particular feature that maximize the split.In other words, to have an order of which the features should be prioritized first to do splitting, decision tree algorithm do sorting of Information Gain value of each features from maximum to minimum.
Let's go back to Sebastian Thrun's examples. The output(predicting value) should be speed, either slow or fast. Then we have 3 different parameters, each have binary value.
Let's first observe grade. by plotting tree like above, either steep or flat, we separate it by the column grade specifies. the value are st-st-fl-st. If we look at the plot flat only get f, and steep combination of slow and fast.
The entropy of node flat in this case is zero. This is because we only have fast in this node, so perform some splitting will get nothing, we only split in the same class.In the contrary, the steep has ssf, and for slow the probability is 2/3, and fast 1/3
After we have entropy flat, we calculate the entropy of steep, given Pslow and Pfast. Here's we have the the formula for calculating the entropy, and the solution code as follows:
frac = 2.0/3
- frac * math.log(frac,2) - (1-frac) * math.log((1-frac),2)
Or, we can use scipy
Now that we have entropy of both steep and flat, we are ready to calculate total entropy(Information Gain) of feature 'grade'
The weighted average is the probablity of which steep/flat occurs.
1.0-(3.0/4 * 0.9184)
This is the Information Gain value for feature 'grade'.