Prioritizing What to Work On

Spam Classification Example
Seem a little more complicated because touch high abstraction difference in the field machine learning
more complex problems for machine learning, math formula in this video tends to help much
this lesson is about how we prioritising in the machine learning problem

How do we use supervised learning to classify spam or not-spam?
Non spam on the left denotes by using number in words

Features:Choose words related to purchasing to spam, and subject name to non-spam.
This is spam/non-spam classification with logistic regression method.
We create 100 words feature vector.
List whether or not the each word is in the list are appear on the examples. If it does, mark 1, otherwise 0
Rather than manually choosing list of words, we can instead choosing the most frequent words (10K-50K) by training examples

Spend time to have high accuracy
Honeypot: create lots of fake email, and let it be spammed. So we can have those fake emails as our training examples.
Gather data by email header, which the content has routing information. They sometimes take an unusual routes of source.
in Email body,misspellings in spam is often intended to avoid spam filtering words.

These are options on what to do to in spam classifier examples.
Often many machine learning scientist spend some time and fixated about these options. Sometimes this doesn't fruitful at all.
What's not recommended is when people use "gut feeling" like the solution they feel right when they woke up in the morning.

Next error analysis and how we spend time choosing the right way to increase the learning performance