Prioritizing What to Work On

  |   Source
Prioritizing What to Work On
  • Spam Classification Example
  • Seem a little more complicated because touch high abstraction difference in the field machine learning
  • more complex problems for machine learning, math formula in this video tends to help much
  • this lesson is about how we prioritising in the machine learning problem

  • How do we use supervised learning to classify spam or not-spam?
  • Non spam on the left denotes by using number in words

  • Features:Choose words related to purchasing to spam, and subject name to non-spam.
  • This is spam/non-spam classification with logistic regression method.
  • We create 100 words feature vector.
  • List whether or not the each word is in the list are appear on the examples. If it does, mark 1, otherwise 0
  • Rather than manually choosing list of words, we can instead choosing the most frequent words (10K-50K) by training examples
  • Spend time to have high accuracy
  • Honeypot: create lots of fake email, and let it be spammed. So we can have those fake emails as our training examples.
  • Gather data by email header, which the content has routing information. They sometimes take an unusual routes of source.
  • in Email body,misspellings in spam is often intended to avoid spam filtering words.

  • These are options on what to do to in spam classifier examples.
  • Often many machine learning scientist spend some time and fixated about these options. Sometimes this doesn't fruitful at all.
  • What's not recommended is when people use "gut feeling" like the solution they feel right when they woke up in the morning.
  • Next error analysis and how we spend time choosing the right way to increase the learning performance