• Features:Choose words related to purchasing to spam, and subject name to non-spam.
  • This is spam/non-spam classification with logistic regression method.
  • We create 100 words feature vector.
  • List whether or not the each word is in the list are appear on the examples. If it does, mark 1, otherwise 0
  • Rather than manually choosing list of words, we can instead choosing the most frequent words (10K-50K) by training examples
  • Spend time to have high accuracy
  • Honeypot: create lots of fake email, and let it be spammed. So we can have those fake emails as our training examples.
  • Gather data by email header, which the content has routing information. They sometimes take an unusual routes of source.
  • in Email body,misspellings in spam is often intended to avoid spam filtering words.

  • These are options on what to do to in spam classifier examples.
  • Often many machine learning scientist spend some time and fixated about these options. Sometimes this doesn't fruitful at all.
  • What's not recommended is when people use "gut feeling" like the solution they feel right when they woke up in the morning.
  • Next error analysis and how we spend time choosing the right way to increase the learning performance