Cleaning the data

The data can be dirty:
  • there's us date format when expecting european format
  • not normally distributed
  • expecting one kind but in return various kind

  • Basically, everytime human involved, there's always an error.
  • The last one not human's error though

  • Audit:This how we plan our data cleaning, first audit our data, checking the errors, outliers.
  • Create cleaning plan:
    • identify causes: this is problem spesific
    • operations: also problem spesific
    • test: test earlier to make a solid understanding
  • execute the plan
  • And sometimes we can't clean programmatically, clean manually by you/others.
  • Because all this step is performed by human, there's also human error, which is why we should reiterate our steps twice or more.