Cleaning the data

  |   Source
Cleaning the data
The data can be dirty:
  • there's us date format when expecting european format
  • not normally distributed
  • expecting one kind but in return various kind


  • Basically, everytime human involved, there's always an error.
  • The last one not human's error though


  • Audit:This how we plan our data cleaning, first audit our data, checking the errors, outliers.
  • Create cleaning plan:
    • identify causes: this is problem spesific
    • operations: also problem spesific
    • test: test earlier to make a solid understanding
  • execute the plan
  • And sometimes we can't clean programmatically, clean manually by you/others.
  • Because all this step is performed by human, there's also human error, which is why we should reiterate our steps twice or more.