Cleaning the data
The data can be dirty:
- there's us date format when expecting european format
- not normally distributed
- expecting one kind but in return various kind
- Basically, everytime human involved, there's always an error.
- The last one not human's error though
- Audit:This how we plan our data cleaning, first audit our data, checking the errors, outliers.
-
Create cleaning plan:
- identify causes: this is problem spesific
- operations: also problem spesific
-
test: test earlier to make a solid understanding
- execute the plan
- And sometimes we can't clean programmatically, clean manually by you/others.
- Because all this step is performed by human, there's also human error, which is why we should reiterate our steps twice or more.