The data can be dirty:
-
there's us date format when expecting european format
-
not normally distributed
-
expecting one kind but in return various kind
-
Basically, everytime human involved, there's always an error.
-
The last one not human's error though
-
Audit:This how we plan our data cleaning, first audit our data, checking the errors, outliers.
-
Create cleaning plan:
-
identify causes: this is problem spesific
-
operations: also problem spesific
-
test: test earlier to make a solid understanding
-
execute the plan
-
And sometimes we can't clean programmatically, clean manually by you/others.
-
Because all this step is performed by human, there's also human error, which is why we should reiterate our steps twice or more.