How to handle missing data
How to handle missing data
-
Listwise Deletion: For computation that require birthdate and deathdate, lose the row that missing
-
Partwise Deletion: Lose the row like listwise deletion, for the computation, but still keep it for computation that require other data, (e.g height)
-
Destroy the missing value may effect the effectiveness of the data, statistic. We want to remove it, but then it would not represent whole population.
-
there's technique called imputation that will be discussed briefly
-
One way is to fill the missing values with average of an existing one, that way we don't destroy the mean.
-
But that also destroy the correlation of other variable that include the missing value. (e.g. height 5" couldn't be weight 191.67)
-
Another technique is to use machine learning's linear regression.
-
Based on existing parameter we want to predict the output(say given avg, height, parameters with labeled weight).
-
Build the model using linear regression based on given parameters and predict the output, the missing data.
-
This imputation is only the tip of the iceberg as it both have disadvante
-
They only over-emphasize the existing data.(May makes overfitting)