How to handle missing data

  |   Source
How to handle missing data

  • Listwise Deletion: For computation that require birthdate and deathdate, lose the row that missing
  • Partwise Deletion: Lose the row like listwise deletion, for the computation, but still keep it for computation that require other data, (e.g height)

  • Destroy the missing value may effect the effectiveness of the data, statistic. We want to remove it, but then it would not represent whole population.
  • there's technique called imputation that will be discussed briefly

  • One way is to fill the missing values with average of an existing one, that way we don't destroy the mean.
  • But that also destroy the correlation of other variable that include the missing value. (e.g. height 5" couldn't be weight 191.67)

  • Another technique is to use machine learning's linear regression.
  • Based on existing parameter we want to predict the output(say given avg, height, parameters with labeled weight).
  • Build the model using linear regression based on given parameters and predict the output, the missing data.

  • This imputation is only the tip of the iceberg as it both have disadvante
  • They only over-emphasize the existing data.(May makes overfitting)