Statistics vs Machine Learning

  |   Source

In a Data Science process, after Data Scientist question the data and extract many useful information, it's time to get into the modeling process.

The modeling process is formally divided into two categories, statistical modeling and machine learning. Statistical modeling requires data scientist to have a deep understanding of the model. Specifically, what kind of features that will be played as the important predictors and what is the reason behind it. Statistical modeling is about the quality of the features.

While machine learning requires data scientist to brute force all the features that may increase the accuracy. Machine learning doesn't need data scientist to understand why the features are important. As long as the model accuracy is great at predicting unseen data, then the machine learning is a success. Machine learning is about the quantity of the features. As long as these features increase accuracy and generalize well, hundreds or thousand of features still makes sense

Using a Titanic case as an example, understanding about what kind of passengers survive the accident is a statistical modeling. Looking at age, gender, first class cabin, socio-economic status. Looking how each of the feature relates to the survivability of passengers and what is the standard error rate. Machine Learning uses all those features as long as it increases the accuracy. Even passengers height is taken into account if it meant to increase the accuracy.

Statistical modeling

In general, statistical modeling requires us to understand various predictors and how it contribute to the outcome, while at the same time measuring confounding variables that contribute to the noise of the model. We use various statistical inference to determine which of the features are really important.

Linear Regression usually comes at the top when talking about the outcome being numerical variable. The predictors can be numerical or categorical. We test model from various combination of predictors and degree of polynomial. These predictors are chosen from our previous EDA as an interesting feature that relates to the outcome.

To test this model, we use number of evaluation methods . First, we use effect size test. Is it true that one model is statistically significant over the other? We can do hypothesis testing on this to test its significance. Second, It's also important to think which of the features weight more than the other. Finally, a model that is parsimonious should be prioritized. That is a statistically significance model over the other that has the simplest and fewest features.

Machine Learning

When talking about statistical modeling earlier, we talk about what features that associate with numerical outcome. That's why Linear Regression is often used for statistical modeling. If on the other hand, the classification is what the model want, it's often good to use classification algorithms in the form of machine learning.

Machine learning uses as many features as it wants as long as it's: 1) Make a great accuracy, and 2) generalize well for unseen data. The machine learning is follow these steps; Learn at training data, test and tune parameters using test data with evaluation metrics. The evaluation metrics can be different regarding if you have skew or equal proportion of target outcome.

For a digital advertising company, they have two options to consider and have some trade-off. Do you want to maximize click, or maximize budget? If you want as much click as possible, you can predict all people to click. And you're right, every people that potentially click will click the ads. But at the same time, you spent enormous budget because you recommend ads to all people. This is called precision, or false positive rate.

On the other hand, you may want to minimize budget as you can, and you only want to recommend ads if the people definitely clicks. You don't want people that you're not convinced they will click. Congratulations, you minimized the budget, but you have a very small number of click. In the extreme condition, you don't want to recommend at all, so you definitely have reduced the budget to nothing. This called recall, or sensitivity.

Machine learning is about how you tune the parameters to match the best algorithm possible. If you still can't perform well, you can choose another algorithm and tune it to its best. Don't forget to track the parameters and algorithm as you want it to save the best model. If all the optimizations still don't give a great score, then you may want to consider to get another data.

Machine Learning vs Traditional Statistics

There is a still ongoing debate between machine learning and statistical modeling. But both still follow the statistical approach when making an independent observation. When you have a log data, it's better to make each observation represent one user so the data is independent based on the user.

You avoid the outlier in the data using a normal distribution and try to use log/sqrt/ or other mathematical formula to turn each of the feature to normal distribution. This is also similar to both approaches.

In term of generalizability, statistics saw it as how the model generalize the population of interest. Machine learning, on the other hand, sees generalizability as how the model performs against unseen future data, by splitting training, validation, and test set. This is somewhat similar principals.

In term of feature engineering, both rely on data analysis to create features. In term of features selection, while statistics use some p-value or r-squared to select significant features, machine learning uses select top K/percentile based on the prediction of accuracy.

Now statistician may stop further as each of the features already known the significance. Machine learning further tries test each algorithm in classification or regression and hypertune parameters, evaluated by pre-defined evaluation metrics.

Case Study: Netflix Price

Back to Netflix example. The winner uses machine learning and has a great prediction accuracy. Machine Learning did a great job because complex model outperforms parsimonious model. But it turns out Netflix didn't use it because it's too complex, the computation itself is not feasible with Netflix technology. In this case, a parsimonious model is better against the complex model.


It's important to know your question. This can end up to a different modeling and lead to a different conclusion . Back to the digital advertising example, Statistical modeling will ask, "What contributes people to click?", Machine Learning will ask "Can we maximize the CTR?". Statistical modeling will answer which features that significance, and detect other confounding variables that contribute to click. While machine learning will get all features, regardless its significance or confounding variables, as long as it maximizes CTR.