10 Minutes into Data Science

  |   Source

I just followed John Hopkin's Executive Data Science team. In the first chapter of the course Jeff Leek said,

In Data Science, the importance is science and not data. Data Science is only useful when we use data to answer the question.

It actually true in some point. I've actually seen too many companies that brag how big they data are but they didn't know how to pose a question. If the data can't be used to answer the question that makes the company growth, it's not a good idea. They press even more to the point of data, data, data but in the end, machine learning will just used to answer the question. It's critical to pose a question first, then try to get/build data to answer it. And more importantly, don't be afraid to get other data from sources outside your company.

When investigating a problem and communicating to the broader audience, it's important to find the right question, and then find the data related to answering the questions. This is how all data science work look like. Even when we measure the confidence interval in A/B testing between control and experiment group, the real question is still about how we can use data for investigation

In that course, Jeff Leek gave an example of Money Ball. We can find some evaluation metrics to measure player's skills, but the key important to answer questions is, "Can we be a winning team with a small budget?" Creating the best predictive model is not the most important. In the case of Netflix prize, one-million-dollar algorithm can't be implemented because it's impossible to scale with all of the customer combined.

Statistics and Machine Learning

Statistics mainly divided into two parts, Descriptive Statistics and Inferential Statistics. Descriptive Statistics, as the name implied, is using statistics to better understand about the data. This includes using summary statistics and visualization to explore the data. Jake Vanderplas, the author of Python Data Science Handbook, showed how to use this method to understand the pattern of Seattle's Bicycle Habits in his blog.

While Inferential Statistics let you use Hypothesis testing and Confidence Interval to make an inference about your assumption. Suppose you have two group with numerical variable (female age vs male age), and you want to found whether these groups is significantly different with each other. Statistical Inference need you to follow experiment design, so that your inference can generalize well to your population of interest and also found correlation that suggest causation. This method is useful to get insight, whether the relationship of two variable is correlate with each other.

Machine Learning is one of the fields in artificial intelligence that give machine capability to learn about your data. Thanks to modern era, where computation power exponentially increase and hype of data science , machine learning has grown into broad area. Two of the interesting topic is Supervised Learning and Unsupervised Learning. Supervised Learning is where given a set of input and output, machine learning tried to predict future output given future input. Think of it as a student that given a bunch of papers of quiz where it has a wrong and right answer, he learns and able to answer future quiz. While the example of Unsupervised Learning is clustering algorithm like we discussed earlier.

Machine Learning is a different use case when compared to Statistical Inference. Have you seen Kaggle competitions? take a look around at their leaderboard, and all that score is extremely close. Often top 100 hundred is already a winner. They're willing to go into 2-3 times more complexity to get 1% increase. This is not practically possible, and we saw from Netflix prize, they give up on the one million dollar code because it's computation is expensive. So if you in for the accuracy games, go ahead! Otherwise, a simpler model is better.

So what do you use when you want to make a prediction? Use Statistical modeling to understand what your prediction is. Use machine learning to make your prediction better. Statistical modeling concerns about the complexity model because we have to understand it, but machine learning scale as complexity increase. Moreover, machine learning examine which parameter to tuned for the performance. On the other hand, statistics concern about parsimonious model (simpler model for best understanding the pattern).

Software Engineering

So why software engineering is important in data science? Because often you will have to get data using programming language. Sure there is some reporting that you can be download from Google Analytics or any dashboard in your company, but it's only summary and aggregation metrics. Things will get harder if you need advanced or specialized metrics. Log data will always have some messy data. If you have human-input data, you will always have human error. In order to do that, you need engineering skills. Pulling data from database alone need some programming or SQL skills. In case you need additional data from open data in the web, you also need programming to get data through API.

So engineering skill is a critical part of data science. In fact, it's so critical that you won't get the fun part, analyzing and making inference/prediction without engineering skills. You don't even know how to get data and clean it. At least software engineer alone can still do something. They can get some data and validate through analyzing inconsistency, and make some descriptive statistics to analyze the data.


One of data science toolbox is often a debate between choosing R or Python, but they are two different things. Python comes from software engineer background. Since Python famous first at web development, gathering and manipulating data has becomes very important.

On the other hand, R has statistician background. There is widely variant of statistical packages available. Statistician also use visualization to get insight about data, therefore R has rich visualization packages.

So when processing the data use Python, and doing R when you want to analyze. Of course, sometimes these two overlaps, and you can choose to stick to one language. You can manipulate string in R, or do statistical analysis in Python.

You have few choices when you want to present your result of analysis. If narrative, you can use Jupyter or Rmarkdown to narratively storytelling your analysis. Sometimes you end-up engage your audience based on your findings. In this case, you want to create interactive visualization. D3.js is great in this area.

So when you talk about data science, think more about the question that you want to answer. Can you get some information if you have to answer the question? Do you have useful data to answer your question? If you could even answer it, is the answer practically possible? By doing this series of questions, it will avoid your missteps in the long run.