Research Methods

  |   Source

As a role of Data Scientist, often we need to do a little bit of researching instead relying of our data alone. This could become a good insight as we see the data as a whole, and give a better clarification. There are many ways to conduct a research. But for a good research method, it requires us to:

  • Know how many people you surveyed
  • Who you surveyed
  • How the survey was conducter

To know about certain things, suppose we include some of the factor to help us get the result. These factors must be validated. But first we might want to measure how related the things and the factors to measue them. This called operational condition. You want to try to find a correlation between the things and the factor.

Things that are harder to measure called Constructs. Like Happiness, Guilt, or Happy. We can't measure them, so we want to try some ways to measure it. Like IQ test, stress test, or any others. These methods called Operational Definition.

It's better for us to list all the possible Operational Definition to measure the Constructs. Things that get in the possibility list, must be validated, how the accuracy.

For all the possibility out there, be aware of Extraneous Factors (lurking variables). These are the factors that we have to pay attention to. Maybe the variable has to be constant for all the data so we can keep it's more robust. Or other things related to the factor. Things could get different and send weird result if we don't analyze these factors well.

To get the sample parameters, usually we use $\mu$ to descibe average for our entire population. Often, it's not possible to have values for entire population, so we take samples average, describe in $\bar{x}$ symbol. $\bar{x}$ may not be able to represent the whole population, but it's a good representation, and we may come close to the real $\mu$. These error measurement is calculated in $$\mu - \bar{x}$$

Sample can represent whole population, if we have random distribution of the whole population.Note by the random means all of the data has equal chances of being picked up.These sampling values from population is called statistic, and the values describing whole population is called parameter. Note though as we know statistic only pick sample of whole population, one must be expect the sampling value should be different than the whole population.


Take a look at the example. Here we have the Memory, the Constructs, things that we want to get measure one. And one way to measure it called Hours Slept, the Operational Definition. In this case, we want to validate if Hours Slept is directly influence memory. By doing sample test, Hours Slept(independent variables) is affect temporal memory (dependent variables) by linear scale.

We see the correlation of hours slept influence memory. But this is correlation, not causation. We still have other variables to observed. And descriptive statistics can't be used in inferential statistics.


Survey is one of the things we do if we want to get data, in this case, people. It provided an easy way to get some samples of the population. It's also relatively inexpensive. We can ask people in the street. Can be conducted remotely, by phone call or by online survey. And, if it open to public, anyone can access and analyze it. But beware though, some of the survey doesn't reveal true information. People may become dishonest if we ask things that attack their private. People may give wrong phone number because we ask them to conduct more survey from their family or friend. People maybe lies and tell the opposite. These are ones of the things to consider.There could be non-response bias, where student maybe discard university survey because they don't like them. This response that we get should be our indicator response.

Controlled experiment, however, is the experiement in which we force the condition to the way we want it. We closed all the possible lurking variables, and by doing this we have higher accuracy. A doctor can do A/B testing experiment, of two different pill given to his patients. The patients are not told which of the medication that different. By doing this the doctor don't want the patients to feel different, or they might get response bias.The doctor also doesn't know which pill to give the patients. If the doctor knows, he might unconsciously favor the other side over the others. This called Double Blind, for both the doctor and the patients.To prevent any other parameters, such as gender and age, the patients may be equally randomly distributed. If too much female in one side, and male in the opposite, the experiment may caused bias depending on the gender.Placebo on the other hand is fake the medicine that we give to the patients. We do this to identify lurking variables. Like for example, it turns out some patients cured after given placebo. We identify these lurking variables that make them cured.