eda-fb
In this blog, I want to perform Exploratory Data Analysis with Facebook dataset. This dataset contains almost 100,000 users and it varies from age, birthday,gender, to likes, mobile likes, etc..
Well this isn’t actually a real Facebook dataset. But this pseudo data is provided by Data Analysts at Facebook. So we can be assured it’s as good as the real one. This Exploratory Data Analysis ranging from my experience from Udacity Course, Exploratory Data Analysis with R, in which I acquired the dataset. You should check it, it’s really recommended course.
Here I generate the html using Knit HTML with Rstudio. the code is as given.
Overview
Now, to get better at analyzing at the dataset, it’s good to have all summary that we need to do this analysiz. First, I will do some basic summary to get better understand at the dataset. Here the dataset contain the words ‘dob’, which means data of birth.
## userid age dob_day dob_year
## Min. :1000008 Min. : 13.00 Min. : 1.00 Min. :1900
## 1st Qu.:1298806 1st Qu.: 20.00 1st Qu.: 7.00 1st Qu.:1963
## Median :1596148 Median : 28.00 Median :14.00 Median :1985
## Mean :1597045 Mean : 37.28 Mean :14.53 Mean :1976
## 3rd Qu.:1895744 3rd Qu.: 50.00 3rd Qu.:22.00 3rd Qu.:1993
## Max. :2193542 Max. :113.00 Max. :31.00 Max. :2000
##
## dob_month gender tenure friend_count
## Min. : 1.000 female:40254 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.000 male :58574 1st Qu.: 226.0 1st Qu.: 31.0
## Median : 6.000 NA's : 175 Median : 412.0 Median : 82.0
## Mean : 6.283 Mean : 537.9 Mean : 196.4
## 3rd Qu.: 9.000 3rd Qu.: 675.0 3rd Qu.: 206.0
## Max. :12.000 Max. :3139.0 Max. :4923.0
## NA's :2
## friendships_initiated likes likes_received
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 17.0 1st Qu.: 1.0 1st Qu.: 1.0
## Median : 46.0 Median : 11.0 Median : 8.0
## Mean : 107.5 Mean : 156.1 Mean : 142.7
## 3rd Qu.: 117.0 3rd Qu.: 81.0 3rd Qu.: 59.0
## Max. :4144.0 Max. :25111.0 Max. :261197.0
##
## mobile_likes mobile_likes_received www_likes
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 4.0 Median : 4.00 Median : 0.00
## Mean : 106.1 Mean : 84.12 Mean : 49.96
## 3rd Qu.: 46.0 3rd Qu.: 33.00 3rd Qu.: 7.00
## Max. :25111.0 Max. :138561.00 Max. :14865.00
##
## www_likes_received
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 2.00
## Mean : 58.57
## 3rd Qu.: 20.00
## Max. :129953.00
##
## userid age dob_day dob_year dob_month gender tenure friend_count
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## friendships_initiated likes likes_received mobile_likes
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## mobile_likes_received www_likes www_likes_received
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
Lastly, we may want to plot each other variables against another so we can get a better insight.
Now let’s disscuss some of these graph.
Female vs Male
As this dataset also contain the gender, I want to know every analysis that differentiate the male from the female. First let’s take a look at each of the gender. Let’s see the friend count for both female and male.
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
The median of female is better than male, because the mean in female will drag the median lefwards of the graph. The median will resistance about the outliers(friend_count hight), because the average we can say that we at least try half of our data.
Number of years using Facebook vs number of users in examples.
We see exponential growth of users in Facebook for the last 3 years. Adding the color, this graph will proceed as follows;
Tenure range from 1-365 days. in this case, I want to take range from 0 to 1 This will give me range histogram, to make it more convenience to see. Then I bulk(binwidth) it in 0.25.
Let’s see users’ age in this dataset.
For some people, they don’t have friend, so the baselog10 of zero(friend_count) = 0, this will get me negative towards infinity which will disrupt this data. So to avoid this I increment all friend count by 1. Natural Log will set better normal distribution over outliers in this data. I also using sqrt to compare log based 10 to convert it to normal distribution.
Female vs Male Continue
Now let’s see which gender has likes by using web platform the most.
The data seems to tell me that female likes more often than male. Let’s test this by using different approach.
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
And by using boxplot,
the boxplot specify that most of women has higher max friend_count than men
On average, who initiated more friendships in our sample: men or women?
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
Here I see that women initiated more friendship than men. Boxplot should have more advantage than mere by-function, and summary split each of gender, because it gives me better sense how determining the outlier of data, and also the normal distribution.
Checking if the facebook mobile app is really benefit for the company
## [1] 0.6459097
With almost 65% user on mobile device, then I know why Facebook should make mobile app. It’s very benefit for the company.
Don’t try to understand kinds of data that you have, but also what transformations can you made using your data. In this example I synthesize the data, rather than just mere naturallog/sqrt function
Identify the anomaly of the dataset
People below thirty would have more friends.there’s some extreme where ages>90 (some maybe lying). ’s also important to notice the outliers of our data, and make actions how to audit the data.
The unusual way is because the Facebook has default first day as default drop down for day-birthday, we should expect the first day is surprisingly high.It may be the case when user just skip over detail, or keep maintaining their privacy. Whatever the case, it’s imp ortant to look over these anomaly and understanding better about the data. ed data. Where people that have much larger income encoded in normal income in stackholder.
Overplotting
Overplotting means we can’t exactly see what are the real plotting. In this case I want to have a plot represent on 20 plot As X(age) is discrete, only attribute point doesn’t really describe age. So instead I using jitter. The warning of ommited missing values because I limit to only age 13-70.