Exploratory Data Analysis on Facebook

2014-11-18 07:01 | Source

In this blog, I want to perform Exploratory Data Analysis with Facebook dataset. This dataset contains almost 100,000 users and it varies from age, birthday,gender, to likes, mobile likes, etc..

Well this isn’t actually a real Facebook dataset. But this pseudo data is provided by Data Analysts at Facebook. So we can be assured it’s as good as the real one. This Exploratory Data Analysis ranging from my experience from Udacity Course, Exploratory Data Analysis with R, in which I acquired the dataset. You should check it, it’s really recommended course.

Here I generate the html using Knit HTML with Rstudio. the code is as given.

Overview

Now, to get better at analyzing at the dataset, it’s good to have all summary that we need to do this analysiz. First, I will do some basic summary to get better understand at the dataset. Here the dataset contain the words ‘dob’, which means data of birth.

##      userid             age            dob_day         dob_year   
##  Min.   :1000008   Min.   : 13.00   Min.   : 1.00   Min.   :1900  
##  1st Qu.:1298806   1st Qu.: 20.00   1st Qu.: 7.00   1st Qu.:1963  
##  Median :1596148   Median : 28.00   Median :14.00   Median :1985  
##  Mean   :1597045   Mean   : 37.28   Mean   :14.53   Mean   :1976  
##  3rd Qu.:1895744   3rd Qu.: 50.00   3rd Qu.:22.00   3rd Qu.:1993  
##  Max.   :2193542   Max.   :113.00   Max.   :31.00   Max.   :2000  
##                                                                   
##    dob_month         gender          tenure        friend_count   
##  Min.   : 1.000   female:40254   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 3.000   male  :58574   1st Qu.: 226.0   1st Qu.:  31.0  
##  Median : 6.000   NA's  :  175   Median : 412.0   Median :  82.0  
##  Mean   : 6.283                  Mean   : 537.9   Mean   : 196.4  
##  3rd Qu.: 9.000                  3rd Qu.: 675.0   3rd Qu.: 206.0  
##  Max.   :12.000                  Max.   :3139.0   Max.   :4923.0  
##                                  NA's   :2                        
##  friendships_initiated     likes         likes_received    
##  Min.   :   0.0        Min.   :    0.0   Min.   :     0.0  
##  1st Qu.:  17.0        1st Qu.:    1.0   1st Qu.:     1.0  
##  Median :  46.0        Median :   11.0   Median :     8.0  
##  Mean   : 107.5        Mean   :  156.1   Mean   :   142.7  
##  3rd Qu.: 117.0        3rd Qu.:   81.0   3rd Qu.:    59.0  
##  Max.   :4144.0        Max.   :25111.0   Max.   :261197.0  
##                                                            
##   mobile_likes     mobile_likes_received   www_likes       
##  Min.   :    0.0   Min.   :     0.00     Min.   :    0.00  
##  1st Qu.:    0.0   1st Qu.:     0.00     1st Qu.:    0.00  
##  Median :    4.0   Median :     4.00     Median :    0.00  
##  Mean   :  106.1   Mean   :    84.12     Mean   :   49.96  
##  3rd Qu.:   46.0   3rd Qu.:    33.00     3rd Qu.:    7.00  
##  Max.   :25111.0   Max.   :138561.00     Max.   :14865.00  
##                                                            
##  www_likes_received 
##  Min.   :     0.00  
##  1st Qu.:     0.00  
##  Median :     2.00  
##  Mean   :    58.57  
##  3rd Qu.:    20.00  
##  Max.   :129953.00  
##

##    userid age dob_day dob_year dob_month gender tenure friend_count
## 1 2094382  14      19     1999        11   male    266            0
## 2 1192601  14       2     1999        11 female      6            0
## 3 2083884  14      16     1999        11   male     13            0
## 4 1203168  14      25     1999        12 female     93            0
## 5 1733186  14       4     1999        12   male     82            0
## 6 1524765  14       1     1999        12   male     15            0
##   friendships_initiated likes likes_received mobile_likes
## 1                     0     0              0            0
## 2                     0     0              0            0
## 3                     0     0              0            0
## 4                     0     0              0            0
## 5                     0     0              0            0
## 6                     0     0              0            0
##   mobile_likes_received www_likes www_likes_received
## 1                     0         0                  0
## 2                     0         0                  0
## 3                     0         0                  0
## 4                     0         0                  0
## 5                     0         0                  0
## 6                     0         0                  0

Lastly, we may want to plot each other variables against another so we can get a better insight.

Now let’s disscuss some of these graph.

Female vs Male

As this dataset also contain the gender, I want to know every analysis that differentiate the male from the female. First let’s take a look at each of the gender. Let’s see the friend count for both female and male.

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

The median of female is better than male, because the mean in female will drag the median lefwards of the graph. The median will resistance about the outliers(friend_count hight), because the average we can say that we at least try half of our data.

Number of years using Facebook vs number of users in examples.

We see exponential growth of users in Facebook for the last 3 years. Adding the color, this graph will proceed as follows;

Tenure range from 1-365 days. in this case, I want to take range from 0 to 1 This will give me range histogram, to make it more convenience to see. Then I bulk(binwidth) it in 0.25.

Let’s see users’ age in this dataset.

For some people, they don’t have friend, so the baselog10 of zero(friend_count) = 0, this will get me negative towards infinity which will disrupt this data. So to avoid this I increment all friend count by 1. Natural Log will set better normal distribution over outliers in this data. I also using sqrt to compare log based 10 to convert it to normal distribution.

Female vs Male Continue

Now let’s see which gender has likes by using web platform the most.

The data seems to tell me that female likes more often than male. Let’s test this by using different approach.

## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

And by using boxplot,

the boxplot specify that most of women has higher max friend_count than men

On average, who initiated more friendships in our sample: men or women?

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    49.0   113.9   124.8  3654.0 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0    44.0   103.1   111.0  4144.0

Here I see that women initiated more friendship than men. Boxplot should have more advantage than mere by-function, and summary split each of gender, because it gives me better sense how determining the outlier of data, and also the normal distribution.

Checking if the facebook mobile app is really benefit for the company

## [1] 0.6459097

With almost 65% user on mobile device, then I know why Facebook should make mobile app. It’s very benefit for the company.

Don’t try to understand kinds of data that you have, but also what transformations can you made using your data. In this example I synthesize the data, rather than just mere naturallog/sqrt function

Identify the anomaly of the dataset

People below thirty would have more friends.there’s some extreme where ages>90 (some maybe lying). ’s also important to notice the outliers of our data, and make actions how to audit the data.

The unusual way is because the Facebook has default first day as default drop down for day-birthday, we should expect the first day is surprisingly high.It may be the case when user just skip over detail, or keep maintaining their privacy. Whatever the case, it’s imp ortant to look over these anomaly and understanding better about the data. ed data. Where people that have much larger income encoded in normal income in stackholder.

Overplotting

Overplotting means we can’t exactly see what are the real plotting. In this case I want to have a plot represent on 20 plot As X(age) is discrete, only attribute point doesn’t really describe age. So instead I using jitter. The warning of ommited missing values because I limit to only age 13-70.

We can see more distributed in the plot. Also keep in mind alpha=1/20 in geom means it will take 20 points in that coordinat to make it completely black. By doing this I know that most of users(in block of black) seen as age over 30 has below 1000 average friends.

We have some jitter at age 69 majority of users in facebook is below age 30, and have some normal distribution Whereas age beyond 70 have some peak upside down(either true or users lying.

Age and Friend Count

In the plot earlier, we can see that the age is a discrete value. I can’t make continuous to line graphic to measure the the age and relative friend count. To to this, I have to create a new variable. I have to make the age continuous by including their birthmonth. This birthmonth (variable dob_month) would make it continuous, ranging from 1-12, quantified from 0-1.

Next, I create a new dataframe, so I don’t mess up with the original.

## Source: local data frame [6 x 4]
## 
##   age_with_months friend_count_mean friend_count_median  n
## 1        13.16667          46.33333                30.5  6
## 2        13.25000         115.07143                23.5 14
## 3        13.33333         136.20000                44.0 25
## 4        13.41667         164.24242                72.0 33
## 5        13.50000         131.17778                66.0 45
## 6        13.58333         156.81481                64.0 54

You can see that I have synthesized a new dataframe and have continuous age shown by age_with_months variable. Now let’s plot this.

Now we have continuous age and see that age 70 peak as outliers. To give summary of this plot, I drawn 3 different plot each with different kind of summary. From this we know that as the best summary of this plot.

Keep in mind that this is just descriptive statistics, as opposed to inferential statistics. From this plot, we can’t infer that as people get older they have less friend_count.

Female vs Male Continue

notice from boxplot, that women has more number, with median beyond 30

## Source: local data frame [6 x 5]
## 
##   age gender median_friend_count mean_friend_count    n
## 1  13 female                 148          259.1606  193
## 2  13   male                  55          102.1340  291
## 3  14 female                 224          362.4286  847
## 4  14   male                  92          164.1456 1078
## 5  15 female                 276          538.6813 1139
## 6  15   male                 106          200.6658 1478

Now by this plot we know plotting in range of ages with different gender. We also spot that younger people tend to have more friend. Now you may want to ask different question. By how many ratio women have friend compare to men?

##   age female male
## 1  13    148   55
## 2  14    224   92
## 3  15    276  106
## 4  16    258  136
## 5  17    245  125
## 6  18    243  122

Pseudo Facebook may stated that, many people join from various other countries have tendencies male having lower count than female. These shows us that for younger women, they tend to have almost twice friend count than male.

In this plot, now we observe 3 variables, using x=friend_count, y=age, and year_joined.bucket as categorical variables. Notice how people in older join(tenure) have more friend count than who join later.

Now by plotting these, we know that the mean graph isn’t entirely artifact. So we want to ask another question. how many friend count the user have each day?

Friendships Initiated.

Now this is what we want to ask, Who’s in the year categorical bucket, initiate friends more than the others?

These shows that people with more tenure typically have less friendships_initiated. Let’s see if we can clean the noise and have a better insight.

By doing smoothing, we also get better understanding about the data. Be caution though, as it also destroy some data that we may want to pay attention.

Many interesting variables are derived from two or more others. For example, we might wonder how much of a person’s network on a service like Facebook the user actively initiated. Two users with the same degree (or number of friends) might be very different if one initiated most of those connections on the service, while the other initiated very few. So it could be useful to consider this proportion of existing friendships that the user initiated. This might be a good predictor of how active a user is compared with their peers, or other traits, such as personality (i.e., is this person an extrovert?).

Let’s see if we can smooth these.

So why people that late join initiate more friendships? Because people with late join, tends to have friends that already on facebook. So they’re catching up by invite their friends.When they first join, they were asked to invite friends that they do know, instead earlier when facebook didn’t implement such system. Facebook has become very famous, and sort of de facto standard for social network. Nonetheless, people (must) use facebook for engaging their network. Like google, many apps nowadays in smartphone using facebook account to quickly login into their account.