In this Lesson we want to add 3 or more variables to observe. We also want to use third another variable to look some consistency of 2 variables we want to observe.

Moira Perceived Audience Size Colored by Age

Third Qualitative Variable

library(ggplot2)
?read.csv
pf = read.csv('../lesson3/pseudo_facebook.tsv',sep = "\t")
ggplot(aes(x = gender, y = age),
       data = subset(pf, !is.na(gender))) +geom_boxplot()+
  stat_summary(fun.y = mean, geom = "point", shape =4 )

#+ geom_histogram()

ggplot(aes(x = age, y = friend_count),
       data= subset(pf, !is.na(gender)))+
  geom_line(aes(color=gender), stat="summary", fun.y = median)

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pf.fc_by_age_gender <- group_by(pf,age,gender) %>%
  filter(!is.na(gender))%>%
  summarise(median_friend_count = median(friend_count),
            mean_friend_count = mean(friend_count),
            n=n())%>%
  #Earlier we use groupby age,gender. because gender need to be avoided, remove one layer
  #by using ungroup, and arrange by age
  ungroup()%>% 
  arrange(age)
head(pf.fc_by_age_gender)
## Source: local data frame [6 x 5]
## 
##   age gender median_friend_count mean_friend_count    n
## 1  13 female                 148          259.1606  193
## 2  13   male                  55          102.1340  291
## 3  14 female                 224          362.4286  847
## 4  14   male                  92          164.1456 1078
## 5  15 female                 276          538.6813 1139
## 6  15   male                 106          200.6658 1478

Plotting Conditional Summaries

Create a line graph showing the median friend count over the ages for each gender. Be sure to use the data frame you just created, pf.fc_by_age_gender. Instructor Notes

Your code should look similar to the code we used to make the plot the first time. It will not need to make use of the stat and fun.y parameters.

ggplot(aes(x = age, y = friend_count), data = subset(pf.1, !is.na(gender))) + geom_line(aes(color = gender), stat = ‘summary’, fun.y = median)

ggplot(aes(x= age, y=median_friend_count),
       data = pf.fc_by_age_gender)+
  geom_line(aes(color=gender))


Thinking in Ratios

Wide and Long Format

Reshaping Data

It???s important to use quotes around the variable name that is assigned tovalue.var.

We could also create a similar data frame using the dplyr package. pf.fc_by_age_gender.wide <- pf.fc_by_age_gender %.% group_by(age) %.% summarise(male = friend_count.median[gender = ‘male’], female = friend_count.median[gender = ‘female’], ratio = female / male) %.% arrange(age)

head(pf.fc_by_age_gender.wide)

library(reshape2)
pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
                                  age~gender,#formula,left=value that kept,right=column that retain
                                  value.var='median_friend_count')
head(pf.fc_by_age_gender.wide)
##   age female male
## 1  13    148   55
## 2  14    224   92
## 3  15    276  106
## 4  16    258  136
## 5  17    245  125
## 6  18    243  122

Ratio Plot

Plot the ratio of the female to male median friend counts using the data frame pf.fc_by_age_gender.wide.

Think about what geom you should use. Add a horizontal line to the plot with a y intercept of 1, which will be the base line. Look up the documentation for geom_hline to do that. Use the parameter linetype in geom_hline to make the line dashed.

The linetype parameter can take the values 0-6: 0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash

library(ggplot2)
ggplot(aes(x=age, y = female/male),
      data=pf.fc_by_age_gender.wide)+
   geom_line()+
   geom_hline(aes(yintercept=1),linetype=2)

Pseudo Facebook may stated that, many people join from various other countries have tendencies male having lower count than female. These shows us that for younger women, they tend to have almost twice friend count than male ***

Third Quantitative Variable

Create a variable called year_joined in the pf data frame using the variable tenure and 2014 as the reference year.

The variable year joined should contain the year that a user joined facebook.

Instructor Notes

A common mistake is to use tenure rather than pf$tenure or with(pf, tenure…). Remember that you need to access the variable in the data frame. This is not one of the hints! :) Hint 1: Divide the tenure variable by a number. Tenure is measured in days, but we want to convert it to years. Hint 2: Subtract tenure measured in years from 2014. What does the decimal portion represent? Should we round up or round down to the closest year? Hint 3: You can use the floor() function to round down to the nearest integer. You can use the ceiling() function to round up to the nearest integer. Which one should you use?

pf$year_joined <- floor(2014 - pf$tenure/365)

Cut a Variable

Now by using table, we know how many users join in each year Next we want to take bin-range our year_joined, to make use of categorical using cut function

Create a new variable in the data frame called year_joined.bucket by using the cut function on the variable year_joined.

You need to create the following buckets for the new variable, year_joined.bucket

   (2004, 2009]
   (2009, 2011]
   (2011, 2012]
   (2012, 2014]

Note that a parenthesis means exclude the year and a bracket means include the year.

?cut
summary(pf$year_joined)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2005    2012    2012    2012    2013    2014       2
table(pf$year_joined)
## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##     9    15   581  1507  4557  5448  9860 33366 43588    70
pf$year_joined.bucket <- cut(pf$year_joined, breaks=c(2004,2009,2011,2012,2014))
table(pf$year_joined.bucket)
## 
## (2004,2009] (2009,2011] (2011,2012] (2012,2014] 
##        6669       15308       33366       43658

Plotting it All Together

Now we have joined tenure and age. and using year_joined to create a bucket

Create a line graph of friend_count vs. age so that each year_joined.bucket is a line tracking the median user friend_count across age. This means you should have four different lines on your plot.

You should subset the data to exclude the users whose year_joined.bucket is NA.

table(pf$year_joined, useNA = 'ifany')
## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  <NA> 
##     9    15   581  1507  4557  5448  9860 33366 43588    70     2
ggplot(aes(x = age, y = friend_count),
       data = subset(pf, !is.na(year_joined.bucket)))+
  geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = median)