In this Lesson we want to add 3 or more variables to observe. We also want to use third another variable to look some consistency of 2 variables we want to observe.
library(ggplot2)
?read.csv
pf = read.csv('../lesson3/pseudo_facebook.tsv',sep = "\t")
ggplot(aes(x = gender, y = age),
data = subset(pf, !is.na(gender))) +geom_boxplot()+
stat_summary(fun.y = mean, geom = "point", shape =4 )
#+ geom_histogram()
ggplot(aes(x = age, y = friend_count),
data= subset(pf, !is.na(gender)))+
geom_line(aes(color=gender), stat="summary", fun.y = median)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pf.fc_by_age_gender <- group_by(pf,age,gender) %>%
filter(!is.na(gender))%>%
summarise(median_friend_count = median(friend_count),
mean_friend_count = mean(friend_count),
n=n())%>%
#Earlier we use groupby age,gender. because gender need to be avoided, remove one layer
#by using ungroup, and arrange by age
ungroup()%>%
arrange(age)
head(pf.fc_by_age_gender)
## Source: local data frame [6 x 5]
##
## age gender median_friend_count mean_friend_count n
## 1 13 female 148 259.1606 193
## 2 13 male 55 102.1340 291
## 3 14 female 224 362.4286 847
## 4 14 male 92 164.1456 1078
## 5 15 female 276 538.6813 1139
## 6 15 male 106 200.6658 1478
Create a line graph showing the median friend count over the ages for each gender. Be sure to use the data frame you just created, pf.fc_by_age_gender. Instructor Notes
Your code should look similar to the code we used to make the plot the first time. It will not need to make use of the stat and fun.y parameters.
ggplot(aes(x = age, y = friend_count), data = subset(pf.1, !is.na(gender))) + geom_line(aes(color = gender), stat = ‘summary’, fun.y = median)
ggplot(aes(x= age, y=median_friend_count),
data = pf.fc_by_age_gender)+
geom_line(aes(color=gender))
It???s important to use quotes around the variable name that is assigned tovalue.var.
We could also create a similar data frame using the dplyr package. pf.fc_by_age_gender.wide <- pf.fc_by_age_gender %.% group_by(age) %.% summarise(male = friend_count.median[gender = ‘male’], female = friend_count.median[gender = ‘female’], ratio = female / male) %.% arrange(age)
head(pf.fc_by_age_gender.wide)
library(reshape2)
pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
age~gender,#formula,left=value that kept,right=column that retain
value.var='median_friend_count')
head(pf.fc_by_age_gender.wide)
## age female male
## 1 13 148 55
## 2 14 224 92
## 3 15 276 106
## 4 16 258 136
## 5 17 245 125
## 6 18 243 122
Plot the ratio of the female to male median friend counts using the data frame pf.fc_by_age_gender.wide.
Think about what geom you should use. Add a horizontal line to the plot with a y intercept of 1, which will be the base line. Look up the documentation for geom_hline to do that. Use the parameter linetype in geom_hline to make the line dashed.
The linetype parameter can take the values 0-6: 0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash
library(ggplot2)
ggplot(aes(x=age, y = female/male),
data=pf.fc_by_age_gender.wide)+
geom_line()+
geom_hline(aes(yintercept=1),linetype=2)
Pseudo Facebook may stated that, many people join from various other countries have tendencies male having lower count than female. These shows us that for younger women, they tend to have almost twice friend count than male ***
Create a variable called year_joined in the pf data frame using the variable tenure and 2014 as the reference year.
The variable year joined should contain the year that a user joined facebook.
Instructor Notes
A common mistake is to use tenure rather than pf$tenure or with(pf, tenure…). Remember that you need to access the variable in the data frame. This is not one of the hints! :) Hint 1: Divide the tenure variable by a number. Tenure is measured in days, but we want to convert it to years. Hint 2: Subtract tenure measured in years from 2014. What does the decimal portion represent? Should we round up or round down to the closest year? Hint 3: You can use the floor() function to round down to the nearest integer. You can use the ceiling() function to round up to the nearest integer. Which one should you use?
pf$year_joined <- floor(2014 - pf$tenure/365)
Now by using table, we know how many users join in each year Next we want to take bin-range our year_joined, to make use of categorical using cut function
Create a new variable in the data frame called year_joined.bucket by using the cut function on the variable year_joined.
You need to create the following buckets for the new variable, year_joined.bucket
(2004, 2009]
(2009, 2011]
(2011, 2012]
(2012, 2014]
Note that a parenthesis means exclude the year and a bracket means include the year.
?cut
summary(pf$year_joined)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2005 2012 2012 2012 2013 2014 2
table(pf$year_joined)
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 9 15 581 1507 4557 5448 9860 33366 43588 70
pf$year_joined.bucket <- cut(pf$year_joined, breaks=c(2004,2009,2011,2012,2014))
table(pf$year_joined.bucket)
##
## (2004,2009] (2009,2011] (2011,2012] (2012,2014]
## 6669 15308 33366 43658
Now we have joined tenure and age. and using year_joined to create a bucket
Create a line graph of friend_count vs. age so that each year_joined.bucket is a line tracking the median user friend_count across age. This means you should have four different lines on your plot.
You should subset the data to exclude the users whose year_joined.bucket is NA.
table(pf$year_joined, useNA = 'ifany')
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 <NA>
## 9 15 581 1507 4557 5448 9860 33366 43588 70 2
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined.bucket)))+
geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = median)