In this Lesson we want to add 3 or more variables to observe. We also want to use third another variable to look some consistency of 2 variables we want to observe.
pf = read.csv('../lesson3/pseudo_facebook.tsv',sep = "\t")
ggplot(aes(x = gender, y = age),
data = subset(pf, ! +geom_boxplot()+
stat_summary(fun.y = mean, geom = "point", shape =4 )
#+ geom_histogram()
ggplot(aes(x = age, y = friend_count),
data= subset(pf, !
geom_line(aes(color=gender), stat="summary", fun.y = median)
pf.fc_by_age_gender <- group_by(pf,age,gender) %>%
summarise(median_friend_count = median(friend_count),
mean_friend_count = mean(friend_count),
#Earlier we use groupby age,gender. because gender need to be avoided, remove one layer
#by using ungroup, and arrange by age
Create a line graph showing the median friend count over the ages for each gender. Be sure to use the data frame you just created, pf.fc_by_age_gender. Instructor Notes
Your code should look similar to the code we used to make the plot the first time. It will not need to make use of the stat and fun.y parameters.
ggplot(aes(x = age, y = friend_count), data = subset(pf.1, ! + geom_line(aes(color = gender), stat = ‘summary’, fun.y = median)
ggplot(aes(x= age, y=median_friend_count),
data = pf.fc_by_age_gender)+
It???s important to use quotes around the variable name that is assigned tovalue.var.
We could also create a similar data frame using the dplyr package. pf.fc_by_age_gender.wide <- pf.fc_by_age_gender %.% group_by(age) %.% summarise(male = friend_count.median[gender = ‘male’], female = friend_count.median[gender = ‘female’], ratio = female / male) %.% arrange(age)
pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
age~gender,#formula,left=value that kept,right=column that retain
Plot the ratio of the female to male median friend counts using the data frame pf.fc_by_age_gender.wide.
Think about what geom you should use. Add a horizontal line to the plot with a y intercept of 1, which will be the base line. Look up the documentation for geom_hline to do that. Use the parameter linetype in geom_hline to make the line dashed.
The linetype parameter can take the values 0-6: 0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash
ggplot(aes(x=age, y = female/male),
Pseudo Facebook may stated that, many people join from various other countries have tendencies male having lower count than female. These shows us that for younger women, they tend to have almost twice friend count than male ***
Create a variable called year_joined in the pf data frame using the variable tenure and 2014 as the reference year.
The variable year joined should contain the year that a user joined facebook.
A common mistake is to use tenure rather than pf$tenure or with(pf, tenure…). Remember that you need to access the variable in the data frame. This is not one of the hints! :) Hint 1: Divide the tenure variable by a number. Tenure is measured in days, but we want to convert it to years. Hint 2: Subtract tenure measured in years from 2014. What does the decimal portion represent? Should we round up or round down to the closest year? Hint 3: You can use the floor() function to round down to the nearest integer. You can use the ceiling() function to round up to the nearest integer. Which one should you use?
pf$year_joined <- floor(2014 - pf$tenure/365)
Now by using table, we know how many users join in each year Next we want to take bin-range our year_joined, to make use of categorical using cut function
Create a new variable in the data frame called year_joined.bucket by using the cut function on the variable year_joined.
You need to create the following buckets for the new variable, year_joined.bucket
(2004, 2009]
(2009, 2011]
(2011, 2012]
(2012, 2014]
Note that a parenthesis means exclude the year and a bracket means include the year.
Now we have joined tenure and age. and using year_joined to create a bucket
Create a line graph of friend_count vs. age so that each year_joined.bucket is a line tracking the median user friend_count across age. This means you should have four different lines on your plot.
You should subset the data to exclude the users whose year_joined.bucket is NA.
table(pf$year_joined, useNA = 'ifany')
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !
geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = median)
In this plot, now we observe 3 variables, using x=friend_count, y=age, and year_joined.bucket as categorical variables. ***
Write code to do the following:
Add another geom_line to code below to plot the grand mean of the friend count vs age.
Exclude any users whose year_joined.bucket is NA.
Use a different line type for the grand mean.
As a reminder, the parameter linetype can take the values 0-6:
0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !
geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = mean)+
geom_line(fun.y = mean, stat='summary', linetype=2)
with(subset(pf, tenure > 1), summary(friend_count/tenure))
in site longer, many friends more tenure intiate more friends
What is the median friend rate? .2205
What is the maximum friend rate? 417
Create a line graph of mean of friendships_initiated per day (of tenure) vs. tenure colored by year_joined.bucket.
You need to make use of the variables tenure, friendships_initiated, and year_joined.bucket.
You also need to subset the data to only consider user with at least one day of tenure.
ggplot(aes(x = tenure, y = friendships_initiated/tenure),
data = subset(pf, tenure>1))+
These shows that people with more tenure typically have less friendships_initiated ***
Notice that we have noise in our graph. By doing rounding in x, we have reduce noise with more bias
Instead of geom_line(), use geom_smooth() to add a smoother to the plot. You can use the defaults for geom_smooth() but do color the line by year_joined.bucket
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = mean)
ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_smooth(aes(color = year_joined.bucket))
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
By doing smoothing, we also get better understanding about the data. ***
Bayesian Statistics and Marketing contains the data set and a case study on it.
The citation for the original paper on the yogurt data set is Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi. “Modeling consumer demand for variety.” Marketing Science 21.3 (2002): 229-250.
A special thanks to Professor Allenby for helping us understand this data set.
To learn more about scanner data, check out Panel Data Discrete Choice Models of Consumer Demand ***
yogurt dataset has different set of csv, in which we see the onr purchase per row.
yo = read.csv('yogurt.csv')
yo$id <- factor(yo$id)
geom_histogram(stat='bin', binwidth =10)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Notice: the higher the price, the higher the purchase. with binwidth=10, the bias will go much higher and lost its descreteness and falling to see each price ***
Now, we want to make a count of total yogurt for each household purchases
all.purchases <- transform(yo,table(yo$id))
yo <- transform(yo, all.purchases=strawberry+blueberry+pina.colada+plain+mixed.berry)
Create a scatterplot of price vs time. This will be an example of a time series plot.
data = yo )+
The citation for the original paper on the yogurt data set is Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi. “Modeling consumer demand for variety.” Marketing Science 21.3 (2002): 229-250.
Note: x %in% y returns a logical (boolean) vector the same length as x that says whether each entry in x appears in y. That is, for each entry in x, it checks to see whether it is in y.
This allows us to subset the data so we get all the purchases occasions for the households in the sample. Then, we create scatterplots of price vs. time and facet by the sample id.
Use the pch or shape parameter to specify the symbol when plotting points. Scroll down to ‘Plotting Points’ on QuickR’s Graphical Parameters.
#set the seed for reproducible results
sample.ids <- sample(levels(yo$id), 16)
ggplot(aes(x=time, y= price),
data= subset(yo, id %in% sample.ids))+
geom_point(aes(size=all.purchases), pch=1)+
## Saving 7 x 5 in image
If we look back at the facebook graph. We can’t measure the friendship initiated, because it just cross-section, categorical graph.We can see it by different color in the graph. It’s not time-series (like yogurt, where we can see the purchases) so we can’t see the friendship_iniated. It would be great if we can have time-series day/friendship.initiated
Dean also said that we have EDA to explore relationship between variables. Use another variable to see the consistency of two variable that we observe. But we also may want to predict one variable based on the rest of variables. We may want to reduce the dimension so we can get better visualization(PCA). And also let the data speak for itself. Plot multiple graph and visualization to get better understanding about the data.
Scatter matrix may not good for this particular data, specially if this is categorical.
Here’s the scatterplot matrix as a pdf.
You’ll need to run the code install.packages(‘GGally’) to install the package for creating this partiular scatterplot matrix.
If the plot takes a long time to render or if you want to see some of the scatterplot matrix, then only examine a smaller number of variables. You can use the following code or select fewer variables. We recommend including gender (the 6th variable)!
pf_subset <- pf[ , c(2:7)]
pf_subset <- pf[,c(2:15)]
Great work on finding or computing the correlation coefficients.
Scatterplots are below the diagonal, and categorical variables, like gender, create faceted histograms. The ggpairs will create some lookup (correlation) table that we want to observe between variables. ggpairs may not a good logarithmic analysis. but it’s a good starting point to plotting the graph.
nci <- read.table("nci.tsv")
#Melt the data to long format
#Here we just sampling to just 200 dataset, and all columns the sampe
nci.long.samp <- melt(as.matrix(nci[1:200,]))
#ggplot will make underexpress in blue, and overexpress in red
#The geom will be plot in tile, and scale color from blue to red
ggplot(aes(y = gene, x = case, fill = value),
data = nci.long.samp) +
geom_tile() +
scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
Genomic map of the data is just 200 over 6000 examples. By using 6000 we just increasing the complexity of the visualization. That’s way it’s important to just sampling the data. and work our various visualization and relationship in the variables.