Exploring two variables in R with scatterplot, jitter and smoothing to handle overplotting
In this lesson we will learn how toInvestigate two variable make a Scatter Plot and hear moira’s study in EDA perceive audience size ### Scatterplots and Perceived Audience Size Notes: x->actual vs y->perceive. We can see that people choose round up number(50,100,200,etc) when they perceived audience size In reality, people saw our post saw 100/200 ***
Notes:
library(ggplot2)
pf = read.csv('../Lesson3/pseudo_facebook.tsv', sep='\t')
ggplot(aes(x = age, y=friend_count), data = pf) + geom_point()
Response: People below thirty would have more friends.there’s some extreme where ages>90 (some maybe lying). But that also can infer people who fake beyond age 90 have sense of humor hence more friends. It’s also important to notice the outliers of our data, and make actions how to audit the data. ***
Notes: Need to say aes wrapper in x and y have to say what type of geom
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
ggplot(aes(x = age, y=friend_count), data = pf) + geom_point()
Notes: Overplotting means we can’t exactly see what are the real plotting. In this case we want to have a plot represent on 20 plot As X(age) is discrete, only attribute point doesn’t really describe age. So instead we using jitter The warning of ommited missing values because we limit to only age 13-90
ggplot(aes(x = age, y=friend_count), data = pf) + geom_jitter(alpha=1/20) + xlim(13,90)
## Warning: Removed 5176 rows containing missing values (geom_point).
Response: We can see more distributed in the plot. Also keep in mind alpha=1/20 in geom means it will take 20 points in that coordinat to make it completely black. By doing this we know that most of users(in block of black) seen as age over 30 has below 1000 average friends. ***
Notes:
ggplot(aes(x = age, y=friend_count), data = pf) +
geom_point(alpha=1/20,position= position_jitter(h = 0)) +
xlim(13,90)
## Warning: Removed 5177 rows containing missing values (geom_point).
ggplot(aes(x = age, y=friend_count), data = pf) +
geom_point(alpha=1/20,position= position_jitter(h = 0)) +
xlim(13,90)+
coord_trans(y = "sqrt")
## Warning: Removed 5171 rows containing missing values (geom_point).
?coord_trans
It’s more distinction to see the friend count. ***
Notes:
ggplot(aes(x = age, y=friendships_initiated, color=gender), data = pf) +
geom_point(alpha=1/20,position= position_jitter(h = 0)) +
xlim(13,90)+
coord_trans(y = "sqrt")+
ggsave('femvsmale.png')
## Saving 7 x 5 in image
## Warning: Removed 5193 rows containing missing values (geom_point).
## Warning: Removed 5185 rows containing missing values (geom_point).
Notes: so much can do in scatter plot. In MOira’s case, we transform axis in percentage, that way we can see percentage of survey guess vs survey actual audience still typically we saw 10-20% would have actually 60% seeing our post. ***
Notes: not possible to judge quantity in jitter(harder). avg friend-Coutn vary over age.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
age_groups <- group_by(pf,age)
pf.fc_by_age <-summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
head(pf.fc_by_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
tail(pf.fc_by_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## 1 108 369.2426 213.0 1661
## 2 109 172.8889 120.0 9
## 3 110 336.7333 243.0 15
## 4 111 240.2222 166.0 18
## 5 112 484.9444 120.5 18
## 6 113 334.6683 206.0 202
#arrange the order
pf.fc_by_age <- arrange(pf.fc_by_age, age)
pf.fc_by_age
## Source: local data frame [101 x 4]
##
## age friend_count_mean friend_count_median n
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## .. ... ... ... ...
Create your plot!
names(pf.fc_by_age)
## [1] "age" "friend_count_mean" "friend_count_median"
## [4] "n"
ggplot(aes(age,friend_count_mean), data = pf.fc_by_age) + geom_line()
Notes: quantile = first nth percentage of data that we want to observe
ggplot(aes(x = age, y=friend_count), data = pf) +
geom_point(alpha=1/20,
position= position_jitter(h = 0),
color = "orange") +
geom_line(stat = "summary", fun.y = mean)+
geom_line(stat="summary", fun.y = quantile, probs = 0.1,#10% of users have ~f_count
linetype =2, color="blue")+
geom_line(stat="summary", fun.y = quantile, probs = 0.9,#90% of users have ~f_count
linetype =2, color="blue")+
coord_cartesian(xlim=c(13,70), ylim=c(0,1000))
?coord_cartesian
Response: We have some jitter at age 69 majority of users in facebook is below age 30, and have some normal distribution Whereas age beyond 70 have some peak upside down(either true or users lying) ***
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes: We can see that people have overestimate on the left(graph1) side, right(underestimate), and the graph at 0% which perfectly estimate their audience size. ***
Notes: We’re going to find the correlation of friend_count, and age with -0.02740737 as the result , it’s not monotonic, there’s isn’t correlation between two variable
?cor.test
cor.test(pf$friend_count,pf$age)
##
## Pearson's product-moment correlation
##
## data: pf$friend_count and pf$age
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
?with
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Notes: If not correlated, then perhaps there’s some data(outlier) that we need to hinder Now we want to subset our data less or equal than 70. Don’t use inference statistic in place of descriptive statistices e.g. based on this graph, we describe that people lonelier with aging
with( subset(pf, age <= 70) , cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.5923, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Notes: we can use pearson, produce cor.test spearman produce rho and kendall, produce tau ***
Notes:
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
ggplot(aes(x = www_likes_received,y =likes_received ), data = pf)+
geom_point()
Notes: The correlation coefficient is invariant under a linear transformation of either X or Y, and the slope of the regression line when both X and Y have been transformed to z-scores is the correlation coefficient.
It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers ARE of interest, and it’s important that we understand their values and why they appear in the data set.
This code will zooming to 95% of most of our data, ignoring outliers. What useful method to zoom in! We also can smoothing line, by drawing some line, linear model, and see from the line the correlation between the two.
ggplot(aes(x = www_likes_received,y =likes_received ), data = pf)+
geom_point()+
xlim(0,quantile(pf$www_likes_received, 0.95))+
ylim(0,quantile(pf$likes_received, 0.95))+
geom_smooth(method = "lm", color="red")
## Warning: Removed 6075 rows containing missing values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).