Exploring two variables in R with scatterplot, jitter and smoothing to handle overplotting
In this lesson we will learn how toInvestigate two variable make a Scatter Plot and hear moira’s study in EDA perceive audience size ### Scatterplots and Perceived Audience Size Notes: x->actual vs y->perceive. We can see that people choose round up number(50,100,200,etc) when they perceived audience size In reality, people saw our post saw 100/200 ***
pf = read.csv('../Lesson3/pseudo_facebook.tsv', sep='\t')
ggplot(aes(x = age, y=friend_count), data = pf) + geom_point()
What are some things that you notice right away?
Response: People below thirty would have more friends.there’s some extreme where ages>90 (some maybe lying). But that also can infer people who fake beyond age 90 have sense of humor hence more friends. It’s also important to notice the outliers of our data, and make actions how to audit the data. ***
ggplot Syntax
Notes: Need to say aes wrapper in x and y have to say what type of geom
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
ggplot(aes(x = age, y=friend_count), data = pf) + geom_point()
Notes: Overplotting means we can’t exactly see what are the real plotting. In this case we want to have a plot represent on 20 plot As X(age) is discrete, only attribute point doesn’t really describe age. So instead we using jitter The warning of ommited missing values because we limit to only age 13-90
ggplot(aes(x = age, y=friend_count), data = pf) + geom_jitter(alpha=1/20) + xlim(13,90)
## Warning: Removed 5176 rows containing missing values (geom_point).
What do you notice in the plot?
Response: We can see more distributed in the plot. Also keep in mind alpha=1/20 in geom means it will take 20 points in that coordinat to make it completely black. By doing this we know that most of users(in block of black) seen as age over 30 has below 1000 average friends. ***
ggplot(aes(x = age, y=friend_count), data = pf) +
geom_point(alpha=1/20,position= position_jitter(h = 0)) +
## Warning: Removed 5177 rows containing missing values (geom_point).
Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!
ggplot(aes(x = age, y=friend_count), data = pf) +
geom_point(alpha=1/20,position= position_jitter(h = 0)) +
coord_trans(y = "sqrt")
## Warning: Removed 5171 rows containing missing values (geom_point).
What do you notice?
It’s more distinction to see the friend count. ***
Alpha and Jitter
ggplot(aes(x = age, y=friendships_initiated, color=gender), data = pf) +
geom_point(alpha=1/20,position= position_jitter(h = 0)) +
coord_trans(y = "sqrt")+
## Warning: Removed 5193 rows containing missing values (geom_point).
## Warning: Removed 5185 rows containing missing values (geom_point).
Overplotting and Domain Knowledge
Notes: so much can do in scatter plot. In MOira’s case, we transform axis in percentage, that way we can see percentage of survey guess vs survey actual audience still typically we saw 10-20% would have actually 60% seeing our post. ***
Conditional Means
Notes: not possible to judge quantity in jitter(harder). avg friend-Coutn vary over age.
age_groups <- group_by(pf,age)
pf.fc_by_age <-summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
## Source: local data frame [6 x 4]
## age friend_count_mean friend_count_median n
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## Source: local data frame [6 x 4]
## age friend_count_mean friend_count_median n
## 1 108 369.2426 213.0 1661
## 2 109 172.8889 120.0 9
## 3 110 336.7333 243.0 15
## 4 111 240.2222 166.0 18
## 5 112 484.9444 120.5 18
## 6 113 334.6683 206.0 202
#arrange the order
pf.fc_by_age <- arrange(pf.fc_by_age, age)
## Source: local data frame [101 x 4]
## age friend_count_mean friend_count_median n
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## .. ... ... ... ...
Create your plot!
## [1] "age" "friend_count_mean" "friend_count_median"
## [4] "n"
ggplot(aes(age,friend_count_mean), data = pf.fc_by_age) + geom_line()
Overlaying Summaries with Raw Data
Notes: quantile = first nth percentage of data that we want to observe
ggplot(aes(x = age, y=friend_count), data = pf) +
position= position_jitter(h = 0),
color = "orange") +
geom_line(stat = "summary", fun.y = mean)+
geom_line(stat="summary", fun.y = quantile, probs = 0.1,#10% of users have ~f_count
linetype =2, color="blue")+
geom_line(stat="summary", fun.y = quantile, probs = 0.9,#90% of users have ~f_count
linetype =2, color="blue")+
coord_cartesian(xlim=c(13,70), ylim=c(0,1000))
What are some of your observations of the plot?
Response: We have some jitter at age 69 majority of users in facebook is below age 30, and have some normal distribution Whereas age beyond 70 have some peak upside down(either true or users lying) ***
Moira: Histogram Summary and Scatterplot
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes: We can see that people have overestimate on the left(graph1) side, right(underestimate), and the graph at 0% which perfectly estimate their audience size. ***
Notes: We’re going to find the correlation of friend_count, and age with -0.02740737 as the result , it’s not monotonic, there’s isn’t correlation between two variable
## Pearson's product-moment correlation
## data: pf$friend_count and pf$age
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Correlation on Subsets
Notes: If not correlated, then perhaps there’s some data(outlier) that we need to hinder Now we want to subset our data less or equal than 70. Don’t use inference statistic in place of descriptive statistices e.g. based on this graph, we describe that people lonelier with aging
with( subset(pf, age <= 70) , cor.test(age, friend_count))
## Pearson's product-moment correlation
## data: age and friend_count
## t = -52.5923, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Correlation Methods
Notes: we can use pearson, produce cor.test spearman produce rho and kendall, produce tau ***
Create Scatterplots
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
ggplot(aes(x = www_likes_received,y =likes_received ), data = pf)+
Strong Correlations
Notes: The correlation coefficient is invariant under a linear transformation of either X or Y, and the slope of the regression line when both X and Y have been transformed to z-scores is the correlation coefficient.
It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers ARE of interest, and it’s important that we understand their values and why they appear in the data set.
This code will zooming to 95% of most of our data, ignoring outliers. What useful method to zoom in! We also can smoothing line, by drawing some line, linear model, and see from the line the correlation between the two.
ggplot(aes(x = www_likes_received,y =likes_received ), data = pf)+
xlim(0,quantile(pf$www_likes_received, 0.95))+
ylim(0,quantile(pf$likes_received, 0.95))+
geom_smooth(method = "lm", color="red")
## Warning: Removed 6075 rows containing missing values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
## Pearson's product-moment correlation
## data: pf$www_likes_received and pf$likes_received
## t = 937.1035, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Moira on Correlation
Notes: Sometimes correlation of two variables in the data is not a good thing. It’s good thing that we find the correlation between the variables in data, to understand better about the data. But when we put it into linear regression, which its asumption is the data independence of each other, it becomes harder to know which features is important if it correlated with other features. We can’t be sure to know in feature selection. ***
More Caution with Correlation
Notes: Argument matching (when not providing them by name) in R is a bit complex.
First, arguments (or parameters) can be matched by name. If a parameter matches exactly, it is “removed” from the argument list and the remaining unnamed arguments are matched in the order that they are listed in the function definition.
R does the following to match arguments…
checks for exact match of named argument checks for a partial match of the argument checks for a positional match If R does not find a match for a parameter, it typically throws an “unused” parameter error.
Type str(functionName) to find the order of the parameters and learn more about the parameters of an R function.
## [1] "Month" "Temp"
Create your plot!
ggplot(aes(x = Month, y = Temp), data=Mitchell)+
Noisy Scatterplots
- Take a guess for the correlation coefficient for the scatterplot. 0, because it appears we can’t draw some linear slope line.
- What is the actual correlation of the two variables? (Round to the thousandths place)
cor.test(Mitchell$Month, Mitchell$Temp)
## Pearson's product-moment correlation
## data: Mitchell$Month and Mitchell$Temp
## t = 0.8182, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Making Sense of Data
Notes: Brek up the x-axis so that the month divided every 12 months
ggplot(aes(x = (Month%%12), y = Temp), data=Mitchell)+
#scale_x_discrete(breaks = seq(0,203,12))
A New Perspective
What do you notice? Response:
dcor.ttest(Mitchell$Month, Mitchell$Temp)
## dcor t-test of independence
## data: Mitchell$Month and Mitchell$Temp
## T = -0.939, df = 20501, p-value = 0.8261
## sample estimates:
## Bias corrected dcor
## -0.006558215
Watch the solution video and check out the Instructor Notes! Notes: It’s important to take data visualization to make us understand about the data. To make it better visualize, usually we take about vertical-horizontal == 1:2
You could also get perspective on this data by overlaying each year’s data on top of each other, giving a clear, generally sinusoidal graph. You can do this by using the R’s modulus operator %% in your code. Try running the code below!
ggplot(aes(x=(Month%%12),y=Temp),data=Mitchell)+ geom_point()
Data Visualization Pioneers John Tukey William Playfair William Playfair and the Psychology of Graphs
There are other measures of associations that can detect this. The dcor.ttest() function in the energy package implements a non-parametric test of the independence of two variables. The test correctly rejects the independence. ***
Understanding Noise: Age to Age Months
Notes: Assume the reference date for calculating age is December 31, 2013 and that the age variable gives age in years at the end of 2013.
The variable age_with_months in the data frame pf should be a decimal value. For example, the value of age_with_months for a 33 year old person born in March would be 33.75.
pf$age_with_months <- pf$age + (1.0 - pf$dob_month/12)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received" "age_with_months"
Two alternate solutions:
pf\(age_with_months <- pf\)age + (1 - pf$dob_month / 12)
pf$age_with_months <- with(pf, age + (1 - dob_month / 12)) ### Age with Months Means
Create a new data frame called
pf.fc_by_age_months that contains
the mean friend count, the median friend
count, and the number of users in each
group of age_with_months. The rows of the
data framed should be arranged in increasing
order by the age_with_months variable.
For example, the first two rows of the resulting
data frame would look something like…
age_with_months friend_count_mean friend_count_median n
13 275.0000 275 2
13.25000 133.2000 101 11
See the Instructor Notes for two hints if you get stuck.
This programming assignment will automatically be graded.
Hint 1: Use the group_by(), summarise(), and arrange() functions in the dplyr package to split the data frame by age_with_month. Make sure you arrange by the correct variable (it’s not age anymore).
Hint 2: The code should look similar to the code when we split the data frame by age and found summaries. age_groups <- group_by(pf, age) pf.fc_by_age <- summarise(age_groups, friend_count_mean = mean(friend_count), friend_count_median = median(friend_count), n = n()) pf.fc_by_age <- arrange(pf.fc_by_age, age)
age_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_months_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
## Source: local data frame [6 x 4]
## age_with_months friend_count_mean friend_count_median n
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
Programming Assignment, chain function(execute all at once)
pf.fc_by_age_monthsc <- pf%.%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())%.%
Noise in Conditional Means
Create a new scatterplot showing friend_count_mean
versus the new variable, age_with_months. Be sure to use
the correct data frame (the one you create in the last
exercise) AND subset the data to investigate
users with ages less than 71.
data=filter(pf.fc_by_age_months, age_with_months<=71))+
Smoothing Conditional Means
Notes: The three graph are example of variance-normal-bias. With more bias, we may end up waste important meaning of a feature. Smoothing highlight waste impotance non-monotonic function.
p1 <- ggplot(aes(age,friend_count_mean), data =subset(pf.fc_by_age, age<=71)) +
p2 <- ggplot(aes(y=friend_count_mean,x=age_with_months),
data=filter(pf.fc_by_age_months, age_with_months<=71))+
p3 <- ggplot(aes(y= friend_count, x = round(age/5)*5),
data=subset(pf, age<= 71))+
geom_line(stat="summary", fun.y = mean)
Which Plot to Choose?
Notes: So in the end, we introduced a couple of plot. The question is which should we choose? The EDA doesn’t require us to choose, rather present all possible combination. Sometimes to present in audience, we may end up using one/two visualization
Analyzing Two Variables
Reflection: We learn the comparison about two variables in our data. We learn the correlation, advantage and disadvantage. We learn multiple(conditional means) visualization through just two variables. We learn how to smooth and bias-variance tradeoff
Not trust our init visualization, make multiple and Handle overplotting through jitter and smooth not trust our correlation, and pick feature to use in our model. We also use scatter plot as our main visualization ***
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!