Exploring two variables in R with scatterplot, jitter and smoothing to handle overplotting


In this lesson we will learn how toInvestigate two variable make a Scatter Plot and hear moira’s study in EDA perceive audience size ### Scatterplots and Perceived Audience Size Notes: x->actual vs y->perceive. We can see that people choose round up number(50,100,200,etc) when they perceived audience size In reality, people saw our post saw 100/200 ***

Scatterplots

Notes:

library(ggplot2)
pf = read.csv('../Lesson3/pseudo_facebook.tsv', sep='\t')
ggplot(aes(x = age, y=friend_count), data = pf) + geom_point()


What are some things that you notice right away?

Response: People below thirty would have more friends.there’s some extreme where ages>90 (some maybe lying). But that also can infer people who fake beyond age 90 have sense of humor hence more friends. It’s also important to notice the outliers of our data, and make actions how to audit the data. ***

ggplot Syntax

Notes: Need to say aes wrapper in x and y have to say what type of geom

summary(pf$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00
ggplot(aes(x = age, y=friend_count), data = pf) + geom_point() 


Overplotting

Notes: Overplotting means we can’t exactly see what are the real plotting. In this case we want to have a plot represent on 20 plot As X(age) is discrete, only attribute point doesn’t really describe age. So instead we using jitter The warning of ommited missing values because we limit to only age 13-90

ggplot(aes(x = age, y=friend_count), data = pf) + geom_jitter(alpha=1/20) + xlim(13,90)
## Warning: Removed 5176 rows containing missing values (geom_point).

What do you notice in the plot?

Response: We can see more distributed in the plot. Also keep in mind alpha=1/20 in geom means it will take 20 points in that coordinat to make it completely black. By doing this we know that most of users(in block of black) seen as age over 30 has below 1000 average friends. ***

Coord_trans()

Notes:

ggplot(aes(x = age, y=friend_count), data = pf) +
  geom_point(alpha=1/20,position= position_jitter(h = 0)) + 
  xlim(13,90)
## Warning: Removed 5177 rows containing missing values (geom_point).

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

ggplot(aes(x = age, y=friend_count), data = pf) +
  geom_point(alpha=1/20,position= position_jitter(h = 0)) + 
  xlim(13,90)+
  coord_trans(y = "sqrt")
## Warning: Removed 5171 rows containing missing values (geom_point).

?coord_trans

What do you notice?

It’s more distinction to see the friend count. ***

Alpha and Jitter

Notes:

ggplot(aes(x = age, y=friendships_initiated, color=gender), data = pf) +
  geom_point(alpha=1/20,position= position_jitter(h = 0)) + 
  xlim(13,90)+
  coord_trans(y = "sqrt")+
  ggsave('femvsmale.png')
## Saving 7 x 5 in image
## Warning: Removed 5193 rows containing missing values (geom_point).
## Warning: Removed 5185 rows containing missing values (geom_point).


Overplotting and Domain Knowledge

Notes: so much can do in scatter plot. In MOira’s case, we transform axis in percentage, that way we can see percentage of survey guess vs survey actual audience still typically we saw 10-20% would have actually 60% seeing our post. ***

Conditional Means

Notes: not possible to judge quantity in jitter(harder). avg friend-Coutn vary over age.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
age_groups <- group_by(pf,age)
pf.fc_by_age <-summarise(age_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())


head(pf.fc_by_age)
## Source: local data frame [6 x 4]
## 
##   age friend_count_mean friend_count_median    n
## 1  13          164.7500                74.0  484
## 2  14          251.3901               132.0 1925
## 3  15          347.6921               161.0 2618
## 4  16          351.9371               171.5 3086
## 5  17          350.3006               156.0 3283
## 6  18          331.1663               162.0 5196
tail(pf.fc_by_age)
## Source: local data frame [6 x 4]
## 
##   age friend_count_mean friend_count_median    n
## 1 108          369.2426               213.0 1661
## 2 109          172.8889               120.0    9
## 3 110          336.7333               243.0   15
## 4 111          240.2222               166.0   18
## 5 112          484.9444               120.5   18
## 6 113          334.6683               206.0  202
#arrange the order
pf.fc_by_age <- arrange(pf.fc_by_age, age)
pf.fc_by_age
## Source: local data frame [101 x 4]
## 
##    age friend_count_mean friend_count_median    n
## 1   13          164.7500                74.0  484
## 2   14          251.3901               132.0 1925
## 3   15          347.6921               161.0 2618
## 4   16          351.9371               171.5 3086
## 5   17          350.3006               156.0 3283
## 6   18          331.1663               162.0 5196
## 7   19          333.6921               157.0 4391
## 8   20          283.4991               135.0 3769
## 9   21          235.9412               121.0 3671
## 10  22          211.3948               106.0 3032
## .. ...               ...                 ...  ...

Create your plot!

names(pf.fc_by_age)
## [1] "age"                 "friend_count_mean"   "friend_count_median"
## [4] "n"
ggplot(aes(age,friend_count_mean), data = pf.fc_by_age) + geom_line()


Overlaying Summaries with Raw Data

Notes: quantile = first nth percentage of data that we want to observe

ggplot(aes(x = age, y=friend_count), data = pf) +
  geom_point(alpha=1/20,
             position= position_jitter(h = 0),
             color = "orange") +
  geom_line(stat = "summary", fun.y = mean)+
  geom_line(stat="summary", fun.y = quantile, probs = 0.1,#10% of  users have ~f_count
            linetype =2, color="blue")+
  geom_line(stat="summary", fun.y = quantile, probs = 0.9,#90% of  users have ~f_count
            linetype =2, color="blue")+
  coord_cartesian(xlim=c(13,70), ylim=c(0,1000))

?coord_cartesian

What are some of your observations of the plot?

Response: We have some jitter at age 69 majority of users in facebook is below age 30, and have some normal distribution Whereas age beyond 70 have some peak upside down(either true or users lying) ***

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes: We can see that people have overestimate on the left(graph1) side, right(underestimate), and the graph at 0% which perfectly estimate their audience size. ***

Correlation

Notes: We’re going to find the correlation of friend_count, and age with -0.02740737 as the result , it’s not monotonic, there’s isn’t correlation between two variable

?cor.test
cor.test(pf$friend_count,pf$age)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$friend_count and pf$age
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737
?with

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes: If not correlated, then perhaps there’s some data(outlier) that we need to hinder Now we want to subset our data less or equal than 70. Don’t use inference statistic in place of descriptive statistices e.g. based on this graph, we describe that people lonelier with aging

with(     subset(pf, age <= 70)            , cor.test(age, friend_count))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.5923, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Correlation Methods

Notes: we can use pearson, produce cor.test spearman produce rho and kendall, produce tau ***

Create Scatterplots

Notes:

names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"
ggplot(aes(x = www_likes_received,y =likes_received ), data = pf)+
  geom_point()


Strong Correlations

Notes: The correlation coefficient is invariant under a linear transformation of either X or Y, and the slope of the regression line when both X and Y have been transformed to z-scores is the correlation coefficient.

It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers ARE of interest, and it’s important that we understand their values and why they appear in the data set.

This code will zooming to 95% of most of our data, ignoring outliers. What useful method to zoom in! We also can smoothing line, by drawing some line, linear model, and see from the line the correlation between the two.

ggplot(aes(x = www_likes_received,y =likes_received ), data = pf)+
  geom_point()+
  xlim(0,quantile(pf$www_likes_received, 0.95))+
  ylim(0,quantile(pf$likes_received, 0.95))+
  geom_smooth(method = "lm", color="red")
## Warning: Removed 6075 rows containing missing values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).