Exploring Two Variables

2014-11-12 20:08 | Source

Exploring two variables in R with scatterplot, jitter and smoothing to handle overplotting

In this lesson we will learn how toInvestigate two variable make a Scatter Plot and hear moira’s study in EDA perceive audience size ### Scatterplots and Perceived Audience Size Notes: x->actual vs y->perceive. We can see that people choose round up number(50,100,200,etc) when they perceived audience size In reality, people saw our post saw 100/200 ***

Scatterplots

Notes:

library(ggplot2)
pf = read.csv('../Lesson3/pseudo_facebook.tsv', sep='\t')
ggplot(aes(x = age, y=friend_count), data = pf) + geom_point()

What are some things that you notice right away?

Response: People below thirty would have more friends.there’s some extreme where ages>90 (some maybe lying). But that also can infer people who fake beyond age 90 have sense of humor hence more friends. It’s also important to notice the outliers of our data, and make actions how to audit the data. ***

ggplot Syntax

Notes: Need to say aes wrapper in x and y have to say what type of geom

summary(pf$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

ggplot(aes(x = age, y=friend_count), data = pf) + geom_point()

Overplotting

Notes: Overplotting means we can’t exactly see what are the real plotting. In this case we want to have a plot represent on 20 plot As X(age) is discrete, only attribute point doesn’t really describe age. So instead we using jitter The warning of ommited missing values because we limit to only age 13-90

ggplot(aes(x = age, y=friend_count), data = pf) + geom_jitter(alpha=1/20) + xlim(13,90)

## Warning: Removed 5176 rows containing missing values (geom_point).

What do you notice in the plot?

Response: We can see more distributed in the plot. Also keep in mind alpha=1/20 in geom means it will take 20 points in that coordinat to make it completely black. By doing this we know that most of users(in block of black) seen as age over 30 has below 1000 average friends. ***

Coord_trans()

Notes:

ggplot(aes(x = age, y=friend_count), data = pf) +
  geom_point(alpha=1/20,position= position_jitter(h = 0)) + 
  xlim(13,90)

## Warning: Removed 5177 rows containing missing values (geom_point).

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

ggplot(aes(x = age, y=friend_count), data = pf) +
  geom_point(alpha=1/20,position= position_jitter(h = 0)) + 
  xlim(13,90)+
  coord_trans(y = "sqrt")

## Warning: Removed 5171 rows containing missing values (geom_point).

?coord_trans

What do you notice?

It’s more distinction to see the friend count. ***

Alpha and Jitter

Notes:

ggplot(aes(x = age, y=friendships_initiated, color=gender), data = pf) +
  geom_point(alpha=1/20,position= position_jitter(h = 0)) + 
  xlim(13,90)+
  coord_trans(y = "sqrt")+
  ggsave('femvsmale.png')

## Saving 7 x 5 in image

## Warning: Removed 5193 rows containing missing values (geom_point).

## Warning: Removed 5185 rows containing missing values (geom_point).

Overplotting and Domain Knowledge

Notes: so much can do in scatter plot. In MOira’s case, we transform axis in percentage, that way we can see percentage of survey guess vs survey actual audience still typically we saw 10-20% would have actually 60% seeing our post. ***

Conditional Means

Notes: not possible to judge quantity in jitter(harder). avg friend-Coutn vary over age.

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

age_groups <- group_by(pf,age)
pf.fc_by_age <-summarise(age_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())


head(pf.fc_by_age)

## Source: local data frame [6 x 4]
## 
##   age friend_count_mean friend_count_median    n
## 1  13          164.7500                74.0  484
## 2  14          251.3901               132.0 1925
## 3  15          347.6921               161.0 2618
## 4  16          351.9371               171.5 3086
## 5  17          350.3006               156.0 3283
## 6  18          331.1663               162.0 5196

tail(pf.fc_by_age)

## Source: local data frame [6 x 4]
## 
##   age friend_count_mean friend_count_median    n
## 1 108          369.2426               213.0 1661
## 2 109          172.8889               120.0    9
## 3 110          336.7333               243.0   15
## 4 111          240.2222               166.0   18
## 5 112          484.9444               120.5   18
## 6 113          334.6683               206.0  202

#arrange the order
pf.fc_by_age <- arrange(pf.fc_by_age, age)
pf.fc_by_age

## Source: local data frame [101 x 4]
## 
##    age friend_count_mean friend_count_median    n
## 1   13          164.7500                74.0  484
## 2   14          251.3901               132.0 1925
## 3   15          347.6921               161.0 2618
## 4   16          351.9371               171.5 3086
## 5   17          350.3006               156.0 3283
## 6   18          331.1663               162.0 5196
## 7   19          333.6921               157.0 4391
## 8   20          283.4991               135.0 3769
## 9   21          235.9412               121.0 3671
## 10  22          211.3948               106.0 3032
## .. ...               ...                 ...  ...

Create your plot!

names(pf.fc_by_age)

## [1] "age"                 "friend_count_mean"   "friend_count_median"
## [4] "n"

ggplot(aes(age,friend_count_mean), data = pf.fc_by_age) + geom_line()

Overlaying Summaries with Raw Data

Notes: quantile = first nth percentage of data that we want to observe

ggplot(aes(x = age, y=friend_count), data = pf) +
  geom_point(alpha=1/20,
             position= position_jitter(h = 0),
             color = "orange") +
  geom_line(stat = "summary", fun.y = mean)+
  geom_line(stat="summary", fun.y = quantile, probs = 0.1,#10% of  users have ~f_count
            linetype =2, color="blue")+
  geom_line(stat="summary", fun.y = quantile, probs = 0.9,#90% of  users have ~f_count
            linetype =2, color="blue")+
  coord_cartesian(xlim=c(13,70), ylim=c(0,1000))

?coord_cartesian

What are some of your observations of the plot?

Response: We have some jitter at age 69 majority of users in facebook is below age 30, and have some normal distribution Whereas age beyond 70 have some peak upside down(either true or users lying) ***

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes: We can see that people have overestimate on the left(graph1) side, right(underestimate), and the graph at 0% which perfectly estimate their audience size. ***

Correlation

Notes: We’re going to find the correlation of friend_count, and age with -0.02740737 as the result , it’s not monotonic, there’s isn’t correlation between two variable

?cor.test
cor.test(pf$friend_count,pf$age)

## 
##  Pearson's product-moment correlation
## 
## data:  pf$friend_count and pf$age
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

?with

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:

Correlation on Subsets

Notes: If not correlated, then perhaps there’s some data(outlier) that we need to hinder Now we want to subset our data less or equal than 70. Don’t use inference statistic in place of descriptive statistices e.g. based on this graph, we describe that people lonelier with aging

with(     subset(pf, age <= 70)            , cor.test(age, friend_count))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.5923, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Correlation Methods

Notes: we can use pearson, produce cor.test spearman produce rho and kendall, produce tau ***

Create Scatterplots

Notes:

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

ggplot(aes(x = www_likes_received,y =likes_received ), data = pf)+
  geom_point()

Strong Correlations

Notes: The correlation coefficient is invariant under a linear transformation of either X or Y, and the slope of the regression line when both X and Y have been transformed to z-scores is the correlation coefficient.

It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers ARE of interest, and it’s important that we understand their values and why they appear in the data set.

This code will zooming to 95% of most of our data, ignoring outliers. What useful method to zoom in! We also can smoothing line, by drawing some line, linear model, and see from the line the correlation between the two.

ggplot(aes(x = www_likes_received,y =likes_received ), data = pf)+
  geom_point()+
  xlim(0,quantile(pf$www_likes_received, 0.95))+
  ylim(0,quantile(pf$likes_received, 0.95))+
  geom_smooth(method = "lm", color="red")

## Warning: Removed 6075 rows containing missing values (stat_smooth).

## Warning: Removed 6075 rows containing missing values (geom_point).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

cor.test(pf$www_likes_received,pf$likes_received)

## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_received and pf$likes_received
## t = 937.1035, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response:

Moira on Correlation

Notes: Sometimes correlation of two variables in the data is not a good thing. It’s good thing that we find the correlation between the variables in data, to understand better about the data. But when we put it into linear regression, which its asumption is the data independence of each other, it becomes harder to know which features is important if it correlated with other features. We can’t be sure to know in feature selection. ***

More Caution with Correlation

Notes: Argument matching (when not providing them by name) in R is a bit complex.

First, arguments (or parameters) can be matched by name. If a parameter matches exactly, it is “removed” from the argument list and the remaining unnamed arguments are matched in the order that they are listed in the function definition.

R does the following to match arguments…

checks for exact match of named argument checks for a partial match of the argument checks for a positional match If R does not find a match for a parameter, it typically throws an “unused” parameter error.

Type str(functionName) to find the order of the parameters and learn more about the parameters of an R function.

library(alr3)

## Loading required package: car

data(Mitchell)
names(Mitchell)

## [1] "Month" "Temp"

library(ggplot2)

Create your plot!

ggplot(aes(x = Month, y = Temp), data=Mitchell)+
  geom_point()

Noisy Scatterplots

Take a guess for the correlation coefficient for the scatterplot. 0, because it appears we can’t draw some linear slope line.
What is the actual correlation of the two variables? (Round to the thousandths place)

cor.test(Mitchell$Month, Mitchell$Temp)

## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.8182, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

Notes: Brek up the x-axis so that the month divided every 12 months

ggplot(aes(x = (Month%%12), y = Temp), data=Mitchell)+
  geom_point()

  #scale_x_discrete(breaks = seq(0,203,12))

A New Perspective

What do you notice? Response:

library(energy)
dcor.ttest(Mitchell$Month, Mitchell$Temp)

## 
##  dcor t-test of independence
## 
## data:  Mitchell$Month and Mitchell$Temp
## T = -0.939, df = 20501, p-value = 0.8261
## sample estimates:
## Bias corrected dcor 
##        -0.006558215

Watch the solution video and check out the Instructor Notes! Notes: It’s important to take data visualization to make us understand about the data. To make it better visualize, usually we take about vertical-horizontal == 1:2

You could also get perspective on this data by overlaying each year’s data on top of each other, giving a clear, generally sinusoidal graph. You can do this by using the R’s modulus operator %% in your code. Try running the code below!

ggplot(aes(x=(Month%%12),y=Temp),data=Mitchell)+ geom_point()

Data Visualization Pioneers John Tukey William Playfair William Playfair and the Psychology of Graphs

There are other measures of associations that can detect this. The dcor.ttest() function in the energy package implements a non-parametric test of the independence of two variables. The test correctly rejects the independence. ***

Understanding Noise: Age to Age Months

Notes: Assume the reference date for calculating age is December 31, 2013 and that the age variable gives age in years at the end of 2013.

The variable age_with_months in the data frame pf should be a decimal value. For example, the value of age_with_months for a 33 year old person born in March would be 33.75.

pf$age_with_months <- pf$age + (1.0 - pf$dob_month/12)
names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"    "age_with_months"

Two alternate solutions:

pf$age_with_months <- pf$age + (1 - pf$dob_month / 12)

pf$age_with_months <- with(pf, age + (1 - dob_month / 12)) ### Age with Months Means

Create a new data frame called

pf.fc_by_age_months that contains

the mean friend count, the median friend

count, and the number of users in each

group of age_with_months. The rows of the

data framed should be arranged in increasing

order by the age_with_months variable.

For example, the first two rows of the resulting

data frame would look something like…

age_with_months friend_count_mean friend_count_median n

13 275.0000 275 2

13.25000 133.2000 101 11

See the Instructor Notes for two hints if you get stuck.

This programming assignment will automatically be graded.

Hint 1: Use the group_by(), summarise(), and arrange() functions in the dplyr package to split the data frame by age_with_month. Make sure you arrange by the correct variable (it’s not age anymore).

Hint 2: The code should look similar to the code when we split the data frame by age and found summaries. age_groups <- group_by(pf, age) pf.fc_by_age <- summarise(age_groups, friend_count_mean = mean(friend_count), friend_count_median = median(friend_count), n = n()) pf.fc_by_age <- arrange(pf.fc_by_age, age)

head(pf.fc_by_age)

age_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_months_groups,
                                 friend_count_mean = mean(friend_count),
                                 friend_count_median = median(friend_count),
                                 n = n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
head(pf.fc_by_age_months)

## Source: local data frame [6 x 4]
## 
##   age_with_months friend_count_mean friend_count_median  n
## 1        13.16667          46.33333                30.5  6
## 2        13.25000         115.07143                23.5 14
## 3        13.33333         136.20000                44.0 25
## 4        13.41667         164.24242                72.0 33
## 5        13.50000         131.17778                66.0 45
## 6        13.58333         156.81481                64.0 54

Programming Assignment, chain function(execute all at once)

pf.fc_by_age_monthsc <- pf%.%
  group_by(age_with_months)%.%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n = n())%.%
  arrange(age_with_months)

## Warning: %.% is deprecated. Please use %>%

## Warning: %.% is deprecated. Please use %>%

## Warning: %.% is deprecated. Please use %>%

Noise in Conditional Means

Create a new scatterplot showing friend_count_mean

versus the new variable, age_with_months. Be sure to use

the correct data frame (the one you create in the last

exercise) AND subset the data to investigate

users with ages less than 71.

ggplot(aes(y=friend_count_mean,x=age_with_months),
       data=filter(pf.fc_by_age_months, age_with_months<=71))+
  geom_line()

Smoothing Conditional Means

Notes: The three graph are example of variance-normal-bias. With more bias, we may end up waste important meaning of a feature. Smoothing highlight waste impotance non-monotonic function.

p1 <- ggplot(aes(age,friend_count_mean), data =subset(pf.fc_by_age, age<=71)) +
  geom_line()+
  geom_smooth()


p2 <-  ggplot(aes(y=friend_count_mean,x=age_with_months),
       data=filter(pf.fc_by_age_months, age_with_months<=71))+
  geom_line()+
  geom_smooth()

p3 <- ggplot(aes(y= friend_count, x = round(age/5)*5),
             data=subset(pf, age<= 71))+
  geom_line(stat="summary", fun.y = mean)
  

library(gridExtra)

## Loading required package: grid

grid.arrange(p1,p2,p3,ncol=1)

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Which Plot to Choose?

Notes: So in the end, we introduced a couple of plot. The question is which should we choose? The EDA doesn’t require us to choose, rather present all possible combination. Sometimes to present in audience, we may end up using one/two visualization

Analyzing Two Variables

Reflection: We learn the comparison about two variables in our data. We learn the correlation, advantage and disadvantage. We learn multiple(conditional means) visualization through just two variables. We learn how to smooth and bias-variance tradeoff

Not trust our init visualization, make multiple and Handle overplotting through jitter and smooth not trust our correlation, and pick feature to use in our model. We also use scatter plot as our main visualization ***

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!