Explore Many Variables

2014-11-17 10:33 | Source

In this Lesson we want to add 3 or more variables to observe. We also want to use third another variable to look some consistency of 2 variables we want to observe.

Moira Perceived Audience Size Colored by Age

Moira then observe the the audience size, and come with another question
is it older people better than younger people in estimate the audience?
so she began to plot the age based on color. But it doesn’t help much ***

Third Qualitative Variable

In Moira’s experiment, she didn’t have any correlation of age and the audience size.
In this experiment, we want to find the correlation between age and gender.
Here we can see that women’s average get higher percentage of friend_count then it is for men
And also notice froom boxplot, that women has more number, with median beyond 30
next we want to group_by two variable by using dplyr, groupby, summarise, and arrange

library(ggplot2)
?read.csv
pf = read.csv('../lesson3/pseudo_facebook.tsv',sep = "\t")
ggplot(aes(x = gender, y = age),
       data = subset(pf, !is.na(gender))) +geom_boxplot()+
  stat_summary(fun.y = mean, geom = "point", shape =4 )

#+ geom_histogram()

ggplot(aes(x = age, y = friend_count),
       data= subset(pf, !is.na(gender)))+
  geom_line(aes(color=gender), stat="summary", fun.y = median)

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

pf.fc_by_age_gender <- group_by(pf,age,gender) %>%
  filter(!is.na(gender))%>%
  summarise(median_friend_count = median(friend_count),
            mean_friend_count = mean(friend_count),
            n=n())%>%
  #Earlier we use groupby age,gender. because gender need to be avoided, remove one layer
  #by using ungroup, and arrange by age
  ungroup()%>% 
  arrange(age)
head(pf.fc_by_age_gender)

## Source: local data frame [6 x 5]
## 
##   age gender median_friend_count mean_friend_count    n
## 1  13 female                 148          259.1606  193
## 2  13   male                  55          102.1340  291
## 3  14 female                 224          362.4286  847
## 4  14   male                  92          164.1456 1078
## 5  15 female                 276          538.6813 1139
## 6  15   male                 106          200.6658 1478

Plotting Conditional Summaries

Create a line graph showing the median friend count over the ages for each gender. Be sure to use the data frame you just created, pf.fc_by_age_gender. Instructor Notes

Your code should look similar to the code we used to make the plot the first time. It will not need to make use of the stat and fun.y parameters.

ggplot(aes(x = age, y = friend_count), data = subset(pf.1, !is.na(gender))) + geom_line(aes(color = gender), stat = ‘summary’, fun.y = median)

ggplot(aes(x= age, y=median_friend_count),
       data = pf.fc_by_age_gender)+
  geom_line(aes(color=gender))

Thinking in Ratios

Now by this plot we know plotting in range of ages with different gender.
We also spot that younger people tend to have more friend.
Now we may want to ask different question. By how many ratio women have friend compare to men?

Wide and Long Format

By doing this, we want to reshape our data into different format.
Notice that our subset of data have repeated age.
Now we want to reshape our data, into wide format.
one row each age, put median value inside male and female
It’s normal to be back and forth with the data in different arrangement.
To do this, we’re using ‘reshape’ packages.
Similar to octave, we’re reshaping from wide<->long depending on what we do.
wide(multiple) column to long row, or the other way around

Reshaping Data

It???s important to use quotes around the variable name that is assigned tovalue.var.

We could also create a similar data frame using the dplyr package. pf.fc_by_age_gender.wide <- pf.fc_by_age_gender %.% group_by(age) %.% summarise(male = friend_count.median[gender = ‘male’], female = friend_count.median[gender = ‘female’], ratio = female / male) %.% arrange(age)

head(pf.fc_by_age_gender.wide)

library(reshape2)
pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
                                  age~gender,#formula,left=value that kept,right=column that retain
                                  value.var='median_friend_count')
head(pf.fc_by_age_gender.wide)

##   age female male
## 1  13    148   55
## 2  14    224   92
## 3  15    276  106
## 4  16    258  136
## 5  17    245  125
## 6  18    243  122

Ratio Plot

Plot the ratio of the female to male median friend counts using the data frame pf.fc_by_age_gender.wide.

Think about what geom you should use. Add a horizontal line to the plot with a y intercept of 1, which will be the base line. Look up the documentation for geom_hline to do that. Use the parameter linetype in geom_hline to make the line dashed.

The linetype parameter can take the values 0-6: 0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash

library(ggplot2)
ggplot(aes(x=age, y = female/male),
      data=pf.fc_by_age_gender.wide)+
   geom_line()+
   geom_hline(aes(yintercept=1),linetype=2)

Pseudo Facebook may stated that, many people join from various other countries have tendencies male having lower count than female. These shows us that for younger women, they tend to have almost twice friend count than male ***

Third Quantitative Variable

observe using another variable, tenure
tenure started join friend_count
This exercise will have goals to merge the age and tenure, to observe the comparison in friend_count

Create a variable called year_joined in the pf data frame using the variable tenure and 2014 as the reference year.

The variable year joined should contain the year that a user joined facebook.

Instructor Notes

A common mistake is to use tenure rather than pf$tenure or with(pf, tenure…). Remember that you need to access the variable in the data frame. This is not one of the hints! :) Hint 1: Divide the tenure variable by a number. Tenure is measured in days, but we want to convert it to years. Hint 2: Subtract tenure measured in years from 2014. What does the decimal portion represent? Should we round up or round down to the closest year? Hint 3: You can use the floor() function to round down to the nearest integer. You can use the ceiling() function to round up to the nearest integer. Which one should you use?

pf$year_joined <- floor(2014 - pf$tenure/365)

Cut a Variable

Now by using table, we know how many users join in each year Next we want to take bin-range our year_joined, to make use of categorical using cut function

Create a new variable in the data frame called year_joined.bucket by using the cut function on the variable year_joined.

You need to create the following buckets for the new variable, year_joined.bucket

   (2004, 2009]
   (2009, 2011]
   (2011, 2012]
   (2012, 2014]

Note that a parenthesis means exclude the year and a bracket means include the year.

?cut
summary(pf$year_joined)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2005    2012    2012    2012    2013    2014       2

table(pf$year_joined)

## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##     9    15   581  1507  4557  5448  9860 33366 43588    70

pf$year_joined.bucket <- cut(pf$year_joined, breaks=c(2004,2009,2011,2012,2014))
table(pf$year_joined.bucket)

## 
## (2004,2009] (2009,2011] (2011,2012] (2012,2014] 
##        6669       15308       33366       43658

Plotting it All Together

Now we have joined tenure and age. and using year_joined to create a bucket

Create a line graph of friend_count vs. age so that each year_joined.bucket is a line tracking the median user friend_count across age. This means you should have four different lines on your plot.

You should subset the data to exclude the users whose year_joined.bucket is NA.

table(pf$year_joined, useNA = 'ifany')

## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  <NA> 
##     9    15   581  1507  4557  5448  9860 33366 43588    70     2

ggplot(aes(x = age, y = friend_count),
       data = subset(pf, !is.na(year_joined.bucket)))+
  geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = median)

In this plot, now we observe 3 variables, using x=friend_count, y=age, and year_joined.bucket as categorical variables. ***

Plot the Grand Mean

Write code to do the following:

Add another geom_line to code below to plot the grand mean of the friend count vs age.
Exclude any users whose year_joined.bucket is NA.
Use a different line type for the grand mean.

As a reminder, the parameter linetype can take the values 0-6:

0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash

ggplot(aes(x = age, y = friend_count),
       data = subset(pf, !is.na(year_joined.bucket)))+
  geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = mean)+
  geom_line(fun.y = mean, stat='summary', linetype=2)

Friending Rate

Now by plotting these, we know that the mean graph isn’t entirely artifact.
So we want to ask another question. how many friend count the user have each day

with(subset(pf, tenure > 1), summary(friend_count/tenure))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0775   0.2204   0.6069   0.5652 417.0000

Friendships Initiated

in site longer, many friends more tenure intiate more friends

What is the median friend rate? .2205

What is the maximum friend rate? 417

Create a line graph of mean of friendships_initiated per day (of tenure) vs. tenure colored by year_joined.bucket.

You need to make use of the variables tenure, friendships_initiated, and year_joined.bucket.

You also need to subset the data to only consider user with at least one day of tenure.

ggplot(aes(x = tenure, y = friendships_initiated/tenure),
       data = subset(pf, tenure>1))+
  geom_line(aes(color=year_joined.bucket))

These shows that people with more tenure typically have less friendships_initiated ***

Bias-Variance Tradeoff Revisited

Notice that we have noise in our graph. By doing rounding in x, we have reduce noise with more bias

Instead of geom_line(), use geom_smooth() to add a smoother to the plot. You can use the defaults for geom_smooth() but do color the line by year_joined.bucket

ggplot(aes(x = tenure, y = friendships_initiated / tenure),
       data = subset(pf, tenure >= 1)) +
  geom_line(aes(color = year_joined.bucket),
            stat = 'summary',
            fun.y = mean)

ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)

ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)

ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)

ggplot(aes(x = tenure, y = friendships_initiated / tenure),
       data = subset(pf, tenure >= 1)) +
  geom_smooth(aes(color = year_joined.bucket))

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

By doing smoothing, we also get better understanding about the data. ***

Sean’s NFL Fan Sentiment Study

Now we’re gonna hear about Sean’s NFL fan study, and let’s hear about his bias trade-off visualization.
External link:

His study is about the emotion (sad-happiness) occurs in the particular team, in NFL statistics.
We have pos/neg ratio pos->happiness,sad->negative.
By plotting overtime we have some noise jump/down.
So avg/day and smooth-expand over 7 days.
Here we have some interesting graph.
We convert it into more descrete format, win or lose.
In order to handle bias-variance tradeoff, don’t let your guts choose, rather listen to what data tells you.
Earlier we have huge variance as shown by many noise.
Now by smoothing we have huge bias, but lower variance.
So we’re using this spine(smoothing) and take advantage of both lower variance higher bias.
If we have the data that’s not good enough. We may have to use EDA to ask some interesting question.

Introducing the Yogurt Data Set

Bayesian Statistics and Marketing contains the data set and a case study on it.

The citation for the original paper on the yogurt data set is Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi. “Modeling consumer demand for variety.” Marketing Science 21.3 (2002): 229-250.

A special thanks to Professor Allenby for helping us understand this data set.

To learn more about scanner data, check out Panel Data Discrete Choice Models of Consumer Demand ***

Histograms Revisited

yogurt dataset has different set of csv, in which we see the onr purchase per row.

yo = read.csv('yogurt.csv')
summary(yo)

##       obs               id               time         strawberry     
##  Min.   :   1.0   Min.   :2100081   Min.   : 9662   Min.   : 0.0000  
##  1st Qu.: 696.5   1st Qu.:2114348   1st Qu.: 9843   1st Qu.: 0.0000  
##  Median :1369.5   Median :2126532   Median :10045   Median : 0.0000  
##  Mean   :1367.8   Mean   :2128592   Mean   :10050   Mean   : 0.6492  
##  3rd Qu.:2044.2   3rd Qu.:2141549   3rd Qu.:10255   3rd Qu.: 1.0000  
##  Max.   :2743.0   Max.   :2170639   Max.   :10459   Max.   :11.0000  
##    blueberry        pina.colada          plain         mixed.berry    
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median : 0.0000   Median :0.0000   Median :0.0000  
##  Mean   : 0.3571   Mean   : 0.3584   Mean   :0.2176   Mean   :0.3887  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :12.0000   Max.   :10.0000   Max.   :6.0000   Max.   :8.0000  
##      price      
##  Min.   :20.00  
##  1st Qu.:50.00  
##  Median :65.04  
##  Mean   :59.25  
##  3rd Qu.:68.96  
##  Max.   :68.96

str(yo)

## 'data.frame':    2380 obs. of  9 variables:
##  $ obs        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id         : int  2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 ...
##  $ time       : int  9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
##  $ strawberry : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ blueberry  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pina.colada: int  0 0 0 0 1 2 0 0 0 0 ...
##  $ plain      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mixed.berry: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price      : num  59 59 65 65 49 ...

yo$id <- factor(yo$id)
ggplot(aes(x=price),
       data=yo)+
  geom_histogram(stat='bin', binwidth =10)

ggplot(aes(x=price),
       data=yo)+
  geom_histogram(stat='bin')

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Notice: the higher the price, the higher the purchase. with binwidth=10, the bias will go much higher and lost its descreteness and falling to see each price ***

Number of Purchases

Now, we want to make a count of total yogurt for each household purchases

table(yo$id)

## 
## 2100081 2100370 2100396 2100669 2100768 2100818 2100909 2101394 2101758 
##      34      13       3       2       2      10       7       6       2 
## 2101782 2101790 2101980 2102095 2102129 2102715 2102913 2103218 2103291 
##       6      47       2       3       3      12       4       3      18 
## 2103390 2103887 2103994 2104067 2104091 2104273 2104547 2104620 2104950 
##       2       3       3       2       4       4      22       2       6 
## 2105155 2105239 2105254 2105320 2105346 2105361 2105403 2105759 2106047 
##       4       3       6       2       4       4       7      26       4 
## 2106286 2106351 2106401 2106567 2106724 2106799 2107094 2107300 2107391 
##       7       2       2       2       6       8       2       2       5 
## 2107706 2107953 2108100 2108209 2108639 2108704 2108944 2108985 2109025 
##       5      10       4       5       8       4      15       4       2 
## 2109033 2109298 2109769 2110007 2110031 2110056 2110411 2110460 2110635 
##       2      17       2       5      22       2      19       3       3 
## 2110775 2110890 2110965 2111203 2111385 2111674 2111922 2112235 2112482 
##       3       3       3       2       4      10      18       4       3 
## 2112516 2113340 2113472 2113613 2113779 2114025 2114041 2114074 2114231 
##      15       5      12       4       7      22       2      11       3 
## 2114314 2114348 2114371 2114819 2114892 2114942 2115006 2115220 2115527 
##       3      16       2       4       2       6       2       4       9 
## 2115998 2116277 2116434 2116558 2117069 2117226 2117242 2117317 2117788 
##      28       2       2       4      13       2       2       7       4 
## 2118182 2118299 2118612 2118778 2118927 2119024 2119164 2119594 2119693 
##       3      19      12      19       9       4       5      11      11 
## 2119735 2120089 2120261 2120378 2120436 2120964 2121095 2121277 2121400 
##       8       9      25       4       2       2       5       3      29 
## 2121418 2121533 2121582 2122242 2122655 2122705 2122788 2122838 2123000 
##       7      13       5       3      10      12       2       2       7 
## 2123091 2123257 2123463 2123471 2123554 2123695 2123885 2123968 2124073 
##       6       4       7       2      19       6       3       2      50 
## 2124115 2124156 2124305 2124321 2124388 2124412 2124511 2124545 2124701 
##       4       6      16       2       2      11       5      17       5 
## 2124750 2124909 2124941 2125203 2125427 2125443 2125609 2125658 2126102 
##      25       6       4       4       8      13       2       3       3 
## 2126292 2126490 2126532 2126847 2126946 2127076 2127308 2127407 2127498 
##       8       2      10       2       2       3       3       2       2 
## 2127605 2127621 2127803 2127936 2128116 2128389 2128447 2128595 2128827 
##       3      12       6       6       6       5       5       7       2 
## 2128884 2128959 2129080 2129098 2129163 2129361 2129528 2129734 2129767 
##       2      12       2       2       4       3      39       5      12 
## 2129817 2129874 2129940 2130351 2130377 2130583 2130641 2130807 2130914 
##       8       2       2       3       4      59       5       6       3 
## 2130948 2131250 2131466 2131508 2132019 2132290 2132555 2133033 2133066 
##      12      16       2       2       2      74       2       2       7 
## 2133108 2133207 2133272 2133330 2133413 2133496 2133611 2133660 2133983 
##       2       2      17      10       3      14       2       3       4 
## 2134023 2134122 2134288 2134452 2134478 2134676 2134874 2135251 2135301 
##       2      20      22       2       2       8       2       2       2 
## 2135384 2135681 2135996 2136069 2136531 2136697 2136960 2137067 2137380 
##       3       2       5       2      23       2       4       2       3 
## 2137687 2137745 2138966 2139162 2139626 2139766 2139774 2140483 2141002 
##       4      12      14       4       2       9       3       5       9 
## 2141341 2141507 2141549 2141812 2141861 2142885 2142976 2143180 2143271 
##       2       4       6       6       4      10       4       9       3 
## 2143396 2143503 2143586 2143875 2144048 2144113 2144469 2144576 2144675 
##       2       4       3       9       8       3       2       4       3 
## 2145292 2145326 2145425 2145599 2145672 2146035 2146597 2146621 2146738 
##       5      27       2       8      11       7      16       2       4 
## 2147512 2147751 2147777 2147892 2147991 2148296 2148924 2149500 2149609 
##       2       3      17      24       4       4       7      50       6 
## 2150029 2150854 2151423 2151472 2151613 2151829 2152108 2152264 2152454 
##       4       5       3       2       3       3       2       2       5 
## 2152702 2152975 2153015 2153163 2153387 2153494 2153619 2154278 2154351 
##      29       2       5       6       9       5       8       3       2 
## 2154849 2155697 2155929 2156224 2157040 2157164 2157420 2158097 2158196 
##       3      16       6       3       4       2       3       9       2 
## 2158436 2158642 2158873 2159897 2160259 2160382 2160440 2160549 2160762 
##       7       6       2       3       5       2       7       2       3 
## 2161554 2161885 2162206 2162313 2162545 2162669 2164392 2164756 2164863 
##       8       6       6       5       4       3       2       2       3 
## 2165746 2165779 2165951 2166223 2166934 2167221 2167320 2167817 2167825 
##      24       9       7       2       3       2       3       3       2 
## 2168005 2168013 2168443 2169128 2169250 2169268 2169896 2170639 
##       2       2      15       4       2       7       7       2

all.purchases <- transform(yo,table(yo$id))
yo <- transform(yo, all.purchases=strawberry+blueberry+pina.colada+plain+mixed.berry)

Prices over Time

Now that we have this graph, the plot below shows us some interesting thing.
We know that most people didn’t buy that many yogurt compared to others.
Why? First let’s investigate the price of yogurt overtime

Create a scatterplot of price vs time. This will be an example of a time series plot.

ggplot(aes(x=all.purchases),
       data = yo )+
  geom_histogram(binwidth=1)

ggplot(aes(x=time,y=price),
       data=yo)+
  geom_point(alpha=1/20)

The scatter then shows how the price tends to go up overtime
There’s some graph that tend to flat, in which case the buyer may using coupon to buy ***

Sampling Observations

Dean said that when observing data with multiple graph and multiple objects, often it useful to take small subset of data (sampling) and work various way through it
In the case of yogurt dataset. We want to sample the data to just 16 household.
We may then ask another question. What price that buyer to tends to buy? How many yogurt they want to buy?

The citation for the original paper on the yogurt data set is Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi. “Modeling consumer demand for variety.” Marketing Science 21.3 (2002): 229-250.

Looking at Samples of Households

Note: x %in% y returns a logical (boolean) vector the same length as x that says whether each entry in x appears in y. That is, for each entry in x, it checks to see whether it is in y.

This allows us to subset the data so we get all the purchases occasions for the households in the sample. Then, we create scatterplots of price vs. time and facet by the sample id.

Use the pch or shape parameter to specify the symbol when plotting points. Scroll down to ‘Plotting Points’ on QuickR’s Graphical Parameters.

The plot below will only display us small subset of data that the id registered in sample.ids
We’re gonna plot line by different id,
The point then just emphasize the changing in the line. The size then make smaller/bigger depending of the purchases the household makes

#set the seed for reproducible results
set.seed(10000)
sample.ids <- sample(levels(yo$id),  16)

ggplot(aes(x=time, y= price),
       data= subset(yo, id %in% sample.ids))+
  facet_wrap(~id)+
  geom_line()+
  geom_point(aes(size=all.purchases), pch=1)+
  ggsave('Seed@10000.jpg')

## Saving 7 x 5 in image

some people just buy low but steady purchases
there’s some people that buy the yogurt, and come back in a long time, buy higher amounts. Maybe they just remember the yogurt shop and buy with additional request by friends.
There’s one that buy too many yogurt, and buy just a few. Perhaps the first one is just enough.
One that makes lot purchases, perhaps they just buy it as reseller or share with others.
We also can see that no graph actually always higher, people who buy that tends to receive coupon and use it to minimize their cost

The Limits of Cross Sectional Data

If we look back at the facebook graph. We can’t measure the friendship initiated, because it just cross-section, categorical graph.We can see it by different color in the graph. It’s not time-series (like yogurt, where we can see the purchases) so we can’t see the friendship_iniated. It would be great if we can have time-series day/friendship.initiated

Many Variables

Dean also said that we have EDA to explore relationship between variables. Use another variable to see the consistency of two variable that we observe. But we also may want to predict one variable based on the rest of variables. We may want to reduce the dimension so we can get better visualization(PCA). And also let the data speak for itself. Plot multiple graph and visualization to get better understanding about the data.

Scatterplot Matrix

Scatter matrix may not good for this particular data, specially if this is categorical.

Here’s the scatterplot matrix as a pdf.

You’ll need to run the code install.packages(‘GGally’) to install the package for creating this partiular scatterplot matrix.

If the plot takes a long time to render or if you want to see some of the scatterplot matrix, then only examine a smaller number of variables. You can use the following code or select fewer variables. We recommend including gender (the 6th variable)!

pf_subset <- pf[ , c(2:7)]

library(GGally)
set.seed(1836)
pf_subset <- pf[,c(2:15)]
ggpairs(pf_subset[sample.int(nrow(pf_subset),1000),-1])

Great work on finding or computing the correlation coefficients.

Scatterplots are below the diagonal, and categorical variables, like gender, create faceted histograms. The ggpairs will create some lookup (correlation) table that we want to observe between variables. ggpairs may not a good logarithmic analysis. but it’s a good starting point to plotting the graph.

Even More Variables

Genetic data could be a lot more of some digit parameters(features)
nci data is gene dataset with tons of data set. Close to 600k examples.

Heat Maps

nci <- read.table("nci.tsv")
names(nci)

##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11"
## [12] "V12" "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22"
## [23] "V23" "V24" "V25" "V26" "V27" "V28" "V29" "V30" "V31" "V32" "V33"
## [34] "V34" "V35" "V36" "V37" "V38" "V39" "V40" "V41" "V42" "V43" "V44"
## [45] "V45" "V46" "V47" "V48" "V49" "V50" "V51" "V52" "V53" "V54" "V55"
## [56] "V56" "V57" "V58" "V59" "V60" "V61" "V62" "V63" "V64"

colnames(nci) <- c(1:64)#make it easier for colnames to just contain a number
head(nci)

##        1        2      3             4      5      6      7      8      9
## 1  0.300 0.679961  0.940  2.800000e-01  0.485  0.310 -0.830 -0.190  0.460
## 2  1.180 1.289961 -0.040 -3.100000e-01 -0.465 -0.030  0.000 -0.870  0.000
## 3  0.550 0.169961 -0.170  6.800000e-01  0.395 -0.100  0.130 -0.450  1.150
## 4  1.140 0.379961 -0.040 -8.100000e-01  0.905 -0.460 -1.630  0.080 -1.400
## 5 -0.265 0.464961 -0.605  6.250000e-01  0.200 -0.205  0.075  0.005 -0.005
## 6 -0.070 0.579961  0.000 -1.387779e-17 -0.005 -0.540 -0.360  0.350 -0.700
##       10     11     12     13     14     15     16     17     18     19
## 1  0.760  0.270 -0.450 -0.030  0.710 -0.360 -0.210 -0.500 -1.060  0.150
## 2  1.490  0.630 -0.060 -1.120  0.000 -1.420 -1.950 -0.520 -2.190 -0.450
## 3  0.280 -0.360  0.150 -0.050  0.160 -0.030 -0.700 -0.660 -0.130 -0.320
## 4  0.100 -1.040 -0.610  0.000 -0.770 -2.280 -1.650 -2.610  0.000 -1.610
## 5 -0.525  0.015 -0.395 -0.285  0.045  0.135 -0.075  0.225 -0.485 -0.095
## 6  0.360 -0.040  0.150 -0.250 -0.160 -0.320  0.060 -0.050 -0.430 -0.080
##       20     21     22     23     24     25     26     27     28     29
## 1 -0.290 -0.200  0.430 -0.490 -0.530 -0.010  0.640 -0.480  0.140  0.640
## 2  0.000  0.740  0.500  0.330 -0.050 -0.370  0.550  0.970  0.720  0.150
## 3  0.050  0.080 -0.730  0.010 -0.230 -0.160 -0.540  0.300 -0.240 -0.170
## 4  0.730  0.760  0.600 -1.660  0.170  0.930 -1.780  0.470  0.000  0.550
## 5  0.385 -0.105 -0.635 -0.185  0.825  0.395  0.315  0.425  1.715 -0.205
## 6  0.390 -0.080 -0.430 -0.140  0.010 -0.100  0.810  0.020  0.260  0.290
##       30    31     32     33     34     35     36          37     38
## 1  0.070 0.130  0.320  0.515  0.080  0.410 -0.200 -0.36998050 -0.370
## 2  0.290 2.240  0.280  1.045  0.120  0.000  0.000 -1.38998000  0.180
## 3  0.070 0.640  0.360  0.000  0.060  0.210  0.060 -0.05998047  0.000
## 4  1.310 0.680 -1.880  0.000  0.400  0.180 -0.070  0.07001953 -1.320
## 5  0.085 0.135  0.475  0.330  0.105 -0.255 -0.415 -0.07498047 -0.825
## 6 -0.620 0.300  0.110 -0.155 -0.190 -0.110  0.020  0.04001953 -0.130
##       39     40     41          42     43            44          45     46
## 1 -0.430 -0.380 -0.550 -0.32003900 -0.620 -4.900000e-01  0.07001953 -0.120
## 2 -0.590 -0.550  0.000  0.08996101  0.080  4.200000e-01 -0.82998050  0.000
## 3 -0.500 -1.710  0.100 -0.29003900  0.140 -3.400000e-01 -0.59998050 -0.010
## 4 -1.520 -1.870 -2.390 -1.03003900  0.740  7.000000e-02 -0.90998050  0.130
## 5 -0.785 -0.585 -0.215  0.09496101  0.205 -2.050000e-01  0.24501950  0.555
## 6  0.520  0.120 -0.620  0.05996101  0.000 -1.387779e-17 -0.43998050 -0.550
##       47         48     49         50         51     52          53     54
## 1 -0.290 -0.8100195  0.200 0.37998050  0.3100195  0.030 -0.42998050  0.160
## 2  0.030  0.0000000 -0.230 0.44998050  0.4800195  0.220 -0.38998050 -0.340
## 3 -0.310  0.2199805  0.360 0.65998050  0.9600195  0.150 -0.17998050 -0.020
## 4  1.500  0.7399805  0.180 0.76998050  0.9600195 -1.240  0.86001950 -1.730
## 5  0.005  0.1149805 -0.315 0.05498047 -0.2149805 -0.305  0.78501950 -0.625
## 6 -0.540  0.1199805  0.410 0.54998050  0.3700195  0.050  0.04001953 -0.140
##       55     56     57          58     59     60     61     62
## 1  0.010 -0.620 -0.380  0.04998047  0.650 -0.030 -0.270  0.210
## 2 -1.280 -0.130  0.000 -0.72001950  0.640 -0.480  0.630 -0.620
## 3 -0.770  0.200 -0.060  0.41998050  0.150  0.070 -0.100 -0.150
## 4  0.940 -1.410  0.800  0.92998050 -1.970 -0.700  1.100 -1.330
## 5 -0.015  1.585 -0.115 -0.09501953 -0.065 -0.195  1.045  0.045
## 6  0.270  1.160  0.180  0.19998050  0.130  0.410  0.080 -0.400
##              63     64
## 1 -5.000000e-02  0.350
## 2  1.400000e-01 -0.270
## 3 -9.000000e-02  0.020
## 4 -1.260000e+00 -1.230
## 5  4.500000e-02 -0.715
## 6 -2.710505e-20 -0.340

#Melt the data to long format
library(reshape2)
#Here we just sampling to just 200 dataset, and all columns the sampe
nci.long.samp <- melt(as.matrix(nci[1:200,]))
str(nci.long.samp)

## 'data.frame':    12800 obs. of  3 variables:
##  $ Var1 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ value: num  0.3 1.18 0.55 1.14 -0.265 ...

names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)

##   gene case  value
## 1    1    1  0.300
## 2    2    1  1.180
## 3    3    1  0.550
## 4    4    1  1.140
## 5    5    1 -0.265
## 6    6    1 -0.070

#ggplot will make underexpress in blue, and overexpress in red
library(ggplot2)
#The geom will be plot in tile, and scale color from blue to red
ggplot(aes(y = gene, x = case, fill = value),
  data = nci.long.samp) +
  geom_tile() +
  scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))

Genomic map of the data is just 200 over 6000 examples. By using 6000 we just increasing the complexity of the visualization. That’s way it’s important to just sampling the data. and work our various visualization and relationship in the variables.

Analyzing Three or More Variables

This the summary for how we get this far.
We explore how we compare many variables(at least three)
We synthesize our variables to make better intuition about the data.
We plot many visualization( use GGally as starting point) to achieve the data.
We overcome complexity of our data with smoothing, and sampling.
We have look at many variables at once and plotting them.
We use plot in lesson 4, extending them, divide into multiple group(bucket) and ovserve many variables by using scatter matrix and heatmap
From just one row per case, we convert to one row combination, and using reshape to back and forth long wide format.
Next we want to learn indepth analysis about the diamonds sample, and how Salomon as an expert performing the larger part of EDA and extending it. He also writing the code from scrape, and using it to predict diamonds prices. ***