explore-many-variables
In this Lesson we want to add 3 or more variables to observe. We also want to use third another variable to look some consistency of 2 variables we want to observe.
Moira Perceived Audience Size Colored by Age
- Moira then observe the the audience size, and come with another question
- is it older people better than younger people in estimate the audience?
- so she began to plot the age based on color. But it doesn’t help much ***
Third Qualitative Variable
- In Moira’s experiment, she didn’t have any correlation of age and the audience size.
- In this experiment, we want to find the correlation between age and gender.
- Here we can see that women’s average get higher percentage of friend_count then it is for men
- And also notice froom boxplot, that women has more number, with median beyond 30
- next we want to group_by two variable by using dplyr, groupby, summarise, and arrange
library(ggplot2)
?read.csv
pf = read.csv('../lesson3/pseudo_facebook.tsv',sep = "\t")
ggplot(aes(x = gender, y = age),
data = subset(pf, !is.na(gender))) +geom_boxplot()+
stat_summary(fun.y = mean, geom = "point", shape =4 )
#+ geom_histogram()
ggplot(aes(x = age, y = friend_count),
data= subset(pf, !is.na(gender)))+
geom_line(aes(color=gender), stat="summary", fun.y = median)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pf.fc_by_age_gender <- group_by(pf,age,gender) %>%
filter(!is.na(gender))%>%
summarise(median_friend_count = median(friend_count),
mean_friend_count = mean(friend_count),
n=n())%>%
#Earlier we use groupby age,gender. because gender need to be avoided, remove one layer
#by using ungroup, and arrange by age
ungroup()%>%
arrange(age)
head(pf.fc_by_age_gender)
## Source: local data frame [6 x 5]
##
## age gender median_friend_count mean_friend_count n
## 1 13 female 148 259.1606 193
## 2 13 male 55 102.1340 291
## 3 14 female 224 362.4286 847
## 4 14 male 92 164.1456 1078
## 5 15 female 276 538.6813 1139
## 6 15 male 106 200.6658 1478
Plotting Conditional Summaries
Create a line graph showing the median friend count over the ages for each gender. Be sure to use the data frame you just created, pf.fc_by_age_gender. Instructor Notes
Your code should look similar to the code we used to make the plot the first time. It will not need to make use of the stat and fun.y parameters.
ggplot(aes(x = age, y = friend_count), data = subset(pf.1, !is.na(gender))) + geom_line(aes(color = gender), stat = ‘summary’, fun.y = median)
ggplot(aes(x= age, y=median_friend_count),
data = pf.fc_by_age_gender)+
geom_line(aes(color=gender))
Thinking in Ratios
- Now by this plot we know plotting in range of ages with different gender.
- We also spot that younger people tend to have more friend.
- Now we may want to ask different question. By how many ratio women have friend compare to men?
Wide and Long Format
- By doing this, we want to reshape our data into different format.
- Notice that our subset of data have repeated age.
- Now we want to reshape our data, into wide format.
- one row each age, put median value inside male and female
- It’s normal to be back and forth with the data in different arrangement.
- To do this, we’re using ‘reshape’ packages.
- Similar to octave, we’re reshaping from wide<->long depending on what we do.
- wide(multiple) column to long row, or the other way around
Reshaping Data
It???s important to use quotes around the variable name that is assigned tovalue.var.
We could also create a similar data frame using the dplyr package. pf.fc_by_age_gender.wide <- pf.fc_by_age_gender %.% group_by(age) %.% summarise(male = friend_count.median[gender = ‘male’], female = friend_count.median[gender = ‘female’], ratio = female / male) %.% arrange(age)
head(pf.fc_by_age_gender.wide)
library(reshape2)
pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
age~gender,#formula,left=value that kept,right=column that retain
value.var='median_friend_count')
head(pf.fc_by_age_gender.wide)
## age female male
## 1 13 148 55
## 2 14 224 92
## 3 15 276 106
## 4 16 258 136
## 5 17 245 125
## 6 18 243 122
Ratio Plot
Plot the ratio of the female to male median friend counts using the data frame pf.fc_by_age_gender.wide.
Think about what geom you should use. Add a horizontal line to the plot with a y intercept of 1, which will be the base line. Look up the documentation for geom_hline to do that. Use the parameter linetype in geom_hline to make the line dashed.
The linetype parameter can take the values 0-6: 0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash
library(ggplot2)
ggplot(aes(x=age, y = female/male),
data=pf.fc_by_age_gender.wide)+
geom_line()+
geom_hline(aes(yintercept=1),linetype=2)
Pseudo Facebook may stated that, many people join from various other countries have tendencies male having lower count than female. These shows us that for younger women, they tend to have almost twice friend count than male ***
Third Quantitative Variable
- observe using another variable, tenure
- tenure started join friend_count
- This exercise will have goals to merge the age and tenure, to observe the comparison in friend_count
Create a variable called year_joined in the pf data frame using the variable tenure and 2014 as the reference year.
The variable year joined should contain the year that a user joined facebook.
Instructor Notes
A common mistake is to use tenure rather than pf$tenure or with(pf, tenure…). Remember that you need to access the variable in the data frame. This is not one of the hints! :) Hint 1: Divide the tenure variable by a number. Tenure is measured in days, but we want to convert it to years. Hint 2: Subtract tenure measured in years from 2014. What does the decimal portion represent? Should we round up or round down to the closest year? Hint 3: You can use the floor() function to round down to the nearest integer. You can use the ceiling() function to round up to the nearest integer. Which one should you use?
pf$year_joined <- floor(2014 - pf$tenure/365)
Cut a Variable
Now by using table, we know how many users join in each year Next we want to take bin-range our year_joined, to make use of categorical using cut function
Create a new variable in the data frame called year_joined.bucket by using the cut function on the variable year_joined.
You need to create the following buckets for the new variable, year_joined.bucket
(2004, 2009]
(2009, 2011]
(2011, 2012]
(2012, 2014]
Note that a parenthesis means exclude the year and a bracket means include the year.
?cut
summary(pf$year_joined)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2005 2012 2012 2012 2013 2014 2
table(pf$year_joined)
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 9 15 581 1507 4557 5448 9860 33366 43588 70
pf$year_joined.bucket <- cut(pf$year_joined, breaks=c(2004,2009,2011,2012,2014))
table(pf$year_joined.bucket)
##
## (2004,2009] (2009,2011] (2011,2012] (2012,2014]
## 6669 15308 33366 43658
Plotting it All Together
Now we have joined tenure and age. and using year_joined to create a bucket
Create a line graph of friend_count vs. age so that each year_joined.bucket is a line tracking the median user friend_count across age. This means you should have four different lines on your plot.
You should subset the data to exclude the users whose year_joined.bucket is NA.
table(pf$year_joined, useNA = 'ifany')
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 <NA>
## 9 15 581 1507 4557 5448 9860 33366 43588 70 2
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined.bucket)))+
geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = median)
In this plot, now we observe 3 variables, using x=friend_count, y=age, and year_joined.bucket as categorical variables. ***
Plot the Grand Mean
Write code to do the following:
Add another geom_line to code below to plot the grand mean of the friend count vs age.
Exclude any users whose year_joined.bucket is NA.
Use a different line type for the grand mean.
As a reminder, the parameter linetype can take the values 0-6:
0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined.bucket)))+
geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = mean)+
geom_line(fun.y = mean, stat='summary', linetype=2)
Friending Rate
- Now by plotting these, we know that the mean graph isn’t entirely artifact.
- So we want to ask another question. how many friend count the user have each day
with(subset(pf, tenure > 1), summary(friend_count/tenure))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0775 0.2204 0.6069 0.5652 417.0000
Friendships Initiated
in site longer, many friends more tenure intiate more friends
What is the median friend rate? .2205
What is the maximum friend rate? 417
Create a line graph of mean of friendships_initiated per day (of tenure) vs. tenure colored by year_joined.bucket.
You need to make use of the variables tenure, friendships_initiated, and year_joined.bucket.
You also need to subset the data to only consider user with at least one day of tenure.
ggplot(aes(x = tenure, y = friendships_initiated/tenure),
data = subset(pf, tenure>1))+
geom_line(aes(color=year_joined.bucket))
These shows that people with more tenure typically have less friendships_initiated ***
Bias-Variance Tradeoff Revisited
Notice that we have noise in our graph. By doing rounding in x, we have reduce noise with more bias
Instead of geom_line(), use geom_smooth() to add a smoother to the plot. You can use the defaults for geom_smooth() but do color the line by year_joined.bucket
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = mean)
ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_smooth(aes(color = year_joined.bucket))
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
By doing smoothing, we also get better understanding about the data. ***
Sean’s NFL Fan Sentiment Study
- Now we’re gonna hear about Sean’s NFL fan study, and let’s hear about his bias trade-off visualization.
- External link:
- His study is about the emotion (sad-happiness) occurs in the particular team, in NFL statistics.
- We have pos/neg ratio pos->happiness,sad->negative.
- By plotting overtime we have some noise jump/down.
- So avg/day and smooth-expand over 7 days.
- Here we have some interesting graph.
- We convert it into more descrete format, win or lose.
- In order to handle bias-variance tradeoff, don’t let your guts choose, rather listen to what data tells you.
- Earlier we have huge variance as shown by many noise.
- Now by smoothing we have huge bias, but lower variance.
- So we’re using this spine(smoothing) and take advantage of both lower variance higher bias.
- If we have the data that’s not good enough. We may have to use EDA to ask some interesting question.
Introducing the Yogurt Data Set
Bayesian Statistics and Marketing contains the data set and a case study on it.
The citation for the original paper on the yogurt data set is Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi. “Modeling consumer demand for variety.” Marketing Science 21.3 (2002): 229-250.
A special thanks to Professor Allenby for helping us understand this data set.
To learn more about scanner data, check out Panel Data Discrete Choice Models of Consumer Demand ***
Histograms Revisited
yogurt dataset has different set of csv, in which we see the onr purchase per row.
yo = read.csv('yogurt.csv')
summary(yo)
## obs id time strawberry
## Min. : 1.0 Min. :2100081 Min. : 9662 Min. : 0.0000
## 1st Qu.: 696.5 1st Qu.:2114348 1st Qu.: 9843 1st Qu.: 0.0000
## Median :1369.5 Median :2126532 Median :10045 Median : 0.0000
## Mean :1367.8 Mean :2128592 Mean :10050 Mean : 0.6492
## 3rd Qu.:2044.2 3rd Qu.:2141549 3rd Qu.:10255 3rd Qu.: 1.0000
## Max. :2743.0 Max. :2170639 Max. :10459 Max. :11.0000
## blueberry pina.colada plain mixed.berry
## Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0000 Median : 0.0000 Median :0.0000 Median :0.0000
## Mean : 0.3571 Mean : 0.3584 Mean :0.2176 Mean :0.3887
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :12.0000 Max. :10.0000 Max. :6.0000 Max. :8.0000
## price
## Min. :20.00
## 1st Qu.:50.00
## Median :65.04
## Mean :59.25
## 3rd Qu.:68.96
## Max. :68.96
str(yo)
## 'data.frame': 2380 obs. of 9 variables:
## $ obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : int 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 ...
## $ time : int 9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
## $ strawberry : int 0 0 0 0 1 1 0 0 0 0 ...
## $ blueberry : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pina.colada: int 0 0 0 0 1 2 0 0 0 0 ...
## $ plain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mixed.berry: int 1 1 1 1 1 1 1 1 1 1 ...
## $ price : num 59 59 65 65 49 ...
yo$id <- factor(yo$id)
ggplot(aes(x=price),
data=yo)+
geom_histogram(stat='bin', binwidth =10)
ggplot(aes(x=price),
data=yo)+
geom_histogram(stat='bin')
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Notice: the higher the price, the higher the purchase. with binwidth=10, the bias will go much higher and lost its descreteness and falling to see each price ***
Number of Purchases
Now, we want to make a count of total yogurt for each household purchases
table(yo$id)
##
## 2100081 2100370 2100396 2100669 2100768 2100818 2100909 2101394 2101758
## 34 13 3 2 2 10 7 6 2
## 2101782 2101790 2101980 2102095 2102129 2102715 2102913 2103218 2103291
## 6 47 2 3 3 12 4 3 18
## 2103390 2103887 2103994 2104067 2104091 2104273 2104547 2104620 2104950
## 2 3 3 2 4 4 22 2 6
## 2105155 2105239 2105254 2105320 2105346 2105361 2105403 2105759 2106047
## 4 3 6 2 4 4 7 26 4
## 2106286 2106351 2106401 2106567 2106724 2106799 2107094 2107300 2107391
## 7 2 2 2 6 8 2 2 5
## 2107706 2107953 2108100 2108209 2108639 2108704 2108944 2108985 2109025
## 5 10 4 5 8 4 15 4 2
## 2109033 2109298 2109769 2110007 2110031 2110056 2110411 2110460 2110635
## 2 17 2 5 22 2 19 3 3
## 2110775 2110890 2110965 2111203 2111385 2111674 2111922 2112235 2112482
## 3 3 3 2 4 10 18 4 3
## 2112516 2113340 2113472 2113613 2113779 2114025 2114041 2114074 2114231
## 15 5 12 4 7 22 2 11 3
## 2114314 2114348 2114371 2114819 2114892 2114942 2115006 2115220 2115527
## 3 16 2 4 2 6 2 4 9
## 2115998 2116277 2116434 2116558 2117069 2117226 2117242 2117317 2117788
## 28 2 2 4 13 2 2 7 4
## 2118182 2118299 2118612 2118778 2118927 2119024 2119164 2119594 2119693
## 3 19 12 19 9 4 5 11 11
## 2119735 2120089 2120261 2120378 2120436 2120964 2121095 2121277 2121400
## 8 9 25 4 2 2 5 3 29
## 2121418 2121533 2121582 2122242 2122655 2122705 2122788 2122838 2123000
## 7 13 5 3 10 12 2 2 7
## 2123091 2123257 2123463 2123471 2123554 2123695 2123885 2123968 2124073
## 6 4 7 2 19 6 3 2 50
## 2124115 2124156 2124305 2124321 2124388 2124412 2124511 2124545 2124701
## 4 6 16 2 2 11 5 17 5
## 2124750 2124909 2124941 2125203 2125427 2125443 2125609 2125658 2126102
## 25 6 4 4 8 13 2 3 3
## 2126292 2126490 2126532 2126847 2126946 2127076 2127308 2127407 2127498
## 8 2 10 2 2 3 3 2 2
## 2127605 2127621 2127803 2127936 2128116 2128389 2128447 2128595 2128827
## 3 12 6 6 6 5 5 7 2
## 2128884 2128959 2129080 2129098 2129163 2129361 2129528 2129734 2129767
## 2 12 2 2 4 3 39 5 12
## 2129817 2129874 2129940 2130351 2130377 2130583 2130641 2130807 2130914
## 8 2 2 3 4 59 5 6 3
## 2130948 2131250 2131466 2131508 2132019 2132290 2132555 2133033 2133066
## 12 16 2 2 2 74 2 2 7
## 2133108 2133207 2133272 2133330 2133413 2133496 2133611 2133660 2133983
## 2 2 17 10 3 14 2 3 4
## 2134023 2134122 2134288 2134452 2134478 2134676 2134874 2135251 2135301
## 2 20 22 2 2 8 2 2 2
## 2135384 2135681 2135996 2136069 2136531 2136697 2136960 2137067 2137380
## 3 2 5 2 23 2 4 2 3
## 2137687 2137745 2138966 2139162 2139626 2139766 2139774 2140483 2141002
## 4 12 14 4 2 9 3 5 9
## 2141341 2141507 2141549 2141812 2141861 2142885 2142976 2143180 2143271
## 2 4 6 6 4 10 4 9 3
## 2143396 2143503 2143586 2143875 2144048 2144113 2144469 2144576 2144675
## 2 4 3 9 8 3 2 4 3
## 2145292 2145326 2145425 2145599 2145672 2146035 2146597 2146621 2146738
## 5 27 2 8 11 7 16 2 4
## 2147512 2147751 2147777 2147892 2147991 2148296 2148924 2149500 2149609
## 2 3 17 24 4 4 7 50 6
## 2150029 2150854 2151423 2151472 2151613 2151829 2152108 2152264 2152454
## 4 5 3 2 3 3 2 2 5
## 2152702 2152975 2153015 2153163 2153387 2153494 2153619 2154278 2154351
## 29 2 5 6 9 5 8 3 2
## 2154849 2155697 2155929 2156224 2157040 2157164 2157420 2158097 2158196
## 3 16 6 3 4 2 3 9 2
## 2158436 2158642 2158873 2159897 2160259 2160382 2160440 2160549 2160762
## 7 6 2 3 5 2 7 2 3
## 2161554 2161885 2162206 2162313 2162545 2162669 2164392 2164756 2164863
## 8 6 6 5 4 3 2 2 3
## 2165746 2165779 2165951 2166223 2166934 2167221 2167320 2167817 2167825
## 24 9 7 2 3 2 3 3 2
## 2168005 2168013 2168443 2169128 2169250 2169268 2169896 2170639
## 2 2 15 4 2 7 7 2
all.purchases <- transform(yo,table(yo$id))
yo <- transform(yo, all.purchases=strawberry+blueberry+pina.colada+plain+mixed.berry)
Prices over Time
- Now that we have this graph, the plot below shows us some interesting thing.
- We know that most people didn’t buy that many yogurt compared to others.
- Why? First let’s investigate the price of yogurt overtime
Create a scatterplot of price vs time. This will be an example of a time series plot.
ggplot(aes(x=all.purchases),
data = yo )+
geom_histogram(binwidth=1)
ggplot(aes(x=time,y=price),
data=yo)+
geom_point(alpha=1/20)
- The scatter then shows how the price tends to go up overtime
- There’s some graph that tend to flat, in which case the buyer may using coupon to buy ***
Sampling Observations
- Dean said that when observing data with multiple graph and multiple objects, often it useful to take small subset of data (sampling) and work various way through it
- In the case of yogurt dataset. We want to sample the data to just 16 household.
- We may then ask another question. What price that buyer to tends to buy? How many yogurt they want to buy?
The citation for the original paper on the yogurt data set is Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi. “Modeling consumer demand for variety.” Marketing Science 21.3 (2002): 229-250.
Looking at Samples of Households
Note: x %in% y returns a logical (boolean) vector the same length as x that says whether each entry in x appears in y. That is, for each entry in x, it checks to see whether it is in y.
This allows us to subset the data so we get all the purchases occasions for the households in the sample. Then, we create scatterplots of price vs. time and facet by the sample id.
Use the pch or shape parameter to specify the symbol when plotting points. Scroll down to ‘Plotting Points’ on QuickR’s Graphical Parameters.
- The plot below will only display us small subset of data that the id registered in sample.ids
- We’re gonna plot line by different id,
- The point then just emphasize the changing in the line. The size then make smaller/bigger depending of the purchases the household makes
#set the seed for reproducible results
set.seed(10000)
sample.ids <- sample(levels(yo$id), 16)
ggplot(aes(x=time, y= price),
data= subset(yo, id %in% sample.ids))+
facet_wrap(~id)+
geom_line()+
geom_point(aes(size=all.purchases), pch=1)+
ggsave('Seed@10000.jpg')
## Saving 7 x 5 in image
- some people just buy low but steady purchases
- there’s some people that buy the yogurt, and come back in a long time, buy higher amounts. Maybe they just remember the yogurt shop and buy with additional request by friends.
- There’s one that buy too many yogurt, and buy just a few. Perhaps the first one is just enough.
- One that makes lot purchases, perhaps they just buy it as reseller or share with others.
- We also can see that no graph actually always higher, people who buy that tends to receive coupon and use it to minimize their cost
The Limits of Cross Sectional Data
If we look back at the facebook graph. We can’t measure the friendship initiated, because it just cross-section, categorical graph.We can see it by different color in the graph. It’s not time-series (like yogurt, where we can see the purchases) so we can’t see the friendship_iniated. It would be great if we can have time-series day/friendship.initiated
Many Variables
Dean also said that we have EDA to explore relationship between variables. Use another variable to see the consistency of two variable that we observe. But we also may want to predict one variable based on the rest of variables. We may want to reduce the dimension so we can get better visualization(PCA). And also let the data speak for itself. Plot multiple graph and visualization to get better understanding about the data.
Scatterplot Matrix
Scatter matrix may not good for this particular data, specially if this is categorical.
Here’s the scatterplot matrix as a pdf.
You’ll need to run the code install.packages(‘GGally’) to install the package for creating this partiular scatterplot matrix.
If the plot takes a long time to render or if you want to see some of the scatterplot matrix, then only examine a smaller number of variables. You can use the following code or select fewer variables. We recommend including gender (the 6th variable)!
pf_subset <- pf[ , c(2:7)]
library(GGally)
set.seed(1836)
pf_subset <- pf[,c(2:15)]
ggpairs(pf_subset[sample.int(nrow(pf_subset),1000),-1])
Great work on finding or computing the correlation coefficients.
Scatterplots are below the diagonal, and categorical variables, like gender, create faceted histograms. The ggpairs will create some lookup (correlation) table that we want to observe between variables. ggpairs may not a good logarithmic analysis. but it’s a good starting point to plotting the graph.
Even More Variables
- Genetic data could be a lot more of some digit parameters(features)
- nci data is gene dataset with tons of data set. Close to 600k examples.
Heat Maps
nci <- read.table("nci.tsv")
names(nci)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11"
## [12] "V12" "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22"
## [23] "V23" "V24" "V25" "V26" "V27" "V28" "V29" "V30" "V31" "V32" "V33"
## [34] "V34" "V35" "V36" "V37" "V38" "V39" "V40" "V41" "V42" "V43" "V44"
## [45] "V45" "V46" "V47" "V48" "V49" "V50" "V51" "V52" "V53" "V54" "V55"
## [56] "V56" "V57" "V58" "V59" "V60" "V61" "V62" "V63" "V64"
colnames(nci) <- c(1:64)#make it easier for colnames to just contain a number
head(nci)
## 1 2 3 4 5 6 7 8 9
## 1 0.300 0.679961 0.940 2.800000e-01 0.485 0.310 -0.830 -0.190 0.460
## 2 1.180 1.289961 -0.040 -3.100000e-01 -0.465 -0.030 0.000 -0.870 0.000
## 3 0.550 0.169961 -0.170 6.800000e-01 0.395 -0.100 0.130 -0.450 1.150
## 4 1.140 0.379961 -0.040 -8.100000e-01 0.905 -0.460 -1.630 0.080 -1.400
## 5 -0.265 0.464961 -0.605 6.250000e-01 0.200 -0.205 0.075 0.005 -0.005
## 6 -0.070 0.579961 0.000 -1.387779e-17 -0.005 -0.540 -0.360 0.350 -0.700
## 10 11 12 13 14 15 16 17 18 19
## 1 0.760 0.270 -0.450 -0.030 0.710 -0.360 -0.210 -0.500 -1.060 0.150
## 2 1.490 0.630 -0.060 -1.120 0.000 -1.420 -1.950 -0.520 -2.190 -0.450
## 3 0.280 -0.360 0.150 -0.050 0.160 -0.030 -0.700 -0.660 -0.130 -0.320
## 4 0.100 -1.040 -0.610 0.000 -0.770 -2.280 -1.650 -2.610 0.000 -1.610
## 5 -0.525 0.015 -0.395 -0.285 0.045 0.135 -0.075 0.225 -0.485 -0.095
## 6 0.360 -0.040 0.150 -0.250 -0.160 -0.320 0.060 -0.050 -0.430 -0.080
## 20 21 22 23 24 25 26 27 28 29
## 1 -0.290 -0.200 0.430 -0.490 -0.530 -0.010 0.640 -0.480 0.140 0.640
## 2 0.000 0.740 0.500 0.330 -0.050 -0.370 0.550 0.970 0.720 0.150
## 3 0.050 0.080 -0.730 0.010 -0.230 -0.160 -0.540 0.300 -0.240 -0.170
## 4 0.730 0.760 0.600 -1.660 0.170 0.930 -1.780 0.470 0.000 0.550
## 5 0.385 -0.105 -0.635 -0.185 0.825 0.395 0.315 0.425 1.715 -0.205
## 6 0.390 -0.080 -0.430 -0.140 0.010 -0.100 0.810 0.020 0.260 0.290
## 30 31 32 33 34 35 36 37 38
## 1 0.070 0.130 0.320 0.515 0.080 0.410 -0.200 -0.36998050 -0.370
## 2 0.290 2.240 0.280 1.045 0.120 0.000 0.000 -1.38998000 0.180
## 3 0.070 0.640 0.360 0.000 0.060 0.210 0.060 -0.05998047 0.000
## 4 1.310 0.680 -1.880 0.000 0.400 0.180 -0.070 0.07001953 -1.320
## 5 0.085 0.135 0.475 0.330 0.105 -0.255 -0.415 -0.07498047 -0.825
## 6 -0.620 0.300 0.110 -0.155 -0.190 -0.110 0.020 0.04001953 -0.130
## 39 40 41 42 43 44 45 46
## 1 -0.430 -0.380 -0.550 -0.32003900 -0.620 -4.900000e-01 0.07001953 -0.120
## 2 -0.590 -0.550 0.000 0.08996101 0.080 4.200000e-01 -0.82998050 0.000
## 3 -0.500 -1.710 0.100 -0.29003900 0.140 -3.400000e-01 -0.59998050 -0.010
## 4 -1.520 -1.870 -2.390 -1.03003900 0.740 7.000000e-02 -0.90998050 0.130
## 5 -0.785 -0.585 -0.215 0.09496101 0.205 -2.050000e-01 0.24501950 0.555
## 6 0.520 0.120 -0.620 0.05996101 0.000 -1.387779e-17 -0.43998050 -0.550
## 47 48 49 50 51 52 53 54
## 1 -0.290 -0.8100195 0.200 0.37998050 0.3100195 0.030 -0.42998050 0.160
## 2 0.030 0.0000000 -0.230 0.44998050 0.4800195 0.220 -0.38998050 -0.340
## 3 -0.310 0.2199805 0.360 0.65998050 0.9600195 0.150 -0.17998050 -0.020
## 4 1.500 0.7399805 0.180 0.76998050 0.9600195 -1.240 0.86001950 -1.730
## 5 0.005 0.1149805 -0.315 0.05498047 -0.2149805 -0.305 0.78501950 -0.625
## 6 -0.540 0.1199805 0.410 0.54998050 0.3700195 0.050 0.04001953 -0.140
## 55 56 57 58 59 60 61 62
## 1 0.010 -0.620 -0.380 0.04998047 0.650 -0.030 -0.270 0.210
## 2 -1.280 -0.130 0.000 -0.72001950 0.640 -0.480 0.630 -0.620
## 3 -0.770 0.200 -0.060 0.41998050 0.150 0.070 -0.100 -0.150
## 4 0.940 -1.410 0.800 0.92998050 -1.970 -0.700 1.100 -1.330
## 5 -0.015 1.585 -0.115 -0.09501953 -0.065 -0.195 1.045 0.045
## 6 0.270 1.160 0.180 0.19998050 0.130 0.410 0.080 -0.400
## 63 64
## 1 -5.000000e-02 0.350
## 2 1.400000e-01 -0.270
## 3 -9.000000e-02 0.020
## 4 -1.260000e+00 -1.230
## 5 4.500000e-02 -0.715
## 6 -2.710505e-20 -0.340
#Melt the data to long format
library(reshape2)
#Here we just sampling to just 200 dataset, and all columns the sampe
nci.long.samp <- melt(as.matrix(nci[1:200,]))
str(nci.long.samp)
## 'data.frame': 12800 obs. of 3 variables:
## $ Var1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Var2 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ value: num 0.3 1.18 0.55 1.14 -0.265 ...
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)
## gene case value
## 1 1 1 0.300
## 2 2 1 1.180
## 3 3 1 0.550
## 4 4 1 1.140
## 5 5 1 -0.265
## 6 6 1 -0.070
#ggplot will make underexpress in blue, and overexpress in red
library(ggplot2)
#The geom will be plot in tile, and scale color from blue to red
ggplot(aes(y = gene, x = case, fill = value),
data = nci.long.samp) +
geom_tile() +
scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
Genomic map of the data is just 200 over 6000 examples. By using 6000 we just increasing the complexity of the visualization. That’s way it’s important to just sampling the data. and work our various visualization and relationship in the variables.
Analyzing Three or More Variables
- This the summary for how we get this far.
- We explore how we compare many variables(at least three)
- We synthesize our variables to make better intuition about the data.
- We plot many visualization( use GGally as starting point) to achieve the data.
- We overcome complexity of our data with smoothing, and sampling.
- We have look at many variables at once and plotting them.
- We use plot in lesson 4, extending them, divide into multiple group(bucket) and ovserve many variables by using scatter matrix and heatmap
- From just one row per case, we convert to one row combination, and using reshape to back and forth long wide format.
- Next we want to learn indepth analysis about the diamonds sample, and how Salomon as an expert performing the larger part of EDA and extending it. He also writing the code from scrape, and using it to predict diamonds prices. ***