In this Lesson we want to add 3 or more variables to observe. We also want to use third another variable to look some consistency of 2 variables we want to observe.
library(ggplot2)
?read.csv
pf = read.csv('../lesson3/pseudo_facebook.tsv',sep = "\t")
ggplot(aes(x = gender, y = age),
data = subset(pf, !is.na(gender))) +geom_boxplot()+
stat_summary(fun.y = mean, geom = "point", shape =4 )
#+ geom_histogram()
ggplot(aes(x = age, y = friend_count),
data= subset(pf, !is.na(gender)))+
geom_line(aes(color=gender), stat="summary", fun.y = median)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pf.fc_by_age_gender <- group_by(pf,age,gender) %>%
filter(!is.na(gender))%>%
summarise(median_friend_count = median(friend_count),
mean_friend_count = mean(friend_count),
n=n())%>%
#Earlier we use groupby age,gender. because gender need to be avoided, remove one layer
#by using ungroup, and arrange by age
ungroup()%>%
arrange(age)
head(pf.fc_by_age_gender)
## Source: local data frame [6 x 5]
##
## age gender median_friend_count mean_friend_count n
## 1 13 female 148 259.1606 193
## 2 13 male 55 102.1340 291
## 3 14 female 224 362.4286 847
## 4 14 male 92 164.1456 1078
## 5 15 female 276 538.6813 1139
## 6 15 male 106 200.6658 1478
Create a line graph showing the median friend count over the ages for each gender. Be sure to use the data frame you just created, pf.fc_by_age_gender. Instructor Notes
Your code should look similar to the code we used to make the plot the first time. It will not need to make use of the stat and fun.y parameters.
ggplot(aes(x = age, y = friend_count), data = subset(pf.1, !is.na(gender))) + geom_line(aes(color = gender), stat = ‘summary’, fun.y = median)
ggplot(aes(x= age, y=median_friend_count),
data = pf.fc_by_age_gender)+
geom_line(aes(color=gender))
It???s important to use quotes around the variable name that is assigned tovalue.var.
We could also create a similar data frame using the dplyr package. pf.fc_by_age_gender.wide <- pf.fc_by_age_gender %.% group_by(age) %.% summarise(male = friend_count.median[gender = ‘male’], female = friend_count.median[gender = ‘female’], ratio = female / male) %.% arrange(age)
head(pf.fc_by_age_gender.wide)
library(reshape2)
pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
age~gender,#formula,left=value that kept,right=column that retain
value.var='median_friend_count')
head(pf.fc_by_age_gender.wide)
## age female male
## 1 13 148 55
## 2 14 224 92
## 3 15 276 106
## 4 16 258 136
## 5 17 245 125
## 6 18 243 122
Plot the ratio of the female to male median friend counts using the data frame pf.fc_by_age_gender.wide.
Think about what geom you should use. Add a horizontal line to the plot with a y intercept of 1, which will be the base line. Look up the documentation for geom_hline to do that. Use the parameter linetype in geom_hline to make the line dashed.
The linetype parameter can take the values 0-6: 0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash
library(ggplot2)
ggplot(aes(x=age, y = female/male),
data=pf.fc_by_age_gender.wide)+
geom_line()+
geom_hline(aes(yintercept=1),linetype=2)
Pseudo Facebook may stated that, many people join from various other countries have tendencies male having lower count than female. These shows us that for younger women, they tend to have almost twice friend count than male ***
Create a variable called year_joined in the pf data frame using the variable tenure and 2014 as the reference year.
The variable year joined should contain the year that a user joined facebook.
Instructor Notes
A common mistake is to use tenure rather than pf$tenure or with(pf, tenure…). Remember that you need to access the variable in the data frame. This is not one of the hints! :) Hint 1: Divide the tenure variable by a number. Tenure is measured in days, but we want to convert it to years. Hint 2: Subtract tenure measured in years from 2014. What does the decimal portion represent? Should we round up or round down to the closest year? Hint 3: You can use the floor() function to round down to the nearest integer. You can use the ceiling() function to round up to the nearest integer. Which one should you use?
pf$year_joined <- floor(2014 - pf$tenure/365)
Now by using table, we know how many users join in each year Next we want to take bin-range our year_joined, to make use of categorical using cut function
Create a new variable in the data frame called year_joined.bucket by using the cut function on the variable year_joined.
You need to create the following buckets for the new variable, year_joined.bucket
(2004, 2009]
(2009, 2011]
(2011, 2012]
(2012, 2014]
Note that a parenthesis means exclude the year and a bracket means include the year.
?cut
summary(pf$year_joined)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2005 2012 2012 2012 2013 2014 2
table(pf$year_joined)
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 9 15 581 1507 4557 5448 9860 33366 43588 70
pf$year_joined.bucket <- cut(pf$year_joined, breaks=c(2004,2009,2011,2012,2014))
table(pf$year_joined.bucket)
##
## (2004,2009] (2009,2011] (2011,2012] (2012,2014]
## 6669 15308 33366 43658
Now we have joined tenure and age. and using year_joined to create a bucket
Create a line graph of friend_count vs. age so that each year_joined.bucket is a line tracking the median user friend_count across age. This means you should have four different lines on your plot.
You should subset the data to exclude the users whose year_joined.bucket is NA.
table(pf$year_joined, useNA = 'ifany')
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 <NA>
## 9 15 581 1507 4557 5448 9860 33366 43588 70 2
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined.bucket)))+
geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = median)
In this plot, now we observe 3 variables, using x=friend_count, y=age, and year_joined.bucket as categorical variables. ***
Write code to do the following:
Add another geom_line to code below to plot the grand mean of the friend count vs age.
Exclude any users whose year_joined.bucket is NA.
Use a different line type for the grand mean.
As a reminder, the parameter linetype can take the values 0-6:
0 = blank, 1 = solid, 2 = dashed 3 = dotted, 4 = dotdash, 5 = longdash 6 = twodash
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined.bucket)))+
geom_line(aes(color=year_joined.bucket), stat='summary', fun.y = mean)+
geom_line(fun.y = mean, stat='summary', linetype=2)
with(subset(pf, tenure > 1), summary(friend_count/tenure))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0775 0.2204 0.6069 0.5652 417.0000
in site longer, many friends more tenure intiate more friends
What is the median friend rate? .2205
What is the maximum friend rate? 417
Create a line graph of mean of friendships_initiated per day (of tenure) vs. tenure colored by year_joined.bucket.
You need to make use of the variables tenure, friendships_initiated, and year_joined.bucket.
You also need to subset the data to only consider user with at least one day of tenure.
ggplot(aes(x = tenure, y = friendships_initiated/tenure),
data = subset(pf, tenure>1))+
geom_line(aes(color=year_joined.bucket))
These shows that people with more tenure typically have less friendships_initiated ***
Notice that we have noise in our graph. By doing rounding in x, we have reduce noise with more bias
Instead of geom_line(), use geom_smooth() to add a smoother to the plot. You can use the defaults for geom_smooth() but do color the line by year_joined.bucket
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = mean)
ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_smooth(aes(color = year_joined.bucket))
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
By doing smoothing, we also get better understanding about the data. ***
Bayesian Statistics and Marketing contains the data set and a case study on it.
The citation for the original paper on the yogurt data set is Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi. “Modeling consumer demand for variety.” Marketing Science 21.3 (2002): 229-250.
A special thanks to Professor Allenby for helping us understand this data set.
To learn more about scanner data, check out Panel Data Discrete Choice Models of Consumer Demand ***
yogurt dataset has different set of csv, in which we see the onr purchase per row.
yo = read.csv('yogurt.csv')
summary(yo)
## obs id time strawberry
## Min. : 1.0 Min. :2100081 Min. : 9662 Min. : 0.0000
## 1st Qu.: 696.5 1st Qu.:2114348 1st Qu.: 9843 1st Qu.: 0.0000
## Median :1369.5 Median :2126532 Median :10045 Median : 0.0000
## Mean :1367.8 Mean :2128592 Mean :10050 Mean : 0.6492
## 3rd Qu.:2044.2 3rd Qu.:2141549 3rd Qu.:10255 3rd Qu.: 1.0000
## Max. :2743.0 Max. :2170639 Max. :10459 Max. :11.0000
## blueberry pina.colada plain mixed.berry
## Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0000 Median : 0.0000 Median :0.0000 Median :0.0000
## Mean : 0.3571 Mean : 0.3584 Mean :0.2176 Mean :0.3887
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :12.0000 Max. :10.0000 Max. :6.0000 Max. :8.0000
## price
## Min. :20.00
## 1st Qu.:50.00
## Median :65.04
## Mean :59.25
## 3rd Qu.:68.96
## Max. :68.96
str(yo)
## 'data.frame': 2380 obs. of 9 variables:
## $ obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : int 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 ...
## $ time : int 9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
## $ strawberry : int 0 0 0 0 1 1 0 0 0 0 ...
## $ blueberry : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pina.colada: int 0 0 0 0 1 2 0 0 0 0 ...
## $ plain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mixed.berry: int 1 1 1 1 1 1 1 1 1 1 ...
## $ price : num 59 59 65 65 49 ...
yo$id <- factor(yo$id)
ggplot(aes(x=price),
data=yo)+
geom_histogram(stat='bin', binwidth =10)
ggplot(aes(x=price),
data=yo)+
geom_histogram(stat='bin')
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Notice: the higher the price, the higher the purchase. with binwidth=10, the bias will go much higher and lost its descreteness and falling to see each price ***
Now, we want to make a count of total yogurt for each household purchases
table(yo$id)
##
## 2100081 2100370 2100396 2100669 2100768 2100818 2100909 2101394 2101758
## 34 13 3 2 2 10 7 6 2
## 2101782 2101790 2101980 2102095 2102129 2102715 2102913 2103218 2103291
## 6 47 2 3 3 12 4 3 18
## 2103390 2103887 2103994 2104067 2104091 2104273 2104547 2104620 2104950
## 2 3 3 2 4 4 22 2 6
## 2105155 2105239 2105254 2105320 2105346 2105361 2105403 2105759 2106047
## 4 3 6 2 4 4 7 26 4
## 2106286 2106351 2106401 2106567 2106724 2106799 2107094 2107300 2107391
## 7 2 2 2 6 8 2 2 5
## 2107706 2107953 2108100 2108209 2108639 2108704 2108944 2108985 2109025
## 5 10 4 5 8 4 15 4 2
## 2109033 2109298 2109769 2110007 2110031 2110056 2110411 2110460 2110635
## 2 17 2 5 22 2 19 3 3
## 2110775 2110890 2110965 2111203 2111385 2111674 2111922 2112235 2112482
## 3 3 3 2 4 10 18 4 3
## 2112516 2113340 2113472 2113613 2113779 2114025 2114041 2114074 2114231
## 15 5 12 4 7 22 2 11 3
## 2114314 2114348 2114371 2114819 2114892 2114942 2115006 2115220 2115527
## 3 16 2 4 2 6 2 4 9
## 2115998 2116277 2116434 2116558 2117069 2117226 2117242 2117317 2117788
## 28 2 2 4 13 2 2 7 4
## 2118182 2118299 2118612 2118778 2118927 2119024 2119164 2119594 2119693
## 3 19 12 19 9 4 5 11 11
## 2119735 2120089 2120261 2120378 2120436 2120964 2121095 2121277 2121400
## 8 9 25 4 2 2 5 3 29
## 2121418 2121533 2121582 2122242 2122655 2122705 2122788 2122838 2123000
## 7 13 5 3 10 12 2 2 7
## 2123091 2123257 2123463 2123471 2123554 2123695 2123885 2123968 2124073
## 6 4 7 2 19 6 3 2 50
## 2124115 2124156 2124305 2124321 2124388 2124412 2124511 2124545 2124701
## 4 6 16 2 2 11 5 17 5
## 2124750 2124909 2124941 2125203 2125427 2125443 2125609 2125658 2126102
## 25 6 4 4 8 13 2 3 3
## 2126292 2126490 2126532 2126847 2126946 2127076 2127308 2127407 2127498
## 8 2 10 2 2 3 3 2 2
## 2127605 2127621 2127803 2127936 2128116 2128389 2128447 2128595 2128827
## 3 12 6 6 6 5 5 7 2
## 2128884 2128959 2129080 2129098 2129163 2129361 2129528 2129734 2129767
## 2 12 2 2 4 3 39 5 12
## 2129817 2129874 2129940 2130351 2130377 2130583 2130641 2130807 2130914
## 8 2 2 3 4 59 5 6 3
## 2130948 2131250 2131466 2131508 2132019 2132290 2132555 2133033 2133066
## 12 16 2 2 2 74 2 2 7
## 2133108 2133207 2133272 2133330 2133413 2133496 2133611 2133660 2133983
## 2 2 17 10 3 14 2 3 4
## 2134023 2134122 2134288 2134452 2134478 2134676 2134874 2135251 2135301
## 2 20 22 2 2 8 2 2 2
## 2135384 2135681 2135996 2136069 2136531 2136697 2136960 2137067 2137380
## 3 2 5 2 23 2 4 2 3
## 2137687 2137745 2138966 2139162 2139626 2139766 2139774 2140483 2141002
## 4 12 14 4 2 9 3 5 9
## 2141341 2141507 2141549 2141812 2141861 2142885 2142976 2143180 2143271
## 2 4 6 6 4 10 4 9 3
## 2143396 2143503 2143586 2143875 2144048 2144113 2144469 2144576 2144675
## 2 4 3 9 8 3 2 4 3
## 2145292 2145326 2145425 2145599 2145672 2146035 2146597 2146621 2146738
## 5 27 2 8 11 7 16 2 4
## 2147512 2147751 2147777 2147892 2147991 2148296 2148924 2149500 2149609
## 2 3 17 24 4 4 7 50 6
## 2150029 2150854 2151423 2151472 2151613 2151829 2152108 2152264 2152454
## 4 5 3 2 3 3 2 2 5
## 2152702 2152975 2153015 2153163 2153387 2153494 2153619 2154278 2154351
## 29 2 5 6 9 5 8 3 2
## 2154849 2155697 2155929 2156224 2157040 2157164 2157420 2158097 2158196
## 3 16 6 3 4 2 3 9 2
## 2158436 2158642 2158873 2159897 2160259 2160382 2160440 2160549 2160762
## 7 6 2 3 5 2 7 2 3
## 2161554 2161885 2162206 2162313 2162545 2162669 2164392 2164756 2164863
## 8 6 6 5 4 3 2 2 3
## 2165746 2165779 2165951 2166223 2166934 2167221 2167320 2167817 2167825
## 24 9 7 2 3 2 3 3 2
## 2168005 2168013 2168443 2169128 2169250 2169268 2169896 2170639
## 2 2 15 4 2 7 7 2
all.purchases <- transform(yo,table(yo$id))
yo <- transform(yo, all.purchases=strawberry+blueberry+pina.colada+plain+mixed.berry)
Create a scatterplot of price vs time. This will be an example of a time series plot.
ggplot(aes(x=all.purchases),
data = yo )+
geom_histogram(binwidth=1)
ggplot(aes(x=time,y=price),
data=yo)+
geom_point(alpha=1/20)
The citation for the original paper on the yogurt data set is Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi. “Modeling consumer demand for variety.” Marketing Science 21.3 (2002): 229-250.
Note: x %in% y returns a logical (boolean) vector the same length as x that says whether each entry in x appears in y. That is, for each entry in x, it checks to see whether it is in y.
This allows us to subset the data so we get all the purchases occasions for the households in the sample. Then, we create scatterplots of price vs. time and facet by the sample id.
Use the pch or shape parameter to specify the symbol when plotting points. Scroll down to ‘Plotting Points’ on QuickR’s Graphical Parameters.
#set the seed for reproducible results
set.seed(10000)
sample.ids <- sample(levels(yo$id), 16)
ggplot(aes(x=time, y= price),
data= subset(yo, id %in% sample.ids))+
facet_wrap(~id)+
geom_line()+
geom_point(aes(size=all.purchases), pch=1)+
ggsave('Seed@10000.jpg')
## Saving 7 x 5 in image
If we look back at the facebook graph. We can’t measure the friendship initiated, because it just cross-section, categorical graph.We can see it by different color in the graph. It’s not time-series (like yogurt, where we can see the purchases) so we can’t see the friendship_iniated. It would be great if we can have time-series day/friendship.initiated
Dean also said that we have EDA to explore relationship between variables. Use another variable to see the consistency of two variable that we observe. But we also may want to predict one variable based on the rest of variables. We may want to reduce the dimension so we can get better visualization(PCA). And also let the data speak for itself. Plot multiple graph and visualization to get better understanding about the data.
Scatter matrix may not good for this particular data, specially if this is categorical.
Here’s the scatterplot matrix as a pdf.
You’ll need to run the code install.packages(‘GGally’) to install the package for creating this partiular scatterplot matrix.
If the plot takes a long time to render or if you want to see some of the scatterplot matrix, then only examine a smaller number of variables. You can use the following code or select fewer variables. We recommend including gender (the 6th variable)!
pf_subset <- pf[ , c(2:7)]
library(GGally)
set.seed(1836)
pf_subset <- pf[,c(2:15)]
ggpairs(pf_subset[sample.int(nrow(pf_subset),1000),-1])
Great work on finding or computing the correlation coefficients.
Scatterplots are below the diagonal, and categorical variables, like gender, create faceted histograms. The ggpairs will create some lookup (correlation) table that we want to observe between variables. ggpairs may not a good logarithmic analysis. but it’s a good starting point to plotting the graph.
nci <- read.table("nci.tsv")
names(nci)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11"
## [12] "V12" "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22"
## [23] "V23" "V24" "V25" "V26" "V27" "V28" "V29" "V30" "V31" "V32" "V33"
## [34] "V34" "V35" "V36" "V37" "V38" "V39" "V40" "V41" "V42" "V43" "V44"
## [45] "V45" "V46" "V47" "V48" "V49" "V50" "V51" "V52" "V53" "V54" "V55"
## [56] "V56" "V57" "V58" "V59" "V60" "V61" "V62" "V63" "V64"
colnames(nci) <- c(1:64)#make it easier for colnames to just contain a number
head(nci)
## 1 2 3 4 5 6 7 8 9
## 1 0.300 0.679961 0.940 2.800000e-01 0.485 0.310 -0.830 -0.190 0.460
## 2 1.180 1.289961 -0.040 -3.100000e-01 -0.465 -0.030 0.000 -0.870 0.000
## 3 0.550 0.169961 -0.170 6.800000e-01 0.395 -0.100 0.130 -0.450 1.150
## 4 1.140 0.379961 -0.040 -8.100000e-01 0.905 -0.460 -1.630 0.080 -1.400
## 5 -0.265 0.464961 -0.605 6.250000e-01 0.200 -0.205 0.075 0.005 -0.005
## 6 -0.070 0.579961 0.000 -1.387779e-17 -0.005 -0.540 -0.360 0.350 -0.700
## 10 11 12 13 14 15 16 17 18 19
## 1 0.760 0.270 -0.450 -0.030 0.710 -0.360 -0.210 -0.500 -1.060 0.150
## 2 1.490 0.630 -0.060 -1.120 0.000 -1.420 -1.950 -0.520 -2.190 -0.450
## 3 0.280 -0.360 0.150 -0.050 0.160 -0.030 -0.700 -0.660 -0.130 -0.320
## 4 0.100 -1.040 -0.610 0.000 -0.770 -2.280 -1.650 -2.610 0.000 -1.610
## 5 -0.525 0.015 -0.395 -0.285 0.045 0.135 -0.075 0.225 -0.485 -0.095
## 6 0.360 -0.040 0.150 -0.250 -0.160 -0.320 0.060 -0.050 -0.430 -0.080
## 20 21 22 23 24 25 26 27 28 29
## 1 -0.290 -0.200 0.430 -0.490 -0.530 -0.010 0.640 -0.480 0.140 0.640
## 2 0.000 0.740 0.500 0.330 -0.050 -0.370 0.550 0.970 0.720 0.150
## 3 0.050 0.080 -0.730 0.010 -0.230 -0.160 -0.540 0.300 -0.240 -0.170
## 4 0.730 0.760 0.600 -1.660 0.170 0.930 -1.780 0.470 0.000 0.550
## 5 0.385 -0.105 -0.635 -0.185 0.825 0.395 0.315 0.425 1.715 -0.205
## 6 0.390 -0.080 -0.430 -0.140 0.010 -0.100 0.810 0.020 0.260 0.290
## 30 31 32 33 34 35 36 37 38
## 1 0.070 0.130 0.320 0.515 0.080 0.410 -0.200 -0.36998050 -0.370
## 2 0.290 2.240 0.280 1.045 0.120 0.000 0.000 -1.38998000 0.180
## 3 0.070 0.640 0.360 0.000 0.060 0.210 0.060 -0.05998047 0.000
## 4 1.310 0.680 -1.880 0.000 0.400 0.180 -0.070 0.07001953 -1.320
## 5 0.085 0.135 0.475 0.330 0.105 -0.255 -0.415 -0.07498047 -0.825
## 6 -0.620 0.300 0.110 -0.155 -0.190 -0.110 0.020 0.04001953 -0.130
## 39 40 41 42 43 44 45 46
## 1 -0.430 -0.380 -0.550 -0.32003900 -0.620 -4.900000e-01 0.07001953 -0.120
## 2 -0.590 -0.550 0.000 0.08996101 0.080 4.200000e-01 -0.82998050 0.000
## 3 -0.500 -1.710 0.100 -0.29003900 0.140 -3.400000e-01 -0.59998050 -0.010
## 4 -1.520 -1.870 -2.390 -1.03003900 0.740 7.000000e-02 -0.90998050 0.130
## 5 -0.785 -0.585 -0.215 0.09496101 0.205 -2.050000e-01 0.24501950 0.555
## 6 0.520 0.120 -0.620 0.05996101 0.000 -1.387779e-17 -0.43998050 -0.550
## 47 48 49 50 51 52 53 54
## 1 -0.290 -0.8100195 0.200 0.37998050 0.3100195 0.030 -0.42998050 0.160
## 2 0.030 0.0000000 -0.230 0.44998050 0.4800195 0.220 -0.38998050 -0.340
## 3 -0.310 0.2199805 0.360 0.65998050 0.9600195 0.150 -0.17998050 -0.020
## 4 1.500 0.7399805 0.180 0.76998050 0.9600195 -1.240 0.86001950 -1.730
## 5 0.005 0.1149805 -0.315 0.05498047 -0.2149805 -0.305 0.78501950 -0.625
## 6 -0.540 0.1199805 0.410 0.54998050 0.3700195 0.050 0.04001953 -0.140
## 55 56 57 58 59 60 61 62
## 1 0.010 -0.620 -0.380 0.04998047 0.650 -0.030 -0.270 0.210
## 2 -1.280 -0.130 0.000 -0.72001950 0.640 -0.480 0.630 -0.620
## 3 -0.770 0.200 -0.060 0.41998050 0.150 0.070 -0.100 -0.150
## 4 0.940 -1.410 0.800 0.92998050 -1.970 -0.700 1.100 -1.330
## 5 -0.015 1.585 -0.115 -0.09501953 -0.065 -0.195 1.045 0.045
## 6 0.270 1.160 0.180 0.19998050 0.130 0.410 0.080 -0.400
## 63 64
## 1 -5.000000e-02 0.350
## 2 1.400000e-01 -0.270
## 3 -9.000000e-02 0.020
## 4 -1.260000e+00 -1.230
## 5 4.500000e-02 -0.715
## 6 -2.710505e-20 -0.340
#Melt the data to long format
library(reshape2)
#Here we just sampling to just 200 dataset, and all columns the sampe
nci.long.samp <- melt(as.matrix(nci[1:200,]))
str(nci.long.samp)
## 'data.frame': 12800 obs. of 3 variables:
## $ Var1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Var2 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ value: num 0.3 1.18 0.55 1.14 -0.265 ...
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)
## gene case value
## 1 1 1 0.300
## 2 2 1 1.180
## 3 3 1 0.550
## 4 4 1 1.140
## 5 5 1 -0.265
## 6 6 1 -0.070
#ggplot will make underexpress in blue, and overexpress in red
library(ggplot2)
#The geom will be plot in tile, and scale color from blue to red
ggplot(aes(y = gene, x = case, fill = value),
data = nci.long.samp) +
geom_tile() +
scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
Genomic map of the data is just 200 over 6000 examples. By using 6000 we just increasing the complexity of the visualization. That’s way it’s important to just sampling the data. and work our various visualization and relationship in the variables.