We will explore diamonds dataset, history, and use EDA to create quantitative analysis.
- Salomon, data analyst at Facebook, will make EDA to explore diamond.
- In the end, we will know, given the diamonds, is it a good deal or not.
- Wel also be able to predict the price of given diamonds.
library(ggplot2) data(diamonds) names(diamonds)
##  "carat" "cut" "color" "clarity" "depth" "table" "price" ##  "x" "y" "z"
ggplot(aes(x=carat, y = price), data = diamonds)+ geom_point()+ coord_cartesian(xlim=c(0,quantile(diamonds$carat,0.99)), ylim=c(0,quantile(diamonds$price,0.99)))+ stat_smooth(method = "lm")
Price and Carat Relationship
- There are fix relationship between carat and price
- Same carat may have higher price, but it depends on the other variables
More weight of carat, the higher price, but not go any lower
- We can see that some exponential increase as the price go higher.
- diversion increase as carat higher and price higher.
By using linear model, we may have off predicting the price(too bias!)
- We can’t just input the diamond data and pop the price.
- The diamonds’ price itself has each background story related to it.
- First found south africa.
- Earlier diamonds only found in India and Brazil. Back then, diamonds only priced by its supply.
- Then the biggest diamonds cartel build in US and control the diamonds market, De Beers which advertise the diamonds in many other way
A diamonds is….. FOREVER
- Diamonds earlier only for the rich, but the slogan, which made by Frances Gerety, quote “A diamonds is forever” which point to enggagement should make diamond engagement ring.
The Rise of Diamonds
- The slogan itself is powerful. It create the intense of the diamonds.
- They do that, as earlier said, the company has create a cartel and monopolize the diamonds in South Africa.
- Since then they give movie star a diamond, price vary giving each other between selebrity.
- They can even make Britsh Royal to use diamonds in their crown over other gems.
- They create the engagement ring should wear diamonds. And advertise what are the price of diamonds compared to what men achieve in life.
- Engagament symbol at Facebook
- Movie engagement most contain diamond
- each variable plotting other variable in ggpairs
- qual qual, scat qual auan
- group histogram in top left qual-qual group by x
- boxplot qual-quan
- correlation at lower right quan-quan
# install these if necessary # install.packages('GGally') # install.packages('scales') # install.packages('memisc') # install.packages('lattice') # install.packages('MASS') # install.packages('car') # install.packages('reshape') # install.packages('plyr') # load the ggplot graphics package and the others library(ggplot2) library(GGally) library(scales) library(memisc)
## Loading required package: lattice ## Loading required package: MASS ## ## Attaching package: 'memisc' ## ## The following object is masked from 'package:scales': ## ## percent ## ## The following objects are masked from 'package:stats': ## ## contr.sum, contr.treatment, contrasts ## ## The following object is masked from 'package:base': ## ## as.array
# sample 10,000 diamonds from the data set set.seed(20022012) diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ] ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
What are some things you notice in the ggpairs output?
- price and carat is highly correlated shown by close to 1 at cor.test function.
- Synthesizing varibles(merging) may make useful analsysis
The Demand of Diamonds
## Loading required package: grid
plot1 <- ggplot(aes(x=price), data = diamonds, )+ geom_histogram(aes(fill='orange'))+ ggtitle('Price') #scale_fill_brewer(aes(color='qual')) plot2 <- ggplot(aes(x=price), data = diamonds)+ geom_histogram(aes(fill='red'))+ scale_x_log10()+ ggtitle('Price(log10)') # scale_fill_brewer(aes(color='qual')) grid.arrange(plot1,plot2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this. ## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Connecting Demand and Price Distributions
- Notice that by transforming into log 10 we can see our usual normal distribution.
- Followed by two peak in the middle that is binomial distribution.
- Notice that mid-split in the plot(also shown later in the middle). This will also shows how divided the the people with less money and more money
- By using cuberoot function that we made, we are able to transform our exponential model into linear model.
Create a new function to transform the carat variable
library(scales) cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3), inverse = function(x) x^3)
Use the cuberoot_trans function
# ggplot(aes(carat, price), data = diamonds) + # geom_point() + # scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), # breaks = c(0.2, 0.5, 1, 2, 3)) + # scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), # breaks = c(350, 1000, 5000, 10000, 15000)) + # ggtitle('Price (log10) by Cube-Root of Carat')
- As we learn earlier, overplotting means obscure our keypoints that maybe there somewhere in the plot
- If take a look at our data, we can see the top of our data by using sorting and head over the highest data.
## ## 605 802 625 828 776 698 ## 132 127 126 125 124 121
## ## 0.3 0.31 1.01 0.7 0.32 1 ## 2604 2249 2242 1981 1840 1558
- Overplotting can simply encounter with jitter or alpha
Add a layer to adjust the features of the scatterplot. Set the transparency to one half, the size to three-fourths, and jitter the points.
ggplot(aes(carat, price), data = diamonds) + geom_point(position='jitter',size=0.75,alpha=1/2) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).