us-candidates-contributors-area
Introduction:
Research question : Does there appear to be a relationship between 2012 U.S. presidential candidates with their financial contributors’ neighborhood?
Candidates for presidential election need donations from contributors to run a campaign.You have seen result of votes for New York and its cities.But state and city are too general to make a conclusion. This research will dive deep into the neighborhood level and analyze whether or not there is a correlation between neighborhood and the candidates.If it’s indeed there’s a correlation, then it will provide meaningful factors for the next U.S. presidential candidates.
Data:
This data is collected from Federal Election Commission, http://www.fec.gov/ . This site is official U.S. government, and write detail about 2012 U.S. presidential election.This data has contributors’ identity such as id,name,occupation,city,status, employer, contribution amount, etc.This data can be downloaded at this link.This is a must written survey by all contributors that want to donate for candidates’ campaign.
One contributors can submit more than once. Since I only pay attention to the contributors only, I drop dupplicate the contributors, so the cases are 378562 unique contributors. The variables that will be focused on are contributors’ zipcodes(categorical, various levels) and candidate name(categorical, various levels).
This is observational study. The data is collected from must filled survey by contributors,not computer, as there’s some city that is the same but two different things (“North Hills,” vs “North Hills”). There are total of 2207 cities in dataset as opposed to 62 New York cities the original number.
It can generalized to all New York financial contributions in 2012, but only for those New Yorkers that contribute to presidential election. This data can’t be generalized to others that perhaps have insufficient money to contribute, or other reasons. This can be some extraneous variables that prevent the survey to generalized to New York population. This data is taken just for 2012 New York financial contributions. 2016 presidential election will have different candidates, and hence it will vary greatly.We can’t make a causality based on the fact that this is observational study.
Exploratory data analysis:
Library that I will be using are:
library(ggplot2)
library(ggmap)
library(zipcode)
library(dplyr)
data(zipcode)
Here I load the data into dataframe, and filter for just two candidates, Obama and Romney.
df = read.csv("fc2012ny.csv")
#Row wise, drop duplicated zip codes. And subsetting for two main candidates, Obama and Romney
#Column wise, selecting only two variable required, 3=cand_nm, 7=contbr_zip
rb_zip = subset(df[!duplicated('contbr_zip'),c(3,7)],
cand_nm == 'Romney, Mitt' |
cand_nm == 'Obama, Barack')
#Redefined the factor levels to only 2
rb_zip = droplevels(rb_zip)
This plot will show how many contributors for Obama and Romney.
#Plotting by ggplot
ggplot(rb_zip,aes(x=cand_nm)) + geom_bar() +
xlab('Candidate names')+
ylab('Number of contributors')+
ggtitle('Contributors for candidates presidential election 2012')+
ggsave('plot.jpg',limitsize = T)
Even when we’re doing bar chart, We can see that Obama has almost 10 times contributors than Romney’s. You can see that with this many contributors, Obama has more freedom to run his campaign compared to Romney.Below is the contigency table of zipcode financial contributors for Obama and Romney,
table(rb_zip$cand_nm)
##
## Obama, Barack Romney, Mitt
## 338450 40112
To help me with the analysis, I use the cool zipcode package library from Jeffrey Breen. I extract the latitude, longitude and city (I did not use city in the df dataset, because it need some wrangling.) Since some of the people only fill 5 prefix zipcodes, I convert all of them into 5 prefix zipcodes. Then I use color to differentiate both candidates. Below I use another awesome plot package, ggmap by David Kahle and Hadley Wickham.
#Clean and join the zipcode data, by Jeffrey Breen, author of R zipcode package.
rb_zip$contbr_zip = clean.zipcodes(rb_zip$contbr_zip)
zip_map= merge(rb_zip, zipcode, by.x = 'contbr_zip', by.y = 'zip')
#Draw original map of new york
map = get_map(location="ny",zoom=6,source="stamen")
#Plot the point
ggmap(map)+
geom_point(data=zip_map,
aes(x=longitude,y=latitude,colour=cand_nm))
Here we see that most of the contributors favor Obama over Romney. The plot is overplotting though. I’m going to limit based on the majority area of the contributors. Based on the plot, I can see that I may have pretty good estimate, If I limit latitude between 40 to 45, and longitude between -80 to -72.
zip_map.filtered = subset(zip_map,
!is.na(longitude) &
!is.na(latitude) &
latitude > 40 & latitude < 45 &
longitude > -80 & longitude < -72)
Since there’s so many category levels in the zip code. I group them based on the coordinate, round based on the nearest integers. Then for each of the group, I take the average of latitude and longitude. I create buble chart in the map. The higher the bubble, the higher the population in the area. Divergent colors describe the difference between obama and romney contributors. More blue means more contributors to Obama, brown means both contributors are roughly equal.
zip_map.by_coor = group_by(zip_map.filtered,round(latitude),round(longitude))%>%
summarise(latmean = mean(latitude),
lonmean = mean(longitude),
OfromR = table(cand_nm)["Obama, Barack"] - table(cand_nm)["Romney, Mitt"],
count = n()) %>%
ungroup()
#Create the ggmap,
# map =background of New York, stamen
# x = longitude
# y = latitude
# color = difference of Obama to Romney
# alpha = darker per 20 count
# size = based on the population in the area
# Color palette from Red to Blu
ggmap(map,fullpage = T)+
geom_point(data=zip_map.by_coor,
aes(x=lonmean,y=latmean,
colour=OfromR,
alpha=1/20,
size =count))+scale_size_area(max_size=50,breaks=c(1000,5000,10000,50000),guide=F)+
scale_colour_distiller(palette = 'RdBu',
guide=guide_legend(title="difference contributors"))+
scale_alpha(guide=F)+
ggtitle('Map of New York contributors in 2012, for Obama and Romney')
We can see that majority of the contributors live in New York, the rest is scattered around New York. Contributors in NY city have sided with Obama the most than other area. This could also happen because this area have the highest population.Contributors in similar area(represented by its longtitude and latitude) have roughly equal differrence. But once again, this could also happen because it’s too small to notice the difference.We can’t neighborhood are in this level.But this serve as a good estimate of difference contributors across similar area.
Inference:
In this section I use hypothesis testing to see whether there is a relationship between neighborhood of contributors to candidates they’re contributing. I set my hypothesis as follows:
- H0 : neighborhood and candidates are independent.
- HA : neighborhood and candidates are dependent.
Python is used to wrangle the data. If you know R equivalent to these following codes, feel free to comment below.Basically what I do are filter the data first into two major candidates, Barack Obama and Mitt Romney.Next, the data filtered is to only include 9-length zipcodes instead of 5-length zipcodes. This is to ensure that the zipcodes area are narrower.
import pandas as pd
# #Read csv files
df = pd.read_csv('fc2012ny.csv')
# #Turn 'contbr_zip' column into string
df.contbr_zip = df.contbr_zip.map(lambda x : str(x).split('.')[0])
# #Only observe candidates campaign for Barack Obama and Mitt Romney.
# #Filter the zipcodes with only length 9. Because the zipcodes with length 5 is too many and the area is too wide to observe.
# #filter column dataframe to only candidate name and the zipcode of financial contributors.
df2 = df[(df.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])) & (df.contbr_zip.map(len) == 9)][['cand_nm','contbr_zip']]
# #Create contigency table between both columns
dfc = pd.crosstab(df2.contbr_zip,df2.cand_nm)
# #Filter each cell to allow only above 5. This is used to assure that each cell is independent.
# #The zip_code that pass the criteria, feed back into the dataframe as a filter.
df3 = df2[df2.contbr_zip.isin(dfc[dfc.applymap(lambda x : x > 5)].dropna().index)]
df3.to_csv('fc2012ny-cleaned.csv',index=False)
The data is loaded into R, and you can see the first 10 table below showing contributors count for each zipcodes and each candidates.
df_cleaned = read.csv('fc2012ny-cleaned.csv')
table(df_cleaned$cand_nm,df_cleaned$contbr_zip)[,1:10]
##
## 100017334 100035944 100042400 100051108 100072710
## Obama, Barack 8 47 7 6 24
## Romney, Mitt 7 13 21 7 9
##
## 100116333 100141505 100162759 100163892 100167303
## Obama, Barack 6 9 114 16 27
## Romney, Mitt 12 8 7 6 6
Earlier, the data is filtered so each cell in the table have at least 5 count. This is to ensure that zipcodes and candidates are independent. The data has 5449 unique contributors, which is less than 10% all New York contributors,378562 people.So we can be assured that the dataset is big enough and more importantly, independent for each of the contributor. The test used is chi-square independence test. Again, I’m using another R package, created by Coursera statistics,Dr. Mine Çetinkaya-Rundel.
source("http://bit.ly/dasi_inference")
inference(as.factor(df_cleaned$contbr_zip),
df_cleaned$cand_nm,
est="proportion",
type="ht",
method="simulation",
success="Obama, Barack",
alternative="greater",
eda_plot=F,
nsim=10000)
## Response variable: categorical, Explanatory variable: categorical
## Chi-square test of independence
##
## Summary statistics:
## x
## y Obama, Barack Romney, Mitt Sum
## 100017334 8 7 15
## 100035944 47 13 60
## 100042400 7 21 28
## 100051108 6 7 13
## 100072710 24 9 33
## 100116333 6 12 18
## 100141505 9 8 17
## 100162759 114 7 121
## 100163892 16 6 22
## 100167303 27 6 33
## 100168538 26 6 32
## 100171807 33 7 40
## 100173540 9 7 16
## 100173904 6 14 20
## 100174400 16 21 37
## 100174800 8 6 14
## 100187001 33 7 40
## 100191566 15 14 29
## 100193386 6 7 13
## 100195401 13 14 27
## 100196028 8 17 25
## 100212625 27 10 37
## 100212656 14 7 21
## 100212757 10 8 18
## 100212759 10 16 26
## 100213138 6 6 12
## 100213241 13 8 21
## 100213737 9 8 17
## 100214153 44 18 62
## 100214156 13 6 19
## 100214176 9 8 17
## 100214255 6 15 21
## 100214268 25 9 34
## 100214357 10 8 18
## 100214370 16 10 26
## 100214568 6 8 14
## 100214768 8 7 15
## 100214769 6 6 12
## 100214855 15 7 22
## 100214956 15 9 24
## 100214985 14 17 31
## 100215000 8 6 14
## 100215159 68 12 80
## 100215163 20 9 29
## 100215704 72 10 82
## 100217098 34 15 49
## 100221185 6 7 13
## 100222300 48 14 62
## 100222511 30 7 37
## 100223071 7 7 14
## 100224614 11 7 18
## 100224810 7 6 13
## 100226022 6 6 12
## 100226404 22 10 32
## 100226445 25 35 60
## 100232624 15 6 21
## 100233402 11 6 17
## 100234198 53 9 62
## 100234250 88 23 111
## 100236200 36 7 43
## 100236211 12 8 20
## 100237400 9 7 16
## 100237708 19 8 27
## 100238106 7 6 13
## 100238221 8 7 15
## 100242605 142 13 155
## 100243020 32 9 41
## 100243039 7 9 16
## 100243512 61 19 80
## 100244926 8 10 18
## 100245306 25 6 31
## 100246020 63 10 73
## 100246029 21 6 27
## 100253506 7 7 14
## 100257629 25 10 35
## 100274707 6 10 16
## 100280112 15 6 21
## 100280132 7 6 13
## 100280135 14 9 23
## 100280212 6 8 14
## 100280553 21 12 33
## 100280809 17 9 26
## 100280902 11 6 17
## 100280934 15 8 23
## 100281031 24 12 36
## 100281032 12 10 22
## 100281057 16 15 31
## 100284318 8 7 15
## 100287533 29 27 56
## 100287534 32 20 52
## 100287535 58 6 64
## 100287552 17 16 33
## 100287553 43 6 49
## 100287907 16 8 24
## 100296527 51 19 70
## 100296931 25 29 54
## 100366518 26 7 33
## 100655718 48 7 55
## 100655955 6 6 12
## 100655985 10 10 20
## 100657216 7 14 21
## 100657313 6 9 15
## 100657380 7 6 13
## 100658014 12 7 19
## 100690901 13 26 39
## 100750325 7 14 21
## 100750480 19 9 28
## 100750590 7 7 14
## 100751102 7 11 18
## 100759202 23 6 29
## 101060001 6 8 14
## 101070001 52 11 63
## 101120015 8 11 19
## 101280104 6 12 18
## 101280122 22 14 36
## 101280144 10 16 26
## 101280509 11 11 22
## 101280615 7 15 22
## 101280648 19 15 34
## 101280671 18 7 25
## 101280705 10 9 19
## 101280709 11 7 18
## 101280724 7 19 26
## 101280807 19 6 25
## 101281000 52 16 68
## 101281003 10 9 19
## 101281132 76 32 108
## 101281152 7 50 57
## 101281154 39 6 45
## 101281176 22 8 30
## 101281188 17 6 23
## 101281200 12 9 21
## 101281205 11 6 17
## 101281211 22 17 39
## 101281212 21 7 28
## 101281213 17 9 26
## 101281225 43 17 60
## 101281234 11 8 19
## 101281235 15 20 35
## 101281242 18 9 27
## 101281243 26 6 32
## 101281255 15 22 37
## 101281308 30 12 42
## 101281314 20 6 26
## 101281733 10 21 31
## 101540004 8 12 20
## 101620025 6 12 18
## 101650006 16 6 22
## 103101535 8 7 15
## 105102405 15 32 47
## 105331512 11 7 18
## 105381227 9 10 19
## 105382143 22 10 32
## 105914908 12 24 36
## 108051166 13 11 24
## 109702200 6 13 19
## 110301100 21 13 34
## 110301437 7 7 14
## 111033834 18 6 24
## 112014507 45 9 54
## 112015051 6 8 14
## 112151405 98 9 107
## 112153008 37 8 45
## 112153702 10 6 16
## 112154502 27 7 34
## 112155917 25 7 32
## 113601183 10 11 21
## 115304205 7 7 14
## 115702614 8 6 14
## 117332259 16 10 26
## 117681525 18 8 26
## 141272525 6 6 12
## 146183513 8 11 19
## 146203400 8 9 17
## Sum 3568 1881 5449
##
## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
##
## Pearson's Chi-squared test with simulated p-value (based on 10000
## replicates)
##
## data: y_table
## X-squared = 825.1471, df = NA, p-value = 9.999e-05
The p-value is practically 0 in simulation test. Thus using 5% significance level, I reject the null hypothesis, and conclude that the data provide convincing evidence that in 2012 presidential election, financial contributors’ neighborhood and the candidates they’re contributing are dependent.
The method used is chi-square independence test, since there are no other methods to choose when testing the relationship of both categorical variables, across several groups. Hypothesis testing simulation based is chosen because the sample size is too small(in particular, for Romney).
Conclusion:
So where will this left us? Well we know for the fact that even the closest level area, neighborhood is correlated on how people financially contributing candidates in 2012 US Election. So not just any strategic area, but area where it has many potential financial contributors is one of the important factors. They could win votes and financial contributions, which would drive their campaign even more.
There’s still another potential lurking variable that we may have to pay attention. We know that family, workplace, candidates’s opinion can be a factor of contributors to choose. Neighborhood is more likely not as a significant factor, but one can’t simply ignore this variable.
Things can be done with more accurately. For example, we could group zip code and their donations, see which neighborhood has large donations, vs those with small donations. This could be beneficial for candidates, regardless whoever they be, they could run a campaign in that area. The buble chart for map can also incorporate donations. Buble size is not vary by the population anymore, but by the donations. That way we know how big the donations for particular area, and color the difference of donations between two main candidates.
References:
Appendix
head(rb_zip, n=50L)
## cand_nm contbr_zip
## 1 Romney, Mitt 123032602
## 2 Romney, Mitt 148509700
## 3 Romney, Mitt 114271345
## 4 Romney, Mitt 130789600
## 5 Romney, Mitt 148303636
## 6 Romney, Mitt 110401804
## 7 Romney, Mitt 110401804
## 8 Romney, Mitt 110401804
## 9 Romney, Mitt 110401804
## 10 Romney, Mitt 110401804
## 11 Romney, Mitt 112323134
## 12 Romney, Mitt 105602708
## 13 Romney, Mitt 117874838
## 14 Romney, Mitt 101281314
## 15 Romney, Mitt 134032612
## 16 Romney, Mitt 100114672
## 17 Romney, Mitt 115452023
## 18 Romney, Mitt 110242113
## 19 Romney, Mitt 100656625
## 20 Romney, Mitt 115762844
## 21 Romney, Mitt 121101025
## 22 Romney, Mitt 142252518
## 23 Romney, Mitt 130363413
## 24 Romney, Mitt 119750688
## 25 Romney, Mitt 100214249
## 26 Romney, Mitt 121152005
## 27 Romney, Mitt 115701214
## 28 Romney, Mitt 115701214
## 29 Romney, Mitt 127715223
## 30 Romney, Mitt 100132103
## 31 Romney, Mitt 100132103
## 32 Romney, Mitt 100132103
## 33 Romney, Mitt 112053910
## 34 Romney, Mitt 137431307
## 35 Romney, Mitt 108017154
## 36 Romney, Mitt 107082609
## 37 Romney, Mitt 105761308
## 38 Romney, Mitt 145801303
## 39 Romney, Mitt 133281220
## 40 Romney, Mitt 109523601
## 41 Romney, Mitt 109523601
## 42 Romney, Mitt 109523601
## 43 Romney, Mitt 109523601
## 44 Romney, Mitt 142021443
## 45 Romney, Mitt 115613125
## 46 Romney, Mitt 130789304
## 47 Romney, Mitt 110241714
## 48 Romney, Mitt 125782226
## 49 Romney, Mitt 115763071
## 50 Romney, Mitt 100195173