Introduction:

Research question : Does there appear to be a relationship between 2012 U.S. presidential candidates with their financial contributors’ neighborhood?

Candidates for presidential election need donations from contributors to run a campaign.You have seen result of votes for New York and its cities.But state and city are too general to make a conclusion. This research will dive deep into the neighborhood level and analyze whether or not there is a correlation between neighborhood and the candidates.If it’s indeed there’s a correlation, then it will provide meaningful factors for the next U.S. presidential candidates.

Data:

This data is collected from Federal Election Commission, http://www.fec.gov/ . This site is official U.S. government, and write detail about 2012 U.S. presidential election.This data has contributors’ identity such as id,name,occupation,city,status, employer, contribution amount, etc.This data can be downloaded at this link.This is a must written survey by all contributors that want to donate for candidates’ campaign.

One contributors can submit more than once. Since I only pay attention to the contributors only, I drop dupplicate the contributors, so the cases are 378562 unique contributors. The variables that will be focused on are contributors’ zipcodes(categorical, various levels) and candidate name(categorical, various levels).

This is observational study. The data is collected from must filled survey by contributors,not computer, as there’s some city that is the same but two different things (“North Hills,” vs “North Hills”). There are total of 2207 cities in dataset as opposed to 62 New York cities the original number.

It can generalized to all New York financial contributions in 2012, but only for those New Yorkers that contribute to presidential election. This data can’t be generalized to others that perhaps have insufficient money to contribute, or other reasons. This can be some extraneous variables that prevent the survey to generalized to New York population. This data is taken just for 2012 New York financial contributions. 2016 presidential election will have different candidates, and hence it will vary greatly.We can’t make a causality based on the fact that this is observational study.

Exploratory data analysis:

Library that I will be using are:

library(ggplot2)
library(ggmap)
library(zipcode)
library(dplyr)
data(zipcode)

Here I load the data into dataframe, and filter for just two candidates, Obama and Romney.

df = read.csv("fc2012ny.csv")

#Row wise, drop duplicated zip codes. And subsetting for two main candidates, Obama and Romney
#Column wise, selecting only two variable required, 3=cand_nm, 7=contbr_zip
rb_zip = subset(df[!duplicated('contbr_zip'),c(3,7)],
                cand_nm == 'Romney, Mitt' |
                cand_nm == 'Obama, Barack')
#Redefined the factor levels to only 2
rb_zip = droplevels(rb_zip)

This plot will show how many contributors for Obama and Romney.

#Plotting by ggplot
ggplot(rb_zip,aes(x=cand_nm)) + geom_bar() +
  xlab('Candidate names')+
  ylab('Number of contributors')+
  ggtitle('Contributors for candidates presidential election 2012')+
  ggsave('plot.jpg',limitsize = T)

Even when we’re doing bar chart, We can see that Obama has almost 10 times contributors than Romney’s. You can see that with this many contributors, Obama has more freedom to run his campaign compared to Romney.Below is the contigency table of zipcode financial contributors for Obama and Romney,

table(rb_zip$cand_nm)

## 
## Obama, Barack  Romney, Mitt 
##        338450         40112

To help me with the analysis, I use the cool zipcode package library from Jeffrey Breen. I extract the latitude, longitude and city (I did not use city in the df dataset, because it need some wrangling.) Since some of the people only fill 5 prefix zipcodes, I convert all of them into 5 prefix zipcodes. Then I use color to differentiate both candidates. Below I use another awesome plot package, ggmap by David Kahle and Hadley Wickham.

#Clean and join the zipcode data, by Jeffrey Breen, author of R zipcode package.
rb_zip$contbr_zip = clean.zipcodes(rb_zip$contbr_zip)
zip_map= merge(rb_zip, zipcode, by.x = 'contbr_zip', by.y = 'zip')

#Draw original map of new york
map = get_map(location="ny",zoom=6,source="stamen")
#Plot the point
ggmap(map)+
  geom_point(data=zip_map,
             aes(x=longitude,y=latitude,colour=cand_nm))

Here we see that most of the contributors favor Obama over Romney. The plot is overplotting though. I’m going to limit based on the majority area of the contributors. Based on the plot, I can see that I may have pretty good estimate, If I limit latitude between 40 to 45, and longitude between -80 to -72.

zip_map.filtered = subset(zip_map,
                          !is.na(longitude) & 
                          !is.na(latitude) &
                          latitude > 40 & latitude < 45 &
                          longitude > -80 & longitude < -72)

Since there’s so many category levels in the zip code. I group them based on the coordinate, round based on the nearest integers. Then for each of the group, I take the average of latitude and longitude. I create buble chart in the map. The higher the bubble, the higher the population in the area. Divergent colors describe the difference between obama and romney contributors. More blue means more contributors to Obama, brown means both contributors are roughly equal.

zip_map.by_coor = group_by(zip_map.filtered,round(latitude),round(longitude))%>%
  summarise(latmean = mean(latitude),
            lonmean = mean(longitude),
            OfromR = table(cand_nm)["Obama, Barack"] - table(cand_nm)["Romney, Mitt"],
            count = n()) %>%
  ungroup()

#Create the ggmap,
# map =background of New York, stamen
# x = longitude
# y = latitude
# color = difference of Obama to Romney
# alpha = darker per 20 count
# size = based on the population in the area
# Color palette from Red to Blu
ggmap(map,fullpage = T)+
  geom_point(data=zip_map.by_coor,
             aes(x=lonmean,y=latmean,
                 colour=OfromR,
                 alpha=1/20,
                 size =count))+scale_size_area(max_size=50,breaks=c(1000,5000,10000,50000),guide=F)+
  scale_colour_distiller(palette = 'RdBu',
                         guide=guide_legend(title="difference contributors"))+
  scale_alpha(guide=F)+
  ggtitle('Map of New York contributors in 2012, for Obama and Romney')

We can see that majority of the contributors live in New York, the rest is scattered around New York. Contributors in NY city have sided with Obama the most than other area. This could also happen because this area have the highest population.Contributors in similar area(represented by its longtitude and latitude) have roughly equal differrence. But once again, this could also happen because it’s too small to notice the difference.We can’t neighborhood are in this level.But this serve as a good estimate of difference contributors across similar area.

Inference:

In this section I use hypothesis testing to see whether there is a relationship between neighborhood of contributors to candidates they’re contributing. I set my hypothesis as follows:

H0 : neighborhood and candidates are independent.
HA : neighborhood and candidates are dependent.

Python is used to wrangle the data. If you know R equivalent to these following codes, feel free to comment below.Basically what I do are filter the data first into two major candidates, Barack Obama and Mitt Romney.Next, the data filtered is to only include 9-length zipcodes instead of 5-length zipcodes. This is to ensure that the zipcodes area are narrower.

import pandas as pd
# #Read csv files
 df = pd.read_csv('fc2012ny.csv')
# #Turn 'contbr_zip' column into string
df.contbr_zip = df.contbr_zip.map(lambda x : str(x).split('.')[0])
# #Only observe candidates campaign for Barack Obama and Mitt Romney. 
# #Filter the zipcodes with only length 9. Because the zipcodes with length 5 is too many and the area is too wide to observe.
# #filter column dataframe to only candidate name and the zipcode of financial contributors.
 df2 = df[(df.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])) & (df.contbr_zip.map(len) == 9)][['cand_nm','contbr_zip']]
# #Create contigency table between both columns
dfc = pd.crosstab(df2.contbr_zip,df2.cand_nm)
# #Filter each cell to allow only above 5. This is used to assure that each cell is independent.
# #The zip_code that pass the criteria, feed back into the dataframe as a filter.
df3 = df2[df2.contbr_zip.isin(dfc[dfc.applymap(lambda x : x > 5)].dropna().index)]
df3.to_csv('fc2012ny-cleaned.csv',index=False)

The data is loaded into R, and you can see the first 10 table below showing contributors count for each zipcodes and each candidates.

df_cleaned = read.csv('fc2012ny-cleaned.csv')
table(df_cleaned$cand_nm,df_cleaned$contbr_zip)[,1:10]

##                
##                 100017334 100035944 100042400 100051108 100072710
##   Obama, Barack         8        47         7         6        24
##   Romney, Mitt          7        13        21         7         9
##                
##                 100116333 100141505 100162759 100163892 100167303
##   Obama, Barack         6         9       114        16        27
##   Romney, Mitt         12         8         7         6         6

Earlier, the data is filtered so each cell in the table have at least 5 count. This is to ensure that zipcodes and candidates are independent. The data has 5449 unique contributors, which is less than 10% all New York contributors,378562 people.So we can be assured that the dataset is big enough and more importantly, independent for each of the contributor. The test used is chi-square independence test. Again, I’m using another R package, created by Coursera statistics,Dr. Mine Çetinkaya-Rundel.

source("http://bit.ly/dasi_inference")
inference(as.factor(df_cleaned$contbr_zip),
          df_cleaned$cand_nm,
          est="proportion",
          type="ht",
          method="simulation",
          success="Obama, Barack",
          alternative="greater",
          eda_plot=F,
          nsim=10000)

## Response variable: categorical, Explanatory variable: categorical
## Chi-square test of independence
## 
## Summary statistics:
##            x
## y           Obama, Barack Romney, Mitt  Sum
##   100017334             8            7   15
##   100035944            47           13   60
##   100042400             7           21   28
##   100051108             6            7   13
##   100072710            24            9   33
##   100116333             6           12   18
##   100141505             9            8   17
##   100162759           114            7  121
##   100163892            16            6   22
##   100167303            27            6   33
##   100168538            26            6   32
##   100171807            33            7   40
##   100173540             9            7   16
##   100173904             6           14   20
##   100174400            16           21   37
##   100174800             8            6   14
##   100187001            33            7   40
##   100191566            15           14   29
##   100193386             6            7   13
##   100195401            13           14   27
##   100196028             8           17   25
##   100212625            27           10   37
##   100212656            14            7   21
##   100212757            10            8   18
##   100212759            10           16   26
##   100213138             6            6   12
##   100213241            13            8   21
##   100213737             9            8   17
##   100214153            44           18   62
##   100214156            13            6   19
##   100214176             9            8   17
##   100214255             6           15   21
##   100214268            25            9   34
##   100214357            10            8   18
##   100214370            16           10   26
##   100214568             6            8   14
##   100214768             8            7   15
##   100214769             6            6   12
##   100214855            15            7   22
##   100214956            15            9   24
##   100214985            14           17   31
##   100215000             8            6   14
##   100215159            68           12   80
##   100215163            20            9   29
##   100215704            72           10   82
##   100217098            34           15   49
##   100221185             6            7   13
##   100222300            48           14   62
##   100222511            30            7   37
##   100223071             7            7   14
##   100224614            11            7   18
##   100224810             7            6   13
##   100226022             6            6   12
##   100226404            22           10   32
##   100226445            25           35   60
##   100232624            15            6   21
##   100233402            11            6   17
##   100234198            53            9   62
##   100234250            88           23  111
##   100236200            36            7   43
##   100236211            12            8   20
##   100237400             9            7   16
##   100237708            19            8   27
##   100238106             7            6   13
##   100238221             8            7   15
##   100242605           142           13  155
##   100243020            32            9   41
##   100243039             7            9   16
##   100243512            61           19   80
##   100244926             8           10   18
##   100245306            25            6   31
##   100246020            63           10   73
##   100246029            21            6   27
##   100253506             7            7   14
##   100257629            25           10   35
##   100274707             6           10   16
##   100280112            15            6   21
##   100280132             7            6   13
##   100280135            14            9   23
##   100280212             6            8   14
##   100280553            21           12   33
##   100280809            17            9   26
##   100280902            11            6   17
##   100280934            15            8   23
##   100281031            24           12   36
##   100281032            12           10   22
##   100281057            16           15   31
##   100284318             8            7   15
##   100287533            29           27   56
##   100287534            32           20   52
##   100287535            58            6   64
##   100287552            17           16   33
##   100287553            43            6   49
##   100287907            16            8   24
##   100296527            51           19   70
##   100296931            25           29   54
##   100366518            26            7   33
##   100655718            48            7   55
##   100655955             6            6   12
##   100655985            10           10   20
##   100657216             7           14   21
##   100657313             6            9   15
##   100657380             7            6   13
##   100658014            12            7   19
##   100690901            13           26   39
##   100750325             7           14   21
##   100750480            19            9   28
##   100750590             7            7   14
##   100751102             7           11   18
##   100759202            23            6   29
##   101060001             6            8   14
##   101070001            52           11   63
##   101120015             8           11   19
##   101280104             6           12   18
##   101280122            22           14   36
##   101280144            10           16   26
##   101280509            11           11   22
##   101280615             7           15   22
##   101280648            19           15   34
##   101280671            18            7   25
##   101280705            10            9   19
##   101280709            11            7   18
##   101280724             7           19   26
##   101280807            19            6   25
##   101281000            52           16   68
##   101281003            10            9   19
##   101281132            76           32  108
##   101281152             7           50   57
##   101281154            39            6   45
##   101281176            22            8   30
##   101281188            17            6   23
##   101281200            12            9   21
##   101281205            11            6   17
##   101281211            22           17   39
##   101281212            21            7   28
##   101281213            17            9   26
##   101281225            43           17   60
##   101281234            11            8   19
##   101281235            15           20   35
##   101281242            18            9   27
##   101281243            26            6   32
##   101281255            15           22   37
##   101281308            30           12   42
##   101281314            20            6   26
##   101281733            10           21   31
##   101540004             8           12   20
##   101620025             6           12   18
##   101650006            16            6   22
##   103101535             8            7   15
##   105102405            15           32   47
##   105331512            11            7   18
##   105381227             9           10   19
##   105382143            22           10   32
##   105914908            12           24   36
##   108051166            13           11   24
##   109702200             6           13   19
##   110301100            21           13   34
##   110301437             7            7   14
##   111033834            18            6   24
##   112014507            45            9   54
##   112015051             6            8   14
##   112151405            98            9  107
##   112153008            37            8   45
##   112153702            10            6   16
##   112154502            27            7   34
##   112155917            25            7   32
##   113601183            10           11   21
##   115304205             7            7   14
##   115702614             8            6   14
##   117332259            16           10   26
##   117681525            18            8   26
##   141272525             6            6   12
##   146183513             8           11   19
##   146203400             8            9   17
##   Sum                3568         1881 5449
## 
## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
## 
##  Pearson's Chi-squared test with simulated p-value (based on 10000
##  replicates)
## 
## data:  y_table
## X-squared = 825.1471, df = NA, p-value = 9.999e-05

The p-value is practically 0 in simulation test. Thus using 5% significance level, I reject the null hypothesis, and conclude that the data provide convincing evidence that in 2012 presidential election, financial contributors’ neighborhood and the candidates they’re contributing are dependent.

The method used is chi-square independence test, since there are no other methods to choose when testing the relationship of both categorical variables, across several groups. Hypothesis testing simulation based is chosen because the sample size is too small(in particular, for Romney).

Conclusion:

So where will this left us? Well we know for the fact that even the closest level area, neighborhood is correlated on how people financially contributing candidates in 2012 US Election. So not just any strategic area, but area where it has many potential financial contributors is one of the important factors. They could win votes and financial contributions, which would drive their campaign even more.

There’s still another potential lurking variable that we may have to pay attention. We know that family, workplace, candidates’s opinion can be a factor of contributors to choose. Neighborhood is more likely not as a significant factor, but one can’t simply ignore this variable.

Things can be done with more accurately. For example, we could group zip code and their donations, see which neighborhood has large donations, vs those with small donations. This could be beneficial for candidates, regardless whoever they be, they could run a campaign in that area. The buble chart for map can also incorporate donations. Buble size is not vary by the population anymore, but by the donations. That way we know how big the donations for particular area, and color the difference of donations between two main candidates.

References:

dataset from FEC

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161

Jeffrey Breen. zipcode

Appendix

head(rb_zip, n=50L)

##         cand_nm contbr_zip
## 1  Romney, Mitt  123032602
## 2  Romney, Mitt  148509700
## 3  Romney, Mitt  114271345
## 4  Romney, Mitt  130789600
## 5  Romney, Mitt  148303636
## 6  Romney, Mitt  110401804
## 7  Romney, Mitt  110401804
## 8  Romney, Mitt  110401804
## 9  Romney, Mitt  110401804
## 10 Romney, Mitt  110401804
## 11 Romney, Mitt  112323134
## 12 Romney, Mitt  105602708
## 13 Romney, Mitt  117874838
## 14 Romney, Mitt  101281314
## 15 Romney, Mitt  134032612
## 16 Romney, Mitt  100114672
## 17 Romney, Mitt  115452023
## 18 Romney, Mitt  110242113
## 19 Romney, Mitt  100656625
## 20 Romney, Mitt  115762844
## 21 Romney, Mitt  121101025
## 22 Romney, Mitt  142252518
## 23 Romney, Mitt  130363413
## 24 Romney, Mitt  119750688
## 25 Romney, Mitt  100214249
## 26 Romney, Mitt  121152005
## 27 Romney, Mitt  115701214
## 28 Romney, Mitt  115701214
## 29 Romney, Mitt  127715223
## 30 Romney, Mitt  100132103
## 31 Romney, Mitt  100132103
## 32 Romney, Mitt  100132103
## 33 Romney, Mitt  112053910
## 34 Romney, Mitt  137431307
## 35 Romney, Mitt  108017154
## 36 Romney, Mitt  107082609
## 37 Romney, Mitt  105761308
## 38 Romney, Mitt  145801303
## 39 Romney, Mitt  133281220
## 40 Romney, Mitt  109523601
## 41 Romney, Mitt  109523601
## 42 Romney, Mitt  109523601
## 43 Romney, Mitt  109523601
## 44 Romney, Mitt  142021443
## 45 Romney, Mitt  115613125
## 46 Romney, Mitt  130789304
## 47 Romney, Mitt  110241714
## 48 Romney, Mitt  125782226
## 49 Romney, Mitt  115763071
## 50 Romney, Mitt  100195173