
Research question : Does there appear to be a relationship between 2012 U.S. presidential candidates with their financial contributors’ neighborhood?

Candidates for presidential election need donations from contributors to run a campaign.You have seen result of votes for New York and its cities.But state and city are too general to make a conclusion. This research will dive deep into the neighborhood level and analyze whether or not there is a correlation between neighborhood and the candidates.If it’s indeed there’s a correlation, then it will provide meaningful factors for the next U.S. presidential candidates.


This data is collected from Federal Election Commission, . This site is official U.S. government, and write detail about 2012 U.S. presidential election.This data has contributors’ identity such as id,name,occupation,city,status, employer, contribution amount, etc.This data can be downloaded at this link.This is a must written survey by all contributors that want to donate for candidates’ campaign.

One contributors can submit more than once. Since I only pay attention to the contributors only, I drop dupplicate the contributors, so the cases are 378562 unique contributors. The variables that will be focused on are contributors’ zipcodes(categorical, various levels) and candidate name(categorical, various levels).

This is observational study. The data is collected from must filled survey by contributors,not computer, as there’s some city that is the same but two different things (“North Hills,” vs “North Hills”). There are total of 2207 cities in dataset as opposed to 62 New York cities the original number.

It can generalized to all New York financial contributions in 2012, but only for those New Yorkers that contribute to presidential election. This data can’t be generalized to others that perhaps have insufficient money to contribute, or other reasons. This can be some extraneous variables that prevent the survey to generalized to New York population. This data is taken just for 2012 New York financial contributions. 2016 presidential election will have different candidates, and hence it will vary greatly.We can’t make a causality based on the fact that this is observational study.

Exploratory data analysis:

Library that I will be using are:


Here I load the data into dataframe, and filter for just two candidates, Obama and Romney.

df = read.csv("fc2012ny.csv")

#Row wise, drop duplicated zip codes. And subsetting for two main candidates, Obama and Romney
#Column wise, selecting only two variable required, 3=cand_nm, 7=contbr_zip
rb_zip = subset(df[!duplicated('contbr_zip'),c(3,7)],
                cand_nm == 'Romney, Mitt' |
                cand_nm == 'Obama, Barack')
#Redefined the factor levels to only 2
rb_zip = droplevels(rb_zip)

This plot will show how many contributors for Obama and Romney.

#Plotting by ggplot
ggplot(rb_zip,aes(x=cand_nm)) + geom_bar() +
  xlab('Candidate names')+
  ylab('Number of contributors')+
  ggtitle('Contributors for candidates presidential election 2012')+
  ggsave('plot.jpg',limitsize = T)

Even when we’re doing bar chart, We can see that Obama has almost 10 times contributors than Romney’s. You can see that with this many contributors, Obama has more freedom to run his campaign compared to Romney.Below is the contigency table of zipcode financial contributors for Obama and Romney,

## Obama, Barack  Romney, Mitt 
##        338450         40112

To help me with the analysis, I use the cool zipcode package library from Jeffrey Breen. I extract the latitude, longitude and city (I did not use city in the df dataset, because it need some wrangling.) Since some of the people only fill 5 prefix zipcodes, I convert all of them into 5 prefix zipcodes. Then I use color to differentiate both candidates. Below I use another awesome plot package, ggmap by David Kahle and Hadley Wickham.

#Clean and join the zipcode data, by Jeffrey Breen, author of R zipcode package.
rb_zip$contbr_zip = clean.zipcodes(rb_zip$contbr_zip)
zip_map= merge(rb_zip, zipcode, by.x = 'contbr_zip', by.y = 'zip')

#Draw original map of new york
map = get_map(location="ny",zoom=6,source="stamen")
#Plot the point

Here we see that most of the contributors favor Obama over Romney. The plot is overplotting though. I’m going to limit based on the majority area of the contributors. Based on the plot, I can see that I may have pretty good estimate, If I limit latitude between 40 to 45, and longitude between -80 to -72.

zip_map.filtered = subset(zip_map,
                          ! & 
                          ! &
                          latitude > 40 & latitude < 45 &
                          longitude > -80 & longitude < -72)

Since there’s so many category levels in the zip code. I group them based on the coordinate, round based on the nearest integers. Then for each of the group, I take the average of latitude and longitude. I create buble chart in the map. The higher the bubble, the higher the population in the area. Divergent colors describe the difference between obama and romney contributors. More blue means more contributors to Obama, brown means both contributors are roughly equal.

zip_map.by_coor = group_by(zip_map.filtered,round(latitude),round(longitude))%>%
  summarise(latmean = mean(latitude),
            lonmean = mean(longitude),
            OfromR = table(cand_nm)["Obama, Barack"] - table(cand_nm)["Romney, Mitt"],
            count = n()) %>%

#Create the ggmap,
# map =background of New York, stamen
# x = longitude
# y = latitude
# color = difference of Obama to Romney
# alpha = darker per 20 count
# size = based on the population in the area
# Color palette from Red to Blu
ggmap(map,fullpage = T)+
                 size =count))+scale_size_area(max_size=50,breaks=c(1000,5000,10000,50000),guide=F)+
  scale_colour_distiller(palette = 'RdBu',
                         guide=guide_legend(title="difference contributors"))+
  ggtitle('Map of New York contributors in 2012, for Obama and Romney')

We can see that majority of the contributors live in New York, the rest is scattered around New York. Contributors in NY city have sided with Obama the most than other area. This could also happen because this area have the highest population.Contributors in similar area(represented by its longtitude and latitude) have roughly equal differrence. But once again, this could also happen because it’s too small to notice the difference.We can’t neighborhood are in this level.But this serve as a good estimate of difference contributors across similar area.


In this section I use hypothesis testing to see whether there is a relationship between neighborhood of contributors to candidates they’re contributing. I set my hypothesis as follows:

Python is used to wrangle the data. If you know R equivalent to these following codes, feel free to comment below.Basically what I do are filter the data first into two major candidates, Barack Obama and Mitt Romney.Next, the data filtered is to only include 9-length zipcodes instead of 5-length zipcodes. This is to ensure that the zipcodes area are narrower.

import pandas as pd
# #Read csv files
 df = pd.read_csv('fc2012ny.csv')
# #Turn 'contbr_zip' column into string
df.contbr_zip = x : str(x).split('.')[0])
# #Only observe candidates campaign for Barack Obama and Mitt Romney. 
# #Filter the zipcodes with only length 9. Because the zipcodes with length 5 is too many and the area is too wide to observe.
# #filter column dataframe to only candidate name and the zipcode of financial contributors.
 df2 = df[(df.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])) & ( == 9)][['cand_nm','contbr_zip']]
# #Create contigency table between both columns
dfc = pd.crosstab(df2.contbr_zip,df2.cand_nm)
# #Filter each cell to allow only above 5. This is used to assure that each cell is independent.
# #The zip_code that pass the criteria, feed back into the dataframe as a filter.
df3 = df2[df2.contbr_zip.isin(dfc[dfc.applymap(lambda x : x > 5)].dropna().index)]

The data is loaded into R, and you can see the first 10 table below showing contributors count for each zipcodes and each candidates.

df_cleaned = read.csv('fc2012ny-cleaned.csv')
## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
##  Pearson's Chi-squared test with simulated p-value (based on 10000
##  replicates)
## data:  y_table
## X-squared = 825.1471, df = NA, p-value = 9.999e-05

The p-value is practically 0 in simulation test. Thus using 5% significance level, I reject the null hypothesis, and conclude that the data provide convincing evidence that in 2012 presidential election, financial contributors’ neighborhood and the candidates they’re contributing are dependent.

The method used is chi-square independence test, since there are no other methods to choose when testing the relationship of both categorical variables, across several groups. Hypothesis testing simulation based is chosen because the sample size is too small(in particular, for Romney).


So where will this left us? Well we know for the fact that even the closest level area, neighborhood is correlated on how people financially contributing candidates in 2012 US Election. So not just any strategic area, but area where it has many potential financial contributors is one of the important factors. They could win votes and financial contributions, which would drive their campaign even more.

There’s still another potential lurking variable that we may have to pay attention. We know that family, workplace, candidates’s opinion can be a factor of contributors to choose. Neighborhood is more likely not as a significant factor, but one can’t simply ignore this variable.

Things can be done with more accurately. For example, we could group zip code and their donations, see which neighborhood has large donations, vs those with small donations. This could be beneficial for candidates, regardless whoever they be, they could run a campaign in that area. The buble chart for map can also incorporate donations. Buble size is not vary by the population anymore, but by the donations. That way we know how big the donations for particular area, and color the difference of donations between two main candidates.



