-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDATA210-FinalProject-econandpolitcaldata
203 lines (142 loc) · 10.2 KB
/
DATA210-FinalProject-econandpolitcaldata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
###
#DATA-2100
#Adefoluke Shemsu
#FINAL EXAM
###
# PROBLEM 1
setwd("~/Documents/Education/Penn/Classes/DATA 210/Week 8")
library(tidyverse)
library(dslabs)
library(dplyr)
library(randomizr)
library(forcats)
# 1. In this question, you will use a series of datasets to investigate population density in the United States.
# a) Load in population data for Alabama (“sub-est2016_1.csv”) and Alaska (“sub-est2016_2.csv”),
# then append the two datasets together so that all of the information is within one dataframe.
pop.data.1 <- read_csv("~/Documents/Education/Penn/Classes/DATA 210/Week 8/sub-est2016_1.csv")
pop.data.2 <- read_csv("~/Documents/Education/Penn/Classes/DATA 210/Week 8/sub-est2016_2.csv")
pop.data <- rbind(pop.data.1, pop.data.2) # Appending data
# b) Read in the csv file that already contains population information for each state.
# Check to see which unique states are included in this dataset.
pop.data.all <- read_csv("~/Documents/Education/Penn/Classes/DATA 210/Week 8/sub-est2016_all.csv")
unique(pop.data.all$STNAME) # All 51 states are included
# c) There’s a lot of interesting data in this population dataset, but for our purposes in this problem set,
# we are only interested in a few columns. Use the subset() function to subset the “NAME”, “STNAME”, and
# “POPESTIMATE2012” columns into a new dataset. (Using a different function to complete the same task will
# result in partial credit.)
name.st.pop <- subset(pop.data.all, select = c("NAME", "STNAME", "POPESTIMATE2012"))
# d) This new subsetted dataset definitely makes our lives easier, but it still includes the population stats for
# each city and town. You’ll notice, however, that the first observation for each new state is the population
# total for the entire state where the states name appears in both the NAME and STNAME columns.
# Use the subset() function to choose only these rows. Make sure that your new data set doesn’t have any
# repeating/redundant observations or columns (The resulting dataframe should be 51 X 2)
name.st.pop <- subset(name.st.pop, name.st.pop$NAME == name.st.pop$STNAME)
name.st.pop <- subset(name.st.pop, select = c(-NAME))
name.st.pop <- subset(unique(name.st.pop))
# e) We’re going to try to find the population density of each state. Our first step in doing this is to read in some
# online data about the square mileage of each state from this link:
# (https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-areas.csv)
# Once the data is read in, merge that data set with our 2012 state populations dataset from the last question.
# Which observations can be matched? Make sure to not merge observation(s) that have no match.
st.area <- read.csv(url("https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-areas.csv"))
name.st.pop <- rename(name.st.pop, # First, I'm going to map state names to one another.
"state" = "STNAME",
"popestimate2012" = "POPESTIMATE2012")
name.st.area.pop <- merge(name.st.pop, st.area, by = "state")
# f) Next, we are going to create a new variable in this merged dataset that tells us each state’s population density
# in 2012. Do this by dividing the population variable by the state size variable.
name.st.area.pop <- mutate(name.st.area.pop,
pop.density.2012 = name.st.area.pop$popestimate2012/name.st.area.pop$area..sq..mi.)
# g) Finally we’ve finished preparing our dataset, now we’re going to get into some more interesting investigative work.
# Let’s first load in the “ECN_2012_US_52A1.csv” dataset which includes economic data for each sector within each state.
# Get rid of the first row, as this merely gives us descriptions of each variable.
econ.data <- read_csv("~/Documents/Education/Penn/Classes/DATA 210/Week 8/ECN_2012_US_52A1.csv")
econ.data <- econ.data[-1,]
# h) Find the total revenue per sector by state.
# First, I'll need to aggregate the revenue data in order to display revenue per sector per state.
revenue <- aggregate(econ.data$RCPTOT,
by = list(econ.data$`GEO.display-label`, econ.data$`NAICS.display-label`),
FUN = max)
# Next will be ensuring revenue is stored as a numeric variable.
revenue$x <- as.numeric(revenue$x)
# Then a second round of aggregation to get down to get total revenue per state overall.
revenue.2 <- aggregate(revenue$x,
by = list(revenue$Group.1),
FUN = sum,
na.rm = TRUE)
# Going to clean up a little more by renaming the columns.
revenue.2 <- rename(revenue.2,
"state" = "Group.1",
"total.rev" = "x")
# i) Now merge this dataset with our population density dataset.
rev.density <- merge(name.st.area.pop, revenue.2, by = "state")
# j) Plot the relationship between state population density and the state’s total revenue to see if there’s a relationship.
# Comment on your findings.
library(ggplot2)
ggplot(rev.density, aes(pop.density.2012, total.rev)) +
geom_point(color = "blue", alpha = .5)+
scale_color_gradient(rev.density, low = "red", high = "blue")+
ggtitle("Population Density <> Revenue Relationship")+
labs(y = "Total Revenue", x = "Population Density")+
theme_classic()
# While testing this in graphs, I've found that Washington DC and New York are such massive outliers that it is inhibiting the overall
# analysis. In my purview, one can still spot the trends without this being included in the data set, so I will remove them,
# then graph it again.
rev.density.2 <- rev.density%>%
filter(pop.density.2012 < 1200, total.rev < 700000000)
ggplot(rev.density.2, aes(pop.density.2012, total.rev)) +
geom_point(color = "blue", alpha = .5)+
scale_color_gradient(rev.density, low = "red", high = "blue")+
ggtitle("Population Density <> Revenue Relationship")+
labs(y = "Total Revenue", x = "Population Density")+
theme_classic()
# As soon above, the correlation is much more visible. Based on this data, it is reasonable to say that there isn't a definitive
# relationship between population density and revenue, as higher density doesn't automatically equate to more revenue.
# Rather, I believe revenue has a better relationship to sector than to population density, since different states
# Will have disproportionately different sectors that each compromise different facets of total revenue, and some of these sectors
# generate much more in today's economy than others. Not to mention that population density doesn't guarantee revenue generation.
# A great example of this is Washington DC, where the population density is extremely high, but doesn't come close to NYC
# or other major cities due (I'm guessing here) to the fact that that area is known as the DMV, implying economic
# cross pollenation between DC, VA, and MD.
# PROBLEM 2
# For this question, you will use the data file ‘nes.rda’ (which will require you to use load(“nes.rda”)
# to read in the data.) The codebook, called ‘nes2012_codebook.pdf’ is also available to you on Canvas.
# This data comes from the ANES 2012 Time Series Study, which looks at attitudes toward political ideologies and groups,
# among many other things.
load("~/Documents/Education/Penn/Classes/DATA 210/Week 8/nes.rda")
library(weights)
library(anesrake)
library(purrr)
# 1. According to this survey, of those who claimed to have voted in the 2008 election, what percentage of survey
# respondents voted for Barack Obama in 2008? (Hint: you will need to search the codebook to find the variables
# ‘interest_voted2008’ and ‘interest_whovote2008’ in order to clean them correctly.)
sum(nes$interest_whovote2008 == 1)/ #Dividing sum of votes for Obama by the sum of respondents that claimed to vote in 2008
sum(nes$interest_voted2008 == 1) #According to this code, 67% of confirmed voters voted for Obama
# 2. A ‘Feeling Thermometer’ is a type of survey question that asks respondents to rate how warmly or cool they feel
# toward an individual or group. A feeling thermometer score of 100 indicates a respondent feels the most positive
# toward that entity. A feeling score of 0 indicates the respondent feels most negative about that entity. A score of
# 50 indicates indifference. Using the variable that records the feeling thermometer score towards the ‘Federal Government
# in Washington,’ clean the variable to only include scores between 0 and 100. (Use the codebook to locate the ‘ftgr_fedgov’
# variable to clean it properly.)
nes.exp <- subset(nes, nes$ftgr_fedgov > -1 & nes$ftgr_fedgov < 101) # Setting the range
# 3. Using the cleaned variable, what is the average feeling thermometer for the Federal Government in Washington,
# according to this survey?
mean(nes.exp$ftgr_fedgov) # 52.48 (or just above 'indifference') is the average feeling toward the fed gov in Washington
# 4. Using the ‘prevote_regpty’ variable, create a new variable that indicates whether a respondent is a Democrat or a
# Republican. All other political affiliations or unknowns should be set to ‘NA.’ (Use the codebook to clean this
# variable correctly.)
party.data <- subset(nes.exp, nes.exp$prevote_regpty == 1:2) # Parsing down to dem/repub data
party.data <- mutate(party.data, party = party.data$prevote_regpty == 1)
# 5. Find the difference in means between the average feeling thermometer score for Democrats vs. Republicans.
# What do you conclude?
dem.data <- subset(party.data, party.data$party == TRUE)
rep.data <- subset(party.data, party.data$party == FALSE)
mean(dem.data$ftgr_fedgov) - # Dems mean score - 58.49
mean(rep.data$ftgr_fedgov) # GOP mean score - 41.74
# Without making any assumptions, it can be inferred from this data that a difference of 16.75 on average in favor for Dems
# indicates that democrats in general tend to lean more toward supporting Washington's federal government,
# though this data may arguably have been skewed by the aforementioned Obama election data, which would support
# an additional thesis that this difference is influenced by the party in office at a given time.
# We can test this thesis loosely with a regression model
summary(lm(interest_whovote2008 ~ ftgr_fedgov, data = party.data))
# This test indicates the potential impact of who a person voted for on their level of federal approval in Washington.