forked from NickCH-K/CausalitySlides
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLecture_02_Describing_Relationships.Rmd
550 lines (418 loc) · 20.6 KB
/
Lecture_02_Describing_Relationships.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
---
title: "Lecture 2: Describing Relationships"
author: "Nick Huntington-Klein"
date: "`r Sys.Date()`"
output:
revealjs::revealjs_presentation:
theme: solarized
transition: slide
self_contained: true
smart: true
fig_caption: true
reveal_options:
slideNumber: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
library(tidyverse)
library(purrr)
library(patchwork)
library(ggpubr)
library(haven)
oster <- read_dta('nhanes_summary_cleaned.dta') %>%
mutate(supplement_vite_single = case_when(
!supplement_vite_single ~ 'No Vitamin E',
TRUE ~ 'Took Vitamin E'
))
theme_set(theme_gray(base_size = 15))
```
## Relationships
- Two variables $X$ and $Y$ are *independent* if learning the value of one tells you nothing about the other
- For example, knowing the outcome of a roulette spin tells you nothing about what the next spin will be
- They are *dependent* if learning the value of one *changes the distribution* of the other
- For example, among all Americans, about 50.8\% are legally female and about 49.2\% are legally male
- But if we *learn that someone's name is Susan*, that distribution will change considerably in favor of female!
## Conditional Values
- Another way of saying we *learn the value of one variable* is to say we *condition* on the value of that variable
- "Conditional on someone's name being Susan, the distribution of legal sex is 96% female and 4% male"
- P(Gender | Name = "Susan")
- All the stuff we talked about last time with distributions applies here
- It just applies only to *a portion of the data* (the Susans!)
## Conditional Means
- In many fields, including economics, we are very interested in calculating *conditional means*
- i.e. *what is the mean of the conditional distribution?*
- And we want to know this not only for *one* value of the variable we're conditioning on, but *all* the values
- The *population* conditional mean is the *conditional expectation*, or $E(Y|X)$
## Conditional Means/Expectations and Relationships
- Talking about how two variables are related is just another way of talking about conditional values (usually, conditional means)
- If $X$ and $Y$ are positively related, that means "at higher values of $X$, the conditional expectation $E(Y|X)$ is higher"
- If they're negatively related, "at higher values of $X$, $E(Y|X)$ is lower"
- If their relationship is U-shaped (or similar), "at higher values of $X$, $E(Y|X)$ changes but not always in the same direction"
## Conditional Means and Causal Inference
- *Statistics* is great at telling us how two variables are related
- *Causal inference* is *entirely* about figuring out *which conditional relationships are causal and in what way*
- You can't infer causality from a relationship alone!
- As I'll show, some of the relationships $Y|X$ you'll see today may reflect an $X$ that causes $Y$, or a $Y$ that causes $X$, or something else entirely!
## "Explaining"
- We can also say that $Y|X$ is "the part of $Y$ that is *explained by* $X$"
- If $E(CoffeeCupsPerDay |$ $Occupation = Professor) = 1.79$, and I drink 3 cups per day, then 1.79 of my cups are "explained by" the fact that I'm a professor, and $3 - 1.79 = 1.21$ of my cups are "not explained by" being a professor
- This can extend to multiple variables!
- If $E(CoffeeCupsPerDay |$ $Occupation = Professor,$ $Gender = Male) = 2.13$, then 2.13 of my cups are "explained by" occupation and age, and $3 - 2.13 = .83$ of my cups are "not explained by" those two things
## "Explaining"
- This is a purely statistical explanation
- This doesn't necessarily mean that 2.13 of my cups are *because* I'm a professor and a man, but rather that 2.13 of my cups are *what would be expected* given what you know about me
- The "explanation why" would be the same *statistical* calculation, but would require us to figure out *which* of those calculations is causal in nature.
## Ways of Demonstrating Conditional Means
- When $X$ only takes a few values, we can just show the distrbution of $Y$ conditional on each of the values of $X$, and compare them to each other. Any distribution-showing method works - density plots, bar graphs...
- Scatterplots show all the data and imply the conditional mean
- For continuous $X$ data, you can calculate the conditional mean $Y|X$ over different *ranges* of $X$, either splitting $X$ into bins, or doing "local means"
- Regression
## Example
- Following the textbook, we'll be using data from Emily Oster's study of the relationship between taking Vitamin E and health outcomes
- (and whether that relationship changes as a result of Vitamin E being briefly recommended, then not!)
- We'll start with versions that can be used for *discrete* $X$ values, like "smoking vs. non-smoking"
- Remember, the proportion of a binary variable *is* its mean
## Conditional Means: Contrasting Bar Graphs
```{r, echo = FALSE}
ggplot(oster %>%
select(smoke,supplement_vite_single) %>%
na.omit() %>%
mutate(smoke = as.factor(
case_when(smoke == 0 ~ 'No Smoking',
TRUE ~ 'Smoking')
)) %>%
group_by(smoke) %>%
mutate(BigN = n()) %>%
group_by(smoke, supplement_vite_single) %>%
summarize(N = n()/first(BigN)) %>%
group_by(smoke, supplement_vite_single) %>%
mutate(sup_label = case_when(
row_number() == 1 ~ supplement_vite_single,
TRUE ~ NA_character_
)),
aes(x = smoke, y = N, group = supplement_vite_single)) +
geom_col(position = 'dodge', fill = 'white', color = 'black') +
geom_text(aes(label = sup_label, y = N + .05),position = position_dodge(0.9), family = 'Garamond', size = 14/.pt) +
guides(fill = FALSE) +
labs(x = NULL, y = 'Proportion') +
scale_fill_manual(values = c('gray','black'))+
theme_pubr()
```
## Conditional Means
- Or we can just show that information in a table
```{r, echo = TRUE}
oster %>% select(smoke,supplement_vite_single) %>%
mutate(smoke = as.factor(
case_when(smoke == 0 ~ 'No Smoking',
TRUE ~ 'Smoking'))) %>%
table() %>% prop.table(margin = 1)
```
## Conditional Means
- That was $TookVitaminE|Smoking$. How about $Smoking|TookVitaminE$?
```{r, echo = TRUE}
oster %>% select(smoke,supplement_vite_single) %>%
mutate(smoke = as.factor(
case_when(smoke == 0 ~ 'No Smoking',
TRUE ~ 'Smoking'))) %>%
table() %>% prop.table(margin = 2)
```
## Conditional Distributions: Contrasting Densities
```{r, echo = TRUE, eval = FALSE}
ggplot(oster, aes(x = vite, linetype = factor(vigorous_exercise_month))) +
geom_density(size = 1.5) +
scale_x_log10() +
labs(x = 'Vitamin E Taken (Log Scale)',
y = 'Density') +
annotate(geom = 'label', x = 3.5, y = .75, label = 'No Vigorous Exercise\nLast Month', hjust = 1, family = 'Garamond', size = 13/.pt) +
annotate(geom = 'label', x = 15, y = .75, label = 'Vigorous Exercise\nLast Month', hjust = 0, family = 'Garamond', size = 13/.pt) +
guides(linetype = FALSE) +
theme_pubr()
```
## Conditional Distributions: Contrasting Densities
```{r}
ggplot(oster, aes(x = vite, linetype = factor(vigorous_exercise_month))) +
geom_density(size = 1.5) +
scale_x_log10() +
labs(x = 'Vitamin E Taken (Log Scale)',
y = 'Density') +
annotate(geom = 'label', x = 3.5, y = .75, label = 'No Vigorous Exercise\nLast Month', hjust = 1, family = 'Garamond', size = 13/.pt) +
annotate(geom = 'label', x = 15, y = .75, label = 'Vigorous Exercise\nLast Month', hjust = 0, family = 'Garamond', size = 13/.pt) +
guides(linetype = FALSE) +
theme_pubr()
```
## Continuous $X$ Variables
- What if the $X$ variable we want to condition on is continuous?
- We have options!
- A scatterplot will just show *all* the data (which may be too much/busy!)
- Of course, if we really want to *understand* or *describe* the relationship, we'll need some way of summarizing what we see!
## Scatterplots
```{r, echo = TRUE, eval = FALSE}
ggplot(oster %>% slice(150:300), aes(x= age, y = heart_health)) +
geom_point() +
labs(x = 'Age',
y = 'Heart Health Score') +
theme_pubr()
```
## Scatterplots
```{r}
ggplot(oster %>% slice(150:300), aes(x= age, y = heart_health)) +
geom_point() +
labs(x = 'Age',
y = 'Heart Health Score') +
theme_pubr()
```
## Data in the Social Sciences
- Looks like a blob with sort of a suggestion of a relationship, rather than something really clear, doesn't it?
- That's par for the course in the social sciences
- Super-clear scatterplots are for relationships that are truly *bivariate*, i.e. $X$ explains $Y$ almost perfectly
- If there's *a lot of other stuff going on*, i.e. unexplained parts, the data reflects that and looks more blobby
- In the social sciences, there's ALWAYS a lot of other stuff going on
## Binning
- How can we calculate a conditional mean when there might only be one observation with a given $X$ value? Can't really estimate $E(Y|X=x)$ well if there's only one observation with $X = x$!
- We'll need to group observations together
- One way is binning - we'll rarely actually use this but it's good for demonstration
- Cut up $X$ into bins, and take the average of $Y$ within each bin
## Binning
```{r, echo = TRUE}
oster %>% filter(bmi < 100) %>%
mutate(bmi_cut = cut(bmi, 8)) %>% group_by(bmi_cut) %>%
summarize(vite = mean(heart_health, na.rm = TRUE)) %>%
mutate(vite = scales::number(vite, accuracy = .001)) %>%
rename(`BMI Bin` = bmi_cut,
`Heart Health Index` = vite)
```
## Binning
- Any guesses why that last bin is weird?
```{r}
oster %>% filter(bmi < 100) %>%
#sample_n(500) %>%
ggplot(aes(x = bmi, y = heart_health)) +
geom_point(alpha = .4) +
geom_step(data = oster %>%
filter(bmi < 100) %>%
mutate(bmi_cut = cut(bmi, 8)) %>%
group_by(bmi_cut) %>%
summarize(`Heart Health Index` = mean(heart_health, na.rm = TRUE), bmi = min(bmi)),
mapping = aes(x = bmi, y = `Heart Health Index`),
size = 2, color = 'red') +
scale_y_continuous(limits = c(-5,5)) +
labs(x = 'Body Mass Index', y = 'Heart Health Index',
caption = 'Outliers on the y-axis not pictured to allow zoom-in') +
theme_pubr()
```
## Binning
- Doing this binning thing tells us the average of $Y$ within certain defined ranges of $X$, letting us estimate the population $E(Y | X \in [StartofBin,EndofBin])$
- So I can get $E(Y|X=x)$ by just seeing which bin $x$ is in
- Of course, this is a bit arbitrary and strange - where do the bin definitions come from, how close should we make them, and is it really reasonable to see big jumps going from the edge of one bin to another?
## Local Means
- Another approach is to take *local expectations*
- Same binning idea, but we define the bin in a *rolling* manner
- For each value $x$, use a bin *centered on that $x$* (perhaps weighting closer values more highly)
- This gives us a smooth estimate (no jumps!) and while we still have decisions to make (width of bin, weights), it's less arbitrary
- If we fit a quadtratic shape at each of those points, that's a LOESS (Locally Estimated Scatterplot Smoothing)
## LOESS
```{r, echo = TRUE, eval = FALSE}
oster %>% filter(bmi < 100) %>%
ggplot(aes(x = bmi, y = heart_health)) +
geom_point(alpha = .4) +
geom_smooth(size = 2, color = 'red', se = FALSE) +
scale_y_continuous(limits = c(-5,5)) +
labs(x = 'Body Mass Index', y = 'Heart Health Index',
caption = 'Outliers on the y-axis omitted to allow zoom-in') +
theme_pubr()
```
## LOESS
```{r}
oster %>% filter(bmi < 100) %>%
ggplot(aes(x = bmi, y = heart_health)) +
geom_point(alpha = .4) +
geom_smooth(size = 2, color = 'red', se = FALSE) +
scale_y_continuous(limits = c(-5,5)) +
labs(x = 'Body Mass Index', y = 'Heart Health Index',
caption = 'Outliers on the y-axis omitted to allow zoom-in') +
theme_pubr()
```
## LOESS
Benefits of LOESS:
- Nonparametric
- Easy to understand
Downsides:
- Difficult to use to sum up a relationship
- Or try to uncover population relationships
## Regression
- This brings us to regression!
- Regression takes the *idea* behind a LOESS curve - use a line to represent the conditional mean at each value of $X$, smoothing over gaps between the $X$'s - and generalizes it using a *shape*
- Regression is the process of fitting a line to data, *and requiring that that line holds a particular shape*
- In basic forms of regression, that shape is a straight line
- (You can make it a bit curvy by adding polynomial terms)
## Linear Regression
```{r, echo = TRUE, eval = FALSE}
oster %>% filter(bmi < 100) %>%
ggplot(aes(x = bmi, y = heart_health)) +
geom_point(alpha = .4) +
geom_smooth(size = 2, color = 'red', se = FALSE, method = 'lm') +
scale_y_continuous(limits = c(-5,5)) +
labs(x = 'Body Mass Index', y = 'Heart Health Index',
caption = 'Outliers on the y-axis omitted to allow zoom-in') +
theme_pubr()
```
## Linear Regression
```{r}
oster %>% filter(bmi < 100) %>%
ggplot(aes(x = bmi, y = heart_health)) +
geom_point(alpha = .4) +
geom_smooth(size = 2, color = 'red', se = FALSE, method = 'lm') +
scale_y_continuous(limits = c(-5,5)) +
labs(x = 'Body Mass Index', y = 'Heart Health Index',
caption = 'Outliers on the y-axis omitted to allow zoom-in') +
theme_pubr()
```
## Line-Fitting
$$ Y = \beta_0 + \beta_1X + \varepsilon $$
Pros:
- $\beta_1$ is a lot easier to interpret than continuously changing local means
- We can extrapolate to areas of $X$ where we have no observations (but don't go beyond edge of data!)
- We understand the sampling distribution of $\hat{\beta}_1$
Cons:
- If we don't use the *right shape* the results will be bad
- We may toss out some interesting information that doesn't fit the shape
## Line-Fitting
- Easy interpretation: a one-unit change in $X$ is **associated with** a $\beta_1$-unit change in $Y$
- (If we've *identified a causal effect* then a one-unit change in $X$ **causes** a $\beta_1$-unit change in $Y$ )
- $\hat{\beta}_1$ follows a normal distribution (or nearly so, if $\varepsilon$ isn't normal), so easy to do hypothesis tests
- Eight million notes and details to go over that we won't today, but they'll come up, and you hopefully covered some in econometrics!
## Line-Fitting
Running a regression. This sees whether restaurant chains with more locations get higher/lower health inspection scores:
```{r, echo = TRUE, eval = FALSE}
df <- read_csv('restaurant_data.csv')
m1 <- lm(inspection_score ~ NumberofLocations, data = df)
library(modelsummary)
msummary(m1, stars = TRUE)
```
## Interpret!
```{r, echo = FALSE, eval = TRUE}
df <- read_csv('restaurant_data.csv')
m1 <- lm(inspection_score ~ NumberofLocations, data = df)
library(modelsummary)
msummary(m1, stars = TRUE)
```
## Conditional Conditional Means
- The other bonus regression gives us is the ability to add *control variables*
$$ Y = \beta_0 + \beta_1X + \beta_2Z + \varepsilon $$
- This gives us a "conditional conditional mean", i.e. "the conditional mean of $Y$ given $X$, conditional on $Z$ "
- "What is the part of the relationship between $X$ and $Y$ that is not explained by differences in $Z$?"
## Conditional Conditional Means
- Adding a control *removes the part of $X$ that is explained by $Z$* and also *removes the part of $Y$ that is explained by $Z$*
- What's left over is the unrelated parts, and we can see how those unrelated parts relate
- This is demonstrated most cleanly by the Frisch-Waugh-Lovell theorem
- (this also extends to more than one control)
## Conditional Conditional Means
```{r, echo = TRUE, eval = FALSE}
m2 <- lm(inspection_score ~ NumberofLocations + Year, data = df)
df <- df %>%
mutate(inspection_score_res = resid(lm(inspection_score ~ Year)),
NumberofLocations_res = resid(lm(NumberofLocations ~ Year)))
m3 <- lm(inspection_score_res ~ NumberofLocations_res, data = df)
msummary(list(m2, m3), stars = TRUE, fmt = 5,
gof_omit = 'AIC|BIC|Lik')
```
## Conditional Conditional Means
```{r, echo = FALSE, eval = TRUE}
m2 <- lm(inspection_score ~ NumberofLocations + Year, data = df)
df <- df %>%
mutate(inspection_score_res = resid(lm(inspection_score ~ Year)),
NumberofLocations_res = resid(lm(NumberofLocations ~ Year)))
m3 <- lm(inspection_score_res ~ NumberofLocations_res, data = df)
msummary(list(m2, m3), stars = TRUE, fmt = 5,
gof_omit = 'AIC|BIC|Lik')
```
## Regression Command Notes
```{r, eval = FALSE, echo = TRUE}
# Predictions and residuals
pred <- predict(m3); res <- resid(m3)
# Polynomials
m4 <- lm(inspection_score ~ Year + I(Year^2), data = df)
# Interactions (* includes both by themselves as well as interacted)
m5 <- lm(inspection_score ~ NumberofLocations*Weekend, data = df)
# Getting average marginal effects when you have interactions or polynomials
library(margins)
margins(m4, at = list(Year = 2009)) %>% summary()
margins(m5, at = list(Weekend = TRUE)) %>% summary()
```
## Regression Command Notes
```{r, eval = FALSE, echo = TRUE}
# Logit
m6 <- glm(Weekend ~ Year + NumberofLocations, data = df, family = binomial(link = 'logit'))
# Average marginal effects from logit
margins(m6) %>% summary()
# Heteroskedasticity-robust standard errors
# (currently poking modelsummary author to improve this syntax)
msummary(list(m4,m5,m6), vcov = 'robust', stars = TRUE)
```
## Interactions
- We'll be using interactions a lot. They're very common in causal inference models, usually at least one of the interacted variables is binary (phew!)
- Let's be sure we can interpret them properly!
- Think about what coefficients are being multiplied by $X$ when $D = 0$ and when $D = 1$
- And how much the effect of $X$ goes up by switching $D=0$ to $D = 1$!
$$ Y = \beta_0 + \beta_1X + \beta_2D + \beta_3DX + \varepsilon $$
## Interactions
- Coefficient on $X$ alone: effect of $X$ when $D = 0$
- Coefficient on interaction alone: how much stronger the effect of $X$ is for $D = 1$ than for $D = 0$
- $\beta_1 + \beta_3$: effect of $X$ when $D = 1$
- With interactions (or polynomials), you need to consider *all the related coefficients at once*, not on their own
## Interactions
Interpret!
```{r, echo = FALSE}
m5 <- lm(inspection_score ~ NumberofLocations*Weekend, data = df)
msummary(m5, stars = TRUE,
gof_omit = 'AIC|BIC|Lik')
```
## Practice
- Install (if you haven't) and load the packages **tidyverse, vtable, modelsummary, sandwich**
- Load the `storms` data set with `data(storms)`
- Use `sumtable` to look at the data. Pick a few numbers you see and use that number in a descriptive sentence (i.e. "the mean of the distribution of... is...")
## Practice
- Use `ggplot(storms, aes(x = SOMETHING))` `+ geom_density()` to look at the distribution of a numeric variable
- Use `ggplot(storms, aes(x = SOMETHING))` `+ geom_bar()` to look at the distribution of a categorical variable
- Use `mutate()` to create a variable `Hurricane = status == 'hurricane'`
- Then...
## Practice
- Run a regression of `pressure` on `year` and `Hurricane`, and then another with the interaction between them
- Use `msummary()` to show both regressions on a table, with heteroskedasticity-robust SEs
- Interpret each coefficient in the table in a sentence (including one of the intercepts!)
## Practice Answers (for the code parts)
```{r, echo = TRUE, eval = FALSE}
library(tidyverse); library(vtable); library(modelsummary); library(sandwich)
data(storms)
# st is short for sumtable, works too!
st(storms)
ggplot(storms, aes(x = wind)) + geom_density() + ggpubr::theme_pubr()
ggplot(storms, aes(x = status)) + geom_bar() + ggpubr::theme_pubr()
storms <- storms %>% mutate(Hurricane = status == 'hurricane')
```
## Practice Answers (for the code parts)
```{r, echo = TRUE, eval = FALSE}
m1 <- lm(pressure ~ year + Hurricane, data = storms)
m2 <- lm(pressure ~ year*Hurricane, data = storms)
msummary(list(m1, m2), stars = TRUE, statistic_override = vcovHC)
```
## Practice Answers
```{r}
library(tidyverse); library(vtable); library(modelsummary); library(sandwich)
data(storms)
# st is short for sumtable, works too!
storms <- storms %>% mutate(Hurricane = status == 'hurricane')
m1 <- lm(pressure ~ year + Hurricane, data = storms)
m2 <- lm(pressure ~ year*Hurricane, data = storms)
msummary(list(m1, m2), stars = TRUE, statistic_override = vcovHC,
gof_omit = 'AIC|BIC|Lik')
```
## Practice Answers (interpretation!)
Model 1:
- If it's not a hurricane and it's in year 0, we predict pressure of 1294.063
- Hurricanes have pressure 33.176 lower than non-hurricanes
- Pressure drops by .146 each year
Model 2:
- Among non-hurricanes, pressure drops by .09 each year
- In year 0, hurricanes have pressure 356.178 higher than non-hurricanes
- Pressure drops by .195 *more* every year for hurricanes than non-hurricanes