-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathday5.qmd
330 lines (259 loc) · 13.9 KB
/
day5.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
title: "Day 5"
sidebar:
style: "floating"
search: true
contents: auto
---
Outline of topics:
- Course Core Concept Script
- Reproducibility and Robustness
- Visualizing and Reporting on Regression
- QR Code Activity
- Data Linkage Methods
- Onikye et al Article
::: {.callout-tip collapse='true'}
### Course Core Concepts Script
Find this script online [here](https://gist.github.com/ctesta01/9a7fe6b1f8eeb7ce4f3f324cb53d3c0b)
or written out below:
```r
# know how to install packages:
# install.packages("tidyverse")
# set up a project so we could use the {here} package
# dependencies ------------------------------------------------------------
library(tidyverse)
library(broom)
library(here)
library(palmerpenguins)
library(gtsummary)
# read in data ------------------------------------------------------------
# we could use a csv dataset like this:
# df <- readr::read_csv(here("data.csv"))
# or use an example dataset like penguins from palmerpenguins:
# use View to look at it in RStudio
View(penguins)
# data manipulation -------------------------------------------------------
# use group_by and summarize together to create summary statistics per-group
penguins_summarized <- penguins |>
group_by(species) |>
summarize(
mean_flipper_length_mm = mean(flipper_length_mm, na.rm=TRUE))
# know how to use mutate to update columns (either creating new ones or updating
# existing ones):
# here, we'll just convert species to a character vector just for an example
# so then we can next practice making it a factor:
penguins <- penguins |>
mutate(species = as.character(species))
# convert a variable to a factor:
#
# method 1: base R
# here, the levels will be assumed from the output of unique(penguins$species):
penguins$species <- factor(penguins$species)
#
# method 2: dplyr
penguins <- penguins |>
mutate(species = factor(species))
# if I wanted to change the reference category, I could use relevel:
penguins$species <- relevel(penguins$species, 'Chinstrap')
# or the dplyr way:
penguins <- penguins |>
mutate(species = relevel(species, 'Chinstrap'))
# you can also use forcats::fct_relevel
# data visualization ------------------------------------------------------
# use ggplot2 to make some graphics
# a scatter plot:
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
ggtitle("Penguin Bill Lengths and Depths by Species")
# use ggsave to save your work
ggsave(here("output/penguins_scatterplot.png"), width = 7, height = 5)
# a histogram with facets:
ggplot(penguins, aes(x = flipper_length_mm)) +
geom_histogram() +
facet_wrap(~species) +
ggtitle("Penguin Bill Lengths and Depths by Species")
# again use ggsave and here() to save it within your project
ggsave(here("output/penguins_faceted_histogram.png"), width = 8, height = 3)
# you might also want to plot regression lines in ggplot quickly so
# use geom_smooth:
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point() +
geom_smooth(method = 'lm') +
ggtitle("Linear regression of flipper length on body mass")
# analyze a model -------------------------------------------------------------
model <- lm(flipper_length_mm ~ body_mass_g + species, penguins)
# use broom::tidy to extract the coefficients and their statistics
model_output <- broom::tidy(model, conf.int = TRUE)
# visualize model results
model_output |>
filter(term != '(Intercept)') |>
ggplot(aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
geom_pointrange()
# create a table of the results
gtsummary::tbl_regression(model)
# one example with multiple models --------------------------------------------
model1 <- lm(flipper_length_mm ~ species, penguins)
model2 <- lm(flipper_length_mm ~ species + body_mass_g, penguins)
model3 <- lm(flipper_length_mm ~ species + body_mass_g + island, penguins)
# extract tables of results
model_results <- list(
bind_cols(model = 'model1', broom::tidy(model1, conf.int = TRUE)),
bind_cols(model = 'model2', broom::tidy(model2, conf.int = TRUE)),
bind_cols(model = 'model3', broom::tidy(model3, conf.int = TRUE)))
# make into one data frame
model_results <- bind_rows(model_results)
# create a plot of covariates from multiple models
model_results |>
filter(term %in% c('speciesGentoo', 'speciesAdelie')) |>
ggplot(
aes(x = estimate,
y = term,
xmin = conf.low,
xmax = conf.high,
color = model,
shape = model)) +
geom_pointrange(position = position_dodge(width = 0.5)) +
ggtitle("Coefficient estimates for species effect",
stringr::str_wrap(
paste(
"Model 1 includes no other covariates, model 2 includes body mass,",
"and model 3 includes body mass and island effects"
)
))
```
:::
::: {.callout-tip collapse='true'}
### Lecture 1: Reproducibility and Robustness
<iframe src="https://id529.github.io/slides-day5/lecture1-reproducibility-and-robustness/reproducibility.html" width="100%" height="500px"></iframe>
:::
<a href="https://id529.github.io/slides-day5/lecture1-reproducibility-and-robustness/reproducibility.html" >link to view slides fullscreen</a>
<br>
<!--
::: {.callout-tip collapse='true'}
### Lecture 2: Data Dictionaries and Documentation
<iframe src="https://id529.github.io/lectures/day2/lecture2-data-dictionaries-and-documentation/data-documentation.html#/data-dictionaries" width="100%" height="500px"></iframe>
:::
<a href="https://id529.github.io/lectures/day2/lecture2-data-dictionaries-and-documentation/data-documentation.html#/data-dictionaries" >link to view slides fullscreen</a>
<br>
<a href="https://drive.google.com/file/d/1tlhk9Ach86d4cwkCzzlG1wfJdq3cByX2/view?usp=share_link" > link to slide PDFs </a>
<br>
:::
-->
::: {.callout-tip collapse='true'}
### Lecture 2: Regression (part 2!)
<iframe src="https://id529.github.io/slides-day5/lecture2-regression-reporting/id529_day5_regression_pretty_output.html" width="100%" height="500px"></iframe>
:::
<a href="https://id529.github.io/slides-day5/lecture2-regression-reporting/id529_day5_regression_pretty_output.html" >link to view slides fullscreen</a>
<br>
<a href="https://github.com/jarvischen01/id529-regression-models" >link to repository</a></iframe>
<br>
::: {.callout-tip collapse='true'}
### Lecture 3: Data linkage methods
<iframe src="https://id529.github.io/slides-day5/lecture3-data-linkage-methods/data-linkage-methods.html#/introduction-to-data-linkage" width="100%" height="500px"></iframe>
:::
<a href="https://id529.github.io/slides-day5/lecture3-data-linkage-methods/data-linkage-methods.html#/introduction-to-data-linkage" >link to view slides fullscreen</iframe>
<br>
<a href="https://github.com/ID529/lectures2024/blob/master/day5/lecture3-data-linkage-methods/nhanes-activity.R" >link to follow along code</a>
::: {.callout-tip collapse='true'}
### Brown et al Response and Reproduction of Onikye et al
**Preface**
In 2021 Brown et al published an article titled *A Reproduction of the Results
of Onyike et al. (2003)* in Meta-Psychology, a journal that is free, open-access
and conducts open peer review. The Onikye et al. article they reproducing,
*Is obesity associated with major depression? Results from the Third National Health and Nutrition Examination Survey*,
was published in the American Journal of Epidemiology has been cited 1159+ times
according to Google Scholar.
I want to point out an interesting section from the About page describing the
Meta-Psychology journal that you might keep in mind as you read on:
> Prior to publication, all statistical analyses are reproduced by our
statistical reproduction team, which consists of the Statistical Editor and our
editorial assistant. This makes the article eligible for the reproducibility
badge.
**Recommended Reading**
Please read the article by Brown et al (<https://open.lnu.se/index.php/metapsychology/article/view/2071>).
Abstract repeated here as a teaser:
> Onyike et al. (2003) analyzed data from a large-scale US-American data set,
the Third National Health and Nutrition Examination Survey (NHANES-III), and
reported an association between obesity and major depression, especially among
people with severe obesity. Here, we report the results of a detailed
replication of Onyike et al.’s analyses. While we were able to reproduce the
majority of these authors’ descriptive statistics, this took a substantial
amount of time and effort, and we found several minor errors in the univariate
descriptive statistics reported in their Tables 1 and 2. We were able to
reproduce most of Onyike et al.’s bivariate findings regarding the relationship
between obesity and depression (Tables 3 and 4), albeit with some small
discrepancies (e.g., with respect to the magnitudes of standard errors). On the
other hand, we were unable to reproduce Table 5, containing Onyike et al.’s
findings with respect to the relationship between obesity and depression when
controlling for plausible confounding variables—arguably the paper’s most
important results—because some of the included predictor variables appear to be
either unavailable, or not coded in the way reported by Onyike et al., in the
public NHANES-III data sets. We discuss the implications of our findings for the
transparency of reporting and the reproducibility of published results.
Their code is freely, publicly accessible on
OSFHOME, a file storage service provided by the [Open Science Framework](https://osf.io/) from
the [Center for Open Science](https://www.cos.io/).
- Brown et al's code+file repository: https://osf.io/j32yw/
- Download their code+files (direct link): https://files.osf.io/v1/resources/j32yw/providers/osfstorage/?zip=
Note that in order to run their code, you will either want to a) make a new
R project in the folder with their code on your computer, or b) open a new
RStudio window, open up their .R file, and use `setwd('filepath/goes/here/')`
to make sure your R session can run their R code.
Consider the following questions:
* Do you believe that the results Brown et al. have shared are more likely to
be correct than those that Onikye et al published? If so, why? If not, why
not?
* What do you find compelling about their re-analysis and code?
* What do you find lacking about their re-analysis and code?
* How do you think the non-reproducibility of Onikye et al.'s article could
have been avoided?
* When, if at all, do you think articles should be required to share code and
data?
* What about in situations where the data relates to private or sensitive data?
* What about in situations where the subject matter is highly politically charged
and there might be malicious actors who could see shared data and code as
additional surface area to attack?
* Has reading this article made you more skeptical of research publications
that don't share code?
* Do you think for articles where code & data are too sensitive to be shared,
is there an alternative process that would make you similarly confident in
the stated results?
**An important aside**
This isn't a class about stigma and health, but I think being in a Population
Health Science program, it's important to leave the breadcrumbs here for you to
do your own followup reading and learning.
Because the articles in this homework discuss body-weight and health, I want to
emphatically point out that this subject matter is not at all cut and dry. It's
important to acknowledge that:
* Weight-based stigma is real and causes harm to health through
multiple mechanisms including at least discrimination and health
care practitioners' attitudes and behaviors:
* https://ajph.aphapublications.org/doi/full/10.2105/AJPH.2009.159491
* https://onlinelibrary.wiley.com/doi/full/10.1111/obr.12266
* The decision by health organizations to classify obesity as a "disease" is debated:
* https://www.healthline.com/health/is-obesity-a-disease
* The language and terminology that we use can perpetuate stigma:
* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051141/?report=classic
* https://news.yale.edu/2012/07/12/choosing-words-wisely-when-talking-patients-about-their-weight
If anything, what I hope you take away from this aside is that data do not speak
for themselves, but rather are subject to interpretation and leave room for
either the perpetuation or casting aside of pre-existing biases (See "Data Never Speak for Themselves"
from Nancy Krieger's article [*Structural Racism, Health Inequities, and the Two-Edged Sword of Data: Structural Problems Require Structural Solutions*](https://www.frontiersin.org/articles/10.3389/fpubh.2021.655447/full)). It's not
enough to engage with open-science practices and leverage sophisticated
statistical analyses made possible in programs like R; instead, it's necessary to
combine advances in the state of the art in computing with advances in our
conceptual frameworks to do science that can truly shift narratives in ways that
benefit marginalized groups.
:::
:::{.callout-tip collapse='true'}
## Video Recording from 2025
<iframe width="100%" height="500" src="https://www.youtube.com/embed/cpAMPe9sKVE?si=wV04S1Xyb0D4GtfH" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
:::
::: {.callout-tip collapse='true'}
## Video Recording from 2024
<iframe width="100%" height="500" src="https://www.youtube.com/embed/a1b4uWT-J7M?si=SVV2fZW0beZrjb9K" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
:::
::: {.callout-tip collapse='true'}
## Video Recording from 2023
<iframe width="100%" height="500" src="https://www.youtube.com/embed/CUxrO5VVvgM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
:::