day5.qmd

---
title: "Day 5"
sidebar:
  style: "floating"
  search: true
  contents: auto
---

Outline of topics: 

  - Course Core Concept Script
  - Reproducibility and Robustness
  - Visualizing and Reporting on Regression 
  - QR Code Activity 
  - Data Linkage Methods 
  - Onikye et al Article

::: {.callout-tip collapse='true'}
### Course Core Concepts Script

Find this script online [here](https://gist.github.com/ctesta01/9a7fe6b1f8eeb7ce4f3f324cb53d3c0b) 
or written out below: 

```r
# know how to install packages:
# install.packages("tidyverse")

# set up a project so we could use the {here} package

# dependencies ------------------------------------------------------------

library(tidyverse)
library(broom)
library(here)
library(palmerpenguins)
library(gtsummary)


# read in data ------------------------------------------------------------

# we could use a csv dataset like this:
# df <- readr::read_csv(here("data.csv"))

# or use an example dataset like penguins from palmerpenguins:
# use View to look at it in RStudio
View(penguins)


# data manipulation -------------------------------------------------------

# use group_by and summarize together to create summary statistics per-group
penguins_summarized <- penguins |>
  group_by(species) |>
  summarize(
    mean_flipper_length_mm = mean(flipper_length_mm, na.rm=TRUE))

# know how to use mutate to update columns (either creating new ones or updating
# existing ones):
# here, we'll just convert species to a character vector just for an example
# so then we can next practice making it a factor:
penguins <- penguins |>
  mutate(species = as.character(species))

# convert a variable to a factor:
#
# method 1: base R
# here, the levels will be assumed from the output of unique(penguins$species):
penguins$species <- factor(penguins$species)
#
# method 2: dplyr
penguins <- penguins |>
  mutate(species = factor(species))

# if I wanted to change the reference category, I could use relevel:
penguins$species <- relevel(penguins$species, 'Chinstrap')

# or the dplyr way:
penguins <- penguins |>
  mutate(species = relevel(species, 'Chinstrap'))
# you can also use forcats::fct_relevel


# data visualization ------------------------------------------------------

# use ggplot2 to make some graphics
# a scatter plot:
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  ggtitle("Penguin Bill Lengths and Depths by Species")

# use ggsave to save your work
ggsave(here("output/penguins_scatterplot.png"), width = 7, height = 5)

# a histogram with facets:
ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram() +
  facet_wrap(~species) +
  ggtitle("Penguin Bill Lengths and Depths by Species")

# again use ggsave and here() to save it within your project
ggsave(here("output/penguins_faceted_histogram.png"), width = 8, height = 3)


# you might also want to plot regression lines in ggplot quickly so
# use geom_smooth:
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  ggtitle("Linear regression of flipper length on body mass")


# analyze a model -------------------------------------------------------------

model <- lm(flipper_length_mm ~ body_mass_g + species, penguins)

# use broom::tidy to extract the coefficients and their statistics
model_output <- broom::tidy(model, conf.int = TRUE)

# visualize model results
model_output |>
  filter(term != '(Intercept)') |>
  ggplot(aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
  geom_pointrange()

# create a table of the results
gtsummary::tbl_regression(model)


# one example with multiple models --------------------------------------------

model1 <- lm(flipper_length_mm ~ species, penguins)
model2 <- lm(flipper_length_mm ~ species + body_mass_g, penguins)
model3 <- lm(flipper_length_mm ~ species + body_mass_g + island, penguins)

# extract tables of results
model_results <- list(
  bind_cols(model = 'model1', broom::tidy(model1, conf.int = TRUE)),
  bind_cols(model = 'model2', broom::tidy(model2, conf.int = TRUE)),
  bind_cols(model = 'model3', broom::tidy(model3, conf.int = TRUE)))

# make into one data frame
model_results <- bind_rows(model_results)

# create a plot of covariates from multiple models
model_results |>
  filter(term %in% c('speciesGentoo', 'speciesAdelie')) |>
  ggplot(
       aes(x = estimate,
           y = term,
           xmin = conf.low,
           xmax = conf.high,
           color = model,
           shape = model)) +
  geom_pointrange(position = position_dodge(width = 0.5)) +
  ggtitle("Coefficient estimates for species effect",
          stringr::str_wrap(
            paste(
              "Model 1 includes no other covariates, model 2 includes body mass,",
              "and model 3 includes body mass and island effects"
            )
          ))
```
:::


::: {.callout-tip collapse='true'}
### Lecture 1: Reproducibility and Robustness
<iframe src="https://id529.github.io/slides-day5/lecture1-reproducibility-and-robustness/reproducibility.html" width="100%" height="500px"></iframe>
:::
<a href="https://id529.github.io/slides-day5/lecture1-reproducibility-and-robustness/reproducibility.html" >link to view slides fullscreen</a>
<br>

<!-- 
::: {.callout-tip collapse='true'}
### Lecture 2: Data Dictionaries and Documentation
<iframe src="https://id529.github.io/lectures/day2/lecture2-data-dictionaries-and-documentation/data-documentation.html#/data-dictionaries" width="100%" height="500px"></iframe>
:::
<a href="https://id529.github.io/lectures/day2/lecture2-data-dictionaries-and-documentation/data-documentation.html#/data-dictionaries" >link to view slides fullscreen</a>
<br>
<a href="https://drive.google.com/file/d/1tlhk9Ach86d4cwkCzzlG1wfJdq3cByX2/view?usp=share_link" > link to slide PDFs </a>
<br>
:::
--> 

::: {.callout-tip collapse='true'}
### Lecture 2: Regression (part 2!)

<iframe src="https://id529.github.io/slides-day5/lecture2-regression-reporting/id529_day5_regression_pretty_output.html" width="100%" height="500px"></iframe>
:::

<a href="https://id529.github.io/slides-day5/lecture2-regression-reporting/id529_day5_regression_pretty_output.html" >link to view slides fullscreen</a>
<br>
<a href="https://github.com/jarvischen01/id529-regression-models" >link to repository</a></iframe>
<br> 

::: {.callout-tip collapse='true'}
### Lecture 3: Data linkage methods

<iframe src="https://id529.github.io/slides-day5/lecture3-data-linkage-methods/data-linkage-methods.html#/introduction-to-data-linkage" width="100%" height="500px"></iframe> 
:::
<a href="https://id529.github.io/slides-day5/lecture3-data-linkage-methods/data-linkage-methods.html#/introduction-to-data-linkage" >link to view slides fullscreen</iframe>
<br> 
<a href="https://github.com/ID529/lectures2024/blob/master/day5/lecture3-data-linkage-methods/nhanes-activity.R" >link to follow along code</a>

::: {.callout-tip collapse='true'}
### Brown et al Response and Reproduction of Onikye et al

**Preface**

In 2021 Brown et al published an article titled *A Reproduction of the Results
of Onyike et al. (2003)* in Meta-Psychology, a journal that is free, open-access
and conducts open peer review. The Onikye et al. article they reproducing,
*Is obesity associated with major depression? Results from the Third National Health and Nutrition Examination Survey*,
was published in the American Journal of Epidemiology has been cited 1159+ times
according to Google Scholar. 

I want to point out an interesting section from the About page describing the
Meta-Psychology journal that you might keep in mind as you read on:

> Prior to publication, all statistical analyses are reproduced by our
statistical reproduction team, which consists of the Statistical Editor and our
editorial assistant. This makes the article eligible for the reproducibility
badge.

**Recommended Reading**

Please read the article by Brown et al (<https://open.lnu.se/index.php/metapsychology/article/view/2071>).

Abstract repeated here as a teaser: 

> Onyike et al. (2003) analyzed data from a large-scale US-American data set,
the Third National Health and Nutrition Examination Survey (NHANES-III), and
reported an association between obesity and major depression, especially among
people with severe obesity. Here, we report the results of a detailed
replication of Onyike et al.’s analyses. While we were able to reproduce the
majority of these authors’ descriptive statistics, this took a substantial
amount of time and effort, and we found several minor errors in the univariate
descriptive statistics reported in their Tables 1 and 2. We were able to
reproduce most of Onyike et al.’s bivariate findings regarding the relationship
between obesity and depression (Tables 3 and 4), albeit with some small
discrepancies (e.g., with respect to the magnitudes of standard errors). On the
other hand, we were unable to reproduce Table 5, containing Onyike et al.’s
findings with respect to the relationship between obesity and depression when
controlling for plausible confounding variables—arguably the paper’s most
important results—because some of the included predictor variables appear to be
either unavailable, or not coded in the way reported by Onyike et al., in the
public NHANES-III data sets. We discuss the implications of our findings for the
transparency of reporting and the reproducibility of published results.

Their code is freely, publicly accessible on 
OSFHOME, a file storage service provided by the [Open Science Framework](https://osf.io/) from 
the [Center for Open Science](https://www.cos.io/).

  - Brown et al's code+file repository: https://osf.io/j32yw/ 
  - Download their code+files (direct link): https://files.osf.io/v1/resources/j32yw/providers/osfstorage/?zip=

Note that in order to run their code, you will either want to a) make a new 
R project in the folder with their code on your computer, or b) open a new 
RStudio window, open up their .R file, and use `setwd('filepath/goes/here/')`
to make sure your R session can run their R code. 

Consider the following questions: 

  * Do you believe that the results Brown et al. have shared are more likely to
    be correct than those that Onikye et al published?  If so, why?  If not, why
    not?
    * What do you find compelling about their re-analysis and code?
    * What do you find lacking about their re-analysis and code?
  * How do you think the non-reproducibility of Onikye et al.'s article could
    have been avoided?
  * When, if at all, do you think articles should be required to share code and
    data?
    * What about in situations where the data relates to private or sensitive data? 
    * What about in situations where the subject matter is highly politically charged
      and there might be malicious actors who could see shared data and code as 
      additional surface area to attack? 
  * Has reading this article made you more skeptical of research publications
  that don't share code?
  * Do you think for articles where code & data are too sensitive to be shared,
  is there an alternative process that would make you similarly confident in 
  the stated results?
  
**An important aside**

This isn't a class about stigma and health, but I think being in a Population
Health Science program, it's important to leave the breadcrumbs here for you to
do your own followup reading and learning.

Because the articles in this homework discuss body-weight and health, I want to
emphatically point out that this subject matter is not at all cut and dry. It's
important to acknowledge that:

  * Weight-based stigma is real and causes harm to health through
  multiple mechanisms including at least discrimination and health
  care practitioners' attitudes and behaviors:
    * https://ajph.aphapublications.org/doi/full/10.2105/AJPH.2009.159491
    * https://onlinelibrary.wiley.com/doi/full/10.1111/obr.12266 
  * The decision by health organizations to classify obesity as a "disease" is debated:
    * https://www.healthline.com/health/is-obesity-a-disease 
  * The language and terminology that we use can perpetuate stigma:
    * https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051141/?report=classic
    * https://news.yale.edu/2012/07/12/choosing-words-wisely-when-talking-patients-about-their-weight 

If anything, what I hope you take away from this aside is that data do not speak
for themselves, but rather are subject to interpretation and leave room for
either the perpetuation or casting aside of pre-existing biases (See "Data Never Speak for Themselves"
from Nancy Krieger's article [*Structural Racism, Health Inequities, and the Two-Edged Sword of Data: Structural Problems Require Structural Solutions*](https://www.frontiersin.org/articles/10.3389/fpubh.2021.655447/full)). It's not
enough to engage with open-science practices and leverage sophisticated 
statistical analyses made possible in programs like R; instead, it's necessary to
combine advances in the state of the art in computing with advances in our 
conceptual frameworks to do science that can truly shift narratives in ways that 
benefit marginalized groups.
:::

:::{.callout-tip collapse='true'}
## Video Recording from 2025
<iframe width="100%" height="500" src="https://www.youtube.com/embed/cpAMPe9sKVE?si=wV04S1Xyb0D4GtfH" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
:::

::: {.callout-tip collapse='true'}
## Video Recording from 2024

<iframe width="100%" height="500" src="https://www.youtube.com/embed/a1b4uWT-J7M?si=SVV2fZW0beZrjb9K" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
:::

::: {.callout-tip collapse='true'}
## Video Recording from 2023

<iframe width="100%" height="500" src="https://www.youtube.com/embed/CUxrO5VVvgM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
:::