-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathanalysis-gps.Rmd
322 lines (259 loc) · 10.8 KB
/
analysis-gps.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
---
title: "Base Ranker: Gamma Poisson Shrinker (GPS)"
author:
- name: Nan Xiao
url: https://nanx.me/
affiliation: Seven Bridges
affiliation_url: https://www.sevenbridges.com/
- name: Soner Koc
url: https://github.com/skoc
affiliation: Seven Bridges
affiliation_url: https://www.sevenbridges.com/
- name: Kaushik Ghose
url: https://kaushikghose.wordpress.com/
affiliation: Seven Bridges
affiliation_url: https://www.sevenbridges.com/
date: "`r Sys.Date()`"
output:
distill::distill_article:
toc: yes
bibliography: rankv.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = TRUE, cache = TRUE)
```
# Data Model
Let's denote the vaccine-symptom pairs from the VAERS database as $2 \times 2$ contingency tables:
| Target vaccine | Target symptom | All other symptoms | Total |
| :------------- | :------------- | :----------------------- | :-------- |
| Yes | $n_{ij}$ | $n_i - n_{ij}$ | $n_i$ |
| No | $n_j - n_{ij}$ | $n - n_i - n_j + n_{ij}$ | $n - n_i$ |
| Total | $n_j$ | $n - n_j$ | $n$ |
It is often the case than not that $n_{ij}$ occures as a rare event such that the observed count follows a Poisson distribution. In the GPS algorithm, we model the relative reporting ratio (RRR), which measures the ratio of how many symptoms under vaccine were actually observed over the number of expected symptoms, under the assumption that the symptoms and vaccine exposure are independent. The RRR ($\lambda$) is defined as
$$
\lambda_{ij} = \frac{\mu_{ij}}{E_{ij}}.
$$
$\mu_{ij}$ is the mean of the Poisson distribution of $n_{ij}$. $E_{ij}$ is the expected event count under the assumption that the vaccine exposure and the symptom are idependent. It can be estimated as
$$
\hat{E_{ij}} = \frac{n_i n_j}{n}.
$$
For $\lambda$, we can use a gamma distribution as a conjugate prior since the Poisson distributed $n_{ij}$. ($n_{ij} \sim Poisson (\mu_{ij})$). In the GPS model, a simple mixture of two Gamma distributions was used as the prior to allow for more flexibility in the model:
$$
\lambda_{ij} \sim P \ \text{Gamma}(\alpha_1, \beta_1) + (1-P) \ \text{Gamma}(\alpha_2, \beta_2).
$$
Using an empirical Bayesian approach, we can use the data to estimate the paramaters ($\alpha_1$, $\alpha_2$, $\beta_1$, $\beta_2$, $P$), calculate the posterior for $\lambda$, and then derive the expectation of $\log(\lambda)$:
$$
E(\log(\lambda)) = Q (\psi (\alpha_1 + n_{ij}) - \log(1/\beta_1 + E_{ij})) + (1-Q) (\psi (\alpha_2 + n_{ij}) - \log(1/\beta_2) + E_{ij})
$$
where
$$
Q = \frac{P \cdot \text{NB}(n_{ij}, E_{ij} \ | \ \alpha_1, \beta_1)}{P \cdot \text{NB}(n_{ij}, E_{ij} \ | \ \alpha_1, \beta_1) + (1-P) \cdot \text{NB}(n_{ij}, E_{ij} \ | \ \alpha_2, \beta_2)}
$$
$\psi$ is the digamma function. $\text{NB}$ is the negative binomial distribution.
With the expectation of $\log(\lambda)$, we can define the Empirical Bayesian Geometric Mean (EBGM) as
$$EBGM = e^{E(\log(\lambda))}.$$
The fifth percentile of the posterior distribution of $\lambda$ is denoted as EB05 and interpreted as the lower one-sided 95% confidence limit for the EBGM. This estimator gives more conservative estimates when the event counts are small, thus yields a shrinkage estimate which could reduce the number of false-positive signals.
# Calculate PRR
To fit a Gamma Poisson Shrinker model [@dumouchel1999], we first calculate
actual counts (N), expected counts (E) of each vaccine-symptom combination,
relative reporting ratio (RRR), and proportional reporting ratio (PRR)
using age + gender stratification to control potential confounding effects:
```{r}
library("openEBGM")
library("magrittr")
library("kableExtra")
library("ggsci")
df_p <- readRDS("data-processed/df.rds") %>%
processRaw(stratify = TRUE, zeroes = FALSE)
```
```
# stratification variables used: strat_age, strat_gender
# there were 12 strata: <= 2-Female, <= 2-Male, <= 2-Unknown, > 65-Female, > 65-Male, > 65-Unknown, 18 - 65-Female, 18 - 65-Male, 18 - 65-Unknown, 2 - 18-Female, 2 - 18-Male, 2 - 18-Unknown
```
Sanity check:
```{r}
df_p %>% head()
```
```
# var1 var2 N E RR PRR
# 1 ADENOVIRUS (NO BRAND NAME) Arthralgia 1 0.119480157 8.37 16.26
# 2 ADENOVIRUS (NO BRAND NAME) Dyspnoea 1 0.083668035 11.95 17.43
# 3 ADENOVIRUS (NO BRAND NAME) Face oedema 1 0.011163701 89.58 86.42
# 4 ADENOVIRUS (NO BRAND NAME) Gait disturbance 1 0.014375342 69.56 67.12
# 5 ADENOVIRUS (NO BRAND NAME) Osteoarthritis 1 0.005387014 185.63 296.72
# 6 ADENOVIRUS (NO BRAND NAME) Petechiae 1 0.002717102 368.04 212.56
```
Save processed data to file:
```{r}
df_p %>% saveRDS(file = "data-processed/df_p.rds")
```
# Fit Prior
```{r}
df_p <- readRDS("data-processed/df_p.rds")
squashed <- squashData(df_p, bin_size = 100, keep_pts = 0)
squashed <- squashData(squashed, count = 2, bin_size = 12)
```
```{r}
library("foreach")
library("doParallel")
registerDoParallel(parallel::detectCores())
suppressMessages(library("DEoptim"))
set.seed(2020)
theta_hat <- DEoptim(
negLLsquash,
lower = c(rep(1e-05, 4), 0.001),
upper = c(rep(5, 4), 1 - 0.001),
control = DEoptim.control(
itermax = 500,
reltol = 1e-04,
steptol = 200,
NP = 100,
CR = 0.85,
F = 0.75,
trace = 25,
parallelType = 2
),
ni = squashed$N, ei = squashed$E, wi = squashed$weight
)
```
```
# Iteration: 25 bestvalit: 325555.202871 bestmemit: 0.014104 0.116785 1.749330 1.712070 0.101938
# Iteration: 50 bestvalit: 325354.375330 bestmemit: 0.002284 0.113023 2.013934 1.947101 0.094070
# Iteration: 75 bestvalit: 325347.784248 bestmemit: 0.000153 0.115025 2.093288 2.020985 0.097340
# Iteration: 100 bestvalit: 325347.575191 bestmemit: 0.000036 0.116657 2.103292 2.031426 0.099336
# Iteration: 125 bestvalit: 325347.567304 bestmemit: 0.000010 0.116819 2.103899 2.031816 0.099493
# Iteration: 150 bestvalit: 325347.567208 bestmemit: 0.000010 0.116802 2.103769 2.031757 0.099480
# Iteration: 175 bestvalit: 325347.567205 bestmemit: 0.000010 0.116798 2.103761 2.031760 0.099474
# Iteration: 200 bestvalit: 325347.567205 bestmemit: 0.000010 0.116799 2.103766 2.031765 0.099474
# Iteration: 225 bestvalit: 325347.567205 bestmemit: 0.000010 0.116799 2.103766 2.031764 0.099475
```
Check $\hat{\theta}$:
```{r}
theta_hat <- as.numeric(theta_hat$optim$bestmem)
theta_hat
```
```
# 1.000001e-05 1.167989e-01 2.103766e+00 2.031764e+00 9.947457e-02
```
Save to file:
```{r}
saveRDS(theta_hat, file = "data-processed/theta_hat.rds")
```
# Infer Empirical Bayes Scores
```{r}
df_p <- readRDS("data-processed/df_p.rds")
squashed <- squashData(df_p, bin_size = 100, keep_pts = 0)
squashed <- squashData(squashed, count = 2, bin_size = 12)
theta_hat <- readRDS("data-processed/theta_hat.rds")
```
```{r}
qn <- Qn(theta_hat, N = df_p$N, E = df_p$E)
head(qn)
identical(length(qn), nrow(df_p))
summary(qn)
```
```{r}
df_p$ebgm <- ebgm(theta_hat, N = df_p$N, E = df_p$E, qn = qn)
head(df_p) %>% kable() %>% kable_styling()
```
```{r}
df_p$QUANT_05 <- quantBisect(
5,
theta_hat = theta_hat,
N = df_p$N, E = df_p$E, qn = qn
)
df_p$QUANT_95 <- quantBisect(
95,
theta_hat = theta_hat,
N = df_p$N, E = df_p$E, qn = qn
)
head(df_p) %>% kable() %>% kable_styling()
```
Suspicious pair with EBGM05 > 5, which are clearly reporting signals:
```{r}
suspicious <- df_p[df_p$QUANT_05 >= 5, ]
nrow(suspicious)
nrow(df_p)
nrow(suspicious) / nrow(df_p)
suspicious <- suspicious[order(suspicious$QUANT_05, decreasing = TRUE), c("var1", "var2", "N", "E", "QUANT_05", "ebgm", "QUANT_95")]
head(suspicious, 5) %>% kable() %>% kable_styling()
```
```{r}
tabbed <- table(suspicious$var1)
head(tabbed[order(tabbed, decreasing = TRUE)]) %>% kable() %>% kable_styling()
```
```{r,eval=FALSE}
txt <- paste0(
1:nrow(suspicious),
". After controlling for the confounding effects from age and gender, the reporting of the adverse reaction ",
suspicious$var2,
" for vaccine ",
suspicious$var1,
" is disproportionately high compared to this same event for all other vaccines, with ",
suspicious$N,
" total reports and an EBGM05 score ",
suspicious$QUANT_05,
"."
)
```
```{r,eval=FALSE}
txt[1:5]
```
```
# [1] 1. After controlling for the confounding effects from age and gender,
the reporting of the adverse reaction Product use issue for vaccine
INFLUENZA (SEASONAL) (FLUCELVAX) is disproportionately high compared
to this same event for all other vaccines, with 57 total reports and
an EBGM05 score 115.19.
# [2] 2. After controlling for the confounding effects from age and gender,
the reporting of the adverse reaction Gastrointestinal haemorrhage for
vaccine ROTAVIRUS (ROTASHIELD) is disproportionately high compared to
this same event for all other vaccines, with 94 total reports and
an EBGM05 score 94.7.
# [3] 3. After controlling for the confounding effects from age and gender,
the reporting of the adverse reaction Product administered to patient
of inappropriate age for vaccine INFLUENZA (SEASONAL) (FLUBLOK QUADRIVALENT)
is disproportionately high compared to this same event for all other
vaccines, with 141 total reports and an EBGM05 score 74.88.
# [4] 4. After controlling for the confounding effects from age and gender,
the reporting of the adverse reaction Product use issue for vaccine HPV
(CERVARIX) is disproportionately high compared to this same event for all
other vaccines, with 22 total reports and an EBGM05 score 70.29.
# [5] 5. After controlling for the confounding effects from age and gender,
the reporting of the adverse reaction Drug administered to patient of
inappropriate age for vaccine INFLUENZA (SEASONAL) (FLUCELVAX) is
disproportionately high compared to this same event for all other
vaccines, with 350 total reports and an EBGM05 score 53.27.
```
# Sanity Check Empirical Bayes Scores
```{r}
df_p <- readRDS("data-processed/df_p.rds")
squashed <- squashData(df_p, bin_size = 100, keep_pts = 0)
squashed <- squashData(squashed, count = 2, bin_size = 12)
theta_hat <- readRDS("data-processed/theta_hat.rds")
```
```{r}
est <- list("estimates" = theta_hat)
names(est$estimates) <- c("alpha1", "beta1", "alpha2", "beta2", "P")
# for the 5th and 95th percentiles
ebout <- ebScores(
df_p,
hyper_estimate = est,
quantiles = c(5, 95)
)
print(ebout, threshold = 10)
```
The global overview might not be too helpful:
```{r}
plot(ebout, plot.type = "histogram")
plot(ebout, plot.type = "shrinkage") + scale_color_nejm()
```
Let's focus on a specific type of adverse event (vomiting):
```{r}
plot(ebout, event = "Vomiting") + scale_fill_nejm()
plot(ebout, plot.type = "histogram", event = "Vomiting")
plot(ebout, plot.type = "shrinkage", event = "Vomiting") + scale_color_nejm()
```
```{r,echo=FALSE}
saveRDS(suspicious, file = "data-processed/df_gps.rds")
```