-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathvignette.Rmd
233 lines (144 loc) · 7.59 KB
/
vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
title: "Estimating Average Treatment Effect with ICDML and Bootstrap Confidence Intervals"
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(data.table)
library(xgboost)
```
This vignette demonstrates the process of estimating the Average Treatment Effect (ATE) using Calibrated Debiased Machine Learning (C-DML). We'll start by generating synthetic data, fitting nuisance models using machine learning algorithms, and finally applying the C-DML estimation procedure with ensemble learning.
```{r}
# Check if CDML package is installed; if not, install it from GitHub
if(!require(CDML)) {
devtools::install_github("Larsvanderlaan/CDML")
library(CDML)
}
```
# Generate data
To start, we generate synthetic data for a causal inference problem. Our data consists of covariates \(X\), a binary treatment indicator \(A\), and an outcome \(Y\) influenced by both \(X\) and \(A\). This setup allows us to simulate a scenario where we estimate the causal effect of \(A\) on \(Y\).
```{r}
set.seed(123)
# Sample size
n <- 1000
# Generate covariates
X <- matrix(rnorm(n * 5), n, 5)
colnames(X) <- paste0("X", 1:5)
# Generate binary treatment indicator A based on a logistic model of X
pi1_true <- plogis(X %*% rep(0.3, 5)) # Linear predictor for logistic model
A <- rbinom(n, 1, pi1_true)
# True outcome model
Y <- as.vector(2 * A + X %*% rep(0.5, 5) + rnorm(n))
# Predicted outcome regressions (for demonstration, using true model)
mu1_true <- 2 + X %*% rep(0.5, 5) # Expected outcome if A = 1
mu0_true <- X %*% rep(0.5, 5) # Expected outcome if A = 0
print(mean(mu1_true - mu0_true))
```
## Fit initial nuisance models
Next, we fit initial models for nuisance parameters, which include the propensity score (probability of receiving treatment) and outcome regression models (predicted outcomes under treatment and control). We use the \texttt{sl3} package for flexible machine learning modeling with cross-fitting to reduce bias.
```{r}
if(!require(sl3)) {
devtools::install_github("tlverse/sl3", ref="devel")
library(sl3)
}
# Define the data as a sl3 task for propensity score
task_A <- sl3::make_sl3_Task(data = data.table(cbind(X, A = A)), covariates = colnames(X), outcome = "A", outcome_type = "binomial")
task_Y <- sl3::make_sl3_Task(data = data.table(X, Y = Y, A = A), covariates = c(colnames(X), "A"), outcome = "Y", outcome_type = "continuous")
# Define learners for propensity score and outcome regression
# Stack includes XGBoost and GAM learners
# lrnr_cv performs cross-fitting
lrnr_stack <- Stack$new(list(Lrnr_xgboost$new(max_depth = 4), Lrnr_gam$new()))
lrnr_cv <- Lrnr_cv$new(lrnr_stack)
lrnr_selector <- make_learner(Pipeline, lrnr_cv, Lrnr_cv_selector$new(loss_squared_error))
# Fit propensity score model (predicting A based on X)
propensity_fit <- lrnr_selector$train(task_A)
pi1 <- propensity_fit$predict(task_A) # Predictions for propensity scores
pi0 = 1 - pi1
# Define two outcome regression tasks: one with A set to 1, one with A set to 0
task_Y1 <- sl3::make_sl3_Task(data = data.table(X, Y = Y, A = 1), covariates = c(colnames(X), "A"), outcome = "Y", outcome_type = "continuous")
task_Y0 <- sl3::make_sl3_Task(data = data.table(X, Y = Y, A = 0), covariates = c(colnames(X), "A"), outcome = "Y", outcome_type = "continuous")
# Fit outcome regression model using original task
outcome_fit <- lrnr_selector$train(task_Y)
# Predict outcomes under treatment (A = 1) for all observations
mu1 <- outcome_fit$predict(task_Y1)
# Predict outcomes under control (A = 0) for all observations
mu0 <- outcome_fit$predict(task_Y0)
# Check some of the predictions
head(pi1)
head(mu1)
head(mu0)
```
## CDML with specified nuisance estimates
With the predicted nuisance estimates (propensity scores and outcome regressions), we now apply the CDML estimation procedure to estimate the causal effect of treatment on the outcome.
```{r}
output <- cdml(W=X,A,Y, pi_mat = cbind(pi1, pi0), mu_mat = cbind(mu1, mu0))
print(output)
```
# CDML with ensemble learning
The `cdml` function provides a user-friendly interface that integrates machine learning with the `sl3` ensemble learning framework. `sl3` is part of the **`tlverse`** ecosystem, providing a flexible framework for **Super Learner ensembles**—a method that optimally combines multiple algorithms to create robust predictive models. Leveraging cross-validation to evaluate and ensemble diverse learners, `sl3` is particularly suited for **Causal Doubly Robust Machine Learning (CDML)** applications. Its extensible architecture supports both built-in and custom learners, making it powerful for causal inference and predictive modeling. For more, see [tlverse.org/sl3](https://tlverse.org/sl3/).
```{r}
# default learners
output <- cdml(X,A,Y)
print(output)
# custom learners
library(sl3)
learners = list(
Lrnr_glmnet$new(),
Lrnr_earth$new(degree = 2),
Lrnr_gam$new(),
Lrnr_ranger$new(max.depth = 8),
Lrnr_xgboost_early_stopping$new(max_depth = 3),
Lrnr_xgboost_early_stopping$new(max_depth = 4),
Lrnr_xgboost_early_stopping$new(max_depth = 5)
)
output <- cdml(W=X,A,Y, learners_treatment = learners, learners_outcome = learners)
print(output)
```
# CDML with internal functions
## C-DML for ATE and bootstrap CIs
```{r}
## CDML Example
# Estimate the Average Treatment Effect (ATE) using ICDML estimator
# Inputs:
# - A: binary treatment indicator
# - Y: observed outcomes
# - mu1, mu0: predicted outcomes for treatment (A = 1) and control (A = 0)
# - pi1, pi0: propensity scores for treatment and control groups, respectively
estimate <- estimate_cdml_ate(A, Y, mu1, mu0, pi1, pi0, weights = NULL)
# Print the estimated causal effect
print(estimate)
# Estimate the ATE with a bootstrap confidence interval using ICDML
# The function provides both the point estimate and a confidence interval
bootstrap_result <- bootstrap_cdml_ate(A, Y, mu1, mu0, pi1, pi0, weights = NULL)
# Print the bootstrap results: includes ATE estimate and confidence interval
print(bootstrap_result)
```
# Isotonic calibration of nuisance functions for C-DML
To directly calibrate the inverse propensity score weights and outcome regression using isotonic regression, follow the steps below. This will allow you to construct isotonic calibrated nuisance estimators, which can be used in the C-DML framework to target various causal parameters.
The following code demonstrates how to calibrate the inverse propensity weights and outcome regression using isotonic regression with the CDML package.
```{r}
# Isotonic calibration of inverse propensity score weights and outcome regression
# Assuming that you have already computed the following:
# - propensity_scores: estimated propensity scores
# - mu1: predicted outcomes for the treatment group (A = 1)
# - mu0: predicted outcomes for the control group (A = 0)
# - A: treatment indicator
# - Y: observed outcomes
# Calibrate inverse propensity score weights
calibrated_weights <- calibrate_inverse_weights(A = A, pi1 = pi1, pi0 = pi0)
alpha1_star <- calibrated_weights$alpha1_star # Calibrated weights for treatment group
alpha0_star <- calibrated_weights$alpha0_star # Calibrated weights for control group
# Calibrate outcome regression predictions
calibrated_outcomes <- calibrate_outcome_regression(Y = Y, mu1 = mu1, mu0 = mu0, A = A)
mu1_star <- calibrated_outcomes$mu1_star # Calibrated outcome predictions for treatment group
mu0_star <- calibrated_outcomes$mu0_star # Calibrated outcome predictions for control group
# Check calibrated values
head(alpha1_star)
head(alpha0_star)
head(mu1_star)
head(mu0_star)
plot(1/pi1_true, alpha1_star)
plot(mu1_true, mu1_star)
```