Sample Splitting with Caret/SuperLearner

Train the model with Caret

We can train the model with the caret package (for further information about caret, see the original website). We use parallel computing to speed up the computation.

# parallel computing
library(doParallel)
cl <- makePSOCKcluster(2)
registerDoParallel(cl)

# stop after finishing the computation
stopCluster(cl)

The following example shows how to estimate the ITR with gradient boosting machine (GBM) using the caret package. Note that we have already loaded the data and specify the treatment, outcome, and covariates as shown in the Sample Splitting vignette. Since we are using the caret package, we need to specify the trainControl and/or tuneGrid arguments. The trainControl argument specifies the cross-validation method and the tuneGrid argument specifies the tuning grid. For more information about these arguments, please refer to the caret website.

We estimate the ITR with only one machine learning algorithm (GBM) and evaluate the ITR with the evaluate_itr() function. To compute PAPDp, we need to specify the algorithms argument with more than 2 machine learning algorithms.

library(evalITR)
library(caret)

# specify the trainControl method
fitControl <- caret::trainControl(
  method = "repeatedcv", # 3-fold CV
  number = 3, # repeated 3 times
  repeats = 3,
  search='grid',
  allowParallel = TRUE) # grid search

# specify the tuning grid
gbmGrid <- expand.grid(
  interaction.depth = c(5,9), 
  n.trees = (5:10)*100, 
  shrinkage = 0.1,
  n.minobsinnode = 20)

# estimate ITR
fit_caret <- estimate_itr(
  treatment = "treatment",
  form = user_formula,
  trControl = fitControl,
  data = star_data,
  algorithms = c("gbm"),
  budget = 0.2,
  split_ratio = 0.7,
  tuneGrid = gbmGrid,
  verbose = FALSE)
#> Evaluate ITR under sample splitting ...

# evaluate ITR
est_caret <- evaluate_itr(fit_caret)
#> Cannot compute PAPDp

We can extract the training model from caret and check the model performance. Other functions from caret can be applied to the training model.

# extract the final model
caret_model <- fit_caret$estimates$models$gbm
print(caret_model$finalModel)
#> A gradient boosted model with gaussian loss function.
#> 500 iterations were performed.
#> There were 53 predictors of which 49 had non-zero influence.

# check model performance
trellis.par.set(caretTheme()) # theme
plot(caret_model) 

# heatmap 
plot(
  caret_model, 
  plotType = "level",
  scales = list(x = list(rot = 90)))

Train the model with SuperLearner

Alternatively, we can train the model with the SuperLearner package (for further information about SuperLearner, see the original website). SuperLearner utilizes ensemble method by taking optimal weighted average of multiple machine learning algorithms to improve model performance.

We will compare the performance of the ITR estimated with causal_forest and SuperLearner.

library(SuperLearner)

fit_sl <- estimate_itr(
  treatment = "treatment",
  form = user_formula,
  data = star_data,
  algorithms = c("causal_forest","SuperLearner"),
  budget = 0.2,
  split_ratio = 0.7,
  SL_library = c("SL.ranger", "SL.glmnet"))
#> Evaluate ITR under sample splitting ...
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed
#> Error : loading required package (ranger) failed

est_sl <- evaluate_itr(fit_sl)

# summarize estimates
summary(est_sl)
#> ── PAPE ────────────────────────────────────────────────────────────────────────
#>   estimate std.deviation     algorithm statistic p.value
#> 1      2.1          0.99 causal_forest       2.1   0.032
#> 2      1.1          0.86  SuperLearner       1.3   0.181
#> 
#> ── PAPEp ───────────────────────────────────────────────────────────────────────
#>   estimate std.deviation     algorithm statistic p.value
#> 1      2.2           1.1 causal_forest       2.0   0.051
#> 2      1.6           1.1  SuperLearner       1.4   0.169
#> 
#> ── PAPDp ───────────────────────────────────────────────────────────────────────
#>   estimate std.deviation                    algorithm statistic p.value
#> 1     0.67           1.5 causal_forest x SuperLearner      0.44    0.66
#> 
#> ── AUPEC ───────────────────────────────────────────────────────────────────────
#>   estimate std.deviation     algorithm statistic p.value
#> 1      2.0          0.84 causal_forest       2.4   0.017
#> 2      1.5          0.83  SuperLearner       1.9   0.064
#> 
#> ── GATE ────────────────────────────────────────────────────────────────────────
#>    estimate std.deviation     algorithm group statistic p.value upper lower
#> 1       -44           108 causal_forest     1    -0.409    0.68  -221   133
#> 2       -37           109 causal_forest     2    -0.338    0.74  -216   142
#> 3        20           108 causal_forest     3     0.181    0.86  -158   197
#> 4        86           109 causal_forest     4     0.792    0.43   -93   264
#> 5       -10           105 causal_forest     5    -0.097    0.92  -184   163
#> 6        28           109  SuperLearner     1     0.255    0.80  -151   207
#> 7      -107           109  SuperLearner     2    -0.979    0.33  -287    73
#> 8       130           108  SuperLearner     3     1.203    0.23   -48   308
#> 9       111           105  SuperLearner     4     1.060    0.29   -61   283
#> 10     -147           107  SuperLearner     5    -1.376    0.17  -324    29

We plot the estimated Area Under the Prescriptive Effect Curve for the writing score across a range of budget constraints, seperately for the two ITRs, estimated with causal_forest and SuperLearner.

# plot the AUPEC 
plot(est_sl)