--- title: "Sample Splitting with Caret/SuperLearner" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Sample Splitting with Caret/SuperLearner} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "../man/figures/README-" ) library(dplyr) load("../data/star.rda") # specifying the outcome outcomes <- "g3tlangss" # specifying the treatment treatment <- "treatment" # specifying the data (remove other outcomes) star_data <- star %>% dplyr::select(-c(g3treadss,g3tmathss)) # specifying the formula user_formula <- as.formula( "g3tlangss ~ treatment + gender + race + birthmonth + birthyear + SCHLURBN + GRDRANGE + GKENRMNT + GKFRLNCH + GKBUSED + GKWHITE ") ``` ### Train the model with Caret We can train the model with the `caret` package (for further information about `caret`, see [the original website](http://topepo.github.io/caret/index.html)). We use parallel computing to speed up the computation. ```{r parallel, message = FALSE, eval = FALSE} # parallel computing library(doParallel) cl <- makePSOCKcluster(2) registerDoParallel(cl) # stop after finishing the computation stopCluster(cl) ``` The following example shows how to estimate the ITR with gradient boosting machine (GBM) using the `caret` package. Note that we have already loaded the data and specify the treatment, outcome, and covariates as shown in the [Sample Splitting](sample_split.html) vignette. Since we are using the `caret` package, we need to specify the `trainControl` and/or `tuneGrid` arguments. The `trainControl` argument specifies the cross-validation method and the `tuneGrid` argument specifies the tuning grid. For more information about these arguments, please refer to the [caret website](http://topepo.github.io/caret/model-training-and-tuning.html). We estimate the ITR with only one machine learning algorithm (GBM) and evaluate the ITR with the `evaluate_itr()` function. To compute `PAPDp`, we need to specify the `algorithms` argument with more than 2 machine learning algorithms. ```{r caret estimate, message = FALSE} library(evalITR) library(caret) # specify the trainControl method fitControl <- caret::trainControl( method = "repeatedcv", # 3-fold CV number = 3, # repeated 3 times repeats = 3, search='grid', allowParallel = TRUE) # grid search # specify the tuning grid gbmGrid <- expand.grid( interaction.depth = c(5,9), n.trees = (5:10)*100, shrinkage = 0.1, n.minobsinnode = 20) # estimate ITR fit_caret <- estimate_itr( treatment = "treatment", form = user_formula, trControl = fitControl, data = star_data, algorithms = c("gbm"), budget = 0.2, split_ratio = 0.7, tuneGrid = gbmGrid, verbose = FALSE) # evaluate ITR est_caret <- evaluate_itr(fit_caret) ``` We can extract the training model from `caret` and check the model performance. Other functions from `caret` can be applied to the training model. ```{r caret_model, message = FALSE, warning = FALSE, fig.width = 6, fig.height = 4} # extract the final model caret_model <- fit_caret$estimates$models$gbm print(caret_model$finalModel) # check model performance trellis.par.set(caretTheme()) # theme plot(caret_model) # heatmap plot( caret_model, plotType = "level", scales = list(x = list(rot = 90))) ``` ### Train the model with SuperLearner Alternatively, we can train the model with the `SuperLearner` package (for further information about `SuperLearner`, see [the original website](https://CRAN.R-project.org/package=SuperLearner/vignettes/Guide-to-SuperLearner.html)). SuperLearner utilizes ensemble method by taking optimal weighted average of multiple machine learning algorithms to improve model performance. We will compare the performance of the ITR estimated with `causal_forest` and `SuperLearner`. ```{r sl_summary, message = FALSE, warning = FALSE} library(SuperLearner) fit_sl <- estimate_itr( treatment = "treatment", form = user_formula, data = star_data, algorithms = c("causal_forest","SuperLearner"), budget = 0.2, split_ratio = 0.7, SL_library = c("SL.ranger", "SL.glmnet")) est_sl <- evaluate_itr(fit_sl) # summarize estimates summary(est_sl) ``` We plot the estimated Area Under the Prescriptive Effect Curve for the writing score across a range of budget constraints, seperately for the two ITRs, estimated with `causal_forest` and `SuperLearner`. ```{r sl_plot, fig.width=6, fig.height=4,fig.align = "center"} # plot the AUPEC plot(est_sl) ```