---
title: "Sample Splitting"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Sample Splitting}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "../man/figures/README-"
  )
  
load("../data/star.rda")

```


This is an example using the `star` dataset (for more information about the dataset, please use `?star`). 

We start with a simple example with one outcome variable (writing scores) and one machine learning algorithm (causal forest). Then we move to incoporate multiple outcomes and compare model performances with several machine learning algorithms. 


To begin, we load the dataset and specify the outcome variable and covariates to be used in the model. Next, we utilize a random forest algorithm to develop an Individualized Treatment Rule (ITR) for estimating the varied impacts of small class sizes on students' writing scores.  Since the treatment is often costly for most policy programs, we consider a case with 20% budget constraint (`budget` = 0.2). The model will identify the top 20% of units who benefit from the treatment most and assign them to with the treatment. We train the model through sample splitting, with the `split_ratio` between the train and test sets determined by the `split_ratio` argument. Specifically, we allocate 70% of the data to train the model, while the remaining 30% is used as testing data (`split_ratio` = 0.7).


```{r sample_split, warning = FALSE, message = FALSE}
library(dplyr)
library(evalITR)

# specifying the outcome
outcomes <- "g3tlangss"

# specifying the treatment
treatment <- "treatment"

# specifying the data (remove other outcomes)
star_data <- star %>% dplyr::select(-c(g3treadss,g3tmathss))

# specifying the formula
user_formula <- as.formula(
  "g3tlangss ~ treatment + gender + race + birthmonth + 
  birthyear + SCHLURBN + GRDRANGE + GKENRMNT + GKFRLNCH + 
  GKBUSED + GKWHITE ")


# estimate ITR 
fit <- estimate_itr(
  treatment = treatment,
  form = user_formula,
  data = star_data,
  algorithms = c("causal_forest"),
  budget = 0.2,
  split_ratio = 0.7)


# evaluate ITR 
est <- evaluate_itr(fit)
```


The`summary()` function displays the following summary statistics: 


| Statistics | Description              |
|:-------- | :------------------------|
| `PAPE`    | population average prescriptive effect |
| `PAPEp`   | population average prescriptive effect with a budget constraint |
| `PAPDp`   | population average prescriptive effect difference with a budget constraint (this quantity will be computed with more than 2 machine learning algorithms) |
| `AUPEC`   | area under the prescriptive effect curve |
| `GATE`    | grouped average treatment effects |


For more information about these evaluation metrics, please refer to [Imai and Li (2021)](https://arxiv.org/abs/1905.05389) and [Imai and Li (2022)](https://arxiv.org/abs/2203.14511).


```{r sp_summary}
# summarize estimates
summary(est)
```


We can extract estimates from the `est` object. The following code shows how to extract the GATE estimates for the writing score with the causal forest algorithm. 
We can also plot the estimates using the `plot_estimate()` function and specify the type of estimates to be plotted 
(`GATE`, `PAPE`, `PAPEp`, `PAPDp`).


```{r est_extract, warning = FALSE, message = FALSE, fig.width = 6, fig.height = 4}
# plot GATE estimates
library(ggplot2)
gate_est <- summary(est)$GATE

plot_estimate(gate_est, type = "GATE") +
  scale_color_manual(values = c("#0072B2", "#D55E00"))
```  


We plot the estimated Area Under the Prescriptive Effect Curve for the writing score across a range of budget constraints for causal forest.

```{r sp_plot, fig.width = 6, fig.height = 4}
# plot the AUPEC 
plot(est)
```