causalTree: Causal Effect Regression and Estimation Trees

Description

Fit a causalTree model to get an rpart object

Usage

causalTree(
  formula,
  data,
  weights,
  treatment,
  subset,
  na.action = na.causalTree,
  split.Rule,
  split.Honest,
  HonestSampleSize,
  split.Bucket,
  bucketNum = 5,
  bucketMax = 100,
  cv.option,
  cv.Honest,
  minsize = 2L,
  x = FALSE,
  y = TRUE,
  propensity,
  control,
  split.alpha = 0.5,
  cv.alpha = 0.5,
  cv.gamma = 0.5,
  split.gamma = 0.5,
  cost,
  ...
)

Value

An object of class rpart. See rpart.object.

Arguments

formula: a formula, with a response and features but no interaction terms. If this a a data frome, that is taken as the model frame (see model.frame).
data: an optional data frame that includes the variables named in the formula.
weights: optional case weights.
treatment: a vector that indicates the treatment status of each observation. 1 represents treated and 0 represents control. Only binary treatment supported in this version.
subset: optional expression saying that only a subset of the rows of the data should be used in the fit.
na.action: the default action deletes all observations for which y is missing, but keeps those in which one or more predictors are missing.
split.Rule: causalTree splitting options, one of "TOT", "CT", "fit", "tstats", four splitting rules in causalTree. Note that the "tstats" alternative does not have an associated cross-validation method cv.option; see Athey and Imbens (2016) for a discussion. Note further that split.Rule and cv.option can mix and match.
split.Honest: boolean option, TRUE or FALSE, used for split.Rule as "CT" or "fit". If set as TRUE, do honest splitting, with default split.alpha = 0.5; if set as FALSE, do adaptive splitting with split.alpha = 1. The user choice of split.alpha will be ignored if split.Honest is set as FALSE, but will be respected if set to TRUE. For split.Rule="TOT", there is no honest splitting option and the parameter split.alpha does not matter. For split.Rule="tstats", a value of TRUE enables use of split.alpha in calculating the risk function, which determines the order of pruning in cross-validation. Note also that causalTree function returns the estimates from the training data, no matter what the value of split.Honest is; the tree must be re-estimated to get the honest estimates using estimate.causalTree. The wrapper function honest.CausalTree does honest estimation in one step and returns a tree.
HonestSampleSize: number of observations anticipated to be used in honest re-estimation after building the tree. This enters the risk function used in both splitting and cross-validation.
split.Bucket: boolean option, TRUE or FALSE, used to specify whether to apply the discrete method in splitting the tree. If set as TRUE, in splitting a node, the observations in a leaf will be be partitioned into buckets, with each bucket containing bucketNum treated and bucketNum control units, and where observations are ordered prior to partitioning. Splitting will take place by bucket.
bucketNum: number of observations in each bucket when set split.Bucket = TRUE. However, the code will override this choice in order to guarantee that there are at least minsize and at most bucketMax buckets.
bucketMax: Option to choose maximum number of buckets to use in splitting when set split.Bucket = TRUE, bucketNum can change by choice of bucketMax.
cv.option: cross validation options, one of "TOT", "matching", "CT", "fit", four cross validation methods in causalTree. There is no cv.option for the split.Rule "tstats"; see Athey and Imbens (2016) for discussion.
cv.Honest: boolean option, TRUE or FALSE, only used for cv.option as "CT" or "fit", to specify whether to apply honest risk evalation function in cross validation. If set TRUE, use honest risk function, otherwise use adaptive risk function in cross validation. If set FALSE, the user choice of cv.alpha will be set to 1. If set TRUE, cv.alpha will default to 0.5, but the user choice of cv.alpha will be respected. Note that honest cv estimates within-leaf variances and may perform better with larger leaf sizes and/or small number of cross-validation sets.
minsize: in order to split, each leaf must have at least minsize treated cases and minsize control cases. The default value is set as 2.
x: keep a copy of the x matrix in the result.
y: keep a copy of the dependent variable in the result. If missing and model is supplied this defaults to FALSE.
propensity: propensity score used in "TOT" splitting and "TOT", honest "CT" cross validation methods. The default value is the proportion of treated cases in all observations. In this implementation, the propensity score is a constant for the whole dataset. Unit-specific propensity scores are not supported; however, the user may use inverse propensity scores as case weights if desired.
control: a list of options that control details of the rpart algorithm. See rpart.control.
split.alpha: scale parameter between 0 and 1, used in splitting risk evaluation function for "CT". When split.Honest = FALSE, split.alpha will be set as 1. For split.Rule="tstats", if split.Honest=TRUE, split.alpha is used in calculating the risk function, which determines the order of pruning in cross-validation.
cv.alpha: scale paramter between 0 and 1, used in cross validation risk evaluation function for "CT" and "fit". When cv.Honest = FALSE, cv.alpha will be set as 1.
cv.gamma, split.gamma: optional parameters used in evaluating policies.
cost: a vector of non-negative costs, one for each variable in the model. Defaults to one for all variables. These are scalings to be applied when considering splits, so the improvement on splitting on a variable is divided by its cost in deciding which split to choose.
...: arguments to rpart.control may also be specified in the call to causalTree. They are checked against the list of valid arguments. An example of a commonly set parameter would be xval, which sets the number of cross-validation samples. The parameter minsize is implemented differently in causalTree than in rpart; we require a minimum of minsize treated observations and a minimum of minsize control observations in each leaf.

Details

CausalTree differs from rpart function from rpart package in splitting rules and cross validation methods. Please check Athey and Imbens, Recursive Partitioning for Heterogeneous Causal Effects (2016) for more details.

References

Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.

Athey, S and G Imbens (2016) Recursive Partitioning for Heterogeneous Causal Effects. http://arxiv.org/abs/1504.01132

Examples

Run this code

library("htetree")
library("rpart")
library("rpart.plot")
tree <- causalTree(y~ x1 + x2 + x3 + x4, data = simulation.1,
treatment = simulation.1$treatment,
split.Rule = "CT", cv.option = "CT", split.Honest = TRUE, cv.Honest = TRUE,
split.Bucket = FALSE, xval = 5,
cp = 0, minsize = 20, propensity = 0.5)

opcp <- tree$cptable[,1][which.min(tree$cptable[,4])]

opfit <- prune(tree, opcp)

rpart.plot(opfit)

fittree <- causalTree(y~ x1 + x2 + x3 + x4, data = simulation.1,
                      treatment = simulation.1$treatment,
                      split.Rule = "fit", cv.option = "fit",
                      split.Honest = TRUE, cv.Honest = TRUE, split.Bucket = TRUE,
                      bucketNum = 5,
                      bucketMax = 200, xval = 10,
                      cp = 0, minsize = 20, propensity = 0.5)

tstatstree <- causalTree(y~ x1 + x2 + x3 + x4, data = simulation.1,
                         treatment = simulation.1$treatment,
                         split.Rule = "tstats", cv.option = "CT",
                         cv.Honest = TRUE, split.Bucket = TRUE,
                         bucketNum = 10,
                         bucketMax = 200, xval = 5,
                         cp = 0, minsize = 20, propensity = 0.5)

Run the code above in your browser using DataLab