GenericML_single: Single iteration of the GenericML algorithm

Description

Performs generic ML inference for a single learning technique and a given split of the data. Can be seen as a single iteration of Algorithm 1 in the paper.

Usage

GenericML_single(
  Z,
  D,
  Y,
  learner,
  propensity_scores,
  M_set,
  A_set = setdiff(1:length(Y), M_set),
  Z_CLAN = NULL,
  HT = FALSE,
  quantile_cutoffs = c(0.25, 0.5, 0.75),
  X1_BLP = setup_X1(),
  X1_GATES = setup_X1(),
  diff_GATES = setup_diff(),
  diff_CLAN = setup_diff(),
  vcov_BLP = setup_vcov(),
  vcov_GATES = setup_vcov(),
  equal_variances_CLAN = FALSE,
  significance_level = 0.05,
  min_variation = 1e-05
)

Arguments

A numeric design matrix that holds the covariates in its columns.

A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group.

A numeric vector containing the response variable.

learner

A character specifying the machine learner to be used for estimating the baseline conditional average (BCA) and conditional average treatment effect (CATE). Either 'lasso', 'random_forest', 'tree', or a custom learner specified with mlr3 syntax. In the latter case, do not specify in the mlr3 syntax specification if the learner is a regression learner or classification learner. Example: 'mlr3::lrn("ranger", num.trees = 100)' for a random forest learner with 100 trees. Note that this is a string and the absence of the classif. or regr. keywords. See https://mlr3learners.mlr-org.com for a list of mlr3 learners.

propensity_scores

A numeric vector of propensity score estimates.

M_set

A numerical vector of indices of observations in the main sample.

A_set

A numerical vector of indices of observations in the auxiliary sample. Default is complementary set to M_set.

Z_CLAN

A numeric matrix holding variables on which classification analysis (CLAN) shall be performed. CLAN will be performed on each column of the matrix. If NULL (default), then Z_CLAN = Z, i.e. CLAN is performed for all variables in Z.

Logical. If TRUE, a Horvitz-Thompson (HT) transformation is applied in the BLP and GATES regressions. Default is FALSE.

quantile_cutoffs

The cutoff points of the quantiles that shall be used for GATES grouping. Default is c(0.25, 0.5, 0.75), which corresponds to the four quartiles.

X1_BLP

Specifies the design matrix \(X_1\) in the regression. Must be an object of class "setup_X1". See the documentation of setup_X1() for details.

X1_GATES

Same as X1_BLP, just for the GATES regression.

diff_GATES

Specifies the generic targets of GATES. Must be an object of class "setup_diff". See the documentation of setup_diff() for details.

diff_CLAN

Same as diff_GATES, just for the CLAN generic targets.

vcov_BLP

Specifies the covariance matrix estimator in the BLP regression. Must be an object of class "setup_vcov". See the documentation of setup_vcov() for details.

vcov_GATES

Same as vcov_BLP, just for the GATES regression.

equal_variances_CLAN

Logical. If TRUE, then all within-group variances of the CLAN groups are assumed to be equal. Default is FALSE. This specification is required for heteroskedasticity-robust variance estimation on the difference of two CLAN generic targets (i.e. variance of the difference of two means). If TRUE (corresponds to homoskedasticity assumption), the pooled variance is used. If FALSE (heteroskedasticity), the variance of Welch's t-test is used.

significance_level

Significance level for VEIN. Default is 0.05.

min_variation

Specifies a threshold for the minimum variation of the BCA/CATE predictions. If the variation of a BCA/CATE prediction falls below this threshold, random noise with distribution \(N(0, var(Y)/20)\) is added to it. Default is 1e-05.

Value

A list with the following components:

BLP: An object of class "BLP".
GATES: An object of class "GATES".
CLAN: An object of class "CLAN".
proxy_BCA: An object of class "proxy_BCA".
proxy_CATE: An object of class "proxy_CATE".
best: Estimates of the \(\Lambda\) parameters for finding the best learner. Returned by lambda_parameters().

Details

The specifications "lasso", "random_forest", and "tree" in learner correspond to the following mlr3 specifications (we omit the keywords classif. and regr.). "lasso" is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'. "random_forest" is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'. "tree" is a tree learner, which corresponds to 'mlr3::lrn("rpart")'.

References

Chernozhukov V., Demirer M., Duflo E., Fern<U+00E1>ndez-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.

Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. 10.21105/joss.01903.

Examples

Run this code

# NOT RUN {
if(require("ranger")){
## generate data
set.seed(1)
n  <- 150                        # number of observations
p  <- 5                          # number of covariates
Z  <- matrix(runif(n*p), n, p)   # design matrix
D  <- rbinom(n, 1, 0.5)          # random treatment assignment
Y  <- runif(n)                   # outcome variable
propensity_scores <- rep(0.5, n) # propensity scores
M_set <- sample(1:n, size = n/2) # main set

## specify learner
learner <- "mlr3::lrn('ranger', num.trees = 10)"

## run single GenericML iteration
GenericML_single(Z, D, Y, learner, propensity_scores, M_set)
}

# }

Run the code above in your browser using DataLab