Learn R Programming

roseRF (version 0.1.0)

roseRF_plm: ROSE random forest estimator for the partially linear model

Description

Estimates the parameter of interest \(\theta_0\) in the partially linear model $$\mathbb{E}[Y|X,Z] = X\theta_0 + f_0(Z),$$ which can be reposed in terms of the `nuisance functions' \((\mathbb{E}[Y|X], \mathbb{E}[X|Z])\) as $$\mathbb{E}[Y|X,Z]-\mathbb{E}[Y|Z] = (X-\mathbb{E}[X|Z])\theta_0.$$

Usage

roseRF_plm(
  y_formula,
  y_learner,
  y_pars = list(),
  x_formula,
  x_learner,
  x_pars = list(),
  M1_formula = x_formula,
  M1_learner = x_learner,
  M1_pars = x_pars,
  M2_formula = NA,
  M2_learner = NA,
  M2_pars = list(),
  M3_formula = NA,
  M3_learner = NA,
  M3_pars = list(),
  M4_formula = NA,
  M4_learner = NA,
  M4_pars = list(),
  M5_formula = NA,
  M5_learner = NA,
  M5_pars = list(),
  data,
  K = 5,
  S = 1,
  max.depth = 10,
  num.trees = 500,
  min.node.size = max(10, ceiling(0.01 * (K - 1)/K * nrow(data))),
  replace = TRUE,
  sample.fraction = 0.8
)

Value

A list containing:

theta

The estimator of \(\theta_0\).

stderror

Huber robust estimate of the standard error of the \(\theta_0\)-estimator.

coefficients

Table of \(\theta_0\) coefficient estimator, standard error, z-value and p-value.

Arguments

y_formula

a two-sided formula object describing the model for \(\mathbb{E}[Y|Z]\).

y_learner

a string specifying the regression method to fit the regression of \(Y\) on \(Z\) as given by y_formula (e.g. randomforest, xgboost, neuralnet, gam).

y_pars

a list containing hyperparameters for the y_learner chosen. Default is an empty list, which performs hyperparameter tuning.

x_formula

a two-sided formula object describing the model for \(\mathbb{E}[X|Z]\).

x_learner

a string specifying the regression method to fit the regression of \(X\) on \(Z\) as given by x_formula (e.g. randomforest, xgboost, neuralnet, gam).

x_pars

a list containing hyperparameters for the y_learner chosen. Default is an empty list, which performs hyperparameter tuning.

M1_formula

a two-sided formula object for the model \(\mathbb{E}[M_1(X)|Z]\). Default is \(M_1(X)=X\).

M1_learner

a string specifying the regression method for \(\mathbb{E}[M_1(X)|Z]\) estimation.

M1_pars

a list containing hyperparameters for the M1_learner chosen.

M2_formula

a two-sided formula object for the model \(\mathbb{E}[M_2(X)|Z]\). Default is no formula / regression (i.e. \(J=1\))

M2_learner

a string specifying the regression method for \(\mathbb{E}[M_2(X)|Z]\) estimation.

M2_pars

a list containing hyperparameters for the M2_learner chosen.

M3_formula

a two-sided formula object for the model \(\mathbb{E}[M_3(X)|Z]\). Default is no formula / regression (i.e. \(J=1\)).

M3_learner

a string specifying the regression method for \(\mathbb{E}[M_3(X)|Z]\) estimation.

M3_pars

a list containing hyperparameters for the M3_learner chosen.

M4_formula

a two-sided formula object for the model \(\mathbb{E}[M_4(X)|Z]\). Default is no formula / regression (i.e. \(J=1\))

M4_learner

a string specifying the regression method for \(\mathbb{E}[M_4(X)|Z]\) estimation.

M4_pars

a list containing hyperparameters for the M4_learner chosen.

M5_formula

a two-sided formula object for the model \(\mathbb{E}[M_5(X)|Z]\). Default is no formula / regression (i.e. \(J=1\))

M5_learner

a string specifying the regression method for \(\mathbb{E}[M_5(X)|Z]\) estimation.

M5_pars

a list containing hyperparameters for the M5_learner chosen.

data

a data frame containing the variables for the partially linear model.

K

the number of folds used for \(K\)-fold cross-fitting. Default is 5.

S

the number of repeats to mitigate the randomness in the estimator on the sample splits used for \(K\)-fold cross-fitting. Default is 5.

max.depth

Maximum depth parameter used for ROSE random forests. Default is 5.

num.trees

Number of trees used for a single ROSE random forest. Default is 50.

min.node.size

Minimum node size of a leaf in each tree. Default is max(10,ceiling(0.01 (K-1)/K nrow(data))).

replace

Whether sampling for a single random tree are performed with (bootstrap) or without replacement. Default is TRUE (i.e. bootstrap).

sample.fraction

Proportion of data used for each random tree. Default is 0.8.

Details

The estimator of interest \(\theta_0\) solves the estimating equation $$\sum_{i}\psi(Y_i,X_i,Z_i; \theta,\hat{\eta}(Z),\hat{w}(Z)) = 0,$$ $$\psi(Y,X,Z;\theta,\eta_0,w) := \sum_{j=1}^J w_j(Z) \big( M_j(X) - \mathbb{E}[M_j(X)|Z]\big) \Big( \big(Y-\mathbb{E}[Y|Z]\big)-\big(X-\mathbb{E}[X|Z]\big)\theta \Big),$$ $$\eta_0 := \big(\mathbb{E}[Y|Z=\cdot], \mathbb{E}[X|Z=\cdot]\big),$$ where \(M_1(X),\ldots,M_J(X)\) denotes user-chosen functions of \((X)\) and \(w(Z)=\big(w_1(Z),\ldots,w_J(Z)\big)\) denotes weights estimated via ROSE random forests. The default takes \(J=1\) and \(M_1(X)=X\); if taking \(J\geq 2\) we recommend care in checking the applicability and appropriateness of any additional user-chosen regression tasks.

The parameter of interest \(\theta_0\) is estimated using a DML2 / \(K\)-fold cross-fitting framework, to allow for arbitrary (faster than \(n^{1/4}\)-consistent) learners for \(\hat{\eta}\) i.e. solving the estimating equation $$\sum_{k \in [K]}\sum_{i \in I_k}\psi(Y_i,X_i,Z_i; \theta,\hat{\eta}^{(k)}(Z),\hat{w}^{(k)}(Z)) = 0,$$ where \(I_1,\ldots,I_K\) denotes a partition of the index set for the datapoints \((Y_i,X_i,Z_i)\), \(\hat{\eta}^{(k)}\) denotes an estimator for \(\eta_0\) trained on the data indexed by \(I_k^c\), and \(\hat{w}^{(k)}\) denotes a ROSE random forest (again trained on the data indexed by \(I_k^c\)).