cov.sel.high: Model-Free Covariate Selection in High Dimensions

Description

Model-free selection of covariates in high dimensions under unconfoundedness for situations where the parameter of interest is an average causal effect. This package is based on model-free backward elimination algorithms proposed in de Luna, Waernbaum and Richardson (2011) and VanderWeele and Shpitser (2011). Confounder selection can be performed via either Markov/Bayesian networks, random forests or LASSO.

Usage

cov.sel.high(T=NULL, Y=NULL, X=NULL,type=c("mmpc","mmhc","rf","lasso"), 
                    betahat=TRUE, parallel=FALSE, Simulate=TRUE,N=NULL, Setting=1,
                    rep=1, Models=c("Linear", "Nonlinear", "Binary"), 
                    alpha=0.05, mmhc_score=c("aic","bic"))

Arguments

A vector, containing 0 and 1, indicating a binary treatment variable.

A vector of observed outcomes.

A matrix or data frame containing columns of covariates. The covariates may be a mix of continuous, unordered discrete (to be specified in the data frame using factor), and ordered discrete (to be specified in the data frame using ordered).

type

The type of method used for selection. The networks algorithms are "mmpc" for min-max parents and children (Markov network) and "mmhc" for max-min hill climbing (Bayesian network). Other available methods are random forests, "rf", and LASSO, "lasso".

betahat

If betahat=TRUE the average treatment effect for each selected subset and the full covariate vector is estimated using propensity score matching (PSM) via the function Match and using targeted maximum likelihood estimation (TMLE) via the function tmle.

parallel

If parallel=TRUE and there is a registered parallel backend then the computation will be parallelized. Default is parallel=FALSE.

Simulate

If data is to be simulated according to one of the designs in H<U+00E4>ggstr<U+00F6>m (2017) then Simulate should be set to TRUE.

If Simulate=TRUE, N is the number of observations to be simulated.

Setting

If Simulate=TRUE, Setting is the simulation setting to be used. Unconfoundedness holds given X if Setting=1. M-bias given X if Setting=2.

rep

If Simulate=TRUE, rep is the number of replications to be simulated.

Models

If Simulate=TRUE, Models is the type of outcome models to be used, options are "Linear", "Nonlinear" and "Binary".

alpha

A numeric value, the target nominal type I error rate (tuning parameter) for "mmpc" and "mmhc".

mmhc_score

The score to use for "mmhc".

Value

cov.sel.high returns a list with the following content:

X.T

The set of covariates targeting the subset containing all causes of T.

Q.0

The set of covariates targeting the subset of X.T which is also associated with Y given T=0, the response in the control group.

Q.1

The set of covariates targeting the subset of X.T which is also associated with Y given T=1, the response in the treatment group.

Union of Q.0 and Q.1.

X.0

The set of covariates targeting the subset containing all causes of Y given T=0.

X.1

The set of covariates targeting the subset containing all causes of Y given T=1.

X.Y

Union of X.0 and X.1.

Z.0

The set of covariates targeting the subset of X.0 which is also associated with T.

Z.1

The set of covariates targeting the subset of X.1 which is also associated with T.

Union of Z.0 and Z.1.

X.TY

Union of X.T and X.Y, the set of covariates targeting the subset containing all causes of T and Y.

cardinalities

The cardinalities of each selected subset.

est_psm

The PSM estimate of the average causal effect, for the full covariate vector and each selected subset.

se_psm

The Abadie-Imbens standard error for the PSM estimate of the average causal effect, for the full covariate vector and each selected subset.

est_tmle

The TMLE estimate of the average causal effect, for the full covariate vector and each selected subset.

se_psm

The influence-curve based standard error for the TMLE estimate of the average causal effect, for the full covariate vector and each selected subset.

The number of observations.

Setting

The Setting used.

rep

The number of replications.

Models

Models used.

type

type used.

alpha

alpha used.

mmhc_score

score used.

varnames

Variable names of the used data.

Details

See H<U+00E4>ggstr<U+00F6>m (2017).

References

de Luna, X., I. Waernbaum, and T. S. Richardson (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98. 861-875

H<U+00E4>ggstr<U+00F6>m, J. (2017). Data-Driven Confounder Selection via Markov and Bayesian Networks. ArXiv e-prints.

Nagarajan, R., M. Scutari and S. Lebre. (2013) Bayesian Networks in R with Applications in Systems Biology. Springer, New York. ISBN 978-1461464457.

Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35, 1-22. URL http://www.jstatsoft.org/v35/i03/.

Sekhon, J.S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R. Journal of Statistical Software, 42, 1-52. URL http://www.jstatsoft.org/v42/i07/.

Examples

Run this code

##Use simulated data, select subsets using mmpc 
ans<-cov.sel.high(type="mmpc",N=1000, rep=2, Models="Linear", betahat=FALSE, mmhc_score="aic")


##Use simulated data, select subsets using mmpc and estimate ACEs, parallell version
#library(doParallel)
#library(doRNG)
#cl <- makeCluster(4)
#registerDoParallel(cl)
#ans<-cov.sel.high(type="mmpc", parallel=TRUE,  N=500, rep=10, Models="Linear", mmhc_score="aic")
#stopCluster(cl)

Run the code above in your browser using DataLab