filter_train currently fits elastic net or random forest to
find a reduced set of variables which are likely associated with the outcome (Y)
and/or treatment (A). Current options include:
1. glmnet: Wrapper function for the function "glmnet" from the glmnet package. Here,
variables with estimated elastic net coefficients of 0 are filtered. Uses LM/GLM/cox
elastic net for family="gaussian","binomial", "survival" respectively. Default is to
regress Y~ENET(X) with hyper-parameters:
hyper = list(lambda="lambda.min", family="gaussian",interaction=FALSE))
If interaction=TRUE, then Y~ENET(X,X*A), and variables with estimated coefficients of
zero in both the main effects (X) and treatment-interactions (X*A) are filtered. This
aims to find variables that are prognostic and/or predictive.
2. ranger: Wrapper function for the function "ranger" (ranger R package) to calculate
random forest based variable importance (VI) p-values. Here, for the test of VI>0,
variables are filtered if their one-sided p-value>=0.10. P-values are obtained
through subsampling based T-statistics (T=VI_j/SE(VE_j)) for feature j through the
delete-d jackknife), as described in Ishwaran and Lu 2017. Used for continuous, binary,
or survival outcomes. Default hyper-parameters are:
hyper=list(b=0.66, K=200, DF2=FALSE, FDR=FALSE, pval.thres=0.10)
where b=(% of total data to sample; default=66%), K=# of subsamples, FDR (FDR based
multiplicity correction for p-values), pval.thres=0.10 (adjust to change
filtering threshold). DF2 fits Y~ranger(X, XA) and calculates the
VI_2DF = VI_X+VI_XA, which is the variable importance of the main effect + the
interaction effect (joint test). Var(VI_2DF) = Var(VI_X)+Var(VI_AX)+2cov(VI_X, VI_AX)
where each component is calculated using the subsampling approach described above.