mi_pre: Fit a prediction rule ensemble to multiply-imputed data (experimental)

Description

Function mi_pre derives a sparse ensemble of rules and/or linear rules based on imputed data. The function is still experimental, so use at own risk.

Usage

mi_pre(
  formula,
  data,
  weights = NULL,
  obs_ids = NULL,
  compl_frac = NULL,
  nfolds = 10L,
  sampfrac = 0.5,
  ...
)

Value

An object of class pre.

Arguments

formula: a symbolic description of the model to be fit of the form y ~ x1 + x2 + ... + xn. Response (left-hand side of the formula) should be of class numeric (for family = "gaussian" or "mgaussian"), integer (for family = "poisson"), factor (for family = "binomial" or "multinomial"). See Examples below. Note that the minus sign (-) may not be used in the formula to omit the intercept or variables in data, and neither should + 0 be used to omit the intercept. To omit the intercept from the final ensemble, add intercept = FALSE to the call (although omitting the intercept from the final ensemble will only very rarely be appropriate). To omit variables from the final ensemble, make sure they are excluded from data.
data: A list of imputed datasets. The datasets must have identically-named columns, but need not have the same number of rows (this can happen, for example. if a bootstrap sampling approach had been employed for multiple imputation).
weights: A list of observation weights for each observation in each imputed dataset. The list must have the same length as data, and each element must be a numeric vector of length identical to the number of rows of the corresponding imputed dataset in data. The default is NULL, yielding constant observation weights w_i = 1/M, where M is the number of imputed datasets (i.e., length(data)).
obs_ids: A list of observation ids, corresponding to the id in the original data, of each observation in each imputed dataset. Defaults to NULL, which assumes that the imputed datasets contain the observations in identical order. If specified, the list must have the same length as data, and each element must be a numeric or character vector of length identical to the number of rows of the corresponding imputed dataset in data. At least some of the observations ids must be repeated at least some times, within or between imputed datasets.
compl_frac: An optional list specifying the fraction of observed values for each observation. This will be used to compute observation weights as a function of the fraction of complete data per observations, as per Wan et al. (2015), but note that this is only recommended for users who know the risks (i.e., an analysis more like complete-case analysis). If specified, the list must have the same length as data, and each element must be a numeric vector of length identical to the number of rows of the corresponding imputed dataset in data.
nfolds: positive integer. Number of cross-validation folds to be used for selecting the optimal value of the penalty parameter \(\lambda\) in selecting the final ensemble.
sampfrac: numeric value \(> 0\) and \(\le 1\). Specifies the fraction of randomly selected training observations used to produce each tree. Values \(< 1\) will result in sampling without replacement (i.e., subsampling), a value of 1 will result in sampling with replacement (i.e., bootstrap sampling). Alternatively, a sampling function may be supplied, which should take arguments n (sample size) and weights.
...: Further arguments to be passed to cv.glmnet.

Details

Experimental function to fit a prediction rule ensemble to multiply imputed data. Essentially, it is a wrapper function around function pre(), the main differences relate to sampling for the tree induction and fold assignment for estimation of the coefficients for the final ensemble.

Function mi_pre implements a so-called stacking approach to the analysis of imputed data (see also Wood et al., 2008), where imputed datasets are combined into one large dataset. In addition to adjustments of the sampling procedures, adjustments to observation weight are made to counter the artificial inflation of sample size.

Observations which occur repeatedly across the imputed datasets will be completely in- or excluded from each sample or fold, to avoid overfitting. Thus, complete observations instead of individual imputed observations are sampled, for tree and rule induction, as well as the cross-validation for selecting the penalty parameter values for the final ensemble.

It is assumed that data have already been imputed (using e.g., R package mice or missForest), and therefore function mi_pre takes a list of imputed datasets as input data.

Although the option to use the fraction of complete data for computing observation weight is provided through argument compl_frac, users are not advised to use it. See e.g., Du et al. (2022): "An alternative weight specification, proposed in Wan et al. (2015), is o_i = f_i/D, where f_i is the number of observed predictors out of the total number of predictors for subject i [...] upweighting subjects with less missingness and downweighting subjects with more missingness can, in some sense, be viewed as making the optimization more like complete-case analysis, which might be problematic for Missing at Random (MAR) and Missing not at Random (MNAR) scenarios."

References

Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S.A., ... & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075. tools:::Rd_expr_doi("10.1080/10618600.2022.2035739").

Wood, A. M., White, I. R., & Royston, P. (2008). How should variable selection be performed with multiply imputed data? Statistics in medicine, 27(17), 3227-3246. tools:::Rd_expr_doi("10.1002/sim.3177")

Examples

Run this code

library("mice")
set.seed(42)

## Shoot extra holes in airquality data
airq <- sapply(airquality, function(x) {
  x[sample(1:nrow(airquality), size = 25)] <- NA
  return(x)
})

## impute the data
imp <- mice(airq, m = 5)
imp <- as.list(complete(imp, action = "all"))

## fit a rule ensemble to the imputed data
set.seed(42)
airq.ens.mi <- mi_pre(Ozone ~ . , data = imp)

Run the code above in your browser using DataLab