Function mi_pre
derives a sparse ensemble of rules and/or
linear rules based on imputed data. The function is still experimental,
so use at own risk.
mi_pre(
formula,
data,
weights = NULL,
obs_ids = NULL,
compl_frac = NULL,
nfolds = 10L,
sampfrac = 0.5,
...
)
An object of class pre
.
a symbolic description of the model to be fit of the form
y ~ x1 + x2 + ... + xn
. Response (left-hand side of the formula)
should be of class numeric (for family = "gaussian"
or
"mgaussian"
), integer (for family = "poisson"
), factor (for
family = "binomial"
or "multinomial"
). See Examples below.
Note that the minus sign (-
) may not be used in the formula to omit
the intercept or variables in data
, and neither should + 0
be used to omit the intercept. To omit the intercept from the final ensemble,
add intercept = FALSE
to the call (although omitting the intercept from
the final ensemble will only very rarely be appropriate). To omit variables
from the final ensemble, make sure they are excluded from data
.
A list of imputed datasets. The datasets must have identically-named columns, but need not have the same number of rows (this can happen, for example. if a bootstrap sampling approach had been employed for multiple imputation).
A list of observation weights for each observation in each
imputed dataset. The list must have the same length as data
, and each
element must be a numeric vector of length identical to the number of rows of
the corresponding imputed dataset in data
. The default is
NULL
, yielding constant observation weights w_i = 1/M, where M is the
number of imputed datasets (i.e., length(data)
).
A list of observation ids, corresponding to the id in the
original data, of each observation in each imputed dataset. Defaults to
NULL
, which assumes that the imputed datasets contain the observations
in identical order. If specified, the list must have
the same length as data
, and each element must be a numeric or character
vector of length identical to the number of rows of the corresponding imputed
dataset in data
. At least some of the observations ids must be repeated
at least some times, within or between imputed datasets.
An optional list specifying the fraction of observed values
for each observation. This will be used to compute observation weights as
a function of the fraction of complete data per observations, as per
Wan et al. (2015), but note that this is only recommended for users who
know the risks (i.e., an analysis more like complete-case analysis).
If specified, the list must have
the same length as data
, and each element must be a numeric
vector of length identical to the number of rows of the corresponding imputed
dataset in data
.
positive integer. Number of cross-validation folds to be used for selecting the optimal value of the penalty parameter \(\lambda\) in selecting the final ensemble.
numeric value \(> 0\) and \(\le 1\). Specifies
the fraction of randomly selected training observations used to produce each
tree. Values \(< 1\) will result in sampling without replacement (i.e.,
subsampling), a value of 1 will result in sampling with replacement
(i.e., bootstrap sampling). Alternatively, a sampling function may be supplied,
which should take arguments n
(sample size) and weights
.
Further arguments to be passed to
cv.glmnet
.
Experimental function to fit a prediction rule ensemble to
multiply imputed data. Essentially, it is a wrapper function around function
pre()
, the main differences relate to sampling for the tree induction
and fold assignment for estimation of the coefficients for the final ensemble.
Function mi_pre
implements a so-called stacking approach to the analysis
of imputed data (see also Wood et al., 2008), where imputed datasets are combined
into one large dataset.
In addition to adjustments of the sampling procedures, adjustments to observation
weight are made to counter the artificial inflation of sample size.
Observations which occur repeatedly across the imputed datasets will be completely in- or excluded from each sample or fold, to avoid overfitting. Thus, complete observations instead of individual imputed observations are sampled, for tree and rule induction, as well as the cross-validation for selecting the penalty parameter values for the final ensemble.
It is assumed that data have already been imputed (using e.g.,
R package mice or missForest), and therefore function mi_pre
takes a
list
of imputed datasets as input data.
Although the option to use the fraction of complete data for computing
observation weight is provided through argument compl_frac
, users
are not advised to use it. See e.g., Du et al. (2022): "An alternative weight
specification, proposed in Wan et al. (2015), is o_i = f_i/D, where f_i is
the number of observed predictors out of the total number of predictors for
subject i [...] upweighting subjects with less missingness and downweighting
subjects with more missingness can, in some sense, be viewed as making the
optimization more like complete-case analysis, which might be problematic
for Missing at Random (MAR) and Missing not at Random (MNAR) scenarios."
Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S.A., ... & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075. tools:::Rd_expr_doi("10.1080/10618600.2022.2035739").
Wood, A. M., White, I. R., & Royston, P. (2008). How should variable selection be performed with multiply imputed data? Statistics in medicine, 27(17), 3227-3246. tools:::Rd_expr_doi("10.1002/sim.3177")
pre
mi_mean
library("mice")
set.seed(42)
## Shoot extra holes in airquality data
airq <- sapply(airquality, function(x) {
x[sample(1:nrow(airquality), size = 25)] <- NA
return(x)
})
## impute the data
imp <- mice(airq, m = 5)
imp <- as.list(complete(imp, action = "all"))
## fit a rule ensemble to the imputed data
set.seed(42)
airq.ens.mi <- mi_pre(Ozone ~ . , data = imp)
Run the code above in your browser using DataLab