refmodel-init-get: Reference model structure

Description

Function get_refmodel() is a generic function for creating the reference model structure from a specific object. The methods for get_refmodel() usually call init_refmodel() which is the underlying workhorse (and may also be used directly without a call to get_refmodel()). Some arguments are for $K$-fold cross-validation ($K$-fold CV) only; see cv_varsel() for the use of $K$-fold CV in projpred.

Usage

get_refmodel(object, ...)
# S3 method for refmodel
get_refmodel(object, ...)
# S3 method for vsel
get_refmodel(object, ...)
# S3 method for default
get_refmodel(object, formula, family = NULL, ...)
# S3 method for stanreg
get_refmodel(object, ...)
init_refmodel(
  object,
  data,
  formula,
  family,
  ref_predfun = NULL,
  div_minimizer = NULL,
  proj_predfun = NULL,
  extract_model_data,
  cvfun = NULL,
  cvfits = NULL,
  dis = NULL,
  cvrefbuilder = NULL,
  ...
)

Value

An object that can be passed to all the functions that take the reference model fit as the first argument, such as varsel(), cv_varsel(), project(), proj_linpred(), and proj_predict(). Usually, the returned object is of class refmodel. However, if object

is NULL, the returned object is of class datafit as well as of class refmodel (with datafit being first). Objects of class datafit are handled differently at several places throughout this package.

Arguments

object: Object from which the reference model is created. For init_refmodel(), an object on which the functions from arguments extract_model_data and ref_predfun can be applied, with a NULL object being treated specially (see section "Value" below). For get_refmodel.default(), an object on which function family() can be applied to retrieve the family (if argument family is NULL), additionally to the properties required for init_refmodel(). For non-default methods of get_refmodel(), an object of the corresponding class.
...: For get_refmodel.default() and get_refmodel.stanreg(): arguments passed to init_refmodel(). For the get_refmodel() generic: arguments passed to the appropriate method. Else: ignored.
formula: Reference model's formula. For general information on formulas in R, see formula. For multilevel formulas, see also package lme4 (in particular, functions lme4::lmer() and lme4::glmer()). For additive formulas, see also packages mgcv (in particular, function mgcv::gam()) and gamm4 (in particular, function gamm4::gamm4()) as well as the notes in section "Formula terms" below.
family: A family object representing the observational model (i.e., the distributional family for the response). May be NULL for get_refmodel.default() in which case the family is retrieved from object.
data: Data used for fitting the reference model. Any contrasts attributes of the dataset's columns are silently removed.
ref_predfun: Prediction function for the linear predictor of the reference model, including offsets (if existing). See also section "Arguments ref_predfun, proj_predfun, and div_minimizer" below. If object is NULL, ref_predfun is ignored and an internal default is used instead.
div_minimizer: A function for minimizing the Kullback-Leibler (KL) divergence from a submodel to the reference model (i.e., for performing the projection of the reference model onto a submodel). The output of div_minimizer is used, e.g., by proj_predfun's argument fits. See also section "Arguments ref_predfun, proj_predfun, and div_minimizer" below.
proj_predfun: Prediction function for the linear predictor of a submodel onto which the reference model is projected. See also section "Arguments ref_predfun, proj_predfun, and div_minimizer" below.
extract_model_data: A function for fetching some variables (response, observation weights, offsets) from the original dataset (i.e., the dataset used for fitting the reference model) or from a new dataset. See also section "Argument extract_model_data" below.
cvfun: For $K$-fold CV only. A function that, given a fold indices vector, fits the reference model separately for each fold and returns the $K$ model fits as a list. Each of the $K$ model fits needs to be a list. If object is NULL, cvfun may be NULL for using an internal default. Only one of cvfits and cvfun needs to be provided (for $K$-fold CV). Note that cvfits takes precedence over cvfun, i.e., if both are provided, cvfits is used.
cvfits: For $K$-fold CV only. A list containing a sub-list called fits containing the $K$ model fits from which reference model structures are created. The cvfits list (i.e., the super-list) needs to have attributes K and folds: K has to be a single integer giving the number of folds and folds has to be an integer vector giving the fold indices (one fold index per observation). Each element of cvfits$fits (i.e., each of the $K$ model fits) needs to be a list. Only one of cvfits and cvfun needs to be provided (for $K$-fold CV). Note that cvfits takes precedence over cvfun, i.e., if both are provided, cvfits is used.
dis: A vector of posterior draws for the dispersion parameter (if existing). May be NULL if the model has no dispersion parameter or if the model does have a dispersion parameter, but object is NULL (in which case 0 is used for dis). Note that for the gaussian() family, dis is the standard deviation, not the variance.
cvrefbuilder: For $K$-fold CV only. A function that, given a reference model fit for fold $k \in \{1, ..., K\}$ (this model fit is the $k$-th element of the return value of cvfun or the $k$-th element of cvfits$fits, extended by elements omitted (containing the indices of the left-out observations in that fold) and projpred_k (containing the integer $k$)), returns an object of the same type as init_refmodel() does. Argument cvrefbuilder may be NULL for using an internal default: get_refmodel() if object is not NULL and a function calling init_refmodel() appropriately (with the assumption dis = 0) if object is NULL.

Formula terms

For additive models (still an experimental feature), only mgcv::s() and mgcv::t2() are currently supported as smooth terms. Furthermore, these need to be called without any arguments apart from the predictor names (symbols). For example, for smoothing the effect of a predictor x, only s(x) or t2(x) are allowed. As another example, for smoothing the joint effect of two predictors x and z, only s(x, z) or t2(x, z) are allowed (and analogously for higher-order joint effects, e.g., of three predictors).

Arguments <code>ref_predfun</code>, <code>proj_predfun</code>, and <code>div_minimizer</code>

Arguments ref_predfun, proj_predfun, and div_minimizer may be NULL for using an internal default. Otherwise, let $N$ denote the number of observations (in case of CV, these may be reduced to each fold), $S_{\mbox{ref}}$ the number of posterior draws for the reference model's parameters, and $S_{\mbox{prj}}$ the number of (possibly clustered) parameter draws for projection (short: the number of projected draws). Then the functions supplied to these arguments need to have the following prototypes:

ref_predfun: ref_predfun(fit, newdata = NULL) where:
- fit accepts the reference model fit as given in argument object (but possibly re-fitted to a subset of the observations, as done in $K$-fold CV).
- newdata accepts either NULL (for using the original dataset, typically stored in fit) or data for new observations (at least in the form of a data.frame).
proj_predfun: proj_predfun(fits, newdata) where:
- fits accepts a list of length $S_{\mbox{prj}}$ containing this number of submodel fits. This list is the same as that returned by project() in its output element submodl (which in turn is the same as the return value of div_minimizer, except if project() was used with an object of class vsel based on an L1 search as well as with refit_prj = FALSE).
- newdata accepts data for new observations (at least in the form of a data.frame).
div_minimizer does not need to have a specific prototype, but it needs to be able to be called with the following arguments:
- formula accepts either a standard formula with a single response (if $S_{\mbox{prj}} = 1$) or a formula with $S_{\mbox{prj}} > 1$ response variables cbind()-ed on the left-hand side in which case the projection has to be performed for each of the response variables separately.
- data accepts a data.frame to be used for the projection.
- family accepts a family object.
- weights accepts either observation weights (at least in the form of a numeric vector) or NULL (for using a vector of ones as weights).
- projpred_var accepts an $N \times S_{\mbox{prj}}$ matrix of predictive variances (necessary for projpred's internal GLM fitter).
- projpred_regul accepts a single numeric value as supplied to argument regul of project(), for example.
- ... accepts further arguments specified by the user.

The return value of these functions needs to be:

ref_predfun: an $N \times S_{\mbox{ref}}$ matrix.
proj_predfun: an $N \times S_{\mbox{prj}}$ matrix.
div_minimizer: a list of length $S_{\mbox{prj}}$ containing this number of submodel fits.

Argument <code>extract_model_data</code>

The function supplied to argument extract_model_data needs to have the prototype

extract_model_data(object, newdata, wrhs = NULL, orhs = NULL, extract_y = TRUE)

where:

object accepts the reference model fit as given in argument object (but possibly re-fitted to a subset of the observations, as done in $K$-fold CV).
newdata accepts either NULL (for using the original dataset, typically stored in object) or data for new observations (at least in the form of a data.frame).
wrhs accepts at least either NULL (for using a vector of ones) or a right-hand side formula consisting only of the variable in newdata containing the weights.
orhs accepts at least either NULL (for using a vector of zeros) or a right-hand side formula consisting only of the variable in newdata containing the offsets.
extract_y accepts a single logical value indicating whether output element y (see below) shall be NULL (TRUE) or not (FALSE).

The return value of extract_model_data needs to be a list with elements y, weights, and offset, each being a numeric vector containing the data for the response, the observation weights, and the offsets, respectively. An exception is that y may also be NULL (depending on argument extract_y).

Examples

Run this code

if (requireNamespace("rstanarm", quietly = TRUE)) {
  # Data:
  dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

  # The "stanreg" fit which will be used as the reference model (with small
  # values for `chains` and `iter`, but only for technical reasons in this
  # example; this is not recommended in general):
  fit <- rstanarm::stan_glm(
    y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
    QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
  )

  # Define the reference model explicitly:
  ref <- get_refmodel(fit)
  print(class(ref)) # gives `"refmodel"`
  # Now see, for example, `?varsel`, `?cv_varsel`, and `?project` for
  # possible post-processing functions. Most of the post-processing functions
  # call get_refmodel() internally at the beginning, so you will rarely need
  # to call get_refmodel() yourself.

  # A custom reference model which may be used in a variable selection where
  # the candidate predictors are not a subset of those used for the reference
  # model's predictions:
  ref_cust <- init_refmodel(
    fit,
    data = dat_gauss,
    formula = y ~ X6 + X7,
    family = gaussian(),
    extract_model_data = function(object, newdata = NULL, wrhs = NULL,
                                  orhs = NULL, extract_y = TRUE) {
      if (!extract_y) {
        resp_form <- NULL
      } else {
        resp_form <- ~ y
      }

      if (is.null(newdata)) {
        newdata <- dat_gauss
      }

      args <- projpred:::nlist(object, newdata, wrhs, orhs, resp_form)
      return(projpred::do_call(projpred:::.extract_model_data, args))
    },
    cvfun = function(folds) {
      kfold(
        fit, K = max(folds), save_fits = TRUE, folds = folds, cores = 1
      )$fits[, "fit"]
    },
    dis = as.matrix(fit)[, "sigma"]
  )
  # Now, the post-processing functions mentioned above (for example,
  # varsel(), cv_varsel(), and project()) may be applied to `ref_cust`.
}

Run the code above in your browser using DataLab