multiPIMboot: Bootstrap the multiPIM Function

Description

This function will run multiPIM once on the actual data, then sample with replacement from the rows of the data and run multiPIM again (with the same options) the desired number of times.

Usage

multiPIMboot(Y, A, W = NULL, times = 5000, id = 1:nrow(Y), multicore = FALSE, mc.num.jobs, mc.seed = 123, estimator = c("TMLE", "DR-IPCW", "IPCW", "G-COMP"), g.method = "main.terms.logistic", g.sl.cands = NULL, g.num.folds = NULL, g.num.splits = NULL, Q.method = "sl", Q.sl.cands = "default", Q.num.folds = 5, Q.num.splits = 1, Q.type = NULL, adjust.for.other.As = TRUE, truncate = 0.05, return.final.models = TRUE, na.action, verbose = FALSE, extra.cands = NULL, standardize = TRUE, ...)

Arguments

a data frame of outcomes containing only numeric (integer or double) values. See details section of multiPIM for the default method of determining, based on the values in Y, which regression types to allow for modelling Q. Must have unique names.

a data frame containing binary exposure variables. Binary means that all values must be either 0 (indicating unexposed, or part of target group) or 1 (indicating exposed or not part of target group). Must have unique names.

an optional data frame containing possible confounders of the effects of the variables in A on the variables in Y. No effect measures will be calculated for these variables. May contain numeric (integer or double), or factor values. Must be left as NULL if not required. If not NULL, must have unique names.

times

single integer greater than or equal to 2. The number of bootstrap replicates of Y, A and W to generate and pass to multiPIM.

vector which identifies clusters. If obervations i and j are in the same cluster, then id[i] should be equal to id[j]. Bootstrapping will be carried out by sampling with replacement from the clusters. Keeping the default value will result in sampling with replacement from the observations (i.e. no clustering).

multicore

logical value indicting whether bootstrapping should be done using multiple simultaneous jobs (as of multiPIM version 1.3-1 this requires the parallel package, which is distributed with R version 2.14.0 or later. For earlier versions of multiPIM, this feature relied on CRAN packages multicore and rlecuyer.

mc.num.jobs

number of simultaneous multicore jobs, e.g. if you want to use a quad core processor with hyperthreading, use mc.num.jobs = 8. This must be specified whenever multicore is true. Automatic detection of the number of cores is no longer available.

mc.seed

integer value with which to seed the RNG when using parallel processing (internally, RNGkind will be called to set the RNG to "L'Ecuyer-CMRG"). Will be ignored if multicore is FALSE. If mulicore is FALSE, one “should” (depending on the candidates used) be able to get reprodicible results by setting the seed normally (with set.seed) prior to running multiPIMboot.

estimator

the estimator to be used. The default is "TMLE", for the targeted maximum likelihood estimator. Alternatively, one may specify "DR-IPCW", for the Double-Robust Inverse Probability of Censoring-Weighted estimator, or "IPCW", for the regular IPCW estimator. If the regular IPCW estimator is selected, all arguments which begin with the letter Q are ignored, since only g (the regression of each exposure on possible confounders) needs to be modeled in this case.

g.method

a length one character vector indicating the regression method to use in modelling g. The default value, "main.terms.logistic", is meant to be used with the default TMLE estimator. If a different estimator is used, it is recommended to use super learning by specifying "sl". In this case, the arguments g.sl.cands, g.num.folds and g.num.splits must also be specified. Other possible values for the g.method argument are: one of the elements of the vector all.bin.cands, or, if extra.cands is supplied, one of the names of the extra.cands list of functions. Ignored if estimator is "G-COMP".

g.sl.cands

character vector of length $>= 2$ indicating the candidate algorithms that the super learner fits for g should use. The possible values may be taken from the vector all.bin.cands, or from the names of the extra.cands list of functions, if it is supplied. Ignored if estimator is "G-COMP". or if g.method is not "sl". NOTE: The TMLE estimator is recommended, but if one is using either of the IPCW estimators, a reasonable choice is to specify g.method = "sl" and g.sl.cands = default.bin.cands.

g.num.folds

the number of folds to use in cross-validating the super learner fit for g (i.e. the v for v-fold cross-validation). Ignored if estimator is "G-COMP", or if g.method is not "sl".

g.num.splits

the number of times to randomly split the data into g.num.folds folds in cross-validating the super learner fit for g. Cross-validation results will be averaged over all splits. Ignored if estimator is "G-COMP", or if g.method is not "sl".

Q.method

character vector of length 1. The regression method to use in modelling Q. See details to find out which values are allowed. The default value, "sl", indicates that super learning should be used for modelling Q. Ignored if estimator is "IPCW".

Q.sl.cands

either of the length 1 character values "default" or "all" or a character vector of length $>= 2$ containing elements of either all.bin.cands or of all.cont.cands, or of the names of the extra.cands list of functions, if it is supplied. See details. Ignored if estimator is "IPCW" or if Q.method is not "sl".

Q.num.folds

the number of folds to use in cross-validating the super learner fit for Q (i.e. the v for v-fold cross-validation). Ignored if estimator is "IPCW" or if Q.method is not "sl".

Q.num.splits

the number of times to randomly split the data into Q.num.folds folds in cross-validating the super learner fit for Q. Ignored if estimator is "IPCW" or if Q.method is not "sl".

Q.type

either NULL or a length 1 character vector (which must be either "binary.outcome" or "continuous.outcome"). This provides a way to override the default mechanism for deciding which candidates will be allowed for modeling Q (see details). Ignored if estimator is "IPCW".

adjust.for.other.As

a single logical value indicating whether the other columns of A should be included (for TRUE) or not (for FALSE) in the g and Q models used to calculate the effect of each column of A on each column of Y. See details. Ignored if A has only one column.

truncate

either FALSE, or a single number greater than 0 and less than 0.5 at which the values of g(0, W) should be truncated in order to avoid instability of the estimator. Ignored if estimator is "G-COMP".

return.final.models

single logical value indicating whether final g and Q models should be returned by the function (in the slots g.final.models and Q.final.models). Default is TRUE. If memory is a concern, you will probably want to set this to FALSE. Note that only g and Q models for the main multiPIM run will be returned, not for each of the bootstrap runs.

na.action

currently ignored. If any of Y, A or (a non-null) W has missing values, multiPIMboot will throw an error.

verbose

single logical value. Should messages about the progress of the evaluation be printed out. Some of the candidate algorithms may print messages even when verbose is set to FALSE.

extra.cands

a named list of functions. This argument provides a way for the user to specify his or her own functions to use either as stand-alone regression methods, or as candidates for a super learner. See details section of multiPIM.

standardize

should all predictor variables be standardized before certain regression methods are run. Passed to all candidates, but only used by some (at this point, lars, penalized.bin and penalized.cont)

...

currently ignored.

Value

param.estimates: a matrix of dimensions ncol(A) by ncol(Y) with rownames equal to names(A) and colnames equal to names(Y), with each element being the estimated causal attributable risk for the exposure given by its row name vs. the outcome given by its column name.
plug.in.stand.errs: a matrix with the same dimensions as param.estimates containing the corresponding plug-in standard errors of the parameter estimates. These are obtained from the influence curve. Note: plug-in standard errors are not available for estimator = "G-COMP". This field will be set to NA in this case.
call: a copy of the call to multiPIMboot which generated this object.
num.exposures: this will be set to ncol(A).
num.outcomes: this will be set to ncol(Y).
W.names: the names attribute of the W data frame, if one was supplied. If no W was supplied, this will be NA.
estimator: the estimator used.
g.method: the method used for modelling g.
g.sl.cands: in case super learning was used for g, the candidates used in the super learner. Will be NA if g.method was not "sl".
g.winning.cands: if super learning was used for g, this will be a named character vector with ncol(A) elements. The ith element will be the name of the candidate which "won" the cross validation in the g model for the ith column of A.
g.cv.risk.array: array with dim attribute c(ncol(A), g.num.splits, length(g.sl.cands)) containing cross-validated risks from super learner modeling for g for each exposure-split-candidate triple. Has informative dimnames attribute. Note: the values are technically not risks, but log likelihoods (i.e. winning candidate is the one for which this is a max, not a min).
g.final.models: a list of length nrow(A) containing the objects returned by the candidate functions used in the final g models (see Candidates).
g.num.folds: the number of folds used for cross validation in the super learner for g. Will be NA if g.method was not "sl".
g.num.splits: the number of splits used for cross validation in the super learner for g. Will be NA if g.method was not "sl".
Q.method: the method used for modeling Q. Will be NA if double.robust was FALSE.
Q.sl.cands: in case super learning was used for Q, the candidates used in the super learner. Will be NA if double.robust was FALSE or if Q.method was not "sl".
Q.winning.cands: if super learning was used for Q, this will be a named character vector with ncol(Y) elements. The ith element is the name of the candidate which "won" the cross validation in the super learner for the Q model for the ith column of Y.
Q.cv.risk.array: array with dim attribute c(ncol(A), ncol(Y), Q.num.splits, length(Q.sl.cands)) containing cross-validated risks from super learner modeling for Q. Has informative dimnames attribute. Note: the values will be log likelihoods when Q.type is "binary.outcome" (see note above for g.cv.risk.array), and they will be mean squared errors when Q.type is "continuous.outcome".
Q.final.models: a list of length ncol(A), each element of which is another list of length ncol(Y) containing the objects returned by the candidate functions used for the Q models. I.e. Q.final.models[[i]][[j]] contains the Q model information for exposure i and outcome j.
Q.num.folds: the number of folds used for cross validation in the super learner for Q. Will be NA if double.robust was FALSE or if Q.method was not "sl".
Q.num.splits: the number of splits used for cross validation in the super learner for Q. Will be NA if double.robust was FALSE or if Q.method was not "sl".
Q.type: either "continuous.outcome" or "binary.outcome", depending on the contents of Y or on the value of the Q.type argument, if supplied.
adjust.for.other.As: logical value indicating whether the other columns of A were included in models used to calculate the effect of each column of A on each column of Y. Will be set to NA when A has only one column.
truncate: the value of the truncate argument. Will be set to NA if estimator was "G-COMP".
truncation.occured: logical value indicating whether it was necessary to trunctate. FALSE when truncate is FALSE. Will be set to NA if estimator was "G-COMP".
standardize: the value of the standardize argument.
boot.param.array: a three dimensional array with dim attribute equal to c(times, ncol(A), ncol(Y)) containing the corresponding parameter estimate for each bootstrap replicatate-exposure-outcome trio. Also has an informative dimnames attribute for easy printing.
main.time: time (in seconds) taken for main run of multiPIM on the original data.
g.time: time in seconds taken for running g models.
Q.time: time in seconds taken for running Q models.
g.sl.time: if g.method is "sl", time in seconds taken for running cross-validation of g models.
Q.sl.time: if Q.method is "sl", time in seconds taken for running cross-validation of Q models.
g.sl.cand.times: if g.method is "sl", named vector containing time taken, with each element corresponding to a super learner candidate for g.
Q.sl.cand.times: if Q.method is "sl", named vector containing time taken, with each element corresponding to a super learner candidate for Q.

Details

Bootstrap standard errors can be calculated by running the summary function on the multiPIMboot result (see link{summary.multiPIM}).

As of multiPIM version 1.3-1, support for multicore processing is through R's parallel package (distributed with R as of version 2.14.0).

For more details on how to use the arguments, see the details section for multiPIM.

References

Ritter, Stephan J., Jewell, Nicholas P. and Hubbard, Alan E. (2014) “R Package multiPIM: A Causal Inference Approach to Variable Importance Analysis” Journal of Statistical Software 57, 8: 1--29. http://www.jstatsoft.org/v57/i08/.

Hubbard, Alan E. and van der Laan, Mark J. (2008) “Population Intervention Models in Causal Inference.” Biometrika 95, 1: 35--47.

Young, Jessica G., Hubbard, Alan E., Eskenazi, Brenda, and Jewell, Nicholas P. (2009) “A Machine-Learning Algorithm for Estimating and Ranking the Impact of Environmental Risk Factors in Exploratory Epidemiological Studies.” U.C. Berkeley Division of Biostatistics Working Paper Series, Working Paper 250. http://www.bepress.com/ucbbiostat/paper250

van der Laan, Mark J. and Rose, Sherri (2011) Targeted Learning, Springer, New York. ISBN: 978-1441997814

Sinisi, Sandra E., Polley, Eric C., Petersen, Maya L, Rhee, Soo-Yon and van der Laan, Mark J. (2007) “Super learning: An Application to the Prediction of HIV-1 Drug Resistance.” Statistical Applications in Genetics and Molecular Biology 6, 1: article 7. http://www.bepress.com/sagmb/vol6/iss1/art7

van der Laan, Mark J., Polley, Eric C. and Hubbard, Alan E. (2007) “Super learner.” Statistical applications in genetics and molecular biology 6, 1: article 25. http://www.bepress.com/sagmb/vol6/iss1/art25

Examples

Run this code

## Warning: This would take a very long time to run!
## Not run: 
# ## load example from multiPIM help file
# 
# example(multiPIM)
# 
# ## this would run 5000 bootstrap replicates:
# 
# boot.result <- multiPIMboot(Y, A)
# 
# summary(boot.result)## End(Not run)

Run the code above in your browser using DataLab