multiPIM(Y, A, W = NULL, estimator = c("TMLE", "DR-IPCW", "IPCW", "G-COMP"), g.method = "main.terms.logistic", g.sl.cands = NULL, g.num.folds = NULL, g.num.splits = NULL, Q.method = "sl", Q.sl.cands = "default", Q.num.folds = 5, Q.num.splits = 1, Q.type = NULL, adjust.for.other.As = TRUE, truncate = 0.05, return.final.models = TRUE, na.action, check.input = TRUE, verbose = FALSE, extra.cands = NULL, standardize = TRUE, ...)
Y
, which regression types to allow for modelling Q. Must have unique names.A
on the variables in Y
. No effect measures will be calculated for these variables. May contain numeric (integer or double), or factor values. Must be left as NULL
if not required. See details."TMLE"
, for the targeted maximum likelihood estimator. Alternatively, one may specify "DR-IPCW"
, for the Double-Robust Inverse Probability of Censoring-Weighted estimator, or "IPCW"
, for the regular IPCW estimator, or "G-COMP"
for the Graphical Computation estimator. If the regular IPCW estimator is selected, all arguments which begin with the letter Q are ignored, since only g (the regression of each exposure on possible confounders) needs to be modeled in this case. Similarly, if the G-COMP estimator is selected, all arguments which begin with the letter g, as well as the truncate argument, will be ignored, since only Q needs to be modeled in this case. Note: an additional characteristic of the G-COMP estimator is that there are no plug-in standard errors available. If you want to use G-COMP and you need standard errors, the multiPIMboot
function is available and will provide bootstrap standard errors."main.terms.logistic"
, is meant to be used with the default TMLE estimator. If a different estimator is used, it is recommended to use super learning by specifying "sl"
. In this case, the arguments g.sl.cands
, g.num.folds
and g.num.splits
must also be specified. Other possible values for the g.method
argument are: one of the elements of the vector all.bin.cands
, or, if extra.cands
is supplied, one of the names of the extra.cands
list of functions. Ignored if estimator
is "G-COMP"
.all.bin.cands
, or from the names of the extra.cands
list of functions, if it is supplied. Ignored if estimator
is "G-COMP"
. or if g.method
is not "sl"
. NOTE: The TMLE estimator is recommended, but if one is using either of the IPCW estimators, a reasonable choice is to specify g.method = "sl"
and g.sl.cands = default.bin.cands
.estimator
is "G-COMP"
, or if g.method
is not "sl"
.g.num.folds
folds in cross-validating the super learner fit for g. Cross-validation results will be averaged over all splits. Ignored if estimator
is "G-COMP"
, or if g.method
is not "sl"
."sl"
, indicates that super learning should be used for modelling Q. Ignored if estimator
is "IPCW"
."default"
or "all"
or a character vector of length $>= 2$ containing elements of either all.bin.cands
or of all.cont.cands
, or of the names of the extra.cands
list of functions, if it is supplied. See details. Ignored if estimator
is "IPCW"
or if Q.method
is not "sl"
.estimator
is "IPCW"
or if Q.method
is not "sl"
.Q.num.folds
folds in cross-validating the super learner fit for Q. Ignored if estimator
is "IPCW"
or if Q.method
is not "sl"
.NULL
or a length 1 character vector (which must be either "binary.outcome"
or "continuous.outcome"
). This provides a way to override the default mechanism for deciding which candidates will be allowed for modeling Q (see details). Ignored if estimator
is "IPCW"
.A
should be included (for TRUE
) or not (for FALSE
) in the g and Q models used to calculate the effect of each column of A
on each column of Y
. See details. Ignored if A
has only one column.FALSE
, or a single number greater than 0 and less than 0.5 at which the values of g(0, W) should be truncated in order to avoid instability of the estimator. Ignored if estimator
is "G-COMP"
.g.final.models
and Q.final.models
). Default is TRUE
. If memory is a concern, you will probably want to set this to FALSE.Y
, A
or (a non-null) W
has missing values, multiPIM
will throw an error.FALSE
is not recommended.verbose
is set to FALSE
."multiPIM"
with the following elements:
ncol(A)
by ncol(Y)
with rownames
equal to names(A)
and colnames
equal to names(Y)
, with each element being the estimated causal attributable risk for the exposure given by its row name vs. the outcome given by its column name.param.estimates
containing the corresponding plug-in standard errors of the parameter estimates. These are obtained from the influence curve. Note: plug-in standard errors are not available for estimator = "G-COMP"
. This field will be set to NA
in this case.multiPIM
which generated this object.ncol(A)
.ncol(Y)
.W
data frame, if one was supplied. If no W
was supplied, this will be NA
.NA
if g.method
was not "sl"
.ncol(A)
elements. The ith element will be the name of the candidate which "won" the cross validation in the g model for the ith column of A
.c(ncol(A), g.num.splits, length(g.sl.cands))
containing cross-validated risks from super learner modeling for g for each exposure-split-candidate triple. Has informative dimnames attribute. Note: the values are technically not risks, but log likelihoods (i.e. winning candidate is the one for which this is a max, not a min).nrow(A)
containing the objects returned by the candidate functions used in the final g models (see Candidates).NA
if g.method
was not "sl"
.NA
if g.method
was not "sl"
.NA
if double.robust
was FALSE
.NA
if double.robust
was FALSE
or if Q.method
was not "sl"
.ncol(Y)
elements. The ith element is the name of the candidate which "won" the cross validation in the super learner for the Q model for the ith column of Y
.c(ncol(A), ncol(Y), Q.num.splits, length(Q.sl.cands))
containing cross-validated risks from super learner modeling for Q. Has informative dimnames attribute. Note: the values will be log likelihoods when Q.type
is "binary.outcome"
(see note above for g.cv.risk.array
), and they will be mean squared errors when Q.type
is "continuous.outcome"
.ncol(A)
, each element of which is another list of length ncol(Y)
containing the objects returned by the candidate functions used for the Q models. I.e. Q.final.models[[i]][[j]]
contains the Q model information for exposure i and outcome j.NA
if double.robust
was FALSE
or if Q.method
was not "sl"
.NA
if double.robust
was FALSE
or if Q.method
was not "sl"
."continuous.outcome"
or "binary.outcome"
, depending on the contents of Y
or on the value of the Q.type
argument, if supplied.A
were included in models used to calculate the effect of each column of A
on each column of Y
. Will be set to NA
when A
has only one column.truncate
argument. Will be set to NA if estimator was "G-COMP"
.FALSE
when truncate
is FALSE
. Will be set to NA if estimator was "G-COMP"
.standardize
argument.NULL
for objects returned by the multiPIM
function. See multiPIMboot
for details on what this slot is actually used for.Y
for the units in
the target (or unexposed) group and the overall mean value of
Y
. Units which are in the target (or unexposed) group with
respect to one of the variables in A
are characterized as such by
having the value 0 in the respective column of A
. Members of the
the non-target (or exposed) group should have a 1 in that column of
A
. Assuming all causal assumptions hold (see the paper), each
parameter estimate can be thought of as estimating the hypothetical
effect on the respective outcome of totally eliminating the respective
exposure from the population (i.e. setting everyone to 0 for that
exposure). For example, in the case of a binary outcome, a parameter
estimate for exposure x and outcome y of -0.03 could be interpreted
as follows: the effect of an intervention in which the entire population
was set to exposure x = 0 would be to reduce the level of outcome y
by 3 percentage points.If check.input
is TRUE
(which is the default and is highly
recommended), all of the arguments will be checked to make sure they
have permissible values. Many of the arguments, especially those for
which a single logical value (TRUE
or FALSE
) or a single
character value (such as, for example, "all"
) is expected, are
checked using the identical
function, which means that if any of
these arguments has any extraneous attributes (such as names), this may
cause multiPIM
to throw an error.
On the other hand, the arguments Y
and A
(and W
if
it is non-null) must have valid names attributes. multiPIM
will throw an error if there is any overlap between the names of the
columns of these data frames, or if any of the names cannot be used in a
formula
(for example, because it begins with a number and not a
letter).
By default, the regression methods which will be allowed
for fitting models for Q will be determined from the contents of
Y
as follows: if all values in Y are either 0 or 1 (i.e. all
outcomes are binary), then logistic-type regression methods
will be used (and only these methods will be allowed in the arguments
Q.method
and Q.sl.cands
); however, if there are any values
in Y
which are not equal to 0 or 1 then it will be assumed that
all outcomes are continuous, linear-type regression will be
used, and the values allowed for Q.method
and Q.sl.cands
will change accordingly. This behavior can be overriden by specifying
Q.type
as either "binary.outcome"
(for logistic-type
regression), or as "continuous.outcome"
(for linear-type
regression). If Q.type
is specified, Y
will not be checked for
binaryness.
The values allowed for Q.method
(which should have length 1) are:
either "sl"
if one would like to use super learning, or one of the
elements of the vector all.bin.cands
(for the binary outcome case),
or of all.cont.cands
(for the continuous outcome case), if one would
like to use only a
particular regression method for all modelling of Q. If Q.method
is given as "sl"
, then the candidates used by the super learner
will be determined from the value of Q.sl.cands
. If the value of
Q.sl.cands
is "default"
, then the candidates listed in either
default.bin.cands
or default.cont.cands
will
be used. If the value of Q.sl.cands
is "all"
, then the candidates
listed in either all.bin.cands
or all.cont.cands
will be used. The function will automatically choose the candidates which
correspond to the correct outcome type (binary or continuous). Alternatively,
one may specify Q.sl.cands
explicitly as a vector of names of the
candidates to be used.
If A
has more than one column, the adjust.for.other.As
argument can be used to specify whether the other
columns of A
should possibly be included in the g and Q models
which will be used in calculating the effect of a
certain column of A
on each column of Y
.
With the argument extra.cands
, one may supply alternative R
functions to be used as stand-alone regression methods, or as super
learner candidates, within the multiPIM
function. extra.cands
should be given as a named list of
functions. See Candidates for the form (e.g. arguments) that the
functions in this list should have. In order to supply your own stand
alone regression method for g or Q, simply specify g.method
or
Q.method
as the name of the function you want to use (i.e. the
corresponding element of the names attribute of extra.cands
). To
add candidates to a super learner, simply use the corresponding names of
your functions (from the names attribute of extra.cands
) when you
supply the g.sl.cands
or Q.sl.cands
arguments. Note that
you may mix and match between your own extra candidates and the built-in
candidates given in the all.bin.cands
and
all.cont.cands
vectors. Note
also that extra candidates must be explicitly specified as
g.method
, Q.method
, or as elements of g.sl.cands
or
Q.sl.cands
-- Specifying Q.sl.cands
as "all"
will not
cause any extra candidates to be used.
Hubbard, Alan E. and van der Laan, Mark J. (2008) Population Intervention Models in Causal Inference. Biometrika 95, 1: 35--47.
Young, Jessica G., Hubbard, Alan E., Eskenazi, Brenda, and Jewell, Nicholas P. (2009) A Machine-Learning Algorithm for Estimating and Ranking the Impact of Environmental Risk Factors in Exploratory Epidemiological Studies. U.C. Berkeley Division of Biostatistics Working Paper Series, Working Paper 250. http://www.bepress.com/ucbbiostat/paper250
van der Laan, Mark J. and Rose, Sherri (2011) Targeted Learning, Springer, New York. ISBN: 978-1441997814
Sinisi, Sandra E., Polley, Eric C., Petersen, Maya L, Rhee, Soo-Yon and van der Laan, Mark J. (2007) Super learning: An Application to the Prediction of HIV-1 Drug Resistance. Statistical Applications in Genetics and Molecular Biology 6, 1: article 7. http://www.bepress.com/sagmb/vol6/iss1/art7
van der Laan, Mark J., Polley, Eric C. and Hubbard, Alan E. (2007) Super learner. Statistical applications in genetics and molecular biology 6, 1: article 25. http://www.bepress.com/sagmb/vol6/iss1/art25
multiPIMboot
for running multiPIM
with automatic bootstrapping to get standard errors.summary.multiPIM
for printing summaries of the results.
Candidates
to see which candidates are currently available, and for information on writing user-defined super learner candidates and regression methods.
num.columns <- 3
num.obs <- 250
set.seed(23)
## use rbinom with size = 1 to make a data frame of binary data
A <- as.data.frame(matrix(rbinom(num.columns*num.obs, 1, .5),
nrow = num.obs, ncol = num.columns))
## let Y[,i] depend only on A[,i] plus some noise
## (start with the noise then add a multiple of A[,i] to Y[,i])
Y <- as.data.frame(matrix(rnorm(num.columns*num.obs),
nrow = num.obs, ncol = num.columns))
for(i in 1:num.columns)
Y[,i] <- Y[,i] + i * A[,i]
## make sure the names are unique
names(A) <- paste("A", 1:num.columns, sep = "")
names(Y) <- paste("Y", 1:num.columns, sep = "")
result <- multiPIM(Y, A)
summary(result)
Run the code above in your browser using DataLab