Candidates: Super learner candidates (regression methods) available for use with the multiPIM and multiPIMboot functions

Description

When the multiPIM package is loaded, four character vectors are made available to the user. They are defined as follows:

all.bin.cands <- c("polyclass", "penalized.bin", "main.terms.logistic", "rpart.bin")

default.bin.cands <- c("polyclass", "penalized.bin", "main.terms.logistic")

all.cont.cands <- c("polymars", "lars", "main.terms.linear", "penalized.cont", "rpart.cont")

default.cont.cands <- c("polymars", "lars", "main.terms.linear")

These vectors (or subsets thereof) can be supplied as arguments to the multiPIM and the multiPIMboot functions, in order to specify which regression methods should be used to estimate the nuisance parameters g(0, W) and Q(0|W). The user may also supply custom written regression methods or super learner candidates. The mechanism for this is described in a section below.

Arguments

Candidates

polyclass and polymars{

These candidate use the functions polyclass and polymars from the package polspline.

} penalized.bin and penalized.cont{

These candidates perform L1 penalized logistic (penalized.bin) or linear (penalized.cont) regression using the function penalized from the package penalized. The value of the L1 penalty is selected by cross validation (using the profL1 function).

} lars{

This candidate uses the function lars from the package lars. Cross validation is performed using the function cv.lars.

} main.terms.logistic and main.terms.linear{

These candidates perform standard main terms logistic or linear regression, using the functions glm and lm.

}

rpart.bin and rpart.cont{

These candidates use the function rpart from the package rpart. They are not included as default candidates since methods such as this, which are based on an individual tree, have many drawbacks, see e.g. Hastie, Tibshirani and Friedman (2009, section 9.2).

}

Forcing of Variables into Q Models

Since some of the available candidates (such as polyclass/polymars, penalized) will sometimes completely drop an input variable from the model, it is necessary to have some mechanism to make sure that the relevant exposure variable stays in the model. How this is done for each candidate is described in greater detail in the technical report referenced below (Ritter, Jewell and Hubbard, 2011).

User-Defined Regression Methods and Super Learner Candidates

Below is the code which defines the main.terms.logistic candidate function. This is an example of the form that functions which are passed as elements of the extra.cands argument should have. (See code for the multiPIM function in the file multiPIM/R/functions.R to see other examples of how candidates are defined.)

candidate.functions$main.terms.logistic <- function(X, Y, newX, force, standardize) {

result <- vector("list", length = 4) names(result) <- c("preds", "fitted", "main.model", "force.model") class(result) <- "main.terms.logistic.result"

formula.string <- paste(names(Y), "~", paste(names(X), collapse = "+"), sep = "")

result$main.model <- glm(formula.string, data = cbind(Y, X), family = binomial, model = FALSE, y = FALSE)

result$preds <- predict(result$main.model, newdata = newX, type = "response") result$fitted <- predict(result$main.model, type = "response") return(result) }

The functions muse take these four arguments: X will be a data frame of predictors, Y will be a single-column data frame containing the outcome, and newX will be a data frame with columns corresponding to the columns of X, containing data on which to predict the outcome based on the model fit by the function. force will be an integer specifying the column number (of X and newX) corresponding to the variable which should be forced into the model in case the function is being used to fit a Q model, and standardize will just be a logical value indicating whether or not to standardize the input variables before running the fitting algorithm (this is meant for algorithms like penalized, where the scale of the predictors will make a difference).

For g models, the force argument will be missing when the function is called, so if the function must do something differently in order to force in a variable (unlike the main.terms.logistic function above), then one can use a conditional such as:

if(missing(force)) ...

in order to differentiate between g and Q models.

The list returned by the functions must have a slot named preds containing the predictions (or predicted probabilities for the binary outcome case), based on newX. Also, for the TMLE estimator it is necessary to have the fitted values, i.e. the predictions on X. This should be returned in the fitted slot. Note that for binary outcomes, these predictions and fitted values should be predicted probabilities that the outcome is equal to 1. Thus, if the candidate is being used for a g model, where the outcome is an exposure variable, the returned values will be estimated probabilities that the exposure variable is equal to 1. The probabilities will be converted as necessary elsewhere in the multiPIM function.

The other slots (main.model and force.model), and setting the class of the object returned, are not necessary for the multiPIM function to work correctly, but may be useful if one would like to inspect the final g and Q models after running the function (see the return.final.models argument).

References

multiPIM:

Ritter, Stephan J., Jewell, Nicholas P. and Hubbard, Alan E. (2014) R Package multiPIM: A Causal Inference Approach to Variable Importance Analysis Journal of Statistical Software 57, 8: 1--29. http://www.jstatsoft.org/v57/i08/.

General Machine Learning Reference:

Hastie, T, Tibshirani, R and Friedman, J (2009). The Elements of Statistical Learning. Springer, 2nd edition. ISBN: 0387848576

lars:

Efron, B et al. (2004). Least angle regression. The Annals of statistics, 32(2):407-499.

penalized:

Goeman, J. J. (2010). L1 penalized estimation in the cox proportional hazards model. Biometrical Journal, 52(1):70-84.

polyclass and polymars:

Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). The Annals of Statistics, 19:1-141.

Kooperberg, C. et al. (1997). Polychotomous regression. Journal of the American Statistical Association, 92(437):117-127.

Stone, C. J. et al. (1997). The use of polynomial splines and their tensor products in extended linear modeling (with discussion) . Annals of Statistics, 25:1371-1470. rpart: