all.bin.cands <- c("polyclass", "penalized.bin", "main.terms.logistic", "rpart.bin")
default.bin.cands <- c("polyclass", "penalized.bin", "main.terms.logistic")
all.cont.cands <- c("polymars", "lars", "main.terms.linear", "penalized.cont", "rpart.cont")
default.cont.cands <- c("polymars", "lars", "main.terms.linear")
These vectors (or subsets thereof) can be supplied as arguments to the multiPIM
and the multiPIMboot
functions, in order to specify which regression methods should be used to estimate the nuisance parameters g(0, W) and Q(0|W). The user may also supply custom written regression methods or super learner candidates. The mechanism for this is described in a section below.
These candidate use the functions polyclass
and polymars
from the package polspline.
}
These candidates perform L1 penalized logistic (penalized.bin) or linear (penalized.cont) regression using the function penalized
from the package penalized. The value of the L1 penalty is selected by cross validation (using the profL1
function).
}
This candidate uses the function lars
from the package lars. Cross validation is performed using the function cv.lars.
}
These candidates perform standard main terms logistic or linear regression, using the functions glm
and lm
.
}
These candidates use the function rpart
from the package rpart. They are not included as default candidates since methods such as this, which are based on an individual tree, have many drawbacks, see e.g. Hastie, Tibshirani and Friedman (2009, section 9.2).
}
main.terms.logistic
candidate function. This is an example of the form that functions which are passed as elements of the extra.cands
argument should have. (See code for the multiPIM function in the file multiPIM/R/functions.R to see other examples of how candidates are defined.)candidate.functions$main.terms.logistic <- function(X, Y, newX, force, standardize) {
result <- vector("list", length = 4) names(result) <- c("preds", "fitted", "main.model", "force.model") class(result) <- "main.terms.logistic.result"
formula.string <- paste(names(Y), "~", paste(names(X), collapse = "+"), sep = "")
result$main.model <- glm(formula.string, data = cbind(Y, X), family = binomial, model = FALSE, y = FALSE)
result$preds <- predict(result$main.model, newdata = newX, type = "response") result$fitted <- predict(result$main.model, type = "response") return(result) }
The functions muse take these four arguments: X
will be a data frame of predictors, Y
will be a single-column data frame containing the outcome, and newX
will be a data frame with columns corresponding to the columns of X
, containing data on which to predict the outcome based on the model fit by the function. force
will be an integer specifying the column number (of X
and newX
) corresponding to the variable which should be forced into the model in case the function is being used to fit a Q model, and standardize will just be a logical value indicating whether or not to standardize the input variables before running the fitting algorithm (this is meant for algorithms like penalized, where the scale of the predictors will make a difference).
For g models, the force argument will be missing when the function is called, so if the function must do something differently in order to force in a variable (unlike the main.terms.logistic
function above), then one can use a conditional such as:
if(missing(force)) ...
in order to differentiate between g and Q models.
The list returned by the functions must have a slot named preds
containing the predictions (or predicted probabilities for the binary outcome case), based on newX
. Also, for the TMLE estimator it is necessary to have the fitted values, i.e. the predictions on X
. This should be returned in the fitted
slot. Note that for binary outcomes, these predictions and fitted values should be predicted probabilities that the outcome is equal to 1. Thus, if the candidate is being used for a g model, where the outcome is an exposure variable, the returned values will be estimated probabilities that the exposure variable is equal to 1. The probabilities will be converted as necessary elsewhere in the multiPIM function.
The other slots (main.model and force.model), and setting the class of the object returned, are not necessary for the multiPIM function to work correctly, but may be useful if one would like to inspect the final g and Q models after running the function (see the return.final.models
argument).
Ritter, Stephan J., Jewell, Nicholas P. and Hubbard, Alan E. (2014)
General Machine Learning Reference:
Hastie, T, Tibshirani, R and Friedman, J (2009). The Elements of Statistical Learning. Springer, 2nd edition. ISBN: 0387848576
lars:
Efron, B et al. (2004).
penalized:
Goeman, J. J. (2010).
polyclass and polymars:
Friedman, J. H. (1991).
Kooperberg, C. et al. (1997).
Stone, C. J. et al. (1997).
Breiman, L. et al. (1984). Classification and regression trees. Wadsworth International Group, Belmont, CA. ISBN: 0534980538.
multiPIM
, multiPIMboot