randomGLM: Random generalized linear model predictor

Description

Ensemble predictor comprised of individual generalized linear model predictors.

Usage

randomGLM(
  # Input data
  x, y, xtest = NULL,

  # Include interactions?
  maxInteractionOrder = 1,

  # Prediction type
  classify = is.factor(y) | length(unique(y)) < 4,

  # Multi-level classification options - only apply to classification with multi-level response
  multiClass.global = TRUE,
  multiClass.pairwise = FALSE,
  multiClass.minObs = 1,
  multiClass.ignoreLevels = NULL,

  # Sampling options
  nBags = 100,
  replace = TRUE,
  sampleWeight=NULL,
  nObsInBag = if (replace) nrow(x) else as.integer(0.632 * nrow(x)),
  nFeaturesInBag = ceiling(ifelse(ncol(x)

Arguments

Value

The function returns an object of class randomGLM. For continuous prediction or two-level classification, this is a list with the following components:predictedOOBthe continuous prediction (if classify is FALSE) or predicted classification (if classify is TRUE) of the input data based on out-of-bag samples.predictedOOB.responseIn case of a binary outcome, this is the predicted probability of each outcome specified by y based on out-of-bag samples. In case of a continuous outcome, this is the predicted value based on out-of-bag samples (i.e., a copy of predictedOOB).predictedTest.contif test set is given, the predicted probability of each outcome specified by y for test data for binary outcomes. In case of a continuous outcome, this is the test set predicted value.predictedTestif test set is given, the predicted classification for test data. Only for binary outcomes.candidateFeaturescandidate features in each bag. A list with one component per bag. Each component is a matrix with maxInteractionOrder rows and nCandidateCovariates columns. Each column represents one interaction obtained by multiplying the features indicated by the entries in each column (0 means no feature, i.e. a lower order interaction).featuresInForwardRegressionfeatures selected by forward selection in each bag. A list with one component per bag. Each component is a matrix with maxInteractionOrder rows. Each column represents one interaction obtained by multiplying the features indicated by the entries in each column (0 means no feature, i.e. a lower order interaction).coefOfForwardRegressioncoefficients of forward regression. A list with one component per bag. Each component is a vector giving the coefficients of the model determined by forward selection in the corresponding bag. The order of the coefficients is the same as the order of the terms in the corresponding component of featuresInForwardRegression.interceptOfForwardRegressiona vector with one component per bag giving the intercept of the regression model in each bag.bagObsIndxa matrix with nObsInBag rows and nBags columns, giving the indices of observations selected for each bag.timesSelectedByForwardRegressiona matrix of maxInteractionOrder rows and number of features columns. Each entry gives the number of times the corresponding feature appeared in a predictor model at the corresponding order of interactions. Interactions where a single feature enters more than once (e.g., a quadratic interaction of the feature with itself) are counted once.modelsthe regression models for each bag.In addition, the output value contains a copy of several input arguments. These are included to facilitate prediction using the predict method. These returned values should be considered undocumented and may change in the future. In the multi-level classification classification case, the returned list (still considered a valid randomGLM object) contains the following components:binaryPredictorsa list with one component per binary variable, containing the randomGLM predictor trained on that binary variable as the response. The list is named by the corresponding binary variable. For example, if the input response y contains observations with values (levels) "A", "B", "C", the binary variables (and components of this list) will have names "all.vs.A" (1 means "A", 0 means all others), "all.vs.B", "all.vs.C", and optionally also "A.vs.B" (0 means "A", 1 means "B", NA means neither "A" nor "B"), "A.vs.C", and "B.vs.C".predictedOOBa matrix in which columns correspond to the binary variables and rows to samples, containing the predicted binary classification for each binary variable. Columns names and meaning of 0 and 1 are described above.predictedOOB.responsea matrix with two columns per binary variable, giving the class probabilities for each of the two classes in each binary variables. Column names contain the variable and class names.levelMatrixa character matrix with two rows and one column per binary variable, giving the level corresponding to value 0 (row 1) and level corresponding to value 1 (row 2). This encodes the same information as the names of the binaryPredictors list but in a more programmer-friendly way.If input xTest is non-NULL, the components predictedTest and predictedTest.response contain test set predictions analogous to predictedOOB and predictedOOB.response.

Details

At this point, the function randomGLM can be used to predict a binary outcome or a quantitative numeric outcome. This ensemble predictor proceeds along the following steps. Step 1 (bagging): nBags bootstrapped data sets are being generated based on random sampling from the original training data set (x,y). Step 2 (random subspace): For each bag, nFeaturesInBag features are randomly selected (without replacement) from the columns of x. Optionally, interaction terms between the selected features can be formed (see the argument maxInteractionOrder). Step 3 (feature ranking): In each bag, features are ranked according to their correlation with the outcome measure. Next the top nCandidateCovariates are being considered for forward selection in each GLM (and in each bag). Step 4 (forward selection): Forward variable selection is employed to define a multivariate GLM model of the outcome in each bag. Step 5 (aggregating the predictions): Prediction from each bag are aggregated. In case, of a quantitative outcome, the predictions are simply averaged across the bags. Additional comments: Generally, nCandidateCovariates>100 is not recommended, because the forward selection process is time-consuming. If arguments "nBags=1, replace=FALSE, nObsInBag=nrow(x)" are used, the function becomes a forward selection GLM predictor without bagging. Classification of multi-level categorical responses is performed indirectly by turning the single multi-class response into a set of binary variables. The set can include two types of binary variables: Level vs. all others (this binary variable is 1 when the original response equals the level and zero otherwise), and level A vs. level B (this binary variable is 0 when the response equals level A, 1 when the response equals level B, and NA otherwise). For example, if the input response y contains observations with values (levels) "A", "B", "C", the binary variables will have names "all.vs.A" (1 means "A", 0 means all others), "all.vs.B", "all.vs.C", and optionally also "A.vs.B" (0 means "A", 1 means "B", NA means neither "A" nor "B"), "A.vs.C", and "B.vs.C". Note that using pairwise level vs. level binary variables be very time-consuming since the number of such binary variables grows quadratically with the number of levels in the response. The user has the option to limit which levels of the original response will have their "own" binary variables, by setting the minimum observations a level must have to qualify for its own binary variable, and by explicitly enumerating levels that should not have their own binary variables. Note that such "ignored" levels are still included on the "all" side of "level vs. all" binary variables. At this time the predictor does not attempt to summarize the binary variable classifications into a single multi-level classification.

References

Lin Song, Peter Langfelder, Steve Horvath: Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics (2013)

Examples

Run this code

## binary outcome prediction
# data generation
data(iris)
# Restrict data to first 100 observations
iris=iris[1:100,]
# Turn Species into a factor
iris$Species = as.factor(as.character(iris$Species))
# Select a training and a test subset of the 100 observations
set.seed(1)
indx = sample(100, 67, replace=FALSE)
xyTrain = iris[indx,]
xyTest = iris[-indx,]
xTrain = xyTrain[, -5]
yTrain = xyTrain[, 5]

xTest = xyTest[, -5]
yTest = xyTest[, 5]

# predict with a small number of bags - normally nBags should be at least 100.
RGLM = randomGLM(xTrain, yTrain, xTest, nCandidateCovariates=ncol(xTrain), nBags=30, nThreads = 1)
yPredicted = RGLM$predictedTest
table(yPredicted, yTest)


## continuous outcome prediction

x=matrix(rnorm(100*20),100,20)
y=rnorm(100)

xTrain = x[1:50,]
yTrain = y[1:50]
xTest = x[51:100,]
yTest = y[51:100]

RGLM = randomGLM(xTrain, yTrain, xTest, classify=FALSE, nCandidateCovariates=ncol(xTrain), nBags=10, 
                 keepModels = TRUE, nThreads = 1)

Run the code above in your browser using DataLab