randomGLM( # Input data x, y, xtest = NULL,
# Include interactions? maxInteractionOrder = 1,
# Prediction type classify = is.factor(y) | length(unique(y)) < 4,
# Multi-level classification options - only apply to classification with multi-level response multiClass.global = TRUE, multiClass.pairwise = FALSE, multiClass.minObs = 1, multiClass.ignoreLevels = NULL,
# Sampling options nBags = 100, replace = TRUE, sampleWeight=NULL, nObsInBag = if (replace) nrow(x) else as.integer(0.632 * nrow(x)), nFeaturesInBag = ceiling(ifelse(ncol(x)<=10, ncol(x),="" ifelse(ncol(x)<="300," (1.0276-0.00276*ncol(x))*ncol(x),="" ncol(x)="" 5))),="" mininbagobs =" min(" max(="" nrow(x)="" 2,="" 5),="" 2*nrow(x)="" 3),<="" div="">
# Individual ensemble member predictor options nCandidateCovariates=50, corFncForCandidateCovariates= cor, corOptionsForCandidateCovariates = list(method = "pearson", use="p"), mandatoryCovariates = NULL, interactionsMandatory = FALSE, keepModels = is.null(xtest),
# Miscellaneous options thresholdClassProb = 0.5, interactionSeparatorForCoefNames = ".times.", randomSeed = 12345, nThreads = NULL, verbose =0 )=10,>x contains the training data sets.x: at this point, one can either use a binary class outcome (factor variable) or a quantitative outcome (numeric variable). x are interpreted as training data). The number of rows can (and typically will) be different from
the number of rows in x.TRUE the response y will be interpreted as a binary variable and
logistic regression will be used. If FALSE the response y will be interpreted as a
quantitative numeric variable and a least squares regression model will be used to arrive at base learners.
Multi-level classification is split into a series of binary classification problems according to the
multiClass... arguments described below.y that are to be ignored when constructing level vs. all and level vs. level binary responses. Note
that observation with these values will be included in the "all" but will not have their own "level vs. all"
variables.TRUE then each bootstrap sample (bag) is defined by sampling with
replacement. Otherwise, sampling is carried out without replacement. We recommend to choose TRUE. NULL corresponds to equal weights.x).x.minInBagObs. This helps prevent too few unique
observations in a bag which would lead to problems with model selection. bicor, use the argument "robustY=FALSE".maxInteractionOrder. predict function, predict() generic.maxInteractionLevel above) and only affects coefficient names in models and columns names in returned
featuresInForwardRegression (see output value below). We recommend setting it so the interaction
separator does not conflict with any feature name since this may improve interpretability of the results.nThreads=1randomGLM. For continuous prediction or two-level
classification, this is a list with the following components:
classify is FALSE)
or predicted classification (if classify is TRUE) of the input data based on out-of-bag
samples.
y based on out-of-bag samples. In case of a continuous outcome, this is the predicted
value based on out-of-bag samples (i.e., a copy of predictedOOB).y for test data for binary outcomes. In case of a continuous outcome, this is the test set predicted
value. maxInteractionOrder rows and nCandidateCovariates columns.
Each column represents one
interaction obtained by multiplying the features indicated by the entries in each column (0 means no
feature, i.e. a lower order interaction). maxInteractionOrder rows.
Each column represents one
interaction obtained by multiplying the features indicated by the entries in each column (0 means no
feature, i.e. a lower order interaction). The column names contain human-readable names for the terms. featuresInForwardRegression. nObsInBag rows and nBags columns, giving the indices of
observations selected for each bag.maxInteractionOrder rows and number of features
columns. Each entry gives the number of times the corresponding feature appeared in a predictor model at the
corresponding order of interactions. Interactions where a single feature enters more than once (e.g., a
quadratic interaction of the feature with itself) are counted once.
x (FALSE) or whether they had to be changed to make them valid and unique names
(TRUE).featureNamesChanged is TRUE.
A data frame with three columns and one row per input feature (column of input x) giving the
feature number, original feature name, and modified feature name that is used for model fitting.predict method. These returned values should be considered undocumented and may
change in the future.In the multi-level classification classification case, the returned list (still considered a valid
randomGLM object) contains the following components:randomGLM
predictor trained on that binary variable as the response. The list is named by the corresponding binary
variable. For example, if the input response y contains observations with values (levels) "A", "B",
"C", the binary variables (and components of this list)
will have names "all.vs.A" (1 means "A", 0 means all others), "all.vs.B",
"all.vs.C", and optionally also "A.vs.B" (0 means "A", 1 means "B", NA means neither "A" nor "B"), "A.vs.C",
and "B.vs.C". binaryPredictors list but in a more programmer-friendly way.xTest is non-NULL, the components predictedTest and predictedTest.response
contain test set predictions analogous to predictedOOB and predictedOOB.response.At this point, the function randomGLM can be used to predict a binary outcome or a
quantitative numeric outcome. This ensemble predictor proceeds along the following steps.
Step 1 (bagging): nBags bootstrapped data sets are being generated based on random sampling from the
original training data set (x,y). If a bag contains less than minInBagObs unique
observations or it contains all observations, it is discarded and re-sampled again.
Step 2 (random subspace): For each bag, nFeaturesInBag features are randomly selected (without
replacement) from the columns of x. Optionally, interaction terms between the selected features can
be formed (see the argument maxInteractionOrder).
Step 3 (feature ranking): In each bag, features are ranked according to their correlation with the outcome
measure. Next the top nCandidateCovariates are being considered for forward selection in each GLM
(and in each bag).
Step 4 (forward selection): Forward variable selection is employed to define a multivariate GLM model of the outcome in each bag.
Step 5 (aggregating the predictions): Prediction from each bag are aggregated. In case, of a quantitative outcome, the predictions are simply averaged across the bags.
Generally, nCandidateCovariates>100 is not recommended, because the forward
selection process is
time-consuming. If arguments "nBags=1, replace=FALSE, nObsInBag=nrow(x)" are used,
the function becomes a forward selection GLM predictor without bagging.
Classification of multi-level categorical responses is performed indirectly by turning the single
multi-class response into a set of binary variables. The set can include two types of binary variables:
Level vs. all others (this binary variable is 1 when the original response equals the level and zero
otherwise), and level A vs. level B (this binary variable is 0 when the response equals level A, 1 when the
response equals level B, and NA otherwise).
For example, if the input response y contains observations with values (levels) "A", "B",
"C", the binary variables
will have names "all.vs.A" (1 means "A", 0 means all others), "all.vs.B",
"all.vs.C", and optionally also "A.vs.B" (0 means "A", 1 means "B", NA means neither "A" nor "B"), "A.vs.C",
and "B.vs.C".
Note that using pairwise level vs. level binary variables be
very time-consuming since the number of such binary variables grows quadratically with the number of levels
in the response. The user has the option to limit which levels of the original response will have their
"own" binary variables, by setting the minimum observations a level must have to qualify for its own binary
variable, and by explicitly enumerating levels that should not have their own binary variables. Note that
such "ignored" levels are still included on the "all" side of "level vs. all" binary variables.
At this time the predictor does not attempt to summarize the binary variable classifications into a single multi-level classification.
Training this predictor on data with fewer than 8 observations is not recommended (and the function will warn about it). Due to the bagging step, the number of unique observations in each bag is less than the number of observations in the input data; the low number of unique observations can (and often will) lead to an essentially perfect fit which makes it impossible to perfrom meaningful stepwise model selection.
Feature names: In general, the column names of input x are assumed to be the feature names. If
x has no column names (i.e., colnames(x) is NULL), stadard column names of the form
"F01", "F02", ... are used. If x has non-NULL column names, they are turned into valid and
unique names using the function make.names. If the function make.names returns
names that are not the same as the column names of x, the component featureNamesChanged will
be TRUE and the component nameTranslationTable contains the information about input and actual
used feature names. The feature names are used as predictor names in the individual models in each bag.
## binary outcome prediction
# data generation
data(iris)
# Restrict data to first 100 observations
iris=iris[1:100,]
# Turn Species into a factor
iris$Species = as.factor(as.character(iris$Species))
# Select a training and a test subset of the 100 observations
set.seed(1)
indx = sample(100, 67, replace=FALSE)
xyTrain = iris[indx,]
xyTest = iris[-indx,]
xTrain = xyTrain[, -5]
yTrain = xyTrain[, 5]
xTest = xyTest[, -5]
yTest = xyTest[, 5]
# predict with a small number of bags - normally nBags should be at least 100.
RGLM = randomGLM(xTrain, yTrain, xTest, nCandidateCovariates=ncol(xTrain), nBags=30, nThreads = 1)
yPredicted = RGLM$predictedTest
table(yPredicted, yTest)
## continuous outcome prediction
x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
xTrain = x[1:50,]
yTrain = y[1:50]
xTest = x[51:100,]
yTest = y[51:100]
RGLM = randomGLM(xTrain, yTrain, xTest, classify=FALSE, nCandidateCovariates=ncol(xTrain), nBags=10,
keepModels = TRUE, nThreads = 1)
Run the code above in your browser using DataLab