The function fits generalized boosted models via Rank-Based Trees on both binary and multi-class problems. It converts continuous gene expression profiles into
ranked gene pairs, for which the variable importance indices are computed and adopted for dimension reduction. The boosting implementation was directly imported from the gbm
package. For technical details, see the
vignette: utils::browseVignettes("gbm").
rboost(
formula,
data,
dimreduce = TRUE,
datrank = TRUE,
distribution = "multinomial",
weights,
ntree = 100,
nodedepth = 3,
nodesize = 5,
shrinkage = 0.05,
bag.fraction = 0.5,
train.fraction = 1,
cv.folds = 5,
keep.data = TRUE,
verbose = TRUE,
class.stratify.cv = TRUE,
n.cores = NULL
)A vector containing the fitted values on the scale of regression function (e.g. log-odds scale for bernoulli).
A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the training data.
A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the validation data.
If cv.folds < 2 this
component is NULL. Otherwise, this component is a vector of length equal to
the number of fitted trees containing a cross-validated estimate of the loss
function for each boosting iteration.
A vector of
length equal to the number of fitted trees containing an out-of-bag estimate
of the marginal reduction in the expected value of the loss function. The
out-of-bag estimate uses only the training data and is useful for estimating
the optimal number of boosting iterations. See gbm.perf.
If cross-validation was performed, the cross-validation predicted values on the scale of the linear predictor. That is, the fitted values from the i-th CV-fold, for the model having been trained on the data in all other folds.
Object of class 'formula' describing the model to fit.
Data frame containing the y-outcome and x-variables.
Dimension reduction via variable importance weighted forests. FALSE
means no dimension reduction; TRUE means reducing 75% variables before binary rank conversion and then fitting a weighted forest; a numeric value x% between 0 and 1 means reducing x% variables before binary rank conversion and then fitting a weighted forest.
If using ranked raw data for fitting the dimension reduction model.
Either a character string specifying the name of the
distribution to use: if the response has only 2 unique values,
bernoulli is assumed; otherwise, if the response is a factor, multinomial is
assumed.
an optional vector of weights to be used in the fitting process. It must be positive but does not need to be normalized.
Integer specifying the total number of trees to fit. This is
equivalent to the number of iterations and the number of basis functions in
the additive expansion, which matches n.tree in the gbm package.
Integer specifying the maximum depth of each tree. A value of 1
implies an additive model. This matches interaction.depth in the gbm package.
Integer specifying the minimum number of observations
in the terminal nodes of the trees, which matches n.minobsinnode in the gbm package.. Note that this is the actual number of
observations, not the total weight.
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default is 0.05.
the fraction of the training set observations randomly
selected to propose the next tree in the expansion. This introduces
randomnesses into the model fit. If bag.fraction < 1 then running the
same model twice will result in similar but different fits. gbm uses
the R random number generator so set.seed can ensure that the model
can be reconstructed. Preferably, the user can save the returned
gbm.object using save. Default is 0.5.
The first train.fraction * nrows(data)
observations are used to fit the gbm and the remaining observations are used for
computing out-of-sample estimates of the loss function.
Number of cross-validation folds to perform. If
cv.folds>1 then gbm, in addition to the usual fit, will
perform cross-validation and calculate an estimate of generalization error
returned in cv.error.
a logical variable indicating whether to keep the data and
an index of the data stored with the object. Keeping the data and index
makes subsequent calls to gbm.more faster at the cost of
storing an extra copy of the dataset.
Logical indicating whether or not to print out progress and
performance indicators (TRUE). If this option is left unspecified for
gbm.more, then it uses verbose from object. Default is
TRUE.
Logical indicating whether or not the cross-validation should be stratified by class. The purpose of stratifying the cross-validation is to help avoid situations in which training sets do not contain all classes.
The number of CPU cores to use. The cross-validation loop
will attempt to send different CV folds off to different cores. If
n.cores is not specified by the user, it is guessed using the
detectCores function in the parallel package. Note that the
documentation for detectCores makes clear that it is not failsafe and
could return a spurious number of available cores.
Ruijie Yin (Maintainer,<ruijieyin428@gmail.com>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles. Journal of Translational Medicine. 22, 140 (2024). doi: 10.1186/s12967-024-04940-2
data(tnbc)
obj <- rboost(subtype~., data = tnbc[,c(1:10,337)])
obj
Run the code above in your browser using DataLab