dismo (version 0.9-1)

gbm.step: gbm step

Description

Function to assess the optimal number of boosting trees using k-fold cross validation. Implements the cross-validation procedure described on page 215 of Hastie T., R. Tibshirani, and J.H. Friedman (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction Springer-Verlag, New York. The data is divided into 10 subsets, with stratification by prevalence if required for presence/absence data. The function then fits a gbm model of increasing complexity along the sequence from n.trees to n.trees + (n.steps * step.size), calculating the residual deviance at each step along the way. After each fold processed, the function calculates the average holdout residual deviance and its standard error and then identifies the optimal number of trees as that at which the holdout deviance is minimised. It fits a model with this number of trees, returning it as a gbm model along with additional information from the cross-validation selection process

Usage

gbm.step(data, gbm.x, gbm.y, offset = NULL, fold.vector = NULL, tree.complexity = 1,
 learning.rate = 0.01, bag.fraction = 0.75, site.weights = rep(1, nrow(data)), 
 var.monotone = rep(0, length(gbm.x)), n.folds = 10, prev.stratify = TRUE, 
 family = "bernoulli", n.trees = 50, step.size = n.trees, max.trees = 10000,
 tolerance.method = "auto", tolerance = 0.001, keep.data = FALSE, plot.main = TRUE,
 plot.folds = FALSE, verbose = TRUE, silent = FALSE, keep.fold.models = FALSE, 
 keep.fold.vector = FALSE, keep.fold.fit = FALSE, ...)

Arguments

data
input data.frame
gbm.x
predictor variables
gbm.y
response variable
offset
offset
fold.vector
a fold vector to be read in for cross validation with offsets
tree.complexity
sets the complexity of individual trees
learning.rate
sets the weight applied to inidivudal trees
bag.fraction
sets the proportion of observations used in selecting variables
site.weights
allows varying weighting for sites
var.monotone
restricts responses to individual predictors to monotone
n.folds
number of folds
prev.stratify
prevalence stratify the folds - only for presence/absence data
family
family - bernoulli (=binomial), poisson, laplace or gaussian
n.trees
number of initial trees to fit
step.size
numbers of trees to add at each cycle
max.trees
max number of trees to fit before stopping
tolerance.method
method to use in deciding to stop - "fixed" or "auto"
tolerance
tolerance value to use - if method == fixed is absolute, if auto is multiplier * total mean deviance
keep.data
Logical. keep raw data in final model
plot.main
Logical. plot hold-out deviance curve
plot.folds
Logical. plot the individual folds as well
verbose
Logical. control amount of screen reporting
silent
Logical. to allow running with no output for simplifying model)
keep.fold.models
Logical. keep the fold models from cross valiation
keep.fold.vector
Logical. allows the vector defining fold membership to be kept
keep.fold.fit
Logical. allows the predicted values for observations from cross-validation to be kept
...
Logical. allows for any additional plotting parameters

Value

  • object of S3 class gbm

References

Elith, J., J.R. Leathwick and T. Hastie, 2009. A working guide to boosted regression trees. Journal of Animal Ecology 77: 802-81

Examples

Run this code
data(Anguilla_train)
# reduce data set to speed things up a bit
Anguilla_train = Anguilla_train[1:200,]
angaus.tc5.lr01 <- gbm.step(data=Anguilla_train, gbm.x = 3:14, gbm.y = 2, family = "bernoulli",
       tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5)

Run the code above in your browser using DataCamp Workspace