Learn R Programming

wsrf (version 1.5.47)

wsrf:

Build a Forest of Weighted Subspace Decision Trees

Description

Build weighted subspace C4.5-based decision trees to construct a forest.

Usage

wsrf(formula, data, nvars, mtry, ntrees=500, weights=TRUE, parallel=TRUE, na.action=na.fail, importance=FALSE, clusterlogfile)

Arguments

formula
a formula, with a response but no interaction terms.
data
a data frame in which to interpret the variables named in the formula.
ntrees
number of trees to build on each server; By default, 500
nvars, mtry
number of variables to choose, by default, being the integer less than or equal to $log_2(ninputs) + 1$. For compatibility with other R packages like randomForest, both nvars and mtry are supported, however, only one of them should be specified.
weights
logical. TRUE for weighted subspace selection, which is the default; FALSE for random selection, and the trees are based on C4.5.
na.action
indicate the behaviour when encountering NA values in data.
parallel
whether to run multiple cores (TRUE), nodes, or sequentially (FALSE).
importance
should importance of predictors be assessed?
clusterlogfile
character. The pathname of the log file when building model in a cluster. For debug.

Value

An object of class wsrf, which is a list with the following components:
confusion
the confusion matrix of the prediction (based on OOB data).
oob.times
number of times cases are `out-of-bag' (and thus used in computing OOB error estimate)
predicted
the predicted values of the input data based on out-of-bag samples.
useweights
logical. Whether weighted subspace selcetion is used? NULL if the model is obtained by combining multiple wsrf model and one of them has different value of 'useweights'.
mtry
integer. The number of variables to be chosen when spliting a node.

Details

See Xu, Huang, Williams, Wang, and Ye (2012) for more details of the algorithm.

Currently, wsrf can only be used for classification. When weights=FALSE, C4.5-based trees (Quinlan (1993)) are grown by wsrf, where binary split is used for continuous predictors (variables) and k-way split for categorical ones. For continuous predictors, each of the values themselves is used as split points, no discretization used. The only stopping condition for split is the minimum node size is 2.

References

Xu B, Huang JZ, Williams G, Wang Q, Ye YM (2012). "Classifying very high-dimensional data with random forests built from small subspaces." International Journal of Data Warehousing and Mining (IJDWM), 8(2), 44-63.

Quinlan J. R. (1993). "C4.5: Programs for Machine Learning". Morgan Kaufmann.

Examples

Run this code
  library("wsrf")

  # Prepare parameters.
  ds <- rattle::weather
  dim(ds)
  names(ds)
  target <- "RainTomorrow"
  id     <- c("Date", "Location")
  risk   <- "RISK_MM"
  ignore <- c(id, if (exists("risk")) risk) 
  vars   <- setdiff(names(ds), ignore)
  if (sum(is.na(ds[vars]))) ds[vars] <- randomForest::na.roughfix(ds[vars])
  ds[target] <- as.factor(ds[[target]])
  (tt  <- table(ds[target]))
  form <- as.formula(paste(target, "~ ."))
  set.seed(42)
  train <- sample(nrow(ds), 0.7*nrow(ds))
  test  <- setdiff(seq_len(nrow(ds)), train)
  
  # Build model.
  model.wsrf <- wsrf(form, data=ds[train, vars])
  
  # View model.
  print(model.wsrf)
  print(model.wsrf, tree=1)
  
  # Evaluate.
  strength(model.wsrf)
  correlation(model.wsrf)
  cl <- predict(model.wsrf, newdata=ds[test, vars], type="response")
  actual <- ds[test, target]
  (accuracy.wsrf <- sum(cl==actual)/length(actual))

Run the code above in your browser using DataLab