wsrf:
Build a Forest of Weighted Subspace Decision Trees

Description

Build weighted subspace C4.5-based decision trees to construct a forest.

Usage

wsrf(formula, data, nvars, mtry, ntrees=500, weights=TRUE,  parallel=TRUE, na.action=na.fail, importance=FALSE, clusterlogfile)

Arguments

formula

a formula, with a response but no interaction terms.

data

a data frame in which to interpret the variables named in the formula.

ntrees

number of trees to build on each server; By default, 500

nvars, mtry

number of variables to choose, by default, being the integer less than or equal to $log_2(ninputs) + 1$. For compatibility with other R packages like randomForest, both nvars and mtry are supported, however, only one of them should be specified.

weights

logical. TRUE for weighted subspace selection, which is the default; FALSE for random selection, and the trees are based on C4.5.

na.action

indicate the behaviour when encountering NA values in data.

parallel

whether to run multiple cores (TRUE), nodes, or sequentially (FALSE).

importance

should importance of predictors be assessed?

clusterlogfile

character. The pathname of the log file when building model in a cluster. For debug.

Value

confusion: the confusion matrix of the prediction (based on OOB data).
oob.times: number of times cases are `out-of-bag' (and thus used in computing OOB error estimate)
predicted: the predicted values of the input data based on out-of-bag samples.
useweights: logical. Whether weighted subspace selcetion is used? NULL if the model is obtained by combining multiple wsrf model and one of them has different value of 'useweights'.
mtry: integer. The number of variables to be chosen when spliting a node.

Details

See Xu, Huang, Williams, Wang, and Ye (2012) for more details of the algorithm.

Currently, wsrf can only be used for classification. When weights=FALSE, C4.5-based trees (Quinlan (1993)) are grown by wsrf, where binary split is used for continuous predictors (variables) and k-way split for categorical ones. For continuous predictors, each of the values themselves is used as split points, no discretization used. The only stopping condition for split is the minimum node size is 2.

References

Xu B, Huang JZ, Williams G, Wang Q, Ye YM (2012). "Classifying very high-dimensional data with random forests built from small subspaces." International Journal of Data Warehousing and Mining (IJDWM), 8(2), 44-63.

Quinlan J. R. (1993). "C4.5: Programs for Machine Learning". Morgan Kaufmann.

Examples

Run this code

  library("wsrf")

  # Prepare parameters.
  ds <- rattle::weather
  dim(ds)
  names(ds)
  target <- "RainTomorrow"
  id     <- c("Date", "Location")
  risk   <- "RISK_MM"
  ignore <- c(id, if (exists("risk")) risk) 
  vars   <- setdiff(names(ds), ignore)
  if (sum(is.na(ds[vars]))) ds[vars] <- randomForest::na.roughfix(ds[vars])
  ds[target] <- as.factor(ds[[target]])
  (tt  <- table(ds[target]))
  form <- as.formula(paste(target, "~ ."))
  set.seed(42)
  train <- sample(nrow(ds), 0.7*nrow(ds))
  test  <- setdiff(seq_len(nrow(ds)), train)
  
  # Build model.
  model.wsrf <- wsrf(form, data=ds[train, vars])
  
  # View model.
  print(model.wsrf)
  print(model.wsrf, tree=1)
  
  # Evaluate.
  strength(model.wsrf)
  correlation(model.wsrf)
  cl <- predict(model.wsrf, newdata=ds[test, vars], type="response")
  actual <- ds[test, target]
  (accuracy.wsrf <- sum(cl==actual)/length(actual))

Run the code above in your browser using DataLab