wsrf: Build a Forest of Weighted Subspace Decision Trees

Description

Build weighted subspace decision trees to construct a forest.

Usage

wsrf(formula, data, nvars, mtry, ntrees=500, weights=TRUE, 
                 parallel=TRUE, na.action=na.fail, importance=FALSE, clusterlogfile)

Arguments

formula

a formula, with a response but no interaction terms.

data

a data frame in which to interpret the variables named in the formula.

ntrees

number of trees to build on each server; By default, 500

nvars, mtry

number of variables to choose, by default, being the integer less than or equal to $log_2(ninputs) + 1$. For compatibility with other R packages like randomForest, both nvars and mtry

weights

logical. TRUE for weighted subspace selection, which is the default; FALSE for random selection, and the trees are based on C4.5.

na.action

indicate the behaviour when encountering NA values in data.

parallel

whether to run multiple cores (TRUE), nodes, or sequentially (FALSE).

importance

should importance of predictors be assessed?

clusterlogfile

character. The pathname of the log file when building model in a cluster. For debug.

Value

An object of class wsrf, which is a list with the following components:
confusionthe confusion matrix of the prediction (based on OOB data).
oob.timesnumber of times cases are `out-of-bag' (and thus used in computing OOB error estimate)
predictedthe predicted values of the input data based on out-of-bag samples.
useweightslogical. Whether weighted subspace selcetion is used? NULL if the model is obtained by combining multiple wsrf model and one of them has different value of 'useweights'.
mtryinteger. The number of variables to be chosen when spliting a node.

concept

weighted subspace decision trees
weighted subspace random forest

Details

See Xu, Huang, Williams, Wang, and Ye (2012) for more details of the algorithm.

Currently, wsrf can only be used for classification. When weights=FALSE, C4.5-based trees (Quinlan (1993)) are grown by wsrf, where binary split is used for continuous predictors (variables) and k-way split for categorical ones. For continuous predictors, each of the values themselves is used as split points, no discretization used. The only stopping condition for split is the minimum node size is 2.

References

Xu B, Huang JZ, Williams G, Wang Q, Ye YM (2012). "Classifying very high-dimensional data with random forests built from small subspaces." International Journal of Data Warehousing and Mining (IJDWM), 8(2), 44-63.

Quinlan J. R. (1993). "C4.5: Programs for Machine Learning". Morgan Kaufmann.

Examples

Run this code

library(wsrf)
    library(rattle)
    library(randomForest)
  
  # prepare parameters
  ds <- get("weather")
  dim(ds)
  names(ds)
  target <- "RainTomorrow"
  id     <- c("Date", "Location")
  risk   <- "RISK_MM"
  ignore <- c(id, if (exists("risk")) risk) 
  vars   <- setdiff(names(ds), ignore)
  if (sum(is.na(ds[vars]))) ds[vars] <- na.roughfix(ds[vars])
  ds[target] <- as.factor(ds[[target]])
  (tt  <- table(ds[target]))
  form <- as.formula(paste(target, "~ ."))
  set.seed(42)
  train <- sample(nrow(ds), 0.7*nrow(ds))
  test  <- setdiff(seq_len(nrow(ds)), train)
  
  # build model
  model.wsrf <- wsrf(form, data=ds[train, vars])
  
  # view model
  print(model.wsrf)
  print(model.wsrf, tree=1)
  
  # evaluate
  strength(model.wsrf)
  correlation(model.wsrf)
  cl <- predict(model.wsrf, newdata=ds[test, vars], type="response")
  actual <- ds[test, target]
  (accuracy.wsrf <- sum(cl==actual)/length(actual))

Run the code above in your browser using DataLab