randomUniformForest: Random Uniform Forests for Classification and Regression

Description

Ensemble model, for classification and regression, based on a forest of unpruned and randomized binary trees. Each tree is grown by sampling, with replacement, a set of variables at each node. Each cut-point is generated randomly, according to the Uniform distribution on the support of each candidate variable. Optimal random node is, then, chosen by maximizing information gain (classification) or minimizing 'L2' (or 'L1') distance (regression). Data are either bootstrapped or sub-sampled for each tree. Random Uniform Forests are aimed to lower correlation between trees, to offer more details about variable importance and selection and to allow native incremental learning.

Usage

## S3 method for class 'formula':
randomUniformForest(formula, data = NULL, \dots)
## S3 method for class 'default':
randomUniformForest(X, Y = NULL, xtest = NULL, ytest = NULL, 
	ntree = 100,
	mtry = ifelse(bagging, ncol(X), floor(4/3*ncol(X))),
	nodesize = 1,
	maxnodes = Inf,
	depth = Inf,
	depthcontrol = NULL,
	regression = ifelse(is.factor(Y), FALSE, TRUE),
	replace = ifelse(regression, FALSE, TRUE),
	OOB = TRUE,
	BreimanBounds = ifelse(OOB, TRUE, FALSE),
	subsamplerate = ifelse(regression, 0.7, 1),
	importance = TRUE,
	bagging = FALSE,
	unsupervised = FALSE,
	proximities = FALSE,
	classwt = NULL,
	oversampling = 0,
	targetclass = -1,
	outputperturbationsampling = FALSE,
	rebalancedsampling = FALSE,
	featureselectionrule = c("entropy", "gini", "random", "L2", "L1"),
	randomcombination = 0,
	randomfeature = FALSE,
	categoricalvariablesidx = NULL,
	na.action = c("fastImpute", "accurateImpute", "omit"),
	logX = FALSE,
	classcutoff = c(0,0),
	threads = "auto",
	parallelpackage = "doParallel",
	...)	
	## S3 method for class 'randomUniformForest':
print(x, \dots)
	## S3 method for class 'randomUniformForest':
summary(object, maxVar = 30, border = NA, \dots)
	## S3 method for class 'randomUniformForest':
plot(x, threads = "auto", \dots)

Arguments

maxVar

maximum number of variables to plot and print when summarizing a randomUniformForest object.

border

positive integer value or NA. Change color of the borders when plotting variable importance. By default, NA, which disables border.

x, object

an object of class randomUniformForest.

data

in case of formula, a data frame, or matrix, containing the variables (including response) and their values.

X, formula

a data frame, or matrix, of predictors, or a formula describing the model to be fitted. Note that, it is strongly recommended to avoid formula when using options or with large samples.

a response vector. If it is a factor, classification is assumed, otherwise regression is computed.

xtest

a data frame or matrix (like X) containing predictors for the test set.

ytest

responses for the test set, if provided.

ntree

number of trees to grow. Default value is 100. Do not set it too small.

mtry

number of variables randomly sampled with replacement as candidates at each split. Default value is floor(4/3*ncol(X)) unless 'bagging' or 'randomfeature' options are specified. One can also set mtry = "random". For regression, increasing 'mtry' value usu

nodesize

minimal size of terminal nodes. Default value is 1 (for both classification and regression) and usually produce best results, as it reduces bias when trees are fully grown. Variance is increased, but that is exactly what random Uniform Forests need.

maxnodes

maximal nodes for each tree. Default value is 'Inf', growing trees to maximum size. Random number of nodes is allowed, setting option to "random".

depth

depth of each tree. By default, Trees are fully grown. Maximum depth is floor(log(n)/log(2)) where n = nrow(X). Stumps are not allowed, hence smallest depth is 3. Note that 'depth' has an effect when assessing variable importance.

depthcontrol

let algorithm controls depth by setting depthcontrol = "random", or set it lower than 16 for regression. Small values greatly increase speed while reducing accuracy. For classification, set it less or equal to 0.01 but value is more sensitive to accur

regression

only needed if either classification or regression has to be set explicitly. Otherwise, model checks if 'Y' is a factor (classification), or not (regression) before computing task. If Y is not a factor and one wants to do classification, must be set to FA

replace

if TRUE, sampling of cases is done with replacement. By default, TRUE for classification, FALSE for regression.

OOB

if replace is TRUE, then if OOB is TRUE, "Out-of-bag" evaluation is done, resulting in an estimate of generalization (and mean squared) error and bounds. OOB option has overhead on computing time, but it is one of the most useful option.

BreimanBounds

if TRUE, computes all theoretical properties provided by Breiman (2001), since random uniform forests inherit of random forests properties. For classification, it gives the two bounds of prediction error, average correlation between trees, strength and st

subsamplerate

value is the rate of sub-sampling (Bulhmann et Yu, 2002) for training sample. By default, 0.7 for regression (1, e.g. no sub-sampling in classification). If 'replace' is TRUE, 'subsamplerate' can be set to values greater than 1. For regression, if only ac

importance

should importance of predictors be assessed? By default, TRUE.

bagging

if TRUE, "Bagging" (Breiman, 1996) of random uniform decision trees (unpruned trees whose variables are, usually, sampled with replacement and with cut-points chosen randomly using the Uniform distribution on the support of each candidate variable) is don

unsupervised

not yet implemented.

proximities

not yet fully implemented.

classwt

for classification only. Prior of the classes. Need not add up to one. Useful for imbalanced classes. Note that if one wants to compute many forests and combine them, with 'classwt' enabled for only few of them, all others forests must have 'classwt' enab

oversampling

for classification, a scalar betwen 1 and -1 for over or under-sampling of minority or majority class. Must be used with 'targetclass'. For example, if set to -0.3, and 'targetclass' set to 1, then first class (assumed to be the majority class) will be un

targetclass

for classification only. Which class (given by its subscript, e.g. 1 for first class) should be targeted by 'oversampling' or 'outputperturbationsampling' option ?

outputperturbationsampling

if TRUE, let model applies a random perturbation over responses vector. For classification, 'targetclass' must be set to the class (given by its position) that will be perturbed. By default 5 percent of the values will be perturbed, but more is allowed (u

rebalancedsampling

for classification only. Can be set to TRUE or to a vector containing the desired sample size for each class. If TRUE, model builds samples where all classes are equally distributed, leading to exactly balanced classes, by either oversampling or under-sam

featureselectionrule

which optimization criterion should be chosen for growing trees ? By default, model uses "entropy" (in classification) to compute information gain function. If set to "random", model chooses randomly between Gini criterion and entropy for each node of eac

randomcombination

vector containing features index and, eventually, weight(s) for (random) combination of features. For example, if a combination of feature 1 and feature 2 is desired with a weight of 0.2 for the first, then randomcombination = c(1, 2, 0.2). If a combinati

randomfeature

if TRUE, a forest of totally randomized trees (e.g. purely random forest) will be grown. In this case, there is no optimization. Useful as a base result for forest of randomized trees, since it is statistically consistent given nodesize = sqrt(nrow(X)*log

categoricalvariablesidx

which variables should be considered as categorical? By default, value is NULL, and then categorical variables remain in the same approach as continuous ones. If 'X' is a data frame, value can be set to "all", in which case model will automatically identi

na.action

how to deal with NA data? By default, na.action = "fastImpute", using rough replacement with median or most frequent values. If speed is not required, na.action = "accurateImpute", can lead to better results using model to impute NA values. But, one can n

logX

applies logarithm transformation to all predictors whose values are strictly positive, and ignores the others.

classcutoff

for classification only. Change proportion of votes needed to get majority. First value of vector is the name of the class (between quotes) that has to be assessed. Second value is a kind of weight needed to get majority. For example, in a problem which c

threads

compute model in parallel for computers with many cores. Default value is 'auto', letting model run on all logical cores minus 1. User can set 'threads' to any value greater than 1. Note that, in Windows, logical cores consume same memory than physical on

parallelpackage

which parallel back-end to use for computing parallel tasks ? By default and for ease of use, 'doParallel' is the package retained for now. Should not be modified. It is not the fastest, but has the great advantage to allow killing task, e.g. pushing the

...

not currently used.

Value

An object of class randomUniformForest, which is a list with the following components:
forestlist of tree objects, OOB objects (if OOB = TRUE), variable importance objects (if importance = TRUE).
predictionObjectif 'xtest' is not NULL, prediction objects.
errorObjectstatistics about errors of model.
forestParamsalmost all parameters of the model.
classesoriginal labels of response vector in cas of classification.
logXTRUE, if logarithm transformation has been called.
ytraining responses.
variablesNamesvector of variables names.
callthe original call to randomUniformForest.

Details

Random Uniform Forests are inspired by Bagging and Breiman's Random Forests (tm) but have many differences at theoretical and algorithmic levels. Random Uniform Forests build many randomized and unpruned trees by : - sampling with replacement data (or sub-sampling, in case of regression), - sampling with replacement features and - choosing random cut-points according to the Uniform distribution, i.e. cut-points usually not belong to data but are virtual points drawn between the minimum and the maximum of each candidate variable at each node using the Uniform distribution, since all points are (or will be always converted to) numeric values. Then, node is built using information gain (or distance) between many full random nodes. Classification is done by majority vote, and regression by averaging trees. In this latter case, post-processing (designed first to reduce bias) can be done to achieve a better accuracy. Note that random Uniform Forests do not make assumptions about node size or number of candidate features at each node. Default options usually lead to good results but it is not a rule and one can try many options. Trees are designed to have low bias and large variance, and are, thus, optimized only to reach a high level of randomness. The forest maintains the bias and reduces variance, since variance of the forest is approximatively (in regression) the product of average correlation between trees residuals and average variance of trees. This leads to the same scheme for prediction error. Others main features are - a deep analysis of variables importance and selection, - treatment of imbalanced classes, - quantile regression, - prediction and confidence intervals, - partial dependencies, - visualization to help a better interpretation,... At the algorithmic level, random Uniform Forests are natively parallel and support distributed computing using the principle "compute everywhere, combine in one place", meaning that one can compute many random Uniform Forests, using different options and different data (sharing, at least, some features), in many computers and at the end just retrieve and combine them in the place where test data belong. As a consequence, incremental learning is also native and one can remove or add trees (but not modify) at each step. Note that random Uniform Forests are strongly randomized, and then results will not be reproducible using set.seed() function. One reason is that many (including essential) options run at the tree (or node) level in order to decrease correlation. Hence, for a given training sample with enough trees and enough data, bounds estimates (OOB and Breiman's ones) should act as upper bounds which will vary (depending on both training sample and the runs of the model). But test error, for any sample used with the given training sample and model, should remain below. Note also that speed is currently not at the state of art, except for big dimension ('p' far greater than 'n') or tasks with many variables and observations (for large samples, one may use incremental learning with rUniformForest.combine or rUniformForest.big), but options gives the ability to increase it by a great factor (loosing however some accuracy).

References

Biau, G., Devroye, L., Lugosi, G., 2008. Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research 9, 2015-2033. Breiman, L, 1996. Heuristics of instability and stabilization in model selection. The Annals of Statistics 24, no. 6, 2350-2383. Breiman, L., 1996. Bagging predictors. Machine learning 24, 123-140. Breiman, L., 2001. Random Forests, Machine Learning 45(1), 5-32. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C., 1984. Classification and Regression Trees. New York: Chapman and Hall. Ciss, S., 2014. PhD thesis: Forets uniformement aleatoires et detection des irregularites aux cotisations sociales. Universite Paris Ouest Nanterre, France. In french. English title : Random Uniform Forests and irregularity detection in social security contributions. Link : https://www.dropbox.com/s/q7hbgeafrdd8qtc/Saip_Ciss_These.pdf?dl=0 Ho, T.K., 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 832-844.

Examples

Run this code

# NOTE : use option 'threads = 1' (disabling parallel processing) to speed up computing 
# for small samples, since parallel processing is useful only for computationally intensive tasks

###### Part One : quick guide

## not run
#### Classification 
# data(iris)
# iris.ruf <- randomUniformForest(Species ~ ., data = iris, threads = 1)
# iris.ruf ## or print(iris.ruf)

## plot OOB error: 
# plot(iris.ruf, threads = 1)

## print and plot (global) variable importance and some statistics about trees:
# summary(iris.ruf)

#### Regression

## Note that when formula is used, missing values are automatically deleted and dummies
## are built for categorical features
# data(airquality)
# ozone.ruf <- randomUniformForest(Ozone ~ ., data = airquality, threads = 1)
# ozone.ruf

## plot OOB error: 
# plot(ozone.ruf, threads = 1)

## Bagging
# ozone.bagging.ruf <- randomUniformForest(Ozone ~ ., data = airquality,
# bagging = TRUE, threads = 1)

## Ensemble of totally randomized trees, e.g. purely random forest
# ozone.prf <- randomUniformForest(Ozone ~ ., data = airquality, randomfeature = TRUE, threads = 1)

#### Common case: use X, as a matrix or data frame, and Y, as a response vector, 
#### training and testing (or validating)

#### Classification : iris data, training and testing
data(iris)

## define random train and test sample. "Species" is the response vector
# iris.train_test <- init_values(iris[,-which(colnames(iris) == "Species")], iris$Species,
# sample.size = 1/2)

## iris train and test samples
# iris.train = iris.train_test$xtrain
# species.train = iris.train_test$ytrain
# iris.test = iris.train_test$xtest
# species.test = iris.train_test$ytest

## iris train and test modelling
# iris.train_and_test.ruf <- randomUniformForest(iris.train, species.train,
# xtest = iris.test, ytest = species.test, threads = 1)
# iris.train_and_test.ruf

## Balanced sampling : equal sample size for all labels
# iris.train_and_test.balancedsampling.ruf <- randomUniformForest(iris.train, species.train,
# xtest = iris.test, ytest = species.test, rebalancedsampling = TRUE, threads = 1)

								
###### Part Two : Summarized case studies (remove comments to run)

#### Classification : Wine Quality data
## http://archive.ics.uci.edu/ml/datasets/Wine+Quality
## We use red wine quality file : data have 1599 observations, 12 variables and 6 classes.
 
## data(wineQualityRed)
# wineQualityRed.data = wineQualityRed

## class and observations
# Y = wineQualityRed.data[,"quality"]
# X = wineQualityRed.data[, -which(colnames(wineQualityRed.data) == "quality")]

## First look : train model with default parameters (and retrieve estimates)
## call it standard model.
# wineQualityRed.std.ruf <- randomUniformForest(X, as.factor(Y), threads = 2)

## see OOB evaluation and parameters
# wineQualityRed.std.ruf 

## see statistics about the forest and global variable importance
# summary(wineQualityRed.std.ruf)

## But some labels do not have enough observations to assess variable importance
## merging class 3 and 4. Merging class 7 and 8 to get enough data.
# Y[Y == 3] = 4
# Y[Y == 8] = 7

## make Y as a factor, change names and get a summary
# Y = as.factor(Y)
# levels(Y) = c("3 or 4", "5", "6", "7 or 8")
# table(Y)

## learn a new model to get a better view on variable importance
## note : Y is now a factor, the model will catch the learning task as a classification
# wineQualityRed.new.ruf <- randomUniformForest(X, Y)
# wineQualityRed.new.ruf 

## global variable importance is more consistent
# summary(wineQualityRed.new.ruf)

## plot OOB error (needs some computing)
# plot(wineQualityRed.new.ruf, threads = 2)

## go deeper in assessing variable importance, using a high level of interaction
# importance.wineQualityRed <- importance(wineQualityRed.new.ruf, Xtest = X, maxInteractions = 6)
									
## visualize : global importance, importance based on interactions, 
## importance based on labels, partial dependencies for all influent variables and interactions
## loop over the prompt to get others partial dependencies 
## get more points, using option whichOrder = "all", default option.
# plot(importance.wineQualityRed, Xtest = X, whichOrder = "first")

## look at some specific labels from a (very) local view point
## what features for a very good wine (class 7 or 8) ?
# pImportance.wineQualityRed.class7or8 <- partialImportance(X, importance.wineQualityRed, 
# whichClass = "7 or 8", nLocalFeatures = 6)
											
## but how do they act ?
## get it feature after feature, recalling partial dependence
## and considering feature at the first order 
## assuming the feature is the most important, at least for the class one need to assess.
# pDependence.wineQualityRed.totalSulfurDioxide <- partialDependenceOverResponses(X, 
# importance.wineQualityRed, whichFeature = "total.sulfur.dioxide", 
# whichOrder = "first", outliersFilter = TRUE)
											
## see what happen then for "alcohol"
# pDependence.wineQualityRed.totalSulfurDioxide <- partialDependenceOverResponses(X, 
# importance.wineQualityRed, whichFeature = "alcohol", 
# whichOrder = "first", outliersFilter = TRUE)

#### Regression : Auto MPG 
## http://archive.ics.uci.edu/ml/datasets/Auto+MPG
## 398 observations, 8 variables, missing values
## Variable to predict : "mpg", miles per gallon 

## data(autoMPG)
# autoMPG.data = autoMPG

# Y = autoMPG.data[,"mpg"]
# X = autoMPG.data[,-which(colnames(autoMPG.data) == "mpg")]

## remove "car name" which is a variable with unique ID (car models)
# X = X[, -which(colnames(X) == "car name")]

## train the model and get OOB evaluation
# autoMPG.ruf <- randomUniformForest(X, Y)

## assess variable importance
# importance.autoMPG <- importance(autoMPG.ruf, Xtest = X)
# plot(importance.autoMPG, Xtest = X)

## what are the features that lead to a lower consumption ?
# pImportance.autoMPG.low <- partialImportance(X, importance.autoMPG, 
# threshold = mean(Y), thresholdDirection = "low", nLocalFeatures = 6)
											
## Look at "weight" and "acceleration" dependence
## note that option perspective = TRUE allows a 3D representation
# pDependence.autoMPG.weightAndAcceleration <- 
# partialDependenceBetweenPredictors(X, importance.autoMPG, c("weight", "acceleration"),
# whichOrder = "all", perspective = FALSE, outliersFilter =  TRUE)

##dtFW

Run the code above in your browser using DataLab