randomUniformForest: Random Uniform Forests for Classification, Regression and Unsupervised Learning

Description

Ensemble model for classification, regression and unsupervised learning, based on a forest of unpruned and randomized binary decision trees. Unlike Breiman's Random Forests, each tree is grown by sampling, with replacement, a set of variables at each node. Each cut-point is generated randomly, according to the continuous Uniform distribution between two random points of each candidate variable. Optimal random node is, then, selected among many full random nodes by maximizing information gain (classification) or minimizing 'L2' (or 'L1') distance (regression). Unlike Extremely Randomized Trees, data are either bootstrapped or sub-sampled for each tree. Random Uniform Forests are aimed to lower correlation between trees, to offer a deep analysis of variable importance and to allow native distributed and incremental learning. The unsupervised mode introduces clustering and dimension reduction, using a three-layer engine (dissimilarity matrix, Multidimensional Scaling and k-means or hierarchical clustering).

Usage

## S3 method for class 'formula':
randomUniformForest(formula, data = NULL, subset = NULL, \dots)
## S3 method for class 'default':
randomUniformForest(X, Y = NULL, xtest = NULL, ytest = NULL, 
	ntree = 100,
	mtry = ifelse(bagging, ncol(X), floor(4/3*ncol(X))),
	nodesize = 1,
	maxnodes = Inf,
	depth = Inf,
	depthcontrol = NULL,
	regression = ifelse(is.factor(Y), FALSE, TRUE),
	replace = ifelse(regression, FALSE, TRUE),
	OOB = TRUE,
	BreimanBounds = ifelse(OOB, TRUE, FALSE),
	subsamplerate = ifelse(regression, 0.7, 1),
	importance = TRUE,
	bagging = FALSE,
	unsupervised = FALSE,
	unsupervisedMethod = c("uniform univariate sampling", 
	"uniform multivariate sampling", "with bootstrap"),
	classwt = NULL,
	oversampling = 0,
	targetclass = -1,
	outputperturbationsampling = FALSE,
	rebalancedsampling = FALSE,
	featureselectionrule = c("entropy", "gini", "random", "L2", "L1"),
	randomcombination = 0,
	randomfeature = FALSE,
	categoricalvariablesidx = NULL,
	na.action = c("fastImpute", "accurateImpute", "omit"),
	logX = FALSE,
	classcutoff = c(0,0),
	subset = NULL,
	usesubtrees = FALSE,
	threads = "auto",
	parallelpackage = "doParallel",
	...)	
	## S3 method for class 'randomUniformForest':
print(x, \dots)
	## S3 method for class 'randomUniformForest':
summary(object, maxVar = 30, border = NA, \dots)
	## S3 method for class 'randomUniformForest':
plot(x, threads = "auto", \dots)

Arguments

Value

An object of class randomUniformForest, which is a list with the following components:
forestlist of tree objects, OOB objects (if OOB = TRUE), variable importance objects (if importance = TRUE).
predictionObjectif 'xtest' is not NULL, prediction objects.
errorObjectstatistics about errors of the model.
forestParamsalmost all parameters of the model.
classesoriginal labels of response vector in case of classification.
logXTRUE, if logarithm transformation has been called.
ytraining responses.
variablesNamesvector of variables names.
callthe original call to randomUniformForest.

Details

Random Uniform Forests are inspired by Bagging and Breiman's Random Forests (tm) but have many differences at theoretical and algorithmic levels. Random Uniform Forests build many randomized and unpruned trees and the four main differences with Random Forests are : - sampling, with replacement, features at each node - subsampling data, in the case of regression, - generating random cut-points according to the Uniform distribution, i.e. cut-points usually not belong to data but are virtual points drawn between the minimum and the maximum, or between two random points, of each candidate variable at each node, using the continuous Uniform distribution, since all points are (or will always be converted to) numeric values. - the optimization criterion. Maximizing the Information Gain is preferably used for classification. For regression, sum of squared (or absolute) residuals is computed for each candidate node (region) then, for each sampled feature, the metrics are summed for each pair of complementary nodes. The chosen pair is the one that reaches the minimum. More precisely, in regression only sums are involved and only in the candidate nodes (not in the current one). Note that it could also be the case for Breiman's Random Forests. The enumeration above leads to a large and deep tree that is grown using global optimization, for the current partition, to select each node. Sampling features with replacement increases the competition between nodes, in order to limit variance, especially in the regression case where prediction error depends more on the model than in the classification case. Others differences also appear at the node level. Like Random Forests, classification is done by majority vote, and regression by averaging trees outputs but : - trees can be updated with streaming data (currently disabled for further tests), - trees with different parameters and data can be combined,\ - trees are explicitly designed to have an average low bias, while trying to tame the happening of an increasing variance, and are thus optimized to reach a high level of randomness. The forest maintains the bias and reduces variance, since variance of the forest is approximatively (in regression) the product of average correlation between trees residuals and average variance of trees. This leads to the same scheme for prediction error. Note that the trend in decreasing correlation can not be obtained in the same time than a decreasing variance. The main work is to decrease correlation faster than the growth of variance. Note also that low correlation is mandatory to reach convergence, especially in regression where average correlation tends to be high. Type vignette("randomUniformForestsOverview", package = "randomUniformForest") in the R prompt to get a summary of technical details. Others main features, thanks to Breiman's ideas, to the ensemble structure and to the Bayesian framework, are : - some other paradigms of ensemble learning (like Bagging) using options, - functions to manipulate and plot trees, see getTree.randomUniformForest and friends - all Breiman's bounds, - post-processing votes in order to lower MSE by reducing bias, see postProcessingVotes, - changing majority vote, using options ('classcutoff') - output perturbation sampling, lowering more the correlation, replacing completely (for regression and for each tree) the training vector of responses by an independent (and bootstrapped) random Gaussian one with the same mean but different variance, - deep analysis of variable importance and selection, see importance.randomUniformForest and partialImportance, - partial dependencies, opening the way to extrapolation, see partialDependenceOverResponses and partialDependenceBetweenPredictors, - visualization tools and tables to help for a better interpretation, see importance.randomUniformForest and others methods and functions, - generic function to assess results, see model.stats, - generic cross-validation function, see generic.cv, - missing values imputation, see fillNA2.randomUniformForest, - treatment of imbalanced classes, using options ('oversampling', 'rebalancedsampling', 'classwt', 'usesubtrees') - cost sensitive learning, using options 'classwt' (which is dual) and friends - native handling of categorical variables using a randomization mechanism at the node level. More precisely, the algorithm selects for each candidate node, randomly and before the splitting process, two values. The first one keeps its position while the second replaces, temporarily all others values of the variable. This leads to a binary variable that can be treated like a numerical one. After the splitting, the variable recovers its original values. Since cut-points are almost virtual and random (a cut-point is not a point of the training sample), one has just to take care that the random splitting would not weaker the variable. - quantile regression, see predict.randomUniformForest - new methods for prediction and confidence intervals, see bCI, - native parallelism, thanks to the parallel, doParallel and foreach packages, - internal MapReduce paradigm for large datasets that can fit in memory, see rUniformForest.big, - incremental learning for large datasets that can not fit in memory, see rUniformForest.combine, - distributed learning, allowing to run many different models on different data (sharing, at least, some features) on many computers and combine them in a single one, in different manners, for predictions. See rUniformForest.combine. Note that one has to carefully manage the i.i.d. assumption in order to see convergence happen. - unsupervised learning and dimension reduction, see unsupervised.randomUniformForest. In particular, incremental learning is native, since the model uses random cut-points, and one can remove, duplicate, add or modify/update trees at each step of the incremental process. Note that the model is not allowing results to be exactly reproducible using the set.seed() function. One reason is that many (including essential) options run at the tree (or node) level in order to decrease correlation and use many random seeds internally. Since convergence is the primal property of Random Forests, for the same enough large training sample, even if results will slightly vary, one has to consider the OOB estimate and Breiman's upper bound (in classification) as the main guarantees. They are effective only under the i.i.d. assumption. If enough data is available, one can derive an OOB bound (see vignette), giving conditions where test error would remain below OOB estimate and, consequently, below Breiman's bounds (see also vignette). Note that speed is currently not at the state of art for small datasets, due to the majority R code and some constant overhead that seems coming from the parallelism. However some of the critical parts of the algorithm are written in C++, thanks to the Rcpp package. For large datasets the gap is greatly reduced, due to shortcuts added to the R code and increased randomness. That is the case when the dimension is getting high, or for regression. A great speed-up can also be achieved with the 'depth' option (for values close to 10), or the 'maxnodes' one, at the cost of some loss in accuracy.

References

Biau, G., Devroye, L., Lugosi, G., 2008. Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research 9, 2015-2033. Breiman, L, 1996. Heuristics of instability and stabilization in model selection. The Annals of Statistics 24, no. 6, 2350-2383. Breiman, L., 1996. Bagging predictors. Machine learning 24, 123-140. Breiman, L., 2001. Random Forests, Machine Learning 45(1), 5-32. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C., 1984. Classification and Regression Trees. New York: Chapman and Hall. Ciss, S., 2014. PhD thesis: Forets uniformement aleatoires et detection des irregularites aux cotisations sociales. Universite Paris Ouest Nanterre, France. In french. English title : Random Uniform Forests and irregularity detection in social security contributions. Link : https://www.dropbox.com/s/q7hbgeafrdd8qtc/Saip_Ciss_These.pdf?dl=0 Ciss, S., 2014a. Random Uniform Forests. Pre-print. Ciss, S., 2014b. Variable Importance in Random Uniform Forests. Pre-print. Ho, T.K., 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 832-844.

Examples

Run this code

## not run
## NOTE : use option 'threads = 1' (disabling parallel processing) to speed up computing 
## for small samples, since parallel processing is useful only for computationally 
## intensive tasks

###### PART ONE : QUICK GUIDE

#### Classification 

# data(iris)
# iris.ruf <- randomUniformForest(Species ~ ., data = iris, threads = 1)

## MODEL, PARAMETERS, STATISTICS:
# iris.ruf ## or print(iris.ruf)

## plot OOB error: 
# plot(iris.ruf, threads = 1)

## print and plot (global) variable importance and some statistics about trees:
# summary(iris.ruf)

#### Regression

## NOTE: when formula is used, missing values are automatically deleted and dummies
## are built for categorical features

# data(airquality)
# ozone.ruf <- randomUniformForest(Ozone ~ ., data = airquality, threads = 1)
# ozone.ruf

## plot OOB error: 
# plot(ozone.ruf, threads = 1)

## BAGGING:
# ozone.bagging.ruf <- randomUniformForest(Ozone ~ ., data = airquality,
# bagging = TRUE, threads = 1)

## Ensemble of totally randomized trees, e.g. PURELY RANDOM FOREST:
# ozone.prf <- randomUniformForest(Ozone ~ ., data = airquality, 
# randomfeature = TRUE, threads = 1)

#### Common case: use X, as a matrix or data frame, and Y, as a response vector, 

#### Classification : iris data, training and testing

# data(iris)

## define random train and test sample. "Species" is the response vector:
# iris.train_test <- init_values(iris[,-which(colnames(iris) == "Species")], iris$Species,
# sample.size = 1/2)

## iris train and test samples:
# iris.train = iris.train_test$xtrain
# species.train = iris.train_test$ytrain
# iris.test = iris.train_test$xtest
# species.test = iris.train_test$ytest

## iris train and test modelling:
# iris.train_test.ruf <- randomUniformForest(iris.train, species.train,
# xtest = iris.test, ytest = species.test, threads = 1)

## view model and statistics:
# iris.train_test.ruf

## BALANCED SAMPLING : equal sample size for all labels
# iris.train_test.balancedsampling.ruf <- randomUniformForest(iris.train, species.train,
# xtest = iris.test, ytest = species.test, rebalancedsampling = TRUE, threads = 1)
								
###### PART TWO : SUMMARIZED CASE STUDIES

#### Classification : Wine Quality data
## http://archive.ics.uci.edu/ml/datasets/Wine+Quality
## We use red wine quality file : data have 1599 observations, 12 variables and 6 classes.
 
# data(wineQualityRed)
# wineQualityRed.data = wineQualityRed

## class and observations
# Y = wineQualityRed.data[,"quality"]
# X = wineQualityRed.data[, -which(colnames(wineQualityRed.data) == "quality")]

## First look : train model with default parameters (and retrieve estimates)
## call it standard model.
# wineQualityRed.std.ruf <- randomUniformForest(X, as.factor(Y), threads = 2)
# wineQualityRed.std.ruf 

## GLOBAL VARIABLE IMPORTANCE:
# summary(wineQualityRed.std.ruf)

## But some labels do not have enough observations to assess variable importance
## merging class 3 and 4. Merging class 7 and 8 to get enough data.
# Y[Y == 3] = 4
# Y[Y == 8] = 7

## make Y as a factor, change names and get a summary
# Y = as.factor(Y)
# levels(Y) = c("3 or 4", "5", "6", "7 or 8")
# table(Y)

## learn a new model to get a better view on variable importance
## NOTE: Y is now a factor, the model will catch the learning task as a classification
# wineQualityRed.new.ruf <- randomUniformForest(X, Y)
# wineQualityRed.new.ruf 

## global variable importance is more consistent
# summary(wineQualityRed.new.ruf)

## plot OOB error (needs some computing)
# plot(wineQualityRed.new.ruf, threads = 2)

## go deeper in assessing variable importance, using a high level of interaction
# importance.wineQualityRed <- importance(wineQualityRed.new.ruf, Xtest = X, maxInteractions = 6)
									
## VISUALIZING IMPORTANCE: global importance, interactions, importance based on interactions, 
## importance based on labels, partial dependencies for all influential variables 
## (loop over the prompt to get others partial dependencies)
## get more points, using option whichOrder = "all", default option.

# plot(importance.wineQualityRed, Xtest = X, whichOrder = "first")

## LINKS BETWEEN OBSERVATIONS AND VARIABLES:
# featuresAndObs = as.data.frame(importance.wineQualityRed$localVariableImportance$obs)
# frequencyFeaturesIdx = grep("Frequency", colnames(featuresAndObs))
# featuresNames = apply(featuresAndObs[,-c(1,frequencyFeaturesIdx)], 2, 
# function(Z) colnames(X)[Z])
# featuresAndObs[,-c(1,frequencyFeaturesIdx)] = featuresNames

# head(featuresAndObs)

## PARTIAL IMPORTANCE: look at some specific labels from a (very) local view point
## which features for a very good wine (class 7 or 8) ?
# pImportance.wineQualityRed.class7or8 <- partialImportance(X, importance.wineQualityRed, 
# whichClass = "7 or 8", nLocalFeatures = 6)
											
## PARTIAL DEPENDENCIES: how response relies to variables ?
## get it feature after feature, recalling partial dependence 
## and considering feature at the first order assuming the feature is the most important, 
## at least for the class one need to assess.

# pDependence.wineQualityRed.totalSulfurDioxide <- partialDependenceOverResponses(X, 
# importance.wineQualityRed, whichFeature = "total.sulfur.dioxide", 
# whichOrder = "first", outliersFilter = TRUE)
											
## see what happen then for "alcohol" (ask more points using option 'whichOrder = "all"')
# pDependence.wineQualityRed.alcohol <- partialDependenceOverResponses(X, 
# importance.wineQualityRed, whichFeature = "alcohol",  
# whichOrder = "first", outliersFilter = TRUE)

#### Regression : Auto MPG 
## http://archive.ics.uci.edu/ml/datasets/Auto+MPG
## 398 observations, 8 variables,
## Variable to predict : "mpg", miles per gallon 

# data(autoMPG)
# autoMPG.data = autoMPG

# Y = autoMPG.data[,"mpg"]
# X = autoMPG.data[,-which(colnames(autoMPG.data) == "mpg")]

## remove "car name" which is a variable with unique ID (car models)
# X = X[, -which(colnames(X) == "car name")]

## train the default model and get OOB evaluation
# autoMPG.ruf <- randomUniformForest(X, Y)

## assess variable importance (asking more points with 'maxInteractions' option)
## NOTE: importance strongly depends on 'ntree' and 'mtry' parameters
# importance.autoMPG <- importance(autoMPG.ruf, Xtest = X)
# plot(importance.autoMPG, Xtest = X)

## opening the way for EXTRAPOLATION (recalling partial dependencies and getting points)
## NOTE : points are the result of the forest classifier and not training responses
# pDependence.autoMPG.weight <- partialDependenceOverResponses(X, importance.autoMPG,
# whichFeature = "weight", whichOrder = "all", outliersFilter = TRUE)

## visualize again 'model year' as a discrete variable and not as a continuous one 
# pDependence.autoMPG.modelyear <- partialDependenceOverResponses(X, importance.autoMPG,
# whichFeature = "model year", whichOrder = "all", maxClasses = 30)

## what are the features that lead to a lower consumption (and high mpg)?
# pImportance.autoMPG.high <- partialImportance(X, importance.autoMPG, 
# threshold = mean(Y), thresholdDirection = "high", nLocalFeatures = 6)
											
## PARTIAL DEPENDENCIES BETWEEN COVARIATES : Look at "weight" and "acceleration" dependence
## and looking informations to get their interactions, relatively to all others
# pDependence.autoMPG.weightAndAcceleration <- 
# partialDependenceBetweenPredictors(X, importance.autoMPG, c("weight", "acceleration"),
# whichOrder = "all", perspective = FALSE, outliersFilter = TRUE)

## Visualize in 3D (looking to the prompt to start animation)
## NOTE : requires some computation
# pDependence.autoMPG.weightAndAcceleration <- 
# partialDependenceBetweenPredictors(X, importance.autoMPG, c("weight", "acceleration"),
# whichOrder = "all", perspective = TRUE, outliersFilter = FALSE)

##dtFW

Run the code above in your browser using DataLab