rUniformForest.big: Random Uniform Forests for Classification and Regression with large data sets

Description

Implements random uniform forests for data sets that are too large too fit in physical memory but enough too fit in virtual memory. data set is randomly (or not) cut in many subsamples and each one is processed, getting many base (but ensemble) models per subsample. At the end, all base models are combined to obtain one ensemble of ensembles model. If data can not reside in physical memory, but can reside in virtual memory (physical memory + swap file) then consider R packages 'bigmemory', 'data.table' of 'ff' to load data. To save memory (and computing time), subsamples of data (e.g. using an Hadoop environment like) are suited, computing then one forest per subsample and combining all trees from all forests using randomUniformForest.combine(). Note that rUniformForest.big() is first designed to compute large files on a small computer, at the expense of accuracy. But, in case of a shifting distribution, model may be more robust than the standard one (at least for regression).

Usage

rUniformForest.big(X, Y = NULL, 
	xtest = NULL,
	ytest = NULL,
	nforest = 2,
	randomCut = FALSE,
	reduceDimension = FALSE,
	reduceAll = FALSE,
	replacement = FALSE,
	subsample = FALSE,
	ntree = 100,
    nodesize = 1, 
	maxnodes = Inf,
	mtry = ifelse(bagging,ncol(X),floor(4/3*ncol(X))),
	regression = ifelse(is.factor(Y),FALSE, TRUE),
    subsamplerate = ifelse(regression, 0.7, 1),
	replace = ifelse(regression,FALSE,TRUE),
	OOB = TRUE,
	BreimanBounds = ifelse(OOB, TRUE, FALSE),
    depth = Inf,
    depthcontrol = NULL,
    importance = TRUE,	
	bagging = FALSE,
	unsupervised = FALSE,
	proximities = FALSE,
	classwt = NULL,	
	oversampling = 0,
	targetclass = -1,
	outputperturbationsampling = FALSE,
    rebalancedsampling = FALSE,
	featureselectionrule = c("entropy", "gini", "random", "L1", "L2"),	
	randomcombination = 0,
	randomfeature = FALSE,
	categoricalvariablesidx = NULL,
	na.action = c("fastImpute", "accurateImpute", "omit"),
	logX = FALSE,
	classcutoff = c(0,0),
	threads = "auto",
	parallelpackage = "doParallel")

Arguments

a large data frame, or matrix, of predictors describing the model to be fitted.

a responses vector. If it is a factor, classification is assumed, otherwise regression is computed.

xtest

an optional data frame or matrix (like X) containing predictors for the test set.

ytest

optional responses for the test set.

nforest

number of forests to compute. Size of each subsample will be 'number of observations in the original data set / nforest'. if 'nforest' is too high, accuracy will be lost. If too low, computing time will be long.

randomCut

should original data be cut randomly ?

reduceDimension

should dimension be reduced in original data (or subsamples, if nforest > 1) ? useful for speed, but can reduce dramatically accuracy.

reduceAll

should subsamples have lower dimension ? It is recommended to use it only to quickly get a base result.

replacement

should sample of data be done with replacement, e.g. sample n observations between n with replacement then divide them in 'nforest' subsamples.

subsample

value of subsample rate (m/n), e.g. sample m observations (m < n) between n, then divide them in 'nforest' subsamples. Note than 'subsample' can be combine with 'replacement' option.

ntree

see randomUniformForest.

mtry

see randomUniformForest.

nodesize

see randomUniformForest.

maxnodes

see randomUniformForest.

depth

see randomUniformForest.

depthcontrol

see randomUniformForest.

regression

see randomUniformForest.

replace

see randomUniformForest.

OOB

see randomUniformForest.

BreimanBounds

see randomUniformForest.

subsamplerate

see randomUniformForest.

importance

see randomUniformForest.

bagging

see randomUniformForest.

unsupervised

see randomUniformForest.

proximities

see randomUniformForest.

classwt

see randomUniformForest.

oversampling

see randomUniformForest.

targetclass

see randomUniformForest.

outputperturbationsampling

see randomUniformForest.

rebalancedsampling

see randomUniformForest.

featureselectionrule

see randomUniformForest.

randomcombination

see randomUniformForest.

randomfeature

see randomUniformForest.

categoricalvariablesidx

see randomUniformForest.

na.action

see randomUniformForest.

logX

see randomUniformForest.

classcutoff

see randomUniformForest.

threads

see randomUniformForest.

parallelpackage

see randomUniformForest.

Value

an object of class randomUniformForest.

Examples

Run this code

## not run
## Classification: synthetic data
# n = 100;  p = 10 ## for ease of use, we consider small 'n'
## Simulate 'p' gaussian vectors with random parameters between -10 and 10.
#X <- simulationData(n,p)

## Make a rule to create response vector
# epsilon1 = runif(n,-1,1)
# epsilon2 = runif(n,-1,1)
# rule = 2*(X[,1]*X[,2] + X[,3]*X[,4]) + epsilon1*X[,5] + epsilon2*X[,6]

# Y <- as.factor(ifelse(rule > mean(rule), 1, 0))

# big.ruf <- timer(rUniformForest.big(X, Y, nforest = 2, 
# threads = 1, BreimanBounds = FALSE, replacement = TRUE, importance = FALSE))

## elapsing time
# big.ruf$time 

## OOB accuracy
# big.ruf$object

## standard model
# std.ruf <- timer(randomUniformForest(X, Y, threads = 1, ntree = 20, BreimanBounds = FALSE))

## elapsing time. Note that for small 'n' standard case will be faster.
# std.ruf$time  

## OOB accuracy
#std.ruf$object

## not run
##  regression
# Y = rule
# big.ruf <- timer(rUniformForest.big(X, Y, nforest = 2, 
# threads = 2, BreimanBounds = FALSE, subsample = 0.7))
# big.ruf  

## classic random uniform forest 
# std.ruf <- timer(randomUniformForest(X, Y, threads = 2, BreimanBounds = FALSE))
# std.ruf  # accuracy gap is much larger in case of regression

## but, one can consider a new case, e.g. shifting distribution, to see how it works
# newX <- simulationData(n,p)
# epsilon1 = runif(n,-1,1)
# epsilon2 = runif(n,-1,1)
# newRule = 2*(X[,1]*X[,2] + X[,3]*X[,4]) + epsilon1*X[,5] + epsilon2*X[,6]
# newY = newRule 

## predict using standard model
# pred.std.ruf <- predict(std.ruf$object, newX)

## get mean squared error
# sum( (pred.std.ruf - newY)^2 )/length(newY)

## predict using rUniformForest.big
# pred.big.ruf <- predict(big.ruf$object, newX)

## get mean squared error : both errors will be more closer, and for large 'n' (and more trees), 
## rUniformForest.big might have lower error
# sum( (pred.big.ruf - newY)^2 )/length(newY)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples