randomForest function of randomForest package in a distributed fashion with parallelism in sub-forest level. drandomForest calls several instances of randomForest distributed across a cluster system in order to create sub-forests concurrently. Therefore, the master distributes the input data among all R-executors of the distributedR environment, and trees on different sub-sections of the forest are created simultaneously. At the end, all these trees are combined to result a single forest. The interface of drandomForest is similar to randomForest. Indeed it adds two arguments nExecutor and trace, and removes several other arguments: subset, do.trace, corr.bias, keep.inbag, and oob.prox. Nevertheless, it must be noticed that default value of some arguments are changed as well to make the algorithm more scalable for big data problems; e.g, proximity is FALSE by default. Its returned result is also completely compatible to the result of randomForest.
## S3 method for class 'formula':
drandomForest(formula, data=NULL, ..., ntree=500,
na.action=na.fail, nExecutor, trace=FALSE,
completeModel=FALSE, setSeed)
## S3 method for class 'default':
drandomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y) && !is.dframe(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
replace=TRUE, classwt=NULL, cutoff, strata,
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y) &&
!is.dframe(y)) 5 else 1,
maxnodes=NULL, importance=FALSE, localImp=FALSE, nPerm=1,
proximity=FALSE,norm.votes=TRUE, keep.forest=TRUE,
nExecutor, trace=FALSE, completeModel=FALSE, ...,
setSeed, formula, na.action = na.fail)hpdRF_parallelForest
will run in unsupervised mode. When x is a darray), y
should be also a darrayx) containing
predictors for the test set. When x is a darray,
it should be of the same type.y. Moreover, it should have a single column.x)
and regression (p/3)nodesize). If set larger than maximum
possible, a warning is issued.TRUE will override importance.)TRUE (default), the final result of votes
are expressed as fractions. If FALSE, raw vote counts are
returned (useful for combining results from different runs).
Ignored for regression.FALSE, the forest will not be
retained in the output object.randomForest. The result is similar to the result of the combine function in randomForest package
and will contain the following components.drandomForestregression, classification, or
unsupervised.nclass + 2 (for classification)
or two (for regression) columns. For classification, the first
nclass columns are the class-specific measures computed as
mean descrease in accuracy. The nclass + 1st column is the
mean descrease in accuracy over all classes. The last column is the
mean decrease in Gini index. For Regression, the first column is
the mean decrease in accuracy and the second the mean decrease in MSE.
If importance=FALSE, the last measure is still returned as a
vector.p by nclass + 1
matrix corresponding to the first nclass + 1 columns
of the importance matrix. For regression, a length p vector.NULL if localImp=FALSE.NULL if
hpdRF_parallelForest is run in unsupervised mode or if
keep.forest=FALSE.proximity=TRUE.n.mse / Var(y).xtest or additionally ytest arguments),
this component is a list which contains the corresponding predicted,
err.rate, confusion, votes (for classification)
or predicted, mse and rsq (for regression) for the test set. Random Forests V4.6-10,
library(ddR.randomForest)
## Classification:
##data(iris)
iris.rf <- drandomForest(Species ~ ., data=iris, importance=TRUE)
print(iris.rf)
## The 'unsupervised' case:
iris.urf <- drandomForest(iris[, -5],
proximity=TRUE, completeModel=TRUE)
MDSplot(iris.urf, iris$Species)
## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
(iris.rf2 <- drandomForest(iris[1:4], iris$Species,
sampsize=c(20, 30, 20)))
## Regression:
## data(airquality)
ozone.rf <- drandomForest(Ozone ~ ., data=airquality, mtry=3,
importance=TRUE, na.action=na.omit,
completeModel=TRUE)
print(ozone.rf)
## Show "importance" of variables: higher value mean more important:
round(importance(ozone.rf), 2)
## "x" can be a matrix instead of a data frame:
x <- matrix(runif(5e2), 100)
y <- gl(2, 50)
(myrf <- drandomForest(x, y))
(predict(myrf, x))
## "complicated" formula:
(swiss.rf <- drandomForest(sqrt(Fertility)~. - Catholic + I(Catholic<50),
data=swiss))
(predict(swiss.rf, swiss))
## Test use of 32-level factor as a predictor:
x <- data.frame(x1=gl(32, 5), x2=runif(160), y=rnorm(160))
(rf1 <- drandomForest(x[-3], x[[3]], ntree=10))
## Grow no more than 4 nodes per tree:
(treesize(drandomForest(Species ~ ., data=iris, maxnodes=4, ntree=30)))
distributedR_shutdown()Run the code above in your browser using DataLab