MultivariateRandomForest: Prediction using Random Forest or Multivariate Random Forest

Description

Builds Model of Random Forest(if number of output feature is 1) or Multivariate Random Forest(if number of output feature is greater than 1) using Training samples and do the prediction of testing samples using this model

Usage

MultivariateRandomForest(trainX, trainY, n_tree, mtree, min_leaf, testX)

Arguments

trainX

Input matrix of M x N, M is the number of training samples and N is the number of features

trainY

Output response of M x T, M is the number of samples and T is number of ouput Features(Response)

n_tree

Number of trees in the forest

mtree

Number of randomly selected features used for each split

min_leaf

Minimum number of samples in the leaf node. If a node has less than equal to min_leaf samples, then there will be no splitting in that node and this node is a leaf node.

testX

Testing samples of Q x N, Q is the number of testing samples and N is the number of features(same order and size used as training)

Value

Prediction result of the Testing samples

Details

Random Forest(RF) regression refers to ensembles of regression trees where a set of T un-pruned regression trees are generated based on bootstrap sampling from the original training data. For each node, the optimal feature for node splitting is selected from a random set of m features from the total N features. The selection of the feature for node splitting from a random set of features decreases the correlation between different trees and thus the average prediction of multiple regression trees is expected to have lower variance than individual regression trees. Larger m can improve the predictive capability of individual trees but can also increase the correlation between trees and void any gains from averaging multiple predictions. The bootstrap resampling of the data for training each tree also increases the variation between the trees.

In a node with training predictor features(X) and output feature vectors(Y), node splitting is done with the aim to select a feature from a random set of m features and a threshold z to partition the node into two child nodes, left node (with samples < z) and right node (with samples > z). In multivariate trees(MRF) node cost is measured as the sum of squares of the Mahalanobis distance where in univariate trees(RF) node cost is measured as the Euclidean distance.

After the Model of the forest is built using training predictor features(trainX) and output feature vectors(trainY), the Model is used to do the prediction of features of the testing samples(testX).

Multivariate Random Forest(MRF) calculates prediction of all output features using one model, which is generated using the training output features. While, if the output features are highly correlated then, the prediction using MRF is much improved then prediction using RF.

References

[1] Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.

[2] Segal, Mark, and Yuanyuan Xiao. "Multivariate random forests." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.1 (2011): 80-87.

Examples

Run this code

trainX=matrix(runif(50*100),50,100)
trainY=matrix(runif(50*5),50,5)
n_tree=5
mtree=10
min_leaf=5
testX=matrix(runif(10*100),10,100)
Command=2#2 for MRF method
#Prediction size is 10 x 5, where 10 is the number 
#of testing samples and 5 is the number of output features
Prediction=MultivariateRandomForest(trainX, trainY, n_tree, mtree, min_leaf, testX)

Run the code above in your browser using DataLab