unsupervised.randomUniformForest: Unsupervised Learning with Random Uniform Forests

Description

Unsupervised mode of Random Uniform Forests, allowing clustering, dimension reduction, visualization and variable importance, using a three-layer engine: dissimilarity matrix, Multidimensional Scaling and k-means or hierarchical clustering. The unsupervised mode does not require the number of clusters to be known, thanks to the gap statistic, and inherit of main algorithmic properties of the supervised mode, hence allowing (almost) any type of variable.

Usage

## S3 method for class 'randomUniformForest':
unsupervised(object,
	baseModel = c("proximity", "proximityThenDistance",  "importanceThenDistance"),
	endModel = c("MDSkMeans", "MDShClust", "MDS"),
	endModelMetric = NULL,
	samplingMethod = c("uniform univariate sampling", 
	"uniform multivariate sampling", "with bootstrap"),
	MDSmetric = c("metricMDS", "nonMetricMDS"),
	computeFullMatrix = TRUE,
	proximityMatrix = NULL,
	outliersFilter = FALSE, 
	Xtest = NULL, 
	predObject = NULL, 
	nBiggestProximities = NULL,
	metricDimension = 2, 
	coordinates = c(1,2),
	bootstrapReplicates = 100,
	clusters = NULL,
	maxIters = NULL,
	importanceObject = NULL,
	maxInteractions = 2,
	reduceClusters = FALSE, 
	maxClusters = 5,
	mapAndReduce = FALSE,
	OOB = FALSE, 
	uthreads = "auto",
	...)	
	## S3 method for class 'unsupervised':
print(x, \dots)
	## S3 method for class 'unsupervised':
plot(x, importanceObject = NULL, xlim = NULL, ylim = NULL, \dots)

Arguments

Value

An object of class unsupervised, which is a list with the following components:
proximityMatrixthe resulted dissimilarity matrix.
MDSModelthe resulted Multidimensional scaling model.
unsupervisedModelthe resulted unsupervised model with clustered observations
gapStatisticsif K-means algorithm has been called, the results of the gap statistic. Otherwise NULL.
rUFObjectRandom Uniform Forests object
nbClustersNumber of clusters found.
paramsoptions of the model

Details

The unsupervised mode of Random Uniform Forests is designed to provide dimension reduction, clustering and a full analysis of features and observations. The process uses a tree-layer engine built around a randomUniformForest object. It can be summarized using the following chain : RandomUniformForest object --> dissimilarity matrix --> multidimensional scaling --> clustering algorithm --> clusters --> that can be computed into an object of class importance.randomUniformForest. This latter is, optionally, used to analyse features, their links with clusters and the links between observations, features and clusters. First step involves Breiman's ideas. Since Random Uniform Forests inherit of all properties of Breiman's Random Forests, they are able to implement the key concepts provided by Breiman for the unsupervised case : - create a synthetic data set by scrambling independently (for example uniformly) each column of the original dataset, - merge both synthetic and original dataset and give each dataset a label - run Random (Uniform) Forests to this new dataset and retrieve OOB errors - the lower the errors, the more clustering will be easy. If the error is too high, say close to 50 - once the forest classifier is built, one can now move to the second step. The second step used proximities matrix. If data are a 'n x p' matrix (or data frame) then proximities matrix will have a 'n x n' size, meaning that for each observation, one will search, for each tree, which other observation falls in the same terminal node, then increase the proximity of the two observations by one. At the end, all observations and all trees will have been processed, leading to a proximities matrix that will be normalized by the number of trees. Since it can be very large, it is recommended to use a 'n x B' proximities matrix, where 'B' is the number of trees. A further step is to use a 'n x q' (biggest) proximities matrix, where 'q' can be as small as 2 in order to compress the data to their maximum. Once proximities matrix (or dissimilarity matrix using for example '1 - proximities') has been computed, the third step is to enter in the MDS process. MDS is needed for dimension reduction (if not already happened) and mainly to generate decorrelated components in which the points will reside. The two first components are usually enough to get good visualization. Unlike PCA, they are used only to achieve the best possible separation between points. Note that we allow proximities to be transformed to distances since we found that it produces sometimes better results. The next step concludes the three-layer engine by calling a clustering algorithm, preferably K-means, to partition the MDS points. Note that this can seem strange, but one has to remember that we never use the features except in the early phase of the algorithm. Hence, we manipulate coordinates and points in a new space where MDS is the rule that matters. K-means algorithm is simply a way to provide a measurable way of the clusters which already exist, since clustering is almost done earlier. Note that the number of clusters is automatically adjusted by the gap statistic. If not enough, clusters structure can be instantaneously modified, letting the silhouette coefficient tell the last word. The unsupervised learning is then partially achieved. An object with all tools is generated and can be returned in order to be assessed by the randomUniformForest algorithm in order to provide a deeper analysis and visualization tools : - what are the important features, e.g. the most discriminant ? - what are their interactions, e.g. how features relate to the whole clustering scheme ? - Partial dependencies - Links between observations and features - ... The unsupervised model is turned into a supervised one using the as.supervised function, with all options one need to call for the supervised case, then doing the analysis by the importance.randomUniformForest function. This last step is essential if one wants to access to the full details. Moreover, data are now turned into a training sample for the algorithm. If new data become available, one may use the supervised case or the unsupervised one. The former has the great advantage to have a complexity in O(B*n*log(n)). Indeed, due to the lot of computation involved, the unsupervised case requires much more time than a simple call to K-means algorithm but also provides many details.

References

Abreu, N., 2011. Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon Breiman and Cutler web site : http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm Cox, T. F., Cox, M. A. A., 2001. Multidimensional Scaling. Second edition. Chapman and Hall. Gower, J. C., 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325-328. Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis (1 ed.). New York: John Wiley. Lloyd, S. P., 1957, 1982. Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory 28, 128-137. Murtagh, F., Legendre, P., 2013. Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion? Journal of Classification (in press). Tibshirani, R., Walther, G., Hastie, T., 2001. Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411-423.

Examples

Run this code

## not run
## 1 - the famous iris dataset

## load data
# data(iris)

## run unsupervised modelling, removing labels
## NOTE : due to the stochastic nature of the algorithm, stabilization is an important step
## increase nmber of trees (ntree) and minimal node size (nodesize) seems mandatory
# iris.rufUnsupervised = unsupervised.randomUniformForest(iris[,-5], 
# ntree = 200, nodesize = 10, threads = 1)

## view a summary
# iris.rufUnsupervised

## one may assess the gap statistic by calling the modifyClusters() function, increasing 
## or decreasing the number of clusters and looking the variations of the silhouette coefficient
## for example, if 4 clusters are found (since we know there are 3) :
#  iris.rufUnsupervised2 = modifyClusters(iris.rufUnsupervised, decreaseBy = 1)

## plot clusters 
# plot(iris.rufUnsupervised)

## 2 - Full example with details
## Wholesale customers data (UCI machine learning repository)

# URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/00292/"
# datasetName = "Wholesale%20customers%20data.csv"
# wholesaleCustomers = read.csv(paste(URL, datasetName, sep =""))

## modelling, letting the algorithm deal with all problems :
## categorical features, number of clusters, dimension reduction, visualization,
## variable importance, links between features and observations,...

# wholesaleCustomers.rufUnsupervised = unsupervised.randomUniformForest(wholesaleCustomers, 
# nodesize = 10, bagging = TRUE, ntree = 200, categoricalvariablesidx = "all")

## assess quality of the clustering :
## (and change eventually model parameters, e.g. 'baseModel', running again the model
## to get a better clustering one, looking the average silhouette or the inter-classes variance)

# wholesaleCustomers.rufUnsupervised

## visualization : only clusters in the first step
# plot(wholesaleCustomers.rufUnsupervised)

## but, we may need more :
## get details, turning first the model in a supervised one

# wholesaleCustomers.rufSupervised = as.supervised(wholesaleCustomers.rufUnsupervised, 
# wholesaleCustomers, bagging = TRUE, ntree = 200, 
# nodesize = 10, categoricalvariablesidx = "all")

## Is the learning efficient (using OOB evaluation) ?
# wholesaleCustomers.rufSupervised

## get to variable importance, leading to a full analysis and visualization
# wholesaleCustomers.importance = importance(wholesaleCustomers.rufSupervised, 
# Xtest = wholesaleCustomers, maxInteractions = 3)

## a - visualize : features, interactions, partial dependencies, features in clusters
## NOTE : tile window in the R menu to see all plots. Loop over the prompt to see
## all partial dependencies

# plot(wholesaleCustomers.importance, Xtest = wholesaleCustomers)

## we get global variable importance (information gain), interactions, partial dependencies,
## and variable importance over labels. See vignette for more details.

## b - more visualization : (another look on 'variable importance over labels')
# featuresCluster1 = partialImportance(wholesaleCustomers, wholesaleCustomers.importance, 
# whichClass = 1)

## c - visualization : clusters and most important features
# plot(wholesaleCustomers.rufUnsupervised, importanceObject = wholesaleCustomers.importance)

## d - table : see individual links between observations and features
## the table show each observation with its associated features and their frequencies of occurrence

# featuresAndObs = wholesaleCustomers.importance$localVariableImportance$obs
# rownames(featuresAndObs) = 1:nrow(wholesaleCustomers)
# head(featuresAndObs)

## NOTE : since features are almost in monetary units, one may assess clusters by looking the sum
## of all features per cluster and turn problem into a 'revenues per cluster and feature' one
## that can be linked with the clustering process and visualization tools.

## first, merge outliers and retrieve clusters
# Class = mergeOutliers(wholesaleCustomers.rufUnsupervised)

## then add classes
# wholesaleCustomersClusterized = cbind(wholesaleCustomers, Class)

## finally compute revenues per cluster and feature.
## Note that this view may give more insights on how the algorithm clusters data.

# revenuePerClusterAndFeature = 
# aggregate(wholesaleCustomersClusterized[,-c(1,2,9)], list(Class), sum)

# revenuePerClusterAndFeature

## revenuePerCluster : leading to know where and how more work might happen...
# rowSums(revenuePerClusterAndFeature[,-1])

Run the code above in your browser using DataLab