unsupervised.randomUniformForest: Unsupervised Learning with Random Uniform Forests

Description

Unsupervised mode of Random Uniform Forests, allowing clustering, dimension reduction, visualization and variable importance, using a three-layer engine: dissimilarity matrix, Multidimensional Scaling (MDS) and k-means or hierarchical clustering. The unsupervised mode does not require the number of clusters to be known, thanks to the gap statistic, and inherit of main algorithmic properties of the supervised mode, hence allowing (almost) any type of variable.

Usage

## S3 method for class 'randomUniformForest':
unsupervised(object,
	baseModel = c("proximity", "proximityThenDistance",  "importanceThenDistance"),
	endModel = c("MDSkMeans", "MDShClust", "MDS"),
	endModelMetric = NULL,
	samplingMethod = c("uniform univariate sampling", 
	"uniform multivariate sampling", "with bootstrap"),
	MDSmetric = c("metricMDS", "nonMetricMDS"),
	proximityMatrix = NULL,
	outliersFilter = FALSE, 
	Xtest = NULL, 
	predObject = NULL, 
	metricDimension = 2, 
	coordinates = c(1,2),
	bootstrapReplicates = 100,
	clusters = NULL,
	maxIters = NULL,
	importanceObject = NULL,
	maxInteractions = 2,
	reduceClusters = FALSE, 
	maxClusters = 5,
	mapAndReduce = FALSE,
	OOB = FALSE,
	subset = NULL, 
	seed = 2014,
	uthreads = "auto",
	...)	
	## S3 method for class 'unsupervised':
print(x, \dots)
	## S3 method for class 'unsupervised':
plot(x, importanceObject = NULL, xlim = NULL, ylim = NULL, \dots)

Arguments

Value

An object of class unsupervised, which is a list with the following components:
proximityMatrixthe resulted dissimilarity matrix.
MDSModelthe resulted Multidimensional scaling model.
unsupervisedModelthe resulted unsupervised model with clustered observations in unsupervisedModel$cluster.
largeDataLearningModelif the dataset is large, the resulted model that learned a sample of the MDS points, and predicted others points.
gapStatisticsif K-means algorithm has been called, the results of the gap statistic. Otherwise NULL.
rUFObjectRandom Uniform Forests object.
nbClustersNumber of clusters found.
paramsoptions of the model.

Details

The unsupervised mode of Random Uniform Forests is designed to provide dimension reduction, clustering and a full analysis of features and observations. The process uses a tree-layer engine built around a randomUniformForest object. It can be summarized using the following chain : RandomUniformForest object --> dissimilarity matrix --> multidimensional scaling --> clustering algorithm --> clusters --> that can be computed into an object of class importance.randomUniformForest. This latter is, optionally, used to analyse features, their links with clusters and the links between observations, features and clusters. First step involves Breiman's ideas. Since Random Uniform Forests inherit of all properties of Random Forests, they are able to implement the key concepts provided by Breiman for the unsupervised case : - create a synthetic data set by scrambling independently (for example uniformly) each column of the original dataset, - merge both synthetic and original dataset and give each one a label, - run Random (Uniform) Forests to this new dataset and retrieve OOB errors, - the lower the errors, the more clustering will be easy. If error is too high, say close to 50 percent, then there is (almost) no hope - once the forest classifier is built, one can now move to the second step. The second step used proximities matrix. If data are a 'n x p' matrix (or data frame) then proximities matrix will have a 'n x n' size, meaning that for each observation, one will search, for each tree, which other observation falls in the same terminal node, then increase the proximity of the two observations by one. At the end, all observations and all trees will have been processed, leading to a proximities matrix that will be normalized by the number of trees. Since it can be very large, it is possible (but currently disabled due to the compromise that is needed with large datasets between accuracy and computing time) to use a 'n x B' proximities matrix, where 'B' is the number of trees. A further step is to use a 'n x q' (biggest) proximities matrix, where 'q' can be as small as 2 in order to compress the data to their maximum. Once proximities matrix (or dissimilarity matrix using for example '1 - proximities') has been computed, the third step is to enter in the MDS process. MDS is needed for dimension reduction (if not already happened) and mainly to generate decorrelated components in which the points will reside. The two first components are usually enough to get good visualization. Unlike PCA, they are used only to achieve the best possible separation between points. Note that we allow proximities to be transformed into distances since we found that it produces sometimes better results. The next step concludes the three-layer engine by calling a clustering algorithm, preferably K-means, to partition the MDS points since we use the features only in the early phase of the algorithm. Hence, we manipulate coordinates and points in a new space where MDS is the rule that matters. K-means algorithm is simply a way to provide a measurable way of the clusters which already exist, since clustering is almost done earlier. Note that the number of clusters is automatically adjusted by the gap statistic. If not enough, clusters structure can be instantaneously modified (see modifyClusters), letting the silhouette coefficient (Rousseeuw) tell the last word. The unsupervised learning is then partially achieved. An object with all tools is generated and can be returned in order to be assessed by the randomUniformForest algorithm that will provide a deeper analysis and visualization tools : - what are the important features, e.g. the most discriminant ? - what are their interactions, e.g. how features relate to the whole clustering scheme ? - Partial dependencies. - Links between observations and features. - ... The unsupervised model is turned into a supervised one using the as.supervised function, with all options one need to call for the supervised case, then doing the analysis by the importance.randomUniformForest function. This last step is essential if one wants to access to the full details. Moreover, data are now turned into a training sample for the algorithm. If new data become available, one may use the supervised case or the unsupervised one. The former has the great advantage to have a complexity in O(B*n*log(n)). Indeed, due to the lot of computation involved, the unsupervised case requires much more time than a simple call to K-means algorithm but also provides more details. In comparison with a faster algorithm and for a small dataset, the unsupervised mode of Random Uniform Forests always provides : - dimension reduction that leads to a visualization in the MDS space (in two dimensions), - a resulted clustering scheme that processes MDS points, thus integrating all features in the same representation, - an object that can be used to evaluate and analyse most important features, their interactions and their links with observations, according to the clustering scheme. When going toward large datasets, the unsupervised mode becomes hybrid with a fully unsupervised mode for a subsample of data and a supervised mode that learns MDS points and predict them for the remaining rows of the sample. This allows to strongly reduce computation time and to make the resulted object recyclable. - Hence, new data can be learned incrementally (combining or updating objects) and/or separately to the former object. - The main argument resides in the fact that the combination of Random Uniform Forests and MDS space allows data to be dynamically clustered as the time is going (i.e. new data are coming) or in the space, by changing, on the fly, the clustering representation. Note that the whole engine is stochastic, with almost no possibility of reproducibility using the set.seed() function. However since random Uniform Forests converge, seed has been added to the second primary part of the unsupervised mode, the creation of the synthetic dataset. It means that most of the work to achieve a good clustering representation is devoted to the randomUniformForest part, allowing one to assess parameters independently for each layer and look for the ones that have the main effect on the clustering.

References

Abreu, N., 2011. Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon Breiman and Cutler web site : http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm Cox, T. F., Cox, M. A. A., 2001. Multidimensional Scaling. Second edition. Chapman and Hall. Gower, J. C., 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325-328. Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis (1 ed.). New York: John Wiley. Lloyd, S. P., 1957, 1982. Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory 28, 128-137. Murtagh, F., Legendre, P., 2013. Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion? Journal of Classification (in press). Tibshirani, R., Walther, G., Hastie, T., 2001. Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411-423.

Examples

Run this code

## not run
## 1 - the famous iris dataset

## load data
# data(iris)

## run unsupervised modelling, removing labels :
## Default options, letting the 'gap statistic' find the number of clusters, 
## except for 'baseModel' which is slightly more efficient 
## with the "proximityThenDistance" argument

# iris.rufUnsupervised = unsupervised.randomUniformForest(iris[,-5], 
# baseModel = "proximityThenDistance", seed = 2014, threads = 1)

## view a summary
# iris.rufUnsupervised

## one may assess the gap statistic by calling the 'modifyClusters( )' function,  
## increasing or decreasing the number of clusters and looking the variations 
## of the silhouette coefficient.
## For example, if 4 clusters are found (since we know there are 3) :

# iris.rufUnsupervised2 = modifyClusters(iris.rufUnsupervised, decreaseBy = 1)

## plot clusters 
# plot(iris.rufUnsupervised)

## 2 - Full example with details
## Wholesale customers data (UCI machine learning repository)

# URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/00292/"
# datasetName = "Wholesale%20customers%20data.csv"
# wholesaleCustomers = read.csv(paste(URL, datasetName, sep =""))

## modelling, letting the algorithm deal with all problems :
## categorical features, number of clusters, dimension reduction, visualization,
## variable importance, links between features and observations,...

# wholesaleCustomers.rufUnsupervised = unsupervised.randomUniformForest(wholesaleCustomers, 
# nodesize = 10, bagging = TRUE, ntree = 200, categoricalvariablesidx = "all")

## assess quality of the clustering :
## (and change eventually model parameters, e.g. 'baseModel' or 'endModel', 
## running again the model to get a better clustering one, looking the average silhouette 
## or the distance between clusters)

# wholesaleCustomers.rufUnsupervised

## visualization : at first, only clusters 
# plot(wholesaleCustomers.rufUnsupervised)

## but, we may need more :
## get details, turning first the model in a supervised one

# wholesaleCustomers.rufSupervised = as.supervised(wholesaleCustomers.rufUnsupervised, 
# wholesaleCustomers, bagging = TRUE, ntree = 200, 
# nodesize = 10, categoricalvariablesidx = "all")

## Is the learning efficient (using OOB evaluation) ?
# wholesaleCustomers.rufSupervised

## get variable importance, leading to a full analysis and visualization
## while limiting interactions of variables to 3 orders

# wholesaleCustomers.importance = importance(wholesaleCustomers.rufSupervised, 
# Xtest = wholesaleCustomers, maxInteractions = 3)

## a - visualize : features, interactions, partial dependencies, features in clusters
## NOTE : tile window in the R menu to see all plots. Loop over the prompt to see
## all matched partial dependencies

# plot(wholesaleCustomers.importance, Xtest = wholesaleCustomers)

## we get global variable importance (information gain), interactions, partial dependencies,
## and variable importance over labels. See vignette for more details.

## b - more visualization : (another look on 'variable importance over labels')
# featuresCluster1 = partialImportance(wholesaleCustomers, wholesaleCustomers.importance, 
# whichClass = 1)

## c - visualization : clusters and most important features
# plot(wholesaleCustomers.rufUnsupervised, importanceObject = wholesaleCustomers.importance)

## d - table : see individual links between observations and features
## the table show each observation with its associated features 
## and their frequencies of occurrence

# featuresAndObs = as.data.frame(wholesaleCustomers.importance$localVariableImportance$obs)
# frequencyFeaturesIdx = grep("Frequency", colnames(featuresAndObs))
# featuresNames = apply(featuresAndObs[,-c(1,frequencyFeaturesIdx)], 2, 
# function(Z) colnames(wholesaleCustomers)[Z])
# featuresAndObs[,-c(1,frequencyFeaturesIdx)] = featuresNames

# head(featuresAndObs)

## NOTE : since features are almost in monetary units, one may assess clusters 
## by looking the sum of all features per cluster and turn the problem 
## into a 'revenues per cluster and feature' one that can be linked 
## with the clustering process and visualization tools.

## first, merge outliers and retrieve clusters
# Class = mergeOutliers(wholesaleCustomers.rufUnsupervised)

## then add classes
# wholesaleCustomersClusterized = cbind(wholesaleCustomers, Class)

## finally compute revenues per cluster and feature.
## Note that this view may give more insights on how the algorithm clusters data.

# revenuePerClusterAndFeature = 
# aggregate(wholesaleCustomersClusterized[,-c(1,2,9)], list(Class), sum)

## see results
# revenuePerClusterAndFeature

## revenuePerCluster : leading to know where and how more work might happen...
# rowSums(revenuePerClusterAndFeature[,-1])

Run the code above in your browser using DataLab