'clusterAnalysis( )' function is designed to provide a straightforward interpretation of a clustering problem or, in classification, the relation found by the model between features, observations and labels.
The main ingredient of the function comes from an object of class importance, which gives the local importance and its derivatives (see Ciss, 2015b). For each observation, one records the features, over all trees, that comes in
first, second, ..., and so on position in the terminal nodes where falls the observation. One also records the frequency (number of times divided by number of trees) at which each of the recorded features appears. This gives
the first outputted table ('featuresAndObs' see below), with all observations (in rows) and in columns their class or cluster, their position (from first one to last one) and frequency. Note that a position is a local component
, named 'localVariable', e.g. 'localVariable1' means that for the i-th observation, the recorded variable was the one
that had the highest frequency in the terminal node where the observation has been dropped.
The resulted table is then aggregated, providing in rows, the cluster (or the class) predicted by the model and
in columns a component (with the same purpose than an eigenvector), that aggregates variables recorded for all observations of the dataset. We call the table 'clusterAnalysis' since it connects clusters (or class) with dependent variables on each component. The table is granular since one may choose (using options) both number of components and number of variables that will be displayed allowing an easy viewing on variables that matter. In the table, variables in each component are displayed in an decreased order of influence and are usually co-dependent.
The importance (between 0 and 1) of each component is displayed in the third table 'componentAnalysis' allowing the user to define how many components are needed to explain the clusters. More means more variables and the assistance of object of class importance, friends and methods (plot, print, summary) will be needed. Less means easy, but possibly partial, interpretation. Note that the 'clusterAnalysis( )' function is, partially, the summarized view of importance.randomUniformForest
.
The three resulted tables comes directly from the modelling and, in classification, do not use training labels
but predictions (like clusters are predictions computed by the model). One may want to to know how the modelling is relying on the true nature of data. That is the main interest of the 'clusterAnalysis( )' function. The latter unifies modelling and observed points. The last tables, named 'numericalFeaturesAnalysis' and/or 'categoricalFeaturesAnalysis', aggregates (by a sum, mean, standard deviation or most frequent state) the values of each feature for each predicted cluster or class.
1 - If the clustering scheme is pertinent, then the table must show differences between clusters, by looking to the aggregated values of features.
2 - It must provide insights, especially in a cost-benefit analysis, about the features one needs to take care in order to have more knowledge of each cluster (or class).
3 - In a general case and in Random Uniform Forests, the modelling may not match, even is the clusters are well separated, either the true distribution (number of cases per cluster or class) or the true effects (relations between covariates, relations in each cluster,...) present in the data. The 'clusterAnalysis( )' function, unifying all provided tables, must ensure that:
3.a - true, or at least main or possible, effects are found,
3.b - the approximated distribution generated is enough close to the true one.
From the points above, the main argument is that Random Uniform Forests are almost stochastic. Hence random effects are first expected. If repeated modelling lead to similar results (clusters and within relations), then results are, at least, self-consistent and clusterAnalysis( ) provides a way to look how the modelling enlighten the data.