This function produces outlier ranking factors for a set of cases. The
methodology used for obtaining these factors is described in Section
4.4.1.3 of the book Data Mining with R (Torgo, 2010) and more details
can be obtained in Torgo (2007). The methodology is based on the
simple idea of using the information provided by an agglomerative
hierarchical clustering algorithm to infer the degree of outlyingness
of the observations. The basic assumption is that outliers should
offer "more resistance" to being clustered, i.e. being merged on large
groups of observations. The function was written to be used with the outcome of the
hclust()
R function that implements several agglomerative
clustering methods. Although in theory the methodology could be used with
any other agglomerative hierarchical clustering algorithm, the fact is
that the code of this implementation strongly depends on the data
structures produced by the hclust()
function. As such if you
wish to change the function to be able to use other clustering
algorithms you should ensure that the data structures it produces are
compatible with the requirements of our function. Specifically, your
clustering algorithm should produce a list with a component named
merge
that should be a matrix describing the merging steps of
the clustering process (see the help page of the hclust()
function for a full description of this data structure). This is the
only data structure that is required by our function and that is used
from the object returned by clustering algorithm. The diana()
clustering algorithm
also produces this type of information and thus can also be used with
our function by providing the value "diana" on the component
alg
of the list forming the parameter clus
.
There are essentially two ways of using this function. The first
consists in giving it a data set on the parameter data
and the
function will rank these observations according to their
outlyingness. The other consists in specifying two sets of data. One
is the set for which you want the outlyingness factors that should be given on the
parameter test.data
. The second set is provided on the
data
parameter and it is used to increase
the ammount of data used in the clustering process to improve the
statistical reliability of the process.
In the first way of using this function that was described above the
user can either supply the data set or the respective distance matrix.
If the data set is provided then the user should specify the type of
distance metric it should be used to calculate the distances between
the observations. This is done by including a distance calculation
method in the "dist" component of the list provided in parameter
clus
. This method should be a valid value of the parameter
method
of the R function dist()
(see its help for details).
This function currently implements three different methods for obtaing
outlier ranking factors from the clustering process. These are:
"linear", "sigmoid" and "sizeDiff" (the default). Irrespectively, of
this method the outlyingness factor of observation X is obtained by:
OF_H(X) = max_i of_i(X), where i represents the different merging
steps of the clustering process and it goes from 1 to N-1, where N is
the size of the data set to be clustered. The three methods differ in
the way they calculate of_i(X) for each merging step. In the "linear"
method of_i(X) = i / (N-1) * p(|g|), where g is the group to which X
belongs at the merging step i (each merging step involves two groups),
and |g| is the size of that group. The function p() is a penalization
factor depending on the size of the group. The larger this size the
smaller the value of p(), p(s) = I(s < thr) * ( 1 - (s-1) / (N-2)),
where I() is an indicator function and thr is a threshold defined as
perc*N. The user should set the value of perc by including a component
named "sz.perc" in the list provided in the parameter
method.pars
. In the "sigmoid" method of_i(X) = exp( -2 * (i -
(N-1))^2 / (N-1)^2) * p(|g|), where the p() function has the same
meaning as in the "linear" method but this time is defined as p(s) =
I(s < 2*thr) * ( 1 - exp( -4 * (s-2*thr)^2 / (2*thr)^2)). Again thr is
perc*N and the user must set the value of perc by including a component
named "sz.perc" in the list provided in the parameter
method.pars
. Finally, the method "sizeDiff" defines of_i(X) =
max ( 0, ( |g_y,i| - |g_x,i| ) / ( |g_y,i| + |g_x,i| ) ),
where g_y,i and g_x,i are the two groups involved in the merge at
step i, and g_x,i is the group which X belongs to. Note that if X
belongs to the larger of the two groups this will get X a value of
of_i() equals to zero.