fixmahal: Mahalanobis Fixed Point Clusters

Description

Computes Mahalanobis fixed point clusters (FPCs), i.e., subsets of the data, which consist exactly of the non-outliers w.r.t. themselves, and may be interpreted as generated from a homogeneous normal population. FPCs may overlap, are not necessarily exhausting and do not need a specification of the number of clusters.

Note that while fixmahal has lots of parameters, only one (or few) of them have usually to be specified, cf. the examples. The philosophy is to allow much flexibility, but to always provide sensible defaults.

Usage

fixmahal(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)), 
                      method = "fuzzy", cgen = "fixed",
                      ca = NA, ca2 = NA,
                      calpha = ifelse(method=="fuzzy",0.95,0.99),
                      calpha2 = 0.995,
                      pointit = TRUE, subset = n,
                      nc1 = 100+20*p,
                      startn = 18+p, mnc = floor(startn/2), 
                      mer = ifelse(pointit,0.1,0), 
                      distcut = 0.85, maxit = 5*n, iter = n*1e-5,
                      init.group = list(), 
                      ind.storage = TRUE, countmode = 100, 
                      plot = "none")

## S3 method for class 'mfpc':
summary(object, ...)
## S3 method for class 'summary.mfpc':
print(x, maxnc=30, ...)
## S3 method for class 'mfpc':
plot(x, dat, no, bw=FALSE, main=c("Representative FPC No. ",no),
                    xlab=NULL, ylab=NULL,
                    pch=NULL, col=NULL, ...)
## S3 method for class 'mfpc':
fpclusters(object, dat=NA, ca=object$ca, p=object$p, ...)
fpmi(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)),
                  gv, ca, ca2, method = "ml", plot,
                  maxit = 5*n, iter = n*1e-6)

Arguments

dat

something that can be coerced to a numerical matrix or vector. Data matrix, rows are points, columns are variables. fpclusters.rfpc does not need specification of dat if fixmahal has been run with

optional positive integer. Number of cases.

optional positive integer. Number of independent variables.

method

a string. method="classical" means 0-1 weighting of observations by Mahalanobis distances and use of the classical normal covariance estimator. method="ml" uses the ML-covariance estimator (division by n

cgen

optional string. "fixed" means that the same tuning constant ca is used for all iterations. "auto" means that ca is generated dependently on the size of the current data subset in each iterat

optional positive number. Tuning constant, specifying required cluster separation. By default determined as calpha-quantile of the chisquared distribution with p degrees of freedom.

ca2

optional positive number. Second tuning constant needed if method="fuzzy". By default determined as calpha2-quantile of the chisquared distribution with p degrees of freedom.

calpha

number between 0 and 1. See ca.

calpha2

number between 0 and 1, larger than calpha. See ca2.

pointit

optional logical. If TRUE, subset fixed point algorithms are started from initial configurations, which are built around single points of the dataset, cf. mahalconf.

subset

optional positive integer smaller or equal than n. Initial configurations for the fixed point algorithm (cf. mahalconf) are built from a subset of subset points from

nc1

optional positive integer. Tuning constant needed by cmahal to generate ca automatically. Only needed for cgen="auto".

startn

optional positive integer. Size of the initial configurations. The default value is chosen to prevent that small meaningless FPCs are found, but it should be decreased if clusters of size smaller than the default value are of interest.

mnc

optional positive integer. Minimum size of clusters to be reported.

mer

optional nonnegative number. FPCs (groups of them, respectively, see details) are only reported as stable if the ratio of the number of their findings to their number of points exceeds mer. This holds under pointit=

distcut

optional value between 0 and 1. A similarity measure between FPCs, given in Hennig (2002), and the corresponding Single Linkage groups of FPCs with similarity larger than distcut are computed. A single representative FPC is se

maxit

optional integer. Maximum number of iterations per algorithm run (usually an FPC is found much earlier).

iter

positive number. Algorithm stops when difference between subsequent weight vectors is smaller than iter. Only needed for method="fuzzy".

init.group

optional list of logical vectors of length n. Every vector indicates a starting configuration for the fixed point algorithm. This can be used for datasets with high dimension, where the vectors of init.group indic

ind.storage

optional logical. If TRUE, then all indicator vectors of found FPCs are given in the value of fixmahal. May need lots of memory, but is a bit faster.

countmode

optional positive integer. Every countmode algorithm runs fixmahal shows a message.

plot

optional string. If "start", you get a scatterplot of the first two variables to highlight the initial configuration, "iteration" generates such a plot at each iteration, "both" does both (this may be ver

object

object of class mfpc, output of fixmahal.

maxnc

positive integer. Maximum number of FPCs to be reported.

positive integer. Number of the representative FPC to be plotted.

optional logical. If TRUE, plot is black/white, FPC is indicated by different symbol. Else FPC is indicated red.

main

plot title.

xlab

label for x-axis. If NULL, a default text is used.

ylab

label for y-axis. If NULL, a default text is used.

pch

plotting symbol, see par. If NULL, the default is used.

col

plotting color, see par. If NULL, the default is used.

logical vector (or, with method="fuzzy", vector of weights between 0 and 1) of length n. Indicates the initial configuration for the fixed point algorithm.

...

additional parameters to be passed to plot (no effects elsewhere).

Value

fixmahal returns an object of class mfpc. This is a list containing the components nc, g, means, covs, nfound, er, tsc, ncoll, skc, grto, imatrix, smatrix, stn, stfound, ser, sfpc, ssig, sto, struc, n, p, method, cgen, ca, ca2, cvec, calpha, pointit, subset, mnc, startn, mer, distcut.
summary.mfpc returns an object of class summary.mfpc. This is a list containing the components means, covs, stn, stfound, sn, ser, tskip, skc, tsc, sim, ca, ca2, calpha, mer, method, cgen, pointit.
fpclusters.mfpc returns a list of indicator vectors for the representative FPCs of stable groups.
fpmi returns a list with the components mg, covg, md, gv, coll, method, ca.
ncinteger. Number of FPCs.
glist of logical vectors. Indicator vectors of FPCs. FALSE if ind.storage=FALSE.
meanslist of numerical vectors. Means of FPCs. In summary.mfpc, only for representative FPCs of stable groups and sorted according to ser.
covslist of numerical matrices. Covariance matrices of FPCs. In summary.mfpc, only for representative FPCs of stable groups and sorted according to ser.
nfoundvector of integers. Number of findings for the FPCs.
ernumerical vector. Ratio of number of findings of FPCs to their size. Under pointit=TRUE, this can be taken as a measure of stability of FPCs.
tscinteger. Number of algorithm runs leading to too small or too seldom found FPCs.
ncollinteger. Number of algorithm runs where collinear covariance matrices occurred.
skcinteger. Number of skipped clusters.
grtovector of integers. Numbers of FPCs to which algorithm runs led, which were started by init.group.
imatrixvector of integers. Size of intersection between FPCs. See sseg.
smatrixnumerical vector. Similarities between FPCs. See sseg.
stninteger. Number of representative FPCs of stable groups. In summary.mfpc, sorted according to ser.
stfoundvector of integers. Number of findings of members of all groups of FPCs. In summary.mfpc, sorted according to ser.
sernumerical vector. Ratio of number of findings of groups of FPCs to their size. Under pointit=TRUE, this can be taken as a measure of stability of FPCs. In summary.mfpc, sorted from largest to smallest.
sfpcvector of integers. Numbers of representative FPCs of all groups.
ssigvector of integers of length stn. Numbers of representative FPCs of the stable groups.
stovector of integers. Numbers of groups ordered according to largest ser.
strucvector of integers. Number of group an FPC belongs to.
nsee arguments.
psee arguments.
methodsee arguments.
cgensee arguments.
casee arguments, if cgen has been "fixed". Else numerical vector of length nc (see below), giving the final values of ca for all FPC. In fpmi, tuning constant for the iterated FPC.
ca2see arguments.
cvecnumerical vector of length n for cgen="auto". The values for the tuning constant ca corresponding to the cluster sizes from 1 to n.
calphasee arguments.
pointitsee arguments.
subsetsee arguments.
mncsee arguments.
startnsee arguments.
mersee arguments.
distcutsee arguments.
snvector of integers. Number of points of representative FPCs.
tskipinteger. Number of algorithm runs leading to skipped FPCs.
simvector of integers. Size of intersections between representative FPCs of stable groups. See sseg.
mgmean vector.
covgcovariance matrix.
mdMahalanobis distances.
gvlogical (numerical, respectively, if method="fuzzy") indicator vector of iterated FPC.
colllogical. TRUE means that singular covariance matrices occurred during the iterations.

Details

A (crisp) Mahalanobis FPC is a data subset that reproduces itself under the following operation: Compute mean and covariance matrix estimator for the data subset, and compute all points of the dataset for which the squared Mahalanobis distance is smaller than ca. Fixed points of this operation can be considered as clusters, because they contain only non-outliers (as defined by the above mentioned procedure) and all other points are outliers w.r.t. the subset. The current default is to compute fuzzy Mahalanobis FPCs, where the points in the subset have a membership weight between 0 and 1 and give rise to weighted means and covariance matrices. The new weights are then obtained by computing the weight function wfu of the squared Mahalanobis distances, i.e., full weight for squared distances smaller than ca, zero weight for squared distances larger than ca2 and decreasing weights (linear function of squared distances) in between. A fixed point algorithm is started from the whole dataset, algorithms are started from the subsets specified in init.group, and further algorithms are started from further initial configurations as explained under subset and in the function mahalconf. Usually some of the FPCs are unstable, and more than one FPC may correspond to the same significant pattern in the data. Therefore the number of FPCs is reduced: A similarity matrix is computed between FPCs. Similarity between sets is defined as the ratio between 2 times size of intersection and the sum of sizes of both sets. The Single Linkage clusters (groups) of level distcut are computed, i.e. the connectivity components of the graph where edges are drawn between FPCs with similarity larger than distcut. Groups of FPCs whose members are found often enough (cf. parameter mer) are considered as stable enough. A representative FPC is chosen for every Single Linkage cluster of FPCs according to the maximum expectation ratio ser. ser is the ratio between the number of findings of an FPC and the number of points of an FPC, adjusted suitably if

subset.
  Usually only the representative FPCs of stable groups
  are of interest. 
Default tuning constants are taken from Hennig (2005).
Generally, the default settings are recommended for
  fixmahal. For large datasets, the use of
  init.group together with pointit=FALSE
  is useful. Occasionally, mnc and startn may be chosen
  smaller than the default,
  if smaller clusters are of interest, but this may lead to too many
  clusters. Decrease of
  ca will often lead to too many clusters, even for homogeneous
  data. Increase of ca will produce only very strongly
  separated clusters. Both may be of interest occasionally.
Singular covariance matrices during the iterations are handled by
  solvecov.
  summary.mfpc gives a summary about the representative FPCs of
  stable groups.
  plot.mfpc is a plot method for the representative FPC of stable
  group no. no. It produces a scatterplot, where
  the points belonging to the FPC are highlighted, the mean is and
  for p<3< code=""> also the region of the FPC is shown. For p>=3,
  the optimal separating projection computed by batcoord
  is shown.  
  fpclusters.mfpc produces a list of indicator vectors for the
  representative FPCs of stable groups.
  fpmi is called by fixmahal for a single fixed point
  algorithm and will usually not be executed alone.

References

Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.

Hennig, C. (2005) Fuzzy and Crisp Mahalanobis Fixed Point Clusters, in Baier, D., Decker, R., and Schmidt-Thieme, L. (eds.): Data Analysis and Decision Support. Springer, Heidelberg, 47-56, http://www.homepages.ucl.ac.uk/~ucakche/papers/fuzzyfix.ps

Hennig, C. and Christlieb, N. (2002) Validating visual clusters in large datasets: Fixed point clusters of spectral features, Computational Statistics and Data Analysis 40, 723-739.

Examples

Run this code

set.seed(20000)
  face <- rFace(400,dMoNo=2,dNoEy=0, p=3)
  # The first example uses grouping information via init.group.
  initg <- list()
  grface <- as.integer(attr(face,"grouping"))
  for (i in 1:5) initg[[i]] <- (grface==i)
  ff0 <- fixmahal(face, pointit=FALSE, init.group=initg)
  summary(ff0)
  cff0 <- fpclusters(ff0)
  plot(face, col=1+cff0[[1]])
  plot(face, col=1+cff0[[4]]) # Why does this come out as a cluster? 
  plot(ff0, face, 4) # A bit clearer...
  # Without grouping information, examples need more time:
  # ff1 <- fixmahal(face)
  # summary(ff1)
  # cff1 <- fpclusters(ff1)
  # plot(face, col=1+cff1[[1]])
  # plot(face, col=1+cff1[[6]]) # Why does this come out as a cluster? 
  # plot(ff1, face, 6) # A bit clearer...
  # ff2 <- fixmahal(face,method="ml")
  # summary(ff2)
  # ff3 <- fixmahal(face,method="ml",calpha=0.95,subset=50)
  # summary(ff3)
  ## ...fast, but lots of clusters. mer=0.3 may be useful here.
  # set.seed(3000)
  # face2 <- rFace(400,dMoNo=2,dNoEy=0)
  # ff5 <- fixmahal(face2)
  # summary(ff5)
  ## misses right eye of face data; with p=6,
  ## initial configurations are too large for 40 point clusters 
  # ff6 <- fixmahal(face2, startn=30)
  # summary(ff6)
  # cff6 <- fpclusters(ff6)
  # plot(face2, col=1+cff6[[3]])
  # plot(ff6, face2, 3)
  # x <- c(1,2,3,6,6,7,8,120)
  # ff8 <- fixmahal(x)
  # summary(ff8)
  # ...dataset a bit too small for the defaults...
  # ff9 <- fixmahal(x, mnc=3, startn=3)
  # summary(ff9)

Run the code above in your browser using DataLab