cluster.com: Functional data clustering via concentration inequalities

Description

cluster.com clusters sets of functional data via their covariance operators making use of an EM style algorithm with concentration inequalities.

Usage

cluster.com(dat, labl = NULL, grpCnt = 2, iter = 30, SOFT = FALSE,
  PRINTLK = TRUE, LOADING = FALSE, IGNORESTOP = FALSE)

Arguments

dat

(n X m) data matrix of n samples of m long vectors.

labl

An optional vector of n labels to group curves. (see Details)

grpCnt

Number of clusters into which to split the data.

iter

Number of iterations for EM algorithm.

SOFT

Boolean flag for whether or not category probabilities should be returned.

PRINTLK

Boolean flag, which if TRUE, prints likelihood values for each iteration.

Boolean flag, which if TRUE, prints a loading bar.

IGNORESTOP

Boolean flag, which if TRUE, will ignore early stopping conditions and cause the EM algorithm to run for the total amount of desired iterations.

Value

cluster.com returns a vector a labels with one entry for each row of data corresponding to one of the k categories ( or an array of probability vectors if SOFT=TRUE ).

Details

This function clusters individual curves or sets of curves by considering the distance between their covariance operator and each estimated category covariance operator. The implemented algorithm reworks the concentration inequality based classification method classif.com into an EM style algorithm. This method iteratively updates the probability of a given observation belonging to each of the k categories. These probabilities are in turn used to update the category means. This process continues until either the total number of iterations is reached or a computed likelihood begins to decrease signaling the arrival of a local optimum.

If the argument labl is NULL, then every curve is clustered separately. If labl contains factors used to group the curves, then each set of curves is classified as one group. For example, if you have multiple speakers and multiple speech samples from each speaker, you can group the data from each speaker together in order to cluster based on each speakers' covariance operator rather than based on each speech sample individually.

If the flag SOFT is set to TRUE, then soft clustering occurs. In this case, given k different labels, a k-long probability vector is returned for each observation whose entries correspond to the probability that the observed function belongs to a specific label.

References

Kashlak, Adam B, John A D Aston, and Richard Nickl (2016). "Inference on covariance operators via concentration inequalities: k-sample tests, classification, and clustering via Rademacher complexities", in review

Examples

Run this code

# NOT RUN {
 # Load phoneme data 
 library(fds);
 # Setup data to be clustered
 dat  = rbind( t(aa$y[,1:20]),t(iy$y[,1:20]),t(sh$y[,1:20]) );
 # Cluster data into three groups
 clst = cluster.com(dat,grpCnt=3);
 matrix(clst,3,20,byrow=TRUE);
 
 # cluster groups of curves
 dat  = rbind( t(aa$y[,1:40]),t(iy$y[,1:40]),t(sh$y[,1:40]) );
 lab  = gl(30,4);
 # Cluster data into three groups
 clst = cluster.com(dat,labl=lab,grpCnt=3);
 matrix(clst,3,10,byrow=TRUE);
# }

Run the code above in your browser using DataLab