kml: ~ Algorithm kml: K-means for Longitidinal data ~

Description

kml is a new implematation of k-means for longitudinal data (or trajectories). This algorithm is able to deal with missing value and provides an easy way to re roll the algorithm several times, varying the starting conditions and/or the number of clusters looked for. Here is the description of the algorithm. For an overview of the package, see kml-package.

Usage

kml(Object, nbClusters = 2:6, nbRedrawing = 20, saveFreq = 100,
    maxIt = 200, trajMinSize = 2, print.cal = FALSE,
    print.traj = FALSE, imputationMethod = "copyMean",
    distance, power = 2, centerMethod = meanNA, startingCond = "allMethods",
    distanceStartingCond = "euclidean", ...)

Arguments

Object

[ClusterizLongData]: contains trajectories to clusterize as well as previous Clusterization.

nbClusters

[vector(numeric)]: Vector containing the number of clusters with which kml must work. By default, nbClusters is 2:6 which indicates that kml must search partitions with respectively 2, the

nbRedrawing

[numeric]: Sets the number of time that k-means must be re-run (with different starting conditions) for each number of clusters.

saveFreq

[numeric]: Long computations can take several days. So it is possible to save the object ClusterizLongData once in a wilde. saveFreq define the frequency of the saving process. The ClusterizLongData is

maxIt

[numeric]: Sets a limit to the number of iteration if convergeance is not reach.

trajMinSize

[numeric]: The trajectories that include missing values can either be excluded or included. trajMinSize sets the minimum number of values that a trajectory must contain not to be excluded. For example, if the trajectories have

print.cal

[logical]: If TRUE, the quality criterion will be print on screen during computation (if the number of redrawing is big, this can slow the overall calculation process).

print.traj

[logical]: If TRUE, each step of k-means is print on screen during the calculation. This can slow the overall calculation process by a factor 25, see "optimisation" below.

imputationMethod

[character]: the calculation of quality criterion can not be done if some value are missing. imputationMethod define the method use to impute the missing value. It should be one of "LOCF","LOCB","linearInterpolation","line

distance

[numeric <- function(trajectory,trajectory)] function that compute the distance between two trajectories. If no function is specified, the Euclidian distance with Gower adjustment is used (Gower adjustment takes in accomp missing value.) Us

power

[numeric]: power define the parameter of the minkovski distance, if used.

centerMethod

[numeric <- function(vector(numeric))]: k-means algorithm compute centers of each clusters. It is possible to personnalised the definition of "center" by defining a function "centerMethod". This function should take a vector of numeric as a

startingCond

[character]: specify the starting condition. Should be one of "maxDist", "randomAll", "randomK" or "allMethods". See detail.

distanceStartingCond

[character]: some starting condition needs to compute the distance matrix of the trajectories. distanceStartingCond define the distance that will be use to calculate this matrix. It should be one of "euclidean", "maximum",

...

For graphical parameters.

Value

A ClusterizLongData object, after having added some Clusterization to it.

Optimisation

Behind kml, there is two different procedure :

Fast: when the parameterdistanceis set to a classical distance (one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski") andprint.trajis set toFALSE(the default),kmlcall a C compiled (optimized) procedure.
Slow: when the user define its own distance or if he wants to see the construction of the clusters by settingprint.traj=TRUE,kmluses a R non compiled programmes.

The C prodecure is 25 times faster than the R one. So we advice to use the R procedure 1/ for trying some new method (like using a new distance) or 2/ to "see" the very first cluster construction, in order to check that every thing goes right, then to sweetch to the C procedure (like we do in Example section). If for a specific use, you need a different distance, feel free to contact the author.

Author(s)

Christophe Genolini PSIGIAM: Paris Sud Innovation Group in Adolescent Mental Health INSERM U669 / Maison de Solenn / Paris Contact author:

English translation

Rapha�l Ricaud Laboratoire "Sport & Culture" / "Sports & Culture" Laboratory University of Paris 10 / Nanterre

Details

kml works on object of class ClusterizLongData. For each number included in nbClusters, kml compute a Clusterization then stores it in the field clusters of the object ClusterizLongData according to its number of clusters. The algorithm starts over as many times as it is told in nbRedrawing. By default, it is executed for 2, 3, 4, 5 and 6 clusters 20 times each, namely 100 times. When a Clusterization has been found, it is added to the slot clusters. clusters is a list of 52 sublist called c1, c2, c3 until c52. The sublist cX store the all Clusterization with X clusters. Inside a sublist, the Clusterization are sort from the biggest quality criterion to the smallest (the best are stored first). Note that Clusterization are saved throughout the algorithm. If the user interrupt the execution of kml, the result is not lost. If the user run kml on an object then run kml again on the same object, the Clusterization that are computed the second time are added to the one allready present in the object (unless you "clear" some list, see Object["clusters","clear"]<-value in ClusterizLongData). The possible starting conditions are "randomAll", "randomK" and "maxDist", as define in partitionInitialise. In addition, the method "allMethods" is a shortcut that run a "maxDist", a "randomAll" and "randomK" for all the other re rolling.

References

Article submited Web site: http://christophe.genolini.free.fr/kml

Examples

Run this code

### Generation of some data
cld1 <- as.cld(generateArtificialLongData())

### We suspect 2, 3, 4 or 5 clusters, we want 3 redrawing.
#     We want to "see" what happen (so printCal and printTraj are TRUE)
kml(cld1,2:6,3,printCal=TRUE,printTraj=TRUE)

### 4 seems to be the best. But to be sure, we try more redrawing 4 or 6 only.
#     We don't want to see again, we want to get the result as fast as possible.
kml(cld1,c(4,6),10)

Run the code above in your browser using DataLab