dtwclust: Time series clustering under DTW

Description

This function uses the DTW distance and related lower bounds to cluster time series. For now, all series must be univariate and, in the case of partitional methods, have equal lengths.

Usage

dtwclust(data = NULL, type = "partitional", k = 2, method = "average",
  distance = "dtw", centroid = "pam", window.size = NULL, norm = "L1",
  dc = NULL, control = NULL, save.data = FALSE, seed = NULL,
  trace = FALSE, ...)

Arguments

data

A list where each element is a time series, or a numerical matrix where each row is a time series. Only the latter is supported in case of type = "partitional".

type

What type of clustering method to use, partitional, hierarchical or tadpole.

Numer of desired clusters in partitional methods.

method

Which linkage method to use in hierarchical methods. See hclust.

distance

One of the supported distance definitions (see Distance section). Ignored for type = "tadpole".

centroid

Either a supported string or an appropriate function to calculate centroids when using partitional methods (see Centroid section).

window.size

Window constraint for DTW and LB calculations. See Sakoe-Chiba section.

norm

Pointwise distance for DTW and LB. Either L1 for Manhattan distance or L2 for Euclidean. Ignored for distance = "DTW" (which always uses L1) and distance = "DTW2" (which always uses

Cutoff distance for TADPole algorithm.

control

Parameters for partitional clustering algorithms. See flexclustControl.

save.data

Return a copy of the data in the returned object? Ignored for hierarchical clustering.

seed

Random seed for reproducibility of partitional algorithms.

trace

Boolean flag. If true, more output regarding the progress is printed to screen.

...

Additional arguments to pass to dist or a custom distance function.

Value

An object with formal class dtwclust-class if type = "partitional" | "tadpole". Otherwise an object with class hclust as returned by hclust.

Distance

If a custom distance function is provided, it will receive the data as the first argument. For partitional algorithms, the second argument will be the cluster centers (i.e. other time series) in the form of a matrix where each row is a center series. If hierarchical algorithms are used, the function will also receive the elements of .... For partitional algorithms, the function could make use of the window.size and norm parameters, which should be detected thanks to R's lexical scoping, however this cannot be guaranteed. The function should return a distance matrix, ideally of class crossdist. In case of partitional algorithms, the time series in the data should be along the rows, and the cluster centers along the columns of the distance matrix. The other option is to provide a string. For partitional algorithms, it can be a supported distance of kccaFamily. For hierarchical, it can be any distance function registered in dist. In the latter case, all extra parameters should be provided in .... Additionally, with either type of algorithm, it can be one of the following custom implementations:

"dtw": DTW with L1 norm and optionally a Sakoe-Chiba constraint.
"dtw2": DTW with L2 norm and optionally a Sakoe-Chiba constraint.
"dtw_lb": DTW with L1 or L2 norm and optionally a Sakoe-Chiba constraint. Some computations are avoided by first estimating the distance matrix with Lemire's lower bound and then iteratively refining with DTW. Seedtw_lb.
"lbk": Keogh's lower bound with either L1 or L2 norm for the Sakoe-Chiba constraint.
"lbi": Lemire's lower bound with either L1 or L2 norm for the Sakoe-Chiba constraint.
"sbd": Shape-based distance. Each series is z-normalized in this case. As a result, the cluster centers (for partitional methods) are also z-normalized. SeeSBDfor more details.

Centroid

In the case of partitional algorithms, a suitable function should calculate the cluster centers. In this case, the centers are themselves time series. If a custom function is provided, it will receive a matrix as only argument. Each row will be a time series that belongs to a given cluster. The function should return a numeric vector with the center time series. The other option is to provide a character string. The following options are available:

"mean": The average along each dimension. In other words, the average of all$x^j_i$among the$j$series that belong to the same cluster for all time points$t_i$.
"median": The median along each dimension. Similar to mean.
"shape": Shape averaging. Seeshape_extractionfor more details.
"dba": DTW Barycenter Averaging. SeeDBAfor more details.
"pam": Partition around medoids. This basically means that the cluster centers are always one of the time series in the data. In this case, the distance matrix is pre-computed once using all time series in the data and then re-used at each iteration.

Sakoe-Chiba Constraint

A global constraint to speed up the DTW calculation is the Sakoe-Chiba band (Sakoe and Chiba, 1978). To use it, a window width must be defined via window.size. The windowing constraint uses a centered window. The calculations expect a value in window.size that represents the distance between the point considered and one of the edges of the window. Therefore, if, for example, window.size = 10, the warping for an observation $x_i$ considers the points between $x_{i-10}$ and $x_{i+10}$, resulting in 10*2 + 1 = 21 observations falling within the window.

Details

Partitional algorithms are implemented via kcca. Hierarchical algorithms use the hclust function. The tadpole algorithm uses the TADPole function. In case of partitional algorithms, data should be in the form of a matrix. In the other cases, it may be a matrix or a list, but the matrix will be coerced to a list. A matrix input requires that all time series have equal lengths. If the lengths vary slightly between time series, reinterpolating them to a common length is most likely an acceptable approach (Ratanamahatana and Keogh, 2004). If this is not the case, then clustering them directly is probably ill-advised. See the examples.

References

Sakoe H and Chiba S (1978). ``Dynamic programming algorithm optimization for spoken word recognition.'' Acoustics, Speech and Signal Processing, IEEE Transactions on, 26(1), pp. 43-49. ISSN 0096-3518, http://doi.org/10.1109/TASSP.1978.1163055. Ratanamahatana A and Keogh E (2004). ``Everything you know about dynamic time warping is wrong.'' In 3rd Workshop on Mining Temporal and Sequential Data, in conjunction with 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD-2004), Seattle, WA. Paparrizos J and Gravano L (2015). ``k-Shape: Efficient and Accurate Clustering of Time Series.'' In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, series SIGMOD '15, pp. 1855-1870. ISBN 978-1-4503-2758-9, http://doi.org/10.1145/2723372.2737793.

Examples

Run this code

# Load data
data(uciCT)

# Reinterpolate to same length and coerce as matrix
data <- t(sapply(CharTraj, reinterpolate, newLength = 205))

# Simple partitional clustering with L2 distance and PAM
kc.l2 <- dtwclust(data, k = 20, distance = "kmeans", centroid = "pam", seed = 3247, trace = TRUE)
cat("Rand index for L2+PAM:", randIndex(kc.l2, CharTrajLabels), "\n\n")

# TADPole clustering (takes around 5 seconds)
kc.tadp <- dtwclust(data, type = "tadpole", k = 20, window.size = 20, dc = 1.5, save.data = TRUE)
cat("Rand index for TADPole:", randIndex(kc.tadp, CharTrajLabels), "\n\n")
plot(kc.tadp)

# Hierarchical clustering based on shabe-based distance
hc.sbd <- dtwclust(data, type = "hierarchical", distance = "sbd")
cl.sbd <- cutree(hc.sbd, 20)
cat("Rand index for HC+SBD:", randIndex(cl.sbd, CharTrajLabels), "\n\n")

# Use full DTW and PAM (takes around two minutes)
kc.dtw <- dtwclust(data, k = 20, seed = 3251, trace = TRUE)

# Use full DTW with DBA centroids (takes around five minutes)
kc.dba <- dtwclust(data, k = 20, centroid = "dba", seed = 3251, trace = TRUE)

Run the code above in your browser using DataLab