dtwclust: Time series clustering under DTW

Description

This function uses the DTW distance and related lower bounds to cluster time series. For now, all series must be univariate.

Usage

dtwclust(data = NULL, type = "partitional", k = 2, method = "average",
  distance = "dtw", centroid = "pam", preproc = NULL,
  window.size = NULL, norm = "L1", dc = NULL, dba.iter = 50,
  pam.precompute = TRUE, control = NULL, save.data = TRUE, seed = NULL,
  trace = FALSE, ...)

Arguments

data

A list where each element is a time series, or a numerical matrix where each row is a time series. All series must have equal lengths in case of type = "tadpole".

type

What type of clustering method to use, partitional, hierarchical or tadpole.

Numer of desired clusters in partitional methods.

method

Which linkage method to use in hierarchical methods. See hclust.

distance

One of the supported distance definitions (see Distance section). Ignored for type = "tadpole".

centroid

Either a supported string or an appropriate function to calculate centroids when using partitional methods (see Centroid section).

preproc

Function to preprocess data. Defaults to zscore only if centroid = "shape", but will be replaced by a custom function if provided. See Preprocessing section.

window.size

Window constraint for DTW and LB calculations. See Sakoe-Chiba section.

norm

Pointwise distance for DTW, DBA and the LB. Either L1 for Manhattan distance or L2 for Euclidean. Ignored for distance = "DTW" (which always uses L1) and distance = "DTW2" (which always us

Cutoff distance for TADPole algorithm.

dba.iter

Maximum number of iterations for DBA centroids.

pam.precompute

Precompute the whole distance matrix once and reuse it at each iteration if using PAM centroids. Otherwise calculate distances at every iteration.

control

Parameters for partitional clustering algorithms. See flexclustControl.

save.data

Return a copy of the data in the returned object? Ignored for hierarchical clustering.

seed

Random seed for reproducibility of partitional algorithms.

trace

Boolean flag. If true, more output regarding the progress is printed to screen.

...

Additional arguments to pass to dist or a custom function.

Value

An object with formal class dtwclust-class if type = "partitional" | "tadpole". Otherwise an object with class hclust as returned by hclust.

Distance

If a custom distance function is provided, it will receive the data as the first argument. For partitional algorithms, the second argument will be the cluster centers (i.e. other time series). If data is a matrix, the cluster centers will also be given in the form of a matrix where each row is a center series; if data is a list of series, so will the centers. If hierarchical algorithms are used, the function will also receive the elements of .... For partitional algorithms, the function could make use of the window.size and norm parameters, which should be detected thanks to R's lexical scoping, however this cannot be guaranteed. The function should return a distance matrix, ideally of class crossdist. In case of partitional algorithms, the time series in the data should be along the rows, and the cluster centers along the columns of the distance matrix. The other option is to provide a string. The string can represent a compatible registered distance of dist. Extra parameters can be provided in .... See the examples. Additionally, with either type of algorithm, it can be one of the following custom implementations:

"dtw": DTW with L1 norm and optionally a Sakoe-Chiba/Slanted-band constraint.
"dtw2": DTW with L2 norm and optionally a Sakoe-Chiba/Slanted-band constraint.
"dtw_lb": DTW with L1 or L2 norm and optionally a Sakoe-Chiba constraint. Some computations are avoided by first estimating the distance matrix with Lemire's lower bound and then iteratively refining with DTW. Seedtw_lb.
"lbk": Keogh's lower bound with either L1 or L2 norm for the Sakoe-Chiba constraint.
"lbi": Lemire's lower bound with either L1 or L2 norm for the Sakoe-Chiba constraint.
"sbd": Shape-based distance. Each series is z-normalized in this case. As a result, the cluster centers (for partitional methods) are also z-normalized. SeeSBDfor more details.

Note that only dtw, dtw2 and sbd support series of different lengths.

Centroid

In the case of partitional algorithms, a suitable function should calculate the cluster centers. In this case, the centers are themselves time series. If a custom function is provided, it will receive different inputs depending on the format of data:

For matrix input, it will receive a matrix as single input. Each row will be a series that belongs to a given cluster. The function should return a numeric vector with the centroid time series.
For a list input, the function will receive three inputs in the following order: thewholedata list; a numeric vector with length equal to the number of series indata, indicating which cluster a series belongs to; the current number of total clusters.

The other option is to provide a character string. The following options are available:

"mean": The average along each dimension. In other words, the average of all$x^j_i$among the$j$series that belong to the same cluster for all time points$t_i$.
"median": The median along each dimension. Similar to mean.
"shape": Shape averaging. Seeshape_extractionfor more details.
"dba": DTW Barycenter Averaging. SeeDBAfor more details.
"pam": Partition around medoids. This basically means that the cluster centers are always one of the time series in the data. In this case, the distance matrix can be pre-computed once using all time series in the data and then re-used at each iteration. It usually saves overhead overall.

Note that only dba and pam support series of different lengths

Sakoe-Chiba Constraint

A global constraint to speed up the DTW calculation is the Sakoe-Chiba band (Sakoe and Chiba, 1978). To use it, a window width must be defined via window.size. The windowing constraint uses a centered window. The calculations expect a value in window.size that represents the distance between the point considered and one of the edges of the window. Therefore, if, for example, window.size = 10, the warping for an observation $x_i$ considers the points between $x_{i-10}$ and $x_{i+10}$, resulting in 10*2 + 1 = 21 observations falling within the window.

Preprocessing

It is strongly advised to use z-normalization in case of centroid = "shape", because the resulting series have this normalization (see shape_extraction). The user can, however, specify a custom function that performs any transformation on the data, but the user must make sure that the format stays consistent, i.e. a matrix where each row is a series or a list of time series. For example, the z-normalization could be implemented as t(apply(data, 1, zscore)) or lapply(data, zscore) respectively. The function will receive the data as first argument and, in case hierarchical methods are used, the contents of ... as the second argument.

Notes

Notice that the lower bounds are defined only for time series of equal lengths. DTW and DTW2 don't require this, but they are much slower to compute. The lower bounds are not symmetrical, and DTW is only symmetrical if series are of equal lengths. Specifying distance = "sbd" and centroid = "shape" is equivalent to the k-Shape algorithm (Papparizos and Gravano, 2015). See SBD and shape_extraction for more info.

Details

Partitional algorithms are implemented via kcca. Hierarchical algorithms use the hclust function. The tadpole algorithm uses the TADPole function. The data may be a matrix or a list, but the matrix will be coerced to a list. A matrix input requires that all time series have equal lengths. If the lengths vary slightly between time series, reinterpolating them to a common length is most likely an acceptable approach (Ratanamahatana and Keogh, 2004). If this is not the case, then clustering them directly is probably ill-advised. See the examples.

References

Sakoe H and Chiba S (1978). ``Dynamic programming algorithm optimization for spoken word recognition.'' Acoustics, Speech and Signal Processing, IEEE Transactions on, 26(1), pp. 43-49. ISSN 0096-3518, http://doi.org/10.1109/TASSP.1978.1163055. Ratanamahatana A and Keogh E (2004). ``Everything you know about dynamic time warping is wrong.'' In 3rd Workshop on Mining Temporal and Sequential Data, in conjunction with 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD-2004), Seattle, WA. Paparrizos J and Gravano L (2015). ``k-Shape: Efficient and Accurate Clustering of Time Series.'' In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, series SIGMOD '15, pp. 1855-1870. ISBN 978-1-4503-2758-9, http://doi.org/10.1145/2723372.2737793.

Examples

Run this code

#### Load data
data(uciCT)

# Reinterpolate to same length and coerce as matrix
data <- t(sapply(CharTraj, reinterpolate, newLength = 180))

# Subset for speed
data <- data[1:20, ]
labels <- CharTrajLabels[1:20]

#### Simple partitional clustering with L2 distance and PAM
kc.l2 <- dtwclust(data, k = 4, distance = "L2", centroid = "pam",
                  seed = 3247, trace = TRUE)
cat("Rand index for L2+PAM:", randIndex(kc.l2, labels), "\n\n")

#### TADPole clustering
kc.tadp <- dtwclust(data, type = "tadpole", k = 4,
                    window.size = 20, dc = 1.5,
                    trace = TRUE)
cat("Rand index for TADPole:", randIndex(kc.tadp, labels), "\n\n")
plot(kc.tadp)

# Modify plot
plot(kc.tadp, cl = 1:2, labs.arg = list(title = "TADPole, clusters 1 and 2",
                                        x = "time", y = "series"))

#### Registering a custom distance with the 'proxy' package and using it
# Normalized DTW distance
ndtw <- function(x, y, ...) {
  dtw::dtw(x, y, step.pattern = symmetric2,
           distance.only = TRUE, ...)$normalizedDistance
}

# Registering the function with 'proxy'
proxy::pr_DB$set_entry(FUN = ndtw, names=c("nDTW"),
                       loop = TRUE, type = "metric", distance = TRUE,
                       description = "Normalized DTW with L1 norm")

# Subset of (original) data for speed
# Change pam.precompute to FALSE to see time difference
kc.ndtw <- dtwclust(CharTraj[31:40], distance = "nDTW",
                    trace = TRUE, pam.precompute = TRUE,
                    seed = 8319)
cat("Rand index for nDTW (subset):",
    randIndex(kc.ndtw, CharTrajLabels[31:40]), "\n\n")
plot(kc.ndtw)

#### Hierarchical clustering based on shabe-based distance (different lengths)
hc.sbd <- dtwclust(CharTraj, type = "hierarchical",
                   distance = "sbd", trace = TRUE)
cl.sbd <- cutree(hc.sbd, 20)
cat("Rand index for HC+SBD:", randIndex(cl.sbd, CharTrajLabels), "\n\n")

#### Saving and modifying the ggplot object with custom time
t <- seq(Sys.Date(), len = 205, by = "day")
gkc <- plot(kc.l2, time = t, plot = FALSE)

require(scales)
gkc + scale_x_date(labels = date_format("%b-%Y"),
                   breaks = date_breaks("2 months"))

#### Use full DTW and PAM (takes around two minutes)
kc.dtw <- dtwclust(CharTraj, k = 20, seed = 3251, trace = TRUE)

#### Use full DTW with DBA centroids (takes around five minutes)
kc.dba <- dtwclust(CharTraj, k = 20, centroid = "dba", seed = 3251, trace = TRUE)

#### Use constrained DTW with original series of different lengths (around one minute)
kc.cdtw <- dtwclust(CharTraj, k = 20, window.size = 20,
                    seed = 3251, trace = TRUE)

# Plot one of the clusters
plot(kc.cdtw, cl=18)

Run the code above in your browser using DataLab