cluster_sequences: Cluster Sequences via Dissimilarity Matrix based on String Distances

Description

Performs clustering on sequence data using specified dissimilarity measures and clustering methods. The sequences are first converted to strings and compared using the stringdist package.

Usage

cluster_sequences(
  data,
  k,
  dissimilarity = "hamming",
  method = "pam",
  na_syms = c("*", "%"),
  weighted = FALSE,
  lambda = 1,
  ...
)
# S3 method for tna_clustering
print(x, ...)

Value

A tna_clustering object which is a list containing:

data: The original data.
k: The number of clusters.
assignments: An integer vector of cluster assignments.
silhouette: Silhouette score measuring clustering quality.
sizes: An integer vector of cluster sizes.
method: The clustering method used.
distance: The distance matrix.

Arguments

data: A data.frame or a matrix where the rows are sequences and the columns are time points.
k: An integer giving the number of clusters.
dissimilarity: A character string specifying the dissimilarity measure. The available options are: "osa", "lv", "dl", "hamming", "qgram", "cosine", "jaccard", and "jw". See stringdist::stringdist-metrics for more information on these measures.
method: A character string specifying clustering method. The available methods are "pam", "ward.D", "ward.D2", "complete","average", "single", "mcquitty", "median", and "centroid". See cluster::pam() and stats::hclust() for more information on these methods.
na_syms: A character vector of symbols or factor levels to convert to explicit missing values.
weighted: A logical value indicating whether the dissimilarity measure should be weighted (the default is FALSE for no weighting). If TRUE, earlier observations of the sequences receive a greater weight in the distance calculation with an exponential decay. Currently only supported for the Hamming distance.
lambda: A numeric value defining the strength of the decay when weighted = TRUE. The default is 1.0.
...: Additional arguments passed to stringdist::stringdist().
x: A tna_clustering object.

Examples

Run this code

data <- data.frame(
  T1 = c("A", "B", "A", "C", "A", "B"),
  T2 = c("B", "A", "B", "A", "C", "A"),
  T3 = c("C", "C", "A", "B", "B", "C")
)

# PAM clustering with optimal string alignment (default)
result <- cluster_sequences(data, k = 2)
print(result)

Run the code above in your browser using DataLab