hmm.clust: DBHC Algorithm

Description

Implementation of the DBHC algorithm, an HMM clustering algorithm that finds a mixture of discrete-output HMMs. The algorithm uses heuristics based on BIC to search for the optimal number of hidden states in each HMM and the optimal number of clusters.

Usage

hmm.clust(
  sequences,
  id = NULL,
  smoothing = 1e-04,
  eps = 0.001,
  init.size = 2,
  alphabet = NULL,
  K.max = NULL,
  log_space = FALSE,
  print = FALSE,
  seed.size = 3
)

Value

A list with components:

sequences: An stslist object of sequences with discrete observations.
id: A vector with ids that identify the sequences in sequences.
cluster: A vector with found cluster memberships for the sequences.
partition: A list object with the partition, a mixture of HMMs. Each element in the list is an hmm object.
memberships: A matrix with cluster memberships for each sequence.
n.clusters: Numerical, the found number of clusters.
sizes: A vector with the number of HMM states for each cluster model.
bic: A vector with the BICs for each cluster model.

Arguments

sequences: An stslist object (see seqdef) of sequences with discrete observations or a data.frame.
id: A vector with ids that identify the sequences in sequences.
smoothing: Smoothing parameter for absolute discounting in smooth.probabilities.
eps: A threshold epsilon for counting parameters in count.parameters.
init.size: The number of HMM states in an initial HMM.
alphabet: The alphabet of output labels, if not provided alphabet is taken from stslist object (see seqdef).
K.max: Maximum number of clusters, if not provided algorithm searches for the optimal number itself.
log_space: Logical, parameter provided to fit_model for whether to use optimization in log space or not.
print: Logical, whether to print intermediate steps or not.
seed.size: Seed size, the number of sequences to be selected for a seed

Examples

Run this code

## Simulated data
library(seqHMM)
output.labels <-  c("H", "T")

# HMM 1
states.1 <- c("A", "B", "C")
transitions.1 <- matrix(c(0.8,0.1,0.1,0.1,0.8,0.1,0.1,0.1,0.8), nrow = 3)
rownames(transitions.1) <- states.1
colnames(transitions.1) <- states.1
emissions.1 <- matrix(c(0.5,0.75,0.25,0.5,0.25,0.75), nrow = 3)
rownames(emissions.1) <- states.1
colnames(emissions.1) <- output.labels
initials.1 <- c(1/3,1/3,1/3)

# HMM 2
states.2 <- c("A", "B")
transitions.2 <- matrix(c(0.75,0.25,0.25,0.75), nrow = 2)
rownames(transitions.2) <- states.2
colnames(transitions.2) <- states.2
emissions.2 <- matrix(c(0.8,0.6,0.2,0.4), nrow = 2)
rownames(emissions.2) <- states.2
colnames(emissions.2) <- output.labels
initials.2 <- c(0.5,0.5)

# Simulate
hmm.sim.1 <- simulate_hmm(n_sequences = 100,
                          initial_probs = initials.1,
                          transition_probs = transitions.1,
                          emission_probs = emissions.1,
                          sequence_length = 25)
hmm.sim.2 <- simulate_hmm(n_sequences = 100,
                          initial_probs = initials.2,
                          transition_probs = transitions.2,
                          emission_probs = emissions.2,
                          sequence_length = 25)
sequences <- rbind(hmm.sim.1$observations, hmm.sim.2$observations)
n <- nrow(sequences)

# Clustering algorithm
id <- paste0("K-", 1:n)
rownames(sequences) <- id
sequences <- sequences[sample(1:n, n),]
# \donttest{
res <- hmm.clust(sequences, id = rownames(sequences))
# }


#############################################################################

## Swiss Household Data
data("biofam", package = "TraMineR")

# Clustering algorithm
new.alphabet <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D")
sequences <- seqdef(biofam[,10:25], alphabet = 0:7, states = new.alphabet)
if (FALSE) {
res <- hmm.clust(sequences)

# Heatmaps
cluster <- 1  # display heatmaps for cluster 1
transition.heatmap(res$partition[[cluster]]$transition_probs,
                   res$partition[[cluster]]$initial_probs)
emission.heatmap(res$partition[[cluster]]$emission_probs)
}


## A smaller example, which takes less time to run
# \donttest{
subset <- sequences[sample(1:nrow(sequences), 20, replace = FALSE),]

# Clustering algorithm, limiting number of clusters to 2
res <- hmm.clust(subset, K.max = 2)

# Number of clusters
print(res$n.clusters)

# Table of cluster memberships
table(res$memberships[,"cluster"])

# BIC for each number of clusters
print(res$bic)

# Heatmaps
cluster <- 1  # display heatmaps for cluster 1
transition.heatmap(res$partition[[cluster]]$transition_probs,
                   res$partition[[cluster]]$initial_probs)
emission.heatmap(res$partition[[cluster]]$emission_probs)
# }

# \dontshow{
subset <- sequences[sample(1:nrow(sequences), 4, replace = FALSE),]

# Clustering algorithm, limiting number of clusters to 2
res <- hmm.clust(subset, K.max = 2, seed.size = 2)
# }

Run the code above in your browser using DataLab