Dirichlet Process Clustering with Dissimilarities
The DPCD package implements a Bayesian hierarchical model for clustering dissimilarity data. The two main objectives of DPCD are to 1. Find a configuration of the objects such that the distance between them approximate the input dissimilarities, accounting for measurement error, and 2. Obtain a meaningful clustering structure for the objects.
Installation
You can install the DPCD package from CRAN with:
install.packages('DPCD')Usage
The main function of the DPCD package is run_dpcd(), which fits a
specified DPCD model to dissimilarity data. Here’s a basic example:
library(DPCD)
# Simulate data and calculate distances
set.seed(123)
x <- rbind(matrix(rnorm(20), ncol = 2), matrix(rnorm(30, mean = 4), ncol = 2))
dis_mat <- dist(x)
# Fit the Equal Spherical DPCD model
fit <- run_dpcd(dis_mat, model = "ES", p = 2, niter = 50000, nburn = 10000)
#> |-------------|-------------|-------------|-------------|
#> |-------------------------------------------------------|The posterior samples can be accessed via the samples attribute of the
returned object. You can use the extract_clusters() function to obtain
a cluster assignment for each observation:
clusters <- extract_clusters(fit$samples)
clusters
#> [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2Posterior predictive checks can be performed using the
post_predictive() function to ensure that the model fits well. Note
that, by default, the input dissimilarities are scaled so that the
maximum value is 1. This can be toggled off with scale = TRUE, but it
is recommended to provide different hyperparameters (using the
hyper_params argument) if scaling is disabled.
ppc <- post_predictive(fit$samples, dis_mat)The latent object configuration can be plotted using the
plot_objects() function. Since the latent coordinates of the objects
are non-identifiable, a target matrix must be provided to which the
posterior draws of the object coordinates are aligned.
target_matrix <- cmdscale(dis_mat, k = 2)
plot_objects(fit$samples, target_matrix = target_matrix, show_clusters = TRUE,
main = "DPCD Estimated Configuration")