scclust (version 0.2.2)

get_clustering_stats: Get clustering statistics

Description

get_clustering_stats calculates statistics of a clustering.

Usage

get_clustering_stats(distances, clustering)

Value

Returns a list of class clustering_stats containing the statistics.

Arguments

distances

a distances object describing the distances between the data points in clustering.

clustering

a scclust object containing a non-empty clustering.

Details

The function reports the following measures:

num_data_pointstotal number of data points
num_assignednumber of points assigned to a cluster
num_clustersnumber of clusters
min_cluster_sizesize of the smallest cluster
max_cluster_sizesize of the largest cluster
avg_cluster_sizeaverage cluster size
sum_distssum of all within-cluster distances
min_distsmallest within-cluster distance
max_distlargest within-cluster distance
avg_min_distaverage of the clusters' smallest distances
avg_max_distaverage of the clusters' largest distances
avg_dist_weightedaverage of the clusters' average distances weighed by cluster size
avg_dist_unweightedaverage of the clusters' average distances (unweighed)

Let \(d(i,j)\) denote the distance between data points \(i\) and \(j\). Let \(c\) be a cluster containing the indices of points assigned to the cluster. Let $$D(c) = \{d(i,j): i,j \in c \wedge i>j\}$$ be a function returning all within-cluster distances in \(c\). Let \(C\) be a set containing all clusters.

sum_dists is defined as: $$\sum_{c\in C} sum(D(c))$$

min_dist is defined as: $$\min_{c\in C} \min(D(c))$$

max_dist is defined as: $$\max_{c\in C} \max(D(c))$$

avg_min_dist is defined as: $$\sum_{c\in C} \frac{\min(D(c))}{|C|}$$

avg_max_dist is defined as: $$\sum_{c\in C} \frac{\max(D(c))}{|C|}$$

Let: $$AD(c) = \frac{sum(D(c))}{|D(c)|}$$ be the average within-cluster distance in cluster \(c\).

avg_dist_weighted is defined as: $$\sum_{c\in C} \frac{|c| AD(c)}{num_assigned}$$ where \(num_assigned\) is the number of assigned data points (see above).

avg_dist_unweighted is defined as: $$\sum_{c\in C} \frac{AD(c)}{|C|}$$

Examples

Run this code
my_data_points <- data.frame(x = c(0.1, 0.2, 0.3, 0.4, 0.5,
                                   0.6, 0.7, 0.8, 0.9, 1.0),
                             y = c(10, 9, 8, 7, 6,
                                   10, 9, 8, 7, 6))

my_distances <- distances(my_data_points)

my_scclust <- scclust(c("A", "A", "B", "C", "B",
                        "C", "C", "A", "B", "B"))

get_clustering_stats(my_distances, my_scclust)

# >                     Value
# > num_data_points     10.0000000
# > num_assigned        10.0000000
# > num_clusters         3.0000000
# > min_cluster_size     3.0000000
# > max_cluster_size     4.0000000
# > avg_cluster_size     3.3333333
# > sum_dists           18.2013097
# > min_dist             0.5000000
# > max_dist             3.0066593
# > avg_min_dist         0.8366584
# > avg_max_dist         2.4148611
# > avg_dist_weighted    1.5575594
# > avg_dist_unweighted  1.5847484

Run the code above in your browser using DataLab