sc_clustering: Size-constrained clustering

Description

sc_clustering constructs near-optimal size-constrained clusterings. Subject to user-specified constraints on the size and composition of the clusters, a clustering is constructed so that within-cluster pair-wise distances are minimized. The function does not restrict the number of clusters.

Usage

sc_clustering(distances, size_constraint = NULL, type_labels = NULL,
  type_constraints = NULL, seed_method = "inwards_updating",
  primary_data_points = NULL,
  primary_unassigned_method = "closest_seed",
  secondary_unassigned_method = "ignore", seed_radius = NULL,
  primary_radius = "seed_radius",
  secondary_radius = "estimated_radius", batch_size = 100L)

Value

Returns a scclust object with the derived clustering.

Arguments

distances: a distances object with distances between the data points.
size_constraint: an integer with the required minimum cluster size.
type_labels: a vector containing the type of each data point. May be NULL when type_constraints is NULL.
type_constraints: a named integer vector containing type-specific size constraints. If NULL, only the overall constraint given by size_constraint will be imposed on the clustering.
seed_method: a character scalar indicating how seeds should be selected.
primary_data_points: a vector specifying primary data points, either with point indices or with a logical vector of length equal to the number of points. NULL indicates that all data points are "primary".
primary_unassigned_method: a character scalar indicating how unassigned (primary) points should be assigned to clusters.
secondary_unassigned_method: a character scalar indicating how unassigned secondary points should be assigned to clusters.
seed_radius: a positive numeric scalar restricting the maximum length of an edge in the graph used to construct the clustering. NULL indicates no restriction. This parameter (together with primary_radius and secondary_radius) can be used to control the maximum distance between points assigned to the same cluster (see below for details).
primary_radius: a positive numeric scalar, a character scalar or NULL restricting the match distance for unassigned primary points. If numeric, the value is used to restrict the distances. If character, it must be one of "no_radius", "seed_radius" or "estimated_radius" (see below for details). NULL indicates no radius restriction.
secondary_radius: a positive numeric scalar, a character scalar or NULL restricting the match distance for unassigned secondary points. If numeric, the value is used to restrict the distances. If character, it must be one of "no_radius", "seed_radius" or "estimated_radius" (see below for details). NULL indicates no radius restriction.
batch_size: an integer scalar specifying batch size when seed_method is set to "batches".

Details

sc_clustering constructs a clustering so to minimize within-cluster dissimilarities while ensuring that constraints on the size and composition of the clusters are satisfied. It is possible to impose an overall size constraint so that each cluster must contain at least a certain number of points in total. It is also possible to impose constraints on the composition of the clusters so that each cluster must contain a certain number of points of different types. For example, in a sample with "red" and "blue" data points, one can constrain the clustering so that each cluster must contain at least 10 points in total of which at least 3 must be "red" and at least 2 must be "blue".

The function implements an algorithm that first summaries the distances between data points in a sparse graph and then constructs the clustering based on the graph. This admits fast execution while ensuring near-optimal performance. In particular, the maximum within-cluster distance is garantueed to be at most four times the maximum distance in the optimal clustering. The average performance is much closer to optimal than the worst case bound.

In more detail, the clustering algorithm has four steps:

Construct sparse graph encoding clustering constraints.
Select a set of vertices in the graph to act as "seeds".
Construct initial clusters from the seeds' neighborhoods in the graph.
Assign remaining data points to clusters.

Each data point is represented by a vertex in the sparse graph, and arcs are weighted by the distance between the data points they connect. The graph is constructed so that each vertex's neighborhood satisfies the clustering constraints supplied by the user. For example, if the clusters must contain at least five points, each closed neighborhood in the graph will contain five vertices. sc_clustering constructs the graph that minimize the arc weights subject to the neighborhood constraints; this ensures near-optimal performance. The function selects "seeds" that have non-overlapping neighborhood in the graph. By constructing clusters as supersets of the neighborhoods, it ensures that the clustering will satisfy the constraints imposed by the user.

The seed_method option governs how the seeds are chosen. Any set of seeds yields a near-optimal clustering, but, heuristically, performance can be improved by picking the seeds more carefully. In most cases, smaller clusters are desirable since they tend to minimize within-cluster distances. As the number of data points is fixed, we minimize the cluster size by maximizing the number of clusters (and, thus, the number of seeds). When seed_method is set to "lexical", seeds are chosen in lexical order. This admits a fast solution, but the function might pick points that are central in the graph. Central points tend to exclude many other points from being seeds and, thus, lead to larger clusters.

The "exclusion_order" and "exclusion_updating" options calculate for each vertex how many other vertices are excluded if the vertex is picked as a seed. By picking seeds that exclude few other vertices, the function avoids central points and increases the number of seeds. "exclusion_updating" updates the count after each picked seed so that already excluded vertices are not counted twice; "exclusion_order" derives the count once. The former option is, thus, better-performing but slower.

Deriving the exclusion count when using the "exclusion_order" and "exclusion_updating" options is an expensive operation, and it might not be feasible to do so in large samples. This is particularly problematic when the data points have equidistant nearest neighbors, which tends to happen when the dataset contains only discrete variables. It also happens when the dataset contains many identical data points. The exclusion count operation may in these cases take several orders of magnitude longer to run than the rest of the operations combined. To ensure sane behavior for the default options, the exclusion count is not used by default. If the dataset contains at least one continuous variable, it is generally safe to call sc_clustering with either "exclusion_order" or "exclusion_updating", and this will often improve performance over the default.

The "inwards_order" and "inwards_updating" options count the number of inwards-pointing arcs in the graph, which approximates the exclusion count. "inwards_updating" updates the count after each picked seed, while "inwards_order" derives the count once. The inwards counting options work well with nearly all types of data, and "inwards_updating" is the default.

The "batches" option is identical to "lexical" but it derives the graph in batches. This limits the use of memory to a value proportional to batch_size irrespectively of the size of the dataset. This can be useful when imposing large size constraints, which consume a lot of memory in large datasets. The "batches" option is still experimental and can currently only be used when one does not impose type constraints.

Once the function has selected seeds so that no additional points can be selected without creating overlap in the graph, it constructs the initial clusters. A unique cluster label is assigned to each seed, and all points in the seed's neighborhood is assigned the same label. Some data points might not be in a neighborhood of a seed and will, therefore, not be assigned to a cluster after the initial clusters have been formed. primary_unassigned_method specifies how the unassigned points are assigned. With the "ignore" option, the points are left unassigned. With the "any_neighbor", the function re-uses the sparse graph and assigns the unassigned points to any adjacent cluster in the graph. The "closest_assigned" option assigns the points to the clusters that contain their closest assigned vertex, and the "closest_seed" option assigns to the cluster that contain the closest seed. All these options ensure that the clustering is near-optimal.

Occasionally, some data points are allowed to be left unassigned. Consider the following prediction problem as an example. We have a set of data points with a known outcome value ("training points") and another set of points for which the outcome is unknown ("prediction points"). We want to use the training points to predict the outcome for the prediction points. By clustering the data, we can use the cluster mean of the training points to predict the values of the prediction points in the same cluster. To make this viable, we need to ensure that each cluster contains at least one training point (the cluster mean would otherwise be undefined). We can impose this constraint using the type_labels and type_constraints options. We also need to make sure that each prediction point is assigned to a cluster. We do, however, not need that all training points are assigned a cluster; some training points might not provide useful information (e.g., if they are very dissimilar to all prediction points). In this case, by specifying only the prediction points in primary_data_points, we ensure that all those points are assigned to clusters. The points not specified in primary_data_points (i.e., the training points) will be assigned to clustering only insofar that it is needed to satisfy the clustering constraints. This can lead to large improvements in the clustering if the types of points are unevenly distributed in the metric space. Points not specified in primary_data_points are called "secondary".

Generally, one does not want to discard all unassigned secondary points -- some of them will, occasionally, be close to a cluster and contain useful information. Similar to primary_unassigned_method, we can use the secondary_unassigned_method to specify how the leftover secondary points should be assigned. The three possible options are "ignore", "closest_assigned" and "closest_seed". In nearly all cases, it is beneficial to impose a radius constraint when assigning secondary points (see below for details).

sc_clustering tries to minimize within-cluster distances subject to that all (primary) data points are assigned to clusters. In some cases, it might be beneficial to leave some points unassigned to avoid clusters with very dissimilar points. The function has options that can be used to indirectly constrain the maximum within-cluster distance. Specifically, by restricting the maximum distance in the four steps of the function, we can bound the maximum within-cluster distance. The bound is, however, a blunt tool. A too small bound might lead to that only a few data points are assigned to clusters. If a bound is needed, it is recommended to use a very liberal bound. This will avoid the very dissimilar clusters, but let the function optimize the remaining cluster assignments.

The seed_radius option limits the maximum distance between adjacent vertices in the sparse graph. If the distance to a neighbor in a point's neighborhood is greater than seed_radius, the neighborhood will be deleted and the point is not allowed be a seed (it can, however, still be in other points' neighborhoods). The primary_radius option limits the maximum distance when a primary point is assigned to in the fourth step. When primary_radius is set to a positive numeric value, this will be used to as the radius restriction. "no_radius" and NULL indicate no restriction. "seed_radius" indicates that the same restriction as seed_radius should be used, and "estimated_radius" sets the restriction to the estimated average distance between the seeds and their neighbors. When primary_unassigned_method is set to "any_neighbor", primary_radius must be set to "seed_radius".

The way the radius constraints restrict the maximum within-cluster distance depends on the primary_unassigned_method option. When primary_unassigned_method is "ignore", the maximum distance is bounded by 2 * seed_radius. When primary_unassigned_method is "any_neighbor", it is bounded by 4 * seed_radius. When primary_unassigned_method is "closest_assigned", it is bounded by 2 * seed_radius + 2 * primary_radius. When primary_unassigned_method is "closest_seed", it is bounded by the maximum of 2 * seed_radius and 2 * primary_radius.

The secondary_radius option restricts how secondary points are assigned, but is otherwise identical to the primary_radius option.

References

Higgins, Michael J., Fredrik Sävje and Jasjeet S. Sekhon (2016), ‘Improving massive experiments with threshold blocking’, Proceedings of the National Academy of Sciences, 113:27, 7369--7376. http://www.pnas.org/lookup/doi/10.1073/pnas.1510504113

Sävje, Fredrik and Michael J. Higgins and Jasjeet S. Sekhon (2017), ‘Generalized Full Matching’, arXiv 1703.03882. https://arxiv.org/abs/1703.03882

Examples

Run this code

# Make example data
my_data <- data.frame(id = 1:50000,
                      type = factor(rbinom(50000, 3, 0.3),
                                    labels = c("A", "B", "C", "D")),
                      x1 = rnorm(50000),
                      x2 = rnorm(50000),
                      x3 = rnorm(50000))

# Construct distance metric
my_dist <- distances(my_data,
                     id_variable = "id",
                     dist_variables = c("x1", "x2", "x3"))

# Make clustering with at least 3 data points in each cluster
my_clustering <- sc_clustering(my_dist, 3)

# Check so clustering satisfies constraints
check_clustering(my_clustering, 3)
# > TRUE

# Get statistics about the clustering
get_clustering_stats(my_dist, my_clustering)
# > num_data_points        5.000000e+04
# > ...

# Make clustering with at least one point of each type in each cluster
my_clustering <- sc_clustering(my_dist,
                               type_labels = my_data$type,
                               type_constraints = c("A" = 1, "B" = 1,
                                                    "C" = 1, "D" = 1))

# Check so clustering satisfies constraints
check_clustering(my_clustering,
                 type_labels = my_data$type,
                 type_constraints = c("A" = 1, "B" = 1,
                                      "C" = 1, "D" = 1))
# > TRUE

# Make clustering with at least 8 points in total of which at least
# one must be "A", two must be "B" and five can be any type
my_clustering <- sc_clustering(my_dist,
                               size_constraint = 8,
                               type_labels = my_data$type,
                               type_constraints = c("A" = 1, "B" = 2))

Run the code above in your browser using DataLab