quickblock: Construct threshold blockings

Description

quickblock constructs near-optimal threshold blockings. The function expects the user to provide distances measuring the similarity of units and a required minimum block size. It then constructs a blocking so that units assigned to the same block are as similar as possible while satisfying the minimum block size.

Usage

quickblock(distances, size_constraint = 2L, caliper = NULL,
  break_large_blocks = FALSE, ...)

Value

Returns a qb_blocking object with the constructed blocks.

Arguments

distances: distances object or a numeric vector, matrix or data frame. The parameter describes the similarity of the units to be blocked. It can either be preprocessed distance information using a distances object, or raw covariate data. When called with covariate data, Euclidean distances are calculated unless otherwise specified.
size_constraint: integer with the required minimum number of units in each block.
caliper: restrict the maximum within-block distance.
break_large_blocks: logical indicating whether large blocks should be broken up into smaller blocks.
...: additional parameters to be sent either to the distances function when the distances parameter contains covariate data, or to the underlying sc_clustering function.

Details

The caliper parameter constrains the maximum distance between units assigned to the same block. This is implemented by restricting the edge weight in the graph used to construct the blocks (see sc_clustering for details). As a result, the caliper will affect all blocks and, in general, make it harder for the function to find good matches even for blocks where the caliper is not binding. In particular, a too tight caliper can lead to discarded units that otherwise would be assigned to a block satisfying both the matching constraints and the caliper. For this reason, it is recommended to set the caliper value quite high and only use it to avoid particularly poor blocks. It strongly recommended to use the caliper parameter only when primary_unassigned_method = "closest_seed" in the underlying sc_clustering function (which is the default behavior).

The main algorithm used to construct the blocking may produce some blocks that are much larger than the minimum size constraint. If break_large_blocks is TRUE, all blocks twice as large as size_constraint will be broken into two or more smaller blocks. Block are broken so to ensure that the new blocks satisfy the size constraint. In general, large blocks are produced when units are highly clustered, so breaking up large blocks will often only lead to small improvements. The blocks are broken using the hierarchical_clustering function.

quickblock calls sc_clustering with seed_method = "inwards_updating". The seed_method parameter governs how the seeds are selected in the nearest neighborhood graph that is used to construct the blocks (see sc_clustering for details). The "inwards_updating" option generally works well and is safe with most datasets. Using seed_method = "exclusion_updating" often leads to better performance (in the sense of blocks with more similar units), but it may increase run time. Discrete data (or more generally when units tend to be at equal distance to many other units) will lead to particularly poor run time with this option. If the dataset has at least one continuous covariate, "exclusion_updating" is typically quick. A third option is seed_method = "lexical", which decreases the run time relative to "inwards_updating" (sometimes considerably) at the cost of performance. quickblock passes parameters on to sc_clustering, so to change seed_method, call quickblock with the parameter specified as usual: quickblock(..., seed_method = "exclusion_updating").

References

Higgins, Michael J., Fredrik Sävje and Jasjeet S. Sekhon (2016), ‘Improving massive experiments with threshold blocking’, Proceedings of the National Academy of Sciences, 113:27, 7369--7376. http://www.pnas.org/lookup/doi/10.1073/pnas.1510504113

Examples

Run this code

# Construct example data
my_data <- data.frame(x1 = runif(100),
                      x2 = runif(100))

# Make distances
my_distances <- distances(my_data, dist_variables = c("x1", "x2"))

# Make blocking with at least two units in each block
quickblock(my_distances)

# Require at least three units in each block
quickblock(my_distances, size_constraint = 3)

# Impose caliper
quickblock(my_distances, caliper = 0.2)

# Break large block
quickblock(my_distances, break_large_blocks = TRUE)

# Call `quickblock` directly with covariate data (ie., not pre-calculating distances)
quickblock(my_data[c("x1", "x2")])

# Call `quickblock` directly with covariate data using Mahalanobis distances
quickblock(my_data[c("x1", "x2")], normalize = "mahalanobize")

Run the code above in your browser using DataLab