# block

##### Block units into homogeneous experimental blocks

Block units into experimental blocks, with one unit per treatment condition. Blocking begins by creating a measure of multivariate distance between all possible pairs of units. Maximum, minimum, or an allowable range of differences between units on one variable can be set.

- Keywords
- multivariate, design

##### Usage

`block(data, vcov.data = NULL, groups = NULL, n.tr = 2, id.vars, block.vars = NULL, algorithm = "optGreedy", distance = "mahalanobis", weight = NULL, optfactor = 10^7, row.sort = NULL, level.two = FALSE, valid.var = NULL, valid.range = NULL, seed.dist, namesCol = NULL, verbose = FALSE, ...)`

##### Arguments

- data
- a dataframe or matrix, with units in rows and variables in columns.
- vcov.data
- an optional matrix of data used to estimate the variance-covariance matrix for calculating multivariate distance.
- groups
- an optional column name from
`data`

, specifying subgroups within which blocking occurs. - n.tr
- the number of treatment conditions per block.
- id.vars
- a required string or vector of two strings specifying which
column(s) of
`data`

contain identifying information. - block.vars
- an optional string or vector of strings specifying which
column(s) of
`data`

contain the numeric blocking variables. - algorithm
- a string specifying the blocking algorithm.
`"optGreedy"`

,`"optimal"`

,`"naiveGreedy"`

,`"randGreedy"`

, and`"sortGreedy"`

algorithms are currently available. See Details for more information. - distance
- either a) a string defining how the multivariate
distance used for blocking is calculated (options include
`"mahalanobis"`

,`"mcd"`

,`"mve"`

, and`"euclidean"`

), or b) a user-defined $k$-by-$k$ matrix of distances, where $k$ is the number of rows in`data`

. - weight
- either a vector of length equal to the number of blocking variables or a square matrix with dimensions equal to the number of blocking variables used to explicitly weight blocking variables.
- optfactor
- a number by which distances are multiplied then divided when
`algorithm = "optimal"`

. - row.sort
- an optional vector of integers from 1 to
`nrow(data)`

used to sort the rows of data when`algorithm = "sortGreedy"`

. - level.two
- a logical defining the level of blocking.
- valid.var
- an optional string defining a variable on which
units in the same block must fall within the range defined by
`valid.range`

. - valid.range
- an optional vector defining the range of
`valid.var`

within which units in the same block must fall. - seed.dist
- an optional integer value for the random seed set in
`cov.rob`

, used to calculate measures of the variance-covariance matrix robust to outliers. - namesCol
- an optional vector of column names for the output table.
- verbose
- a logical specifying whether
`groups`

names and block numbers are printed as blocks are created. - ...
- additional arguments passed to
`cov.rob`

.

##### Details

If `vcov.data = NULL`

, then `block`

calculates the
variance-covariance matrix using the `block.vars`

from
`data`

.

If `groups`

is not user-specified, `block`

temporarily creates
a variable in `data`

called `groups`

, which takes the value
1 for every unit.

Where possible, one unit is assigned to each condition in each block. If there are fewer available units than treatment conditions, available units are used.

If `n.tr`

$> 2$, then the `optGreedy`

algorithm finds the best
possible pair match, then the best match to either member of the pair,
then the best match to any member of the triple, etc. After finding the best pair match to a given unit, the other greedy algorithms proceed by finding the third, fourth, etc. best match to that given unit.

An example of `id.vars`

is `id.vars = c("id", "id2")`

. If
two-level blocking is selected, `id.vars`

should be ordered
(unit id, subunit id). See details for `level.two`

below for more
information.

If `block.vars = NULL`

, then all variables in `data`

except
the `id.vars`

are taken as blocking variables. E.g.,
`block.vars = c("b1", "b2")`

.

The algorithm `optGreedy`

calls an optimal-greedy algorithm, repeatedly
finding the best remaining match in the entire dataset;
`optimal`

finds the set of blocks that minimizes the sum of the
distances in all blocks; `naiveGreedy`

finds the best match
proceeding down the dataset from the first unit to the last;
`randGreedy`

randomly selects a unit, finds its best match, and
repeats; `sortGreedy`

resorts the dataset according to
`row.sort`

, then implements the `naiveGreedy`

algorithm.
The `optGreedy`

algorithm breaks ties by randomly selecting one of
the minimum-distance pairs. The `naiveGreedy`

, `sortGreedy`

,
and `randGreedy`

algorithms break ties by randomly selecting one of
the minimum-distance matches to the particular unit in question.

As of version 0.5-1, blocking is done in C for all algorithms except
`optimal`

(see following paragraphs for more details on the
`optimal`

algorithm implementation).

The `optimal`

algorithm uses two functions from the
nbpMatching package: `distancematrix`

prepares a distance
matrix for optimal blocking, and `nonbimatch`

performs the
optimal blocking by minimizing the sum of distances in blocks.
`nonbimatch`

, and thus the `block`

algorithm
`optimal`

, requires that `n.tr = 2`

.

Because `distancematrix`

takes the integer `floor`

of the
distances, and one may want much finer precision, the multivariate
distances calculated within `block`

are multiplied by
`optfactor`

prior to optimal blocking. Then
`distancematrix`

prepares the resulting distance matrix, and
`nonbimatch`

is called on the output. The distances are then
untransformed by dividing by `optfactor`

before being returned by
`block`

.

The choice of `optfactor`

can determine whether the Fortran code
can allocate enough memory to solve the optimization problem. For
example, blocking the first 14 units of `x100`

by executing
```
block(x100[1:14, ], id.vars = "id", block.vars = c("b1", "b2"),
algorithm = "optimal", optfactor = 10^8)
```

fails for Fortran memory
reasons, while the same code with `optfactor = 10^5`

runs
successfully. Smaller values of `optfactor`

imply easier
computation, but less precision.

Most of the algorithms in `block`

make prohibited blockings by
using a distance of `Inf`

. However, the optimal algorithm calls
`Fortran`

code from nbpMatching and requires integers.
Thus, a distance of `99999*max(dist.mat)`

is used to effectively
prohibit blockings. This follows the procedure demonstrated in the
example of `help(nonbimatch)`

.

In order to enable comparisons of block-quality across groups, when
`distance`

is a string, $Sigma$ is calculated using units from
all groups.

The `distance = "mcd"`

and `distance = "mve"`

options call
`cov.rob`

to calculate measures of multivariate spread robust to
outliers. The `distance = "mcd"`

option calculates the Minimum
Covariance Determinant estimate (Rousseeuw 1985); the `distance = "mve"`

option
calculates the Minimum Volume Ellipsoid estimate (Rousseeuw and van Zomeren 1990). When `distance = "mcd"`

, the interquartile range on blocking variables should not be zero.

A user-specified distance matrix must have diagonals equal to 0, indicating zero distance between a unit and itself. Only the lower triangle of the matrix is used.

If `weight`

is a vector, then it is used as the diagonal of a square
weighting matrix with non-diagonal elements equal to zero. The
weighting is done by using as the Mahalanobis distance scaling matrix
$((((chol(Sigma))')^{-1})'W((chol(Sigma))')^{-1})^{-1}$, where
$chol(Sigma)$ is the Cholesky decomposition of the usual variance-covariance
matrix and $W$ is the weighting matrix. Differences should be smaller
on covariates given higher weights.

If `level.two = TRUE`

, then the best subunit block-matches in
different units are found. E.g., provinces could be matched based on
the most similar cities within them. All subunits in the data should
have unique names. Thus, if subunits are numbered 1 to (number of
subunits in unit) within each unit, then they should be renumbered,
e.g., 1 to (total number of subunits in all units). `level.two`

blocking is not currently implemented for ```
algorithm =
"optimal"
```

. Units with no blocked subunit are put into their own
blocks. However, unblocked subunits within a unit that does have a
blocked subunit are not put into their own blocks.

An example of a variable restriction is `valid.var = "b2"`

,
`valid.range = c(10,50)`

, which requires that units in the same
block be at least 10 units apart, but no more than 50 units apart, on
variable `"b2"`

. As of version 0.5-3, variable restrictions are
implemented in all algorithms except `optimal`

. Note that
employing a variable restriction may result in fewer than the maximum
possible number of blocks. See
http://www.ryantmoore.org/html/software.blockTools.html for details.

If `namesCol = NULL`

, then “Unit 1”,
“Unit 2”, ... are used. If `level.two = FALSE`

, then `namesCol`

should be of length `n.tr`

; if `level.two = TRUE`

, then `namesCol`

should be of length 2*`n.tr`

, and in the order shown in the example below.

##### Value

##### References

King, Gary, Emmanuela Gakidou, Nirmala Ravishankar, Ryan T. Moore, Jason
Lakin, Manett Vargas, Martha Mar\'ia T\'ellez-Rojo and Juan Eugenio
Hern\'andez \'Avila and Mauricio Hern\'andez \'Avila and H\'ector
Hern\'andez Llamas. 2007. "A 'Politically Robust' Experimental Design
for Public Policy Evaluation, with Application to the Mexican Universal
Health Insurance Program". *Journal of Policy Analysis and
Management* 26(3): 479-509.

Moore, Ryan T. 2012. "Multivariate Continuous Blocking to Improve Political Science
Experiments." *Political Analysis* 20(4):460-479.

Rousseeuw, Peter J. 1985. "Multivariate Estimation with High Breakdown Point". *Mathematical Statistics and Applications* 8:283-297.

Rousseeuw, Peter J. and Bert C. van Zomeren. 1990. "Unmasking Multivariate Outliers and Leverage Points". *Journal of the American Statistical Association* 85(411):633-639.

##### See Also

##### Examples

```
data(x100)
out <- block(x100, groups = "g", n.tr = 2, id.vars = c("id"), block.vars
= c("b1", "b2"), algorithm="optGreedy", distance =
"mahalanobis", level.two = FALSE, valid.var = "b1",
valid.range = c(0,500), verbose = TRUE)
## out$blocks contains 3 data frames
## To illustrate two-level blocking, with multiple level two units per
## level one unit:
for(i in (1:nrow(x100))){if((i %% 2) == 0){x100$id[i] <- x100$id[i-1]}}
out2 <- block(x100, groups = "g", n.tr = 2, id.vars = c("id", "id2"),
block.vars = c("b1", "b2"), algorithm="optGreedy",
distance = "mahalanobis", level.two = TRUE, valid.var =
"b1", valid.range = c(0,500), namesCol = c("State 1", "City 1",
"State 2", "City 2"), verbose = TRUE)
```

*Documentation reproduced from package blockTools, version 0.6-3, License: GPL (>= 2) | file LICENSE*