match.template: Template Matching

Description

This function solves the multidimensional assignment problem with decomposable costs (MDADC) by matching the data to a pre-specified set of vectors (the template). The dissimilarity function is the squared Euclidean distance.

Usage

match.template(x, template = 1L, unit = NULL, w = NULL, 
	method = c("hungarian","bruteforce"), equal.variance = FALSE)

Value

A list of class matchFeat with fields

sigma: best assignement as set of permutations (\(m\times n\) matrix)
cluster: best assignement as cluster indicators (\(m \times n\) matrix)
objective: minimum objective value
mu: mean vector for each class/label (\(p \times m\) matrix)
V: covariance matrix for each class/label (\(p \times p \times m\) array if equal.variance is FALSE, \(p \times p \) matrix otherwise
call: function call

Arguments

x: data: matrix of dimensions \(mn \times p\) or 3D array of dimensions \((p,m,n)\) with \(m\) = number of labels/classes, \(n\) = number of sample units, and \(p\) = number of variables)
template: integer (= which sample unit to take as template) or \((p,m)\) matrix
unit: integer (=number of units/datasets) or vector mapping rows of x to sample units (length \(mn\)). Must be specified only if x is a matrix.
w: weights for the loss function. Can be specified as a \(p\)-vector of diagonal weights or as a full \(p \times p\) (positive definite) matrix
method: method for the linear assignment problem: hungarian algorithm or bruteforce
equal.variance: logical; if TRUE, resp. FALSE, return common, resp. label-specific, covariance of matched features

Details

Given \(n\) datasets or statistical units, each containing \(m\) feature vectors, the one-to-one matching problem is to find a set of \(n\) label permutations that produce the best match of feature vectors across units. The objective function to minimize is the sum of squared (Euclidean) distances between all feature vectors having the same (new) label. This amounts to minimizing the sum of the within-label variances.

The template-based method consists in relabeling successively each sample unit to best match a template matrix of feature vectors. This method is very fast but its optimization performance is only as good as the template. For best results, the template should be representative of the collected data.

If x is a matrix, the rows should be sorted by increasing unit label and unit should be a nondecreasing sequence of integers, for example \((1,...,1,2,...,2,...,n,...,n)\) with each integer \(1,...,n\) replicated \(m\) times.

The argument w can be specified as a vector of positive numbers (will be recycled to length \(p\) if needed) or as a positive definite matrix of size \((p,p)\).

References

Degras (2022) "Scalable feature matching across large data collections." tools:::Rd_expr_doi("10.1080/10618600.2022.2074429")
https://en.wikipedia.org/wiki/Assignment_problem
https://en.wikipedia.org/wiki/Hungarian_algorithm

Examples

Run this code

## Generate data
n <- 10
k <- 3
d <- 5
mu <- matrix(1:k, nrow=d, ncol=k, byrow=TRUE)
sigma <- 0.3
x <- array(mu, c(d,k,n)) + rnorm(d*k*n,sigma)
## Match all feature vectors with first case as template
result <- match.template(x,1)
## Display results 
result$cost # cost function
xmatched <- array(dim=dim(x)) 
# re-arranged (matched) feature vectors
for (i in 1:n)
	xmatched[,,i] <- x[,result$sigma[,i],i]

Run the code above in your browser using DataLab