Learn R Programming

reclin (version 0.1.2)

select_greedy: Select matching pairs enforcing one-to-one linkage

Description

Select matching pairs enforcing one-to-one linkage

Usage

select_greedy(
  pairs,
  threshold = NULL,
  weight,
  var = "select",
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

select_n_to_m( pairs, threshold = NULL, weight = NULL, var = "select", preselect = NULL, n = 1, m = 1, id_x = NULL, id_y = NULL, ... )

Arguments

pairs

a pairs object, such as generated by pair_blocking

threshold

the threshold to apply. Pairs with a score above the threshold are selected.

weight

name of the score/weight variable of the pairs. When not given and attr(pairs, "score") is defined, that is used.

var

the name of the new variable to create in pairs. This will be a logical variable with a value of TRUE for the selected pairs.

preselect

a logical variable with the same length as pairs has rows, or the name of such a variable in pairs. Pairs are only selected when preselect is TRUE. This interacts with threshold (pairs have to be selected with both conditions).

id_x

a integer vector with the same length a the number of rows in pairs, or the name of a column in x. This vector should identify unique objects in x. When not specified it is assumed that each element in x is unique.

id_y

a integer vector with the same length a the number of rows in pairs, or the name of a column in y. This vector should identify unique objects in y. When not specified it is assumed that each element in y is unique.

...

passed on to other methods.

n

the number of records from x that can at most be linked to a record in y.

m

the number of records from y that can at most be linked to a record in x.

Value

Returns the pairs with the variable given by var added. This is a logical variable indicating which pairs are selected a matches.

Details

Both methods force one-to-one matching. select_greedy uses a greedy algorithm that selects the first pair with the highest weight. select_n_to_m tries to optimise the total weight of all of the selected pairs. In general this will result in a better selection. However, select_n_to_m uses much more memory and is much slower and, therefore, can only be used when the number of possible pairs is not too large.

Examples

Run this code
# NOT RUN {
data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
pairs <- score_simsum(pairs)

# Select pairs with a simsum > 5 and force one-to-one linkage
pairs <- select_n_to_m(pairs, 0, var = "ntom")
pairs <- select_greedy(pairs, 0, var = "greedy")
table(pairs[c("ntom", "greedy")])

# }
# NOT RUN {
# }

Run the code above in your browser using DataLab