links: Multistage deterministic record linkage

Description

Match records in consecutive stages with different matching criteria. Each set of linked records are assigned a unique identifier with relevant group-level information.

Usage

links(
  criteria,
  sub_criteria = NULL,
  sn = NULL,
  strata = NULL,
  data_source = NULL,
  data_links = "ANY",
  display = "none",
  group_stats = FALSE,
  expand = TRUE,
  shrink = FALSE,
  recursive = FALSE,
  check_duplicates = FALSE,
  tie_sort = NULL
)

Arguments

criteria

[list|atomic]. Attributes to compare. Each element of the list is a stage in the linkage process. See Details.

sub_criteria

[list|sub_criteria]. Additional matching criteria for each stage of the linkage process. See sub_criteria

[integer]. Unique record identifier. Useful for creating familiar pid identifiers.

strata

[atomic]. Subsets of the dataset. Record-groups are created separately for each strata. See Details.

data_source

[character]. Data source identifier. Adds the list of data sources in each record-group to the pid. Useful when the data is from multiple sources.

data_links

[list|character]. A set of data_sources required in each pid. A record-group without records from these data_sources will be unlinked. See Details.

display

[character]. Display or produce a status update. Options are; "none" (default), "progress", "stats", "none_with_report", "progress_with_report" or "stats_with_report".

group_stats

[logical]. If TRUE (default), return group specific information like record counts for each pid.

expand

[logical]. If TRUE, allows a record-group to expand with each subsequent stage of the linkage process. Not interchangeable with shrink.

shrink

[logical]. If TRUE, forces a record-group to shrink with each subsequent stage of the linkage process. Not interchangeable with expand.

recursive

[logical]. If TRUE, within each iteration of the process, a match can spawn new matches.

check_duplicates

[logical]. If TRUE, within each iteration of the process, duplicates values of an attributes are not checked. The outcome of the logical test on the first instance of the value will be recycled for the duplicate values.

tie_sort

[atomic]. Preferential order for breaking tied matches within a stage.

Value

pid; list

Details

Match priority decreases with each subsequent stage of the linkage process i.e. earlier stages (criteria) are considered superior. Therefore, it's important for each criteria to be listed in an order of decreasing relevance.

Records with missing criteria (NA) are skipped at each stage, while records with missing strata (NA) are skipped from the entire linkage process.

If a record is skipped, another attempt will be made to match the record at the next stage. If a record does not match any other record by the end of the linkage process (or it has a missing strata), it is assigned to a unique record-group.

A sub_criteria can be used to request additional matching conditions for each stage of the linkage process. When used, only records with a matching criteria and sub_criteria are linked.

In links, each sub_criteria must be linked to a criteria. This is done by adding a sub_criteria to a named element of a list. Each element's name must correspond to a stage. See below for an example of 3 sub_criteria linked to criteria 1, 5 and 13.

For example;

$$list("cr1" = sub_criteria(...), "cr5" = sub_criteria(...), "cr13" = sub_criteria(...)).$$

sub_criteria can be nested to achieve nested conditions.

A sub_criteria can be linked to different criteria but any unlinked sub_criteria will be ignored.

By default, attributes in a sub_criteria are compared for an exact_match. However, user-defined functions are also permitted. Such functions must meet three requirements:

It must be able to compare the attributes.
It must have two arguments named `x` and `y`, where `y` is the value for one observation being compared against all other observations (`x`).
It must return a logical object i.e.TRUE or FALSE.

Every element in data_links must be named "l" (links) or "g" (groups). Unnamed elements of data_links will be assumed to be "l".

If named "l", only groups with records from every listed data_source will remain linked.
If named "g", only groups with records from any listed data_source will remain linked.

See vignette("links") for more information.

Examples

Run this code

# NOT RUN {
# Exact match
attr_1 <- c(1, 1, 1, NA, NA, NA, NA, NA)
attr_2 <- c(NA, NA, 2, 2, 2, NA, NA, NA)
links(criteria = list(attr_1, attr_2))

# User-defined tests using `sub_criteria()`
# Matching `sex` and a 20-year age range
age <- c(30, 28, 40, 25, 25, 29, 27)
sex <- c("M", "M", "M", "F", "M", "M", "F")
f1 <- function(x, y) abs(y - x) %in% 0:20
links(criteria = sex,
      sub_criteria = list(cr1 = sub_criteria(age, match_funcs = f1)))

# Multistage matches
# Relevance of matches: `forename` > `surname`
data(staff_records); staff_records
links(criteria = list(staff_records$forename, staff_records$surname),
      data_source = staff_records$sex)

# Relevance of matches:
# `staff_id` > `age` (AND (`initials`, `hair_colour` OR `branch_office`))
data(missing_staff_id); missing_staff_id
links(criteria = list(missing_staff_id$staff_id, missing_staff_id$age),
      sub_criteria = list(cr2 = sub_criteria(missing_staff_id$initials,
                                          missing_staff_id$hair_colour,
                                          missing_staff_id$branch_office)),
      data_source = missing_staff_id$source_1)

# Group expansion
match_cri <- list(c(1,NA,NA,1,NA,NA),
                  c(1,1,1,2,2,2),
                  c(3,3,3,2,2,2))
links(criteria = match_cri, expand = TRUE)
links(criteria = match_cri, expand = FALSE)
links(criteria = match_cri, shrink = TRUE)

# }

Run the code above in your browser using DataLab