links: Multistage deterministic record linkage

Description

Link records in ordered stages with flexible matching conditions.

Usage

links(
  criteria,
  sub_criteria = NULL,
  sn = NULL,
  strata = NULL,
  data_source = NULL,
  data_links = "ANY",
  display = "progress",
  group_stats = FALSE,
  expand = TRUE,
  shrink = FALSE
)
record_group(df, ..., to_s4 = TRUE)

Arguments

criteria

list of attributes to compare at each stage. Comparisons are done as an exact match i.e. (==). See Details.

sub_criteria

list of additional attributes to compare at each stage. Comparisons are done as an exact match or with user-defined logical tests function. See sub_criteria

Unique numerical record identifier. Useful for creating familiar episode identifiers.

strata

Subsets. Record groups are tracked separately within each subset.

data_source

Unique data source identifier. Useful when the dataset contains data from multiple sources.

data_links

A set of data_sources required in each record group. A strata without records from these data sources will be skipped, and record groups without these will be unlinked. See Details.

display

The messages printed on screen. Options are; "none" (default) or, "progress" and "stats" for a progress update or a more detailed breakdown of the linkage process.

group_stats

If TRUE (default), group-specific information like record counts. See Value.

expand

If TRUE, allows increases in the size of a record group at subsequent stages of the linkage process.

shrink

If TRUE, allows reductions in the size of a record group at subsequent stages of the linkage process.

data.frame. One or more datasets appended together. See Details.

...

Arguments passed to links

to_s4

Data type of returned object. pid (TRUE) or data.frame (FALSE).

Value

pid objects or data.frame if to_s4 is FALSE)

sn - unique record identifier as provided (or generated)
pid | .Data - unique group identifier
link_id - unique record identifier of matching records
pid_cri - matching criteria
pid_dataset - data sources in each group
pid_total - number of records in each group
iteration - iteration of the process when each record was linked to its record group

Details

links() performs an ordered multistage deterministic linkage. The relevance or priority of each stage is determined by the order in which they have been listed.

sub_criteria specifies additional matching conditions for each stage (criteria) of the process. If sub_criteria is not NULL, only records with matching criteria and sub_criteria are linked. If a record has missing values for any criteria, that record is skipped at that stage, and another attempt is made at the next stage. If there are no matches for a record at every stage, that record is assigned a unique group ID.

By default, records are compared for an exact match. However, user-defined logical tests (function) are also permitted. The function must be able to compare two atomic vectors and return either TRUE or FALSE. The function must have two arguments - x for the attribute and y for what it'll be compared against.

A match at each stage is considered more relevant than a match at the next stage. Therefore, criteria should always be listed in order of decreasing relevance.

data_source - including this populates the pid_dataset slot. See Value.

data_links should be a list of atomic vectors with every element named "l" (links) or "g" (groups).

"l" - Record groups with records from every listed data source will be retained.
"g" - Record groups with records from any listed data source will be retained.

data_links is useful for skipping record groups that are not required.

record_group() as it existed before v0.2.0 has been retired. Its now exists to support previous code and arguments with minimal disruption. Please use links() moving forward.

See vignette("links") for more information.

Examples

Run this code

# NOT RUN {
library(diyar)
# Exact match
links(criteria = c("Obinna","James","Ojay","James","Obinna"))

# User-defined tests using `sub_criteria()`
# Matching `sex` and + 20-year age gaps
age <- c(30, 28, 40, 25, 25, 29, 27)
sex <- c("M", "M", "M", "F", "M", "M", "F")
f1 <- function(x, y) (y - x) %in% 0:20
links(criteria = sex,
      sub_criteria = list(s1 = sub_criteria(age, funcs = f1)))

# Multistage linkage
# Relevance of matches: `forename` > `surname`
data(staff_records); staff_records
links(criteria = list(staff_records$forename, staff_records$surname),
      data_source = staff_records$sex)

# Relevance of matches:
# `staff_id` > `age` AND (`initials`, `hair_colour` OR `branch_office`)
data(missing_staff_id); missing_staff_id
links(criteria = list(missing_staff_id$staff_id, missing_staff_id$age),
      sub_criteria = list(s2 = sub_criteria(missing_staff_id$initials,
                                          missing_staff_id$hair_colour,
                                          missing_staff_id$branch_office)),
      data_source = missing_staff_id$source_1)

# }

Run the code above in your browser using DataLab