Learn R Programming

diyar (version 0.4.3)

links: Multistage and nested record linkage

Description

Assign unique identifiers to records based on multiple stages of different match criteria.

Usage

links(
  criteria,
  sub_criteria = NULL,
  sn = NULL,
  strata = NULL,
  data_source = NULL,
  data_links = "ANY",
  display = "none",
  group_stats = FALSE,
  expand = TRUE,
  shrink = FALSE,
  recursive = FALSE,
  check_duplicates = FALSE,
  tie_sort = NULL,
  batched = "yes",
  repeats_allowed = FALSE,
  permutations_allowed = FALSE,
  ignore_same_source = FALSE
)

Value

pid; list

Arguments

criteria

[list|atomic]. Attributes to be compared. Each element of the list is a stage in the linkage process. See Details.

sub_criteria

[list|sub_criteria]. Match criteria. Must be paired to a stage of the linkage process (criteria). See sub_criteria

sn

[integer]. Unique record identifier. Useful for creating familiar pid identifiers.

strata

[atomic]. Subsets of the dataset. Record-groups are created separately for each strata. See Details.

data_source

[character]. Data source identifier. Adds the list of data sources in each record-group to the pid. Useful when the data is from multiple sources.

data_links

[list|character]. data_source required in each pid. A record-group without records from these data_sources will be unlinked. See Details.

display

[character]. display a status updated or generate a status report. Options are; "none" (default), "progress", "stats", "none_with_report", "progress_with_report" or "stats_with_report".

group_stats

[logical]. If TRUE (default), return group specific information like record counts for each pid.

expand

[logical]. If TRUE, a record-group gains new records if a match is found at the next stage of the linkage process. Not interchangeable with shrink.

shrink

[logical]. If TRUE, a record-group loses existing records if no match is found at the next stage of the linkage process. Not interchangeable with expand.

recursive

[logical]. If TRUE, within each iteration of the process, a match can spawn new matches. Ignored when batched is FALSE.

check_duplicates

[logical]. If TRUE, within each iteration of the process, duplicates values of an attributes are not checked. The outcome of the logical test on the first instance of the value will be recycled for the duplicate values. Ignored when batched is FALSE.

tie_sort

[atomic]. Preferential order for breaking ties within a iteration.

batched

[logical] Determines if record-pairs are created and compared in batches. Options are "yes" or "no".

repeats_allowed

[logical] If TRUE, record-pairs with repeat values are created and compared. Ignored when batched is TRUE.

permutations_allowed

[logical] If TRUE, permutations of record-pairs are created and compared. Ignored when batched is TRUE.

ignore_same_source

[logical] If TRUE, only records-pairs with a different data_source are created and compared.

Details

The priority of matches decreases with each subsequent stage of the linkage process i.e. earlier stages (criteria) are considered superior. Therefore, it's important that each criteria is listed in an order of decreasing relevance.

Records with missing criteria (NA values) are skipped at their respective stage, while records with missing strata (NA) are skipped at every stage.

If a record is skipped, another attempt will be made to match the record at the next stage. If a record does not match any other record by the end of the linkage process (or it has a missing strata), it is assigned to a unique record-group.

A sub_criteria can be used to introduce additional and/or nested matching conditions at each stage of the linkage process. This results in only records with a matching criteria and sub_criteria being linked.

In links, each sub_criteria must be linked to a criteria. This is done by adding a sub_criteria to a named element of a list. Each element's name must correspond to a stage. For example, the list for 3 sub_criteria linked to criteria 1, 5 and 13 will be;

$$list(cr1 = sub\_criteria(...), cr5 = sub\_criteria(...), cr13 = sub\_criteria(...))$$

Any unlinked sub_criteria will be ignored.

sub_criteria objects themselves can be nested.

By default, attributes in a sub_criteria are compared for an exact_match. However, user-defined functions are also permitted.

Every element in data_links must be named "l" (links) or "g" (groups). Unnamed elements of data_links will be assumed to be "l".

  • If named "l", only groups with records from every listed data_source will remain linked.

  • If named "g", only groups with records from any listed data_source will remain linked.

See vignette("links") for more information.

See Also

links_sv_probabilistic; episodes; partitions; predefined_tests; sub_criteria; schema

Examples

Run this code
data(patient_records)
# An exact match on surname followed by an exact match on forename
stages <- as.list(patient_records[c("surname", "forename")])
pids_1 <- links(criteria = stages)

# An exact match on forename followed by an exact match on surname
pids_2 <- links(criteria = rev(stages))

# Nested matches
# Same sex OR year of birth
multi_cond1 <- sub_criteria(format(patient_records$dateofbirth, "%Y"),
                           patient_records$sex,
                           operator = "or")

# Same middle name AND a 10 year age difference
age_diff <- function(x, y){
  diff <- abs(as.numeric(x) - as.numeric(y))
  wgt <-  diff %in% 0:(365 * 10) & !is.na(diff)
  wgt
}
multi_cond2 <- sub_criteria(patient_records$dateofbirth,
                           patient_records$middlename,
                           operator = "and",
                           match_funcs = c(age_diff, exact_match))

# 'multi_cond1' OR 'multi_cond2'
nested_cond1 <- sub_criteria(multi_cond1,
                             multi_cond2,
                             operator = "or")

# Record linkage with nested conditions
pids_3 <- links(criteria = stages,
                sub_criteria = list(cr1 = multi_cond1,
                                    cr2 = multi_cond2))

# Record linkage with multiple (two) layers of nested conditions
pids_4 <- links(criteria = stages,
                sub_criteria = list(cr1 = nested_cond1,
                                    cr2 = nested_cond1))

# Record linkage without group expansion
pids_5 <- links(criteria = stages,
                sub_criteria = list(cr1 = multi_cond1,
                                    cr2 = multi_cond2),
                expand = FALSE)

# Record linkage with shrinking record groups
pids_6 <- links(criteria = stages,
                sub_criteria = list(cr1 = multi_cond1,
                                    cr2 = multi_cond2),
                shrink = TRUE)

Run the code above in your browser using DataLab