recordLink: Probabilistic Patient Record Linkage

Description

Probabilistic Patient Record Linkage

Usage

recordLink(
  data1,
  data2,
  dates1 = NULL,
  dates2 = NULL,
  eps_plus,
  eps_minus,
  aggreg_2ways = "mean",
  min_prev = 0.01,
  data1_cont2diff = NULL,
  data2_cont2diff = NULL,
  d_max,
  use_diff = TRUE
)

Arguments

data1

either a binary (1 or 0 values only) matrix or binary data frame of dimension n1 x K whose rownames are the observation identifiers.

data2

either a binary (1 or 0 values only) matrix or a binary data frame of dimension n2 x K whose rownames are the observation identifiers. Columns should be in the same order as in data1.

dates1

matrix or dataframe of dimension n1 x K including the concatenated dates intervals for each corresponding diagnosis codes in data1. Default is NULL in which case dates are not used.

dates2

matrix or dataframe of dimension n2 x K including the concatenated dates intervals for each corresponding diagnosis codes in data2. Default is NULL in which case dates are not used. See details.

eps_plus

discrepancy rate between data1 and data2

eps_minus

discrepancy rate between data2 and data1

aggreg_2ways

a character string indicating how to merge the posterior two probability matrices obtained for each of the 2 databases. Four possibility are currently implemented: "maxnorm", "max", "min", "mean" and "prod". Default is "mean".

min_prev

minimum prevalence for the variables used in matching. Default is 1%.

data1_cont2diff

either a matrix or dataframe of continuous features, such as age, for which the similarity measure uses the difference with data2_cont2diff, whose rownames are . Default is NULL.

data2_cont2diff

either a matrix or dataframe of continuous features, such as age, for which the similarity measure uses the difference with data2_cont1diff, whose rownames are . Default is NULL.

d_max

a numeric vector of length K giving the minimum difference from which it is considered a discrepancy.

use_diff

logical flag indicating whether continuous differentiable variables should be used in the

Value

a matrix of size n1 x n2 with the posterior probability of matching for each n1*n2 pair

Details

Dates: the use of dates1 and dates2 requires that at least one date interval matches across dates1 and dates2 for claiming an agreement on a diagnosis code between data1 and data2, in addition of having that very same code recorded in both.

References

Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). 10.1038/sdata.2018.298.

Examples

Run this code

# NOT RUN {
set.seed(123)
ncodes <- 500
npat <- 200
incid <- abs(rnorm(n=ncodes, 0.15, 0.07))
bin_codes <- rbinom(n=npat*ncodes, size=1,  prob=rep(incid, npat))
bin_codes_mat <- matrix(bin_codes, ncol=ncodes, byrow = TRUE)
data1_ex <- bin_codes_mat[1:(npat/2+npat/10),]
data2_ex <- bin_codes_mat[c(1:(npat/10), (npat/2+npat/10 + 1):npat), ]
rownames(data1_ex) <- paste0("ID", 1:(npat/2+npat/10), "_data1")
rownames(data2_ex) <- paste0("ID", c(1:(npat/10), (npat/2+npat/10 + 1):npat), "_data2")

if(interactive()){
res <- recordLink(data1 = data1_ex, data2 = data2_ex, 
                 use_diff = FALSE, eps_minus = 0.01, eps_plus = 0.01)
round(res[c(1:3, 19:23), c(1:3, 19:23)], 3)
}

# }

Run the code above in your browser using DataLab