Learn R Programming

FlexRL (version 0.1.0)

stEM: Stochastic Expectation Maximisation (StEM) for Record Linkage

Description

Stochastic Expectation Maximisation (StEM) for Record Linkage

Usage

stEM(
  data,
  StEMIter,
  StEMBurnin,
  GibbsIter,
  GibbsBurnin,
  musicOn = TRUE,
  newDirectory = NULL,
  saveInfoIter = FALSE
)

Value

A list with:

  • Delta, the summarry of a sparse matrix, i.e. a dataframe with 3 columns: the indices from the first data source A, the indices from the second data source B, the non-zero probability that the records associated with this pair of indices are linked (i.e. the posterior probabilities to be linked). One has to select the pairs where this proba>0.5 to get a valid set of linked records, (this threshold on the linkage probability is necessary to ensure the one-to-one assignment constraint of record linkage stating that one record in one file can at most be linked to one record in the other file).

  • gamma, a vector with the chain of the parameter gamma representing the proportion of linked records as a fraction of the smallest file,

  • eta, a vector with the chain of the parameter eta representing the distribution of the PIVs,

  • alpha, a vector with the chain of the parameter alpha representing the hazard coefficient of the model for instability,

  • phi, a vector with the chain of the parameter phi representing the registration errors parameters).

There are more details to understand the method in our paper, or on the experiments repository of our paper, or in the vignettes.

Arguments

data

A list with elements:

  • A: the smallest data source (encoded: the categorical values of the Partially Identifying Variables (PIVs) have to be mapped to sets of natural numbers and missing values are encoded as 0).

  • B: the largest data source (encoded).

  • Nvalues: A vector (of size number of PIVs) with the number fo unique values per PIVs (in the order of the PIVs defined in PIVs_config).

  • PIVs_config: A list (of size number of PIVs) where element names are the PIVs and element values are lists with elements: stable (boolean for whether the PIV is stable), conditionalHazard (boolean for whether there are external covariates available to model instability, only required if stable is FALSE), pSameH.cov.A and pSameH.covB (vectors with strings corresponding to the names of the covariates to use to model instability from file A and file B, only required if stable is FALSE, empty vectors may be provided if conditionalHazard is FALSE).

  • controlOnMistakes: A vector (of size number of PIVs) of booleans indicating potential bounds on the mistakes probabilities for each PIV. For each PIV, if TRUE there will be control on mistake and the mistake probability will not go above 10%. If FALSE there is no bound on the probability of mistake. WATCH OUT, if you suspect that a variable is unstable but you do not have data to model its dynamics the boolean value should be set to FALSE to allow the parameter for mistake to adapt for the instability. However if you model this instability, the boolean value should be set to TRUE to help the algorithm differenciate the mistakes from the changes over time.

  • sameMistakes: A boolean value for whether there should be one parameter for the mistakes in A and B or whether each source should have its own parameter. Setting sameMistakes=TRUE is recommended in case of small data sources; the estimation with 2 parameters in that case will fail to capture the mistakes correctly while 1 parameter will be more adapted.

  • phiMistakesAFixed A vector (of size number of PIVs) of booleans indicating whether the parameters for mistakes should be fixed in case of instability. It should be FALSE, except for unstable PIVs for which it may be set to TRUE in order to avoid estimability problems between the parameter for mistake and the parameter for changes across time.

  • phiMistakesBFixed A vector (of size number of PIVs) of booleans indicating whether the parameters for mistakes should be fixed in case of instability. It should be FALSE, except for unstable PIVs for which it may be set to TRUE in order to avoid estimability problems between the parameter for mistake and the parameter for changes across time.

  • phiForMistakesA A vector (of size number of PIVs) of NA or fixed values for the parameters for mistakes. It should be NA, except for unstable PIVs for which one wants to fix the parameter to avoid estimability problem (as indicated with the boolean values in phiMistakesAFixed). In that case it should be set the the expected value for the probability of mistake. If you have no idea: you can put it to 0, the algorithm is quite robust to wrongly fixed parameters.

  • phiForMistakesB A vector (of size number of PIVs) of NA or fixed values for the parameters for mistakes. It should be NA, except for unstable PIVs for which one wants to fix the parameter to avoid estimability problem (as indicated with the boolean values in phiMistakesBFixed). In that case it should be set the the expected value for the probability of mistake. If you have no idea: you can put it to 0, the algorithm is quite robust to wrongly fixed parameters.

StEMIter

An integer with the total number of iterations of the Stochastic EM algorithm (including the period to discard as burn-in)

StEMBurnin

An integer with the number of iterations to discard as burn-in

GibbsIter

An integer with the total number of iterations of the Gibbs sampler (done in each iteration of the StEM) (including the period to discard as burn-in)

GibbsBurnin

An integer with the number of iterations to discard as burn-in

musicOn

A boolean value, if TRUE the algorithm will play music at the end of the algorithm, useful if you have to wait for the record linkage to run and to act as an alarm when record linkage is done

newDirectory

A NULL value or: A string with the name of (or path to) the directory (which should already exist) where to save the environment variables at the end of each iteration (useful when record linkage is very long, to not loose everything and not restart from scratch in case your computer shut downs before record linkage is finished)

saveInfoIter

A boolean value to indicate whether you want the environment variables to be saved at the end of each iteration (useful when record linkage is very long, to not loose everything and not restart from scratch in case your computer shut downs before record linkage is finished)

Examples

Run this code
# \donttest{
PIVs_config = list( V1 = list(stable = TRUE),
                    V2 = list(stable = TRUE),
                    V3 = list(stable = TRUE),
                    V4 = list(stable = TRUE),
                    V5 = list( stable = FALSE,
                               conditionalHazard = FALSE,
                               pSameH.cov.A = c(),
                               pSameH.cov.B = c()) )
PIVs = names(PIVs_config)
PIVs_stable = sapply(PIVs_config, function(x) x$stable)
Nval = c(6, 7, 8, 9, 15)
NRecords = c(500, 800)
Nlinks = 300
PmistakesA = c(0.02, 0.02, 0.02, 0.02, 0.02)
PmistakesB = c(0.02, 0.02, 0.02, 0.02, 0.02)
PmissingA = c(0.007, 0.007, 0.007, 0.007, 0.007)
PmissingB = c(0.007, 0.007, 0.007, 0.007, 0.007)
moving_params = list(V1=c(),V2=c(),V3=c(),V4=c(),V5=c(0.28))
enforceEstimability = TRUE
DATA = DataCreation( PIVs_config,
                     Nval,
                     NRecords,
                     Nlinks,
                     PmistakesA,
                     PmistakesB,
                     PmissingA,
                     PmissingB,
                     moving_params,
                     enforceEstimability)
A                    = DATA$A
B                    = DATA$B
Nvalues              = DATA$Nvalues

encodedA = A
encodedB = B

encodedA[,PIVs][ is.na(encodedA[,PIVs]) ] = 0
encodedB[,PIVs][ is.na(encodedB[,PIVs]) ] = 0

data = list( A = encodedA,
             B = encodedB,
             Nvalues = Nvalues,
             PIVs_config = PIVs_config,
             controlOnMistakes = c(TRUE,TRUE,FALSE,FALSE,FALSE),
             sameMistakes = TRUE,
             phiMistakesAFixed = FALSE,
             phiMistakesBFixed = FALSE,
             phiForMistakesA = c(NA,NA,NA,NA,NA),
             phiForMistakesB = c(NA,NA,NA,NA,NA))
 fit = stEM( data = data,
             StEMIter = 50,
             StEMBurnin = 30,
             GibbsIter = 50,
             GibbsBurnin = 30,
             musicOn = TRUE,
             newDirectory = NULL,
             saveInfoIter = FALSE )
# }

Run the code above in your browser using DataLab