Learn R Programming

corlink (version 1.0.0)

linkd: Function to impute missing agreement patterns and then to link data

Description

Function to impute missing agreement patterns and then to link data

Usage

linkd(d, initial_m = NULL, initial_u = NULL, p_init = 0.5, fixed_col = NULL, alg = "m")

Arguments

d
Matrix of agreement patterns with final column counting the number of times that pattern was observed. See Details
initial_m
starting probabilities for per-field agreement in record pairs, both records being generated from the same individual. Defaults to NULL
initial_u
starting probabilities for per-field agreement in record pairs, with the two records being generated from differing individuals Defaults to NULL
p_init
starting probability that both records for a randomly selected record pair is associated with the same individual
fixed_col
vector indicating columns that are not to be updated in initial EM algorithm. Useful if good prior estimates of the mis-match probabilities. See details
alg
character; see Details

Value

A list, the first component is a matrix - the posterior probabilities of being a true match is the last column, the second component are the fitted models used to generate the predicted probabilities

Details

d is a numeric matrix with N rows corresponding to N record pairs, and L+1 columns the first L of which show the field agreement patterns observed over the record pairs, and the last column the total number of times that pattern was observed in the database. The code 0 is used for a field that differs for two record, 1 for a field that agrees, and 2 for a missing field. fixed_col indicates the components of the u vector (per field probabilities of agreement for 2 records from differing individuals) that are not to be updated when applying the EM algorithm to estimate components of the Feligi Sunter model. alg has four possible values. The default 'm' fits a log-linear model for the agreement counts only within the record pairs that corresponds to the same individual, 'b' fits differing log-linear models for the 2 clusters, 'i' corresponds to the original Feligi Sunter algorithm, with probabilities estimated via the EM algorithm, 'a' fits all the previously listed models

Examples

Run this code

# Simulate data
m_probs <- rep(0.8,6)
u_probs <- rep(0.2,6)
means_match <- -1*qnorm(1-m_probs)
means_mismatch <- -1*qnorm(1-u_probs)
missingprobs <- rep(.2,6)
thedata <- do_sim(cor_match=0.2,cor_mismatch=0,nsample=10^4,pi_match=.5,
m_probs=rep(0.8,5),u_probs=rep(0.2,5),missingprobs=rep(0.4,5))
colnames(thedata) <- c(paste("V",1:5,sep="_"),"count")
output <- linkd(thedata)
output$fitted_probs

Run the code above in your browser using DataLab