nameReweight: nameReweight

Description

Reweights posterior probabilities to account for observed frequency of names. Downweights posterior probability of match if first name is common, upweights if first name is uncommon.

Usage

nameReweight(dfA, dfB, EM, gammalist, matchesLink,
varnames, stringdist.match, partial.match,
firstname.field, threshold.match, stringdist.method, cut.a, cut.p,
jw.weight, n.cores)

Arguments

dfA

The full version of dataset A that is being matched.

dfB

The full version of dataset B that is being matched.

The EM object from emlinkMARmov()

gammalist

The list of gamma objects calculated on the full dataset that indicate matching patterns, which is fed into tableCounts() and matchesLink().

matchesLink

The output from matchesLink().

varnames

A vector of variable names to use for matching. Must be present in both matchesA and matchesB.

stringdist.match

A vector of booleans, indicating whether to use string distance matching when determining matching patterns on each variable. Must be same length as varnames.

partial.match

A vector of booleans, indicating whether to include a partial matching category for the string distances. Must be same length as varnames. Default is FALSE for all variables.

firstname.field

A vector of booleans, indicating whether each field indicates first name. TRUE if so, otherwise FALSE.

threshold.match

A number between 0 and 1 indicating either the lower bound (if only one number provided) or the range of certainty that the user wants to declare a match. For instance, threshold.match = .85 will return all pairs with posterior probability greater than .85 as matches, while threshold.match = c(.85, .95) will return all pairs with posterior probability between .85 and .95 as matches.

stringdist.method

String distance method for calculating similarity, options are: "jw" Jaro-Winkler (Default), "jaro" Jaro, and "lv" Edit

cut.a

Lower bound for full string-distance match, ranging between 0 and 1. Default is 0.92

cut.p

Lower bound for partial string-distance match, ranging between 0 and 1. Default is 0.88

jw.weight

Parameter that describes the importance of the first characters of a string (only needed if stringdist.method = "jw"). Default is .10

n.cores

Number of cores to parallelize over. Default is NULL.

Value

nameReweight() returns a list containing the following elements:

zetaA

The reweighted zeta estimates for each matched element in dataset A.

zetaB

The reweighted zeta estimates for each matched element in dataset B.