match.data.frame: Identify the row of `y` best matching each row of `x`

Description

For each row of x[, by.x], find the best matching row of y[, by.y], with the best match defined by grep. and split. grep. and split must either be missing or have the same length as by.x and by.y. If grep.[i] and split[i] are NA, do a complete match of x[, by.x[i]] and y[, by.y[i]]. Otherwise, for each row j, look for a match for

strsplit(x[j, by.x[i]],
    split[i])[[1]][1]

among strsplit(y[, by.y[i]], split[i]). See details.

Usage

match.data.frame(x, y, by, by.x=by, by.y=by, grep., split, sep=':')

Arguments

x, y

data.frames

by, by.x, by.y

names of columns of x and y to match.

grep.

a character vector of the type of match for each element of by.x and by.y. If NA, require a perfect match. Alternatives are grep and

split

A character vector of split characters to pass to strsplit; strsplit is not called if is.na(split).

sep

a sep argument to use with paste to produce a matching key for the columns of x and y for which perfect matches are required. If(missing(sep) && not(miss

Value

an integer vector of length nrow(x) containing the index of the best matching row of y or NA if no adequate match was found.

Details

1. Check by.x, by.y, grep. and split. If((missing(by.x) | missing(by.y)) && missing(by)) by <- names(x) 2. fullMatch <- (is.na(grep.) & is.na(split)). Create keyfx and keyfy by by pasting columns of x[, by.x[fullMatch]] and y[, by.y[fullMatch]]. Also create x. and y. = strsplit of x[, by.x[!fullMatch]]. 3. Iterate over rows of x looking for the best match. This includes an inner loop over columns of x[, by.x[!fullMatch]], stopping on the first unique match. Return (-1) if no unique match is found.

Examples

Run this code

newdata <- data.frame(state=c("AL", "MI","NY"),
                      surname=c("Rogers", "Rogers", "Smith"),
                      givenName=c("Mike R.", "Mike K.", "Al"),
                      stringsAsFactors=FALSE)
reference <- data.frame(state=c("NY", "NY", "MI", "AL", "NY", "MI"),
                      surname=c("Smith", "Rogers", "Rogers (MI)",
                                "Rogers (AL)", "Smith", 'Jones'),
                      givenName=c("John", "Mike", "Mike", "Mike",
                                "T. Albert", 'Al Thomas'),
                      stringsAsFactors=FALSE)
newInRef <- match.data.frame(newdata, reference,
       grep.=c(NA, 'agrep', 'agrep'))

stopifnot(
all.equal(newInRef, c(4, 3, 5))
)

Run the code above in your browser using DataLab