Learn R Programming

statar (version 0.3.0)

fuzzy_join: Experimental fuzzy join function

Description

fuzzy_join uses record linkage methods to match observations between two datasets where no perfect key fields exist. For each row in x, fuzzy_join finds the closest row(s) in y. The distance is a weighted average of the string distances defined in method over multiple columns.

Usage

fuzzy_join(x, y, exact = NULL, fuzzy = NULL, gen = "distance",
  suffixes = c(".x", ".y"), which = FALSE, w = rep(1, length(fuzzy)),
  na.score = 1/3, method = "jw", p = 0.1, ...)

Arguments

x
The master data.frame
y
The using data.frame
exact
Character vector specifying variables on which to match exactly.
fuzzy
Character vector specifying columns on which to match in a fuzzy way
gen
Name of new variable with the distance between matched observations. Default to "distance".
suffixes
A character vector of length 2 specifying suffix of overlapping columns. Defaut to ".x" and ".y".
which
With which = TRUE, returns a three columns data.tables where he first column corresponds to x's row number, the second column corresponds to y's row number and the third column corresponds to the score of the match.
w
Numeric vector of the same length as fuzzy specifying the weights to use when summing across different column of fuzzy. Default to rep(1, length(fuzzy)).
na.score
Numeric that specifies the distance between NA and another string. Default to 1/3
method
See the stringdist documentation. Default to "jw"
p
See the stringdist documentation. Default to 0.1
...
Other arguments to pass to stringdist. See the stringdist documentation.

Details

Typically, x is a dataset with dirty names, while y is the dataset with true names. When exact is specified, rows without matches are returned with distance NA.

Examples

Run this code
library(stringdist)
library(dplyr)
x <- data_frame(a = c("france", "franc"), b = c("arras", "dijon"))
y <- data_frame(a = c("franc", "france"), b = c("arvars", "dijjon"))
fuzzy_join(x, y, fuzzy = c("a", "b"))
fuzzy_join(x, y, fuzzy = c("a", "b"), w = c(0.9, 0.1))
fuzzy_join(x,y, fuzzy = c("a", "b"), w = c(0, 0.9))
x <- data_frame(a = c(1, 1), b = c("arras", "dijon"))
y <- data_frame(a = c(1, 1), b = c("arvars", "dijjon"))
fuzzy_join(x, y, exact = "a", fuzzy = "b")
x <- data_frame(a = c(1, 2), b = c("arras", "dijon"))
y <- data_frame(a = c(1, 1), b = c("arvars", "dijjon"))
fuzzy_join(x, y, exact = "a", fuzzy = "b")

Run the code above in your browser using DataLab