Performs fuzzy string grouping in which similar strings are assigned to the
same group. Uses the cluster_fast_greedy() community detection algorithm
from the igraph package to create the groups. Must have igraph installed
in order to use this function.
jaccard_string_group(
string,
n_gram_width = 2,
n_bands = 45,
band_width = 8,
threshold = 0.7,
progress = FALSE
)a string vector storing the group of each element in the original input strings. The input vector is grouped so that similar strings belong to the same group, which is given a standardized name.
a character you wish to perform entity resolution on.
the length of the n_grams used in calculating the
jaccard similarity. For best performance, I set this large enough that the
chance any string has a specific n_gram is low (i.e. n_gram_width = 2
or 3 when matching on first names, 5 or 6 when matching on entire
sentences).
the number of bands used in the minihash algorithm (default
is 40). Use this in conjunction with the band_width to determine the
performance of the hashing. The default settings are for a
(.2,.8,.001,.999)-sensitive hash i.e. that pairs with a similarity of less
than .2 have a >.1% chance of being compared, while pairs with a similarity
of greater than .8 have a >99.9% chance of being compared.
the length of each band used in the minihashing algorithm
(default is 8) Use this in conjunction with the n_bands to determine
the performance of the hashing. The default settings are for a
(.2,.8,.001,.999)-sensitive hash i.e. that pairs with a similarity of less
than .2 have a >.1% chance of being compared, while pairs with a similarity
of greater than .8 have a >99.9% chance of being compared.
the jaccard similarity threshold above which two strings should be considered a match (default is .95). The similarity is euqal to 1
the jaccard distance between the two strings, so 1 implies the strings are identical, while a similarity of zero implies the strings are completely dissimilar.
set to true to report progress of the algorithm
string <- c(
"beniamino", "jack", "benjamin", "beniamin",
"jacky", "giacomo", "gaicomo"
)
jaccard_string_group(string, threshold = .2, n_bands = 90, n_gram_width = 1)
Run the code above in your browser using DataLab