Joins two tables using approximate string matching on key columns. Optionally blocks by a second column (e.g., genus) for performance — only rows sharing the same blocking key are compared.
fuzzy_join(
x,
y,
by,
method = "dl",
max_dist = 0.2,
block_by = NULL,
n_threads = 4L,
suffix = ".y"
)A vectra_node with all probe columns, all build columns (suffixed
on collision), and a fuzzy_dist column (double).
A vectra_node object (probe / query side).
A vectra_node object (build / reference side).
A named character vector of length 1: c("probe_col" = "build_col").
The columns to compute string distance on.
Character. Distance algorithm: "dl" (Damerau-Levenshtein,
default), "levenshtein", or "jw" (Jaro-Winkler).
Numeric. Maximum normalized distance (0-1) to keep a match.
Default 0.2.
Optional named character vector of length 1:
c("probe_col" = "build_col"). Rows must match exactly on these columns
before distance is computed. Dramatically reduces comparisons.
Integer. Number of OpenMP threads for parallel distance
computation over partitions. Default 4L.
Character. Suffix appended to build-side column names that
collide with probe-side names. Default ".y".