match_companies: Match Company Names against a Dictionary

Description

Runs a cascading matching pipeline: Exact -> Fuzzy (Zoomer) -> FTS5 -> Rarity. Matches found in earlier steps are removed from subsequent steps.

Usage

match_companies(
  queries,
  dictionary,
  query_col = "company_name",
  dict_col = "company_name",
  unique_id_col = "query_id",
  dict_id_col = "orbis_id",
  threshold_jw = 0.8,
  threshold_zoomer = 0.4,
  threshold_rarity = 1,
  n_cores = 1
)

Value

A data.table containing query_id, dict_id, and match_type.

Arguments

queries: Data frame. Must contain columns specified in query_col and unique_id_col.
dictionary: Data frame. Must contain columns specified in dict_col and dict_id_col.
query_col: String. Column name for company names in queries.
dict_col: String. Column name for company names in dictionary.
unique_id_col: String. ID column in queries.
dict_id_col: String. ID column in dictionary.
threshold_jw: Numeric (0-1). Minimum Jaro-Winkler similarity. Default 0.8.
threshold_zoomer: Numeric (0-1). Jaccard threshold for blocking. Default 0.4.
threshold_rarity: Numeric. Minimum score for rarity matching. Default 1.0.
n_cores: Integer. Number of cores (reserved for future parallel implementation).

Examples

Run this code

# Create sample query data
queries <- data.frame(
  query_id = 1:3,
  company_name = c("BMW", "Siemens AG", "Deutsche Bank")
)

# Create sample dictionary
dictionary <- data.frame(
  orbis_id = c("D001", "D002", "D003"),
  company_name = c("BMW AG", "Siemens Aktiengesellschaft", "Commerzbank AG")
)

# Match companies (uses multi-threaded Rust internals via zoomerjoin)
# \donttest{
results <- match_companies(
  queries = queries,
  dictionary = dictionary,
  query_col = "company_name",
  dict_col = "company_name",
  unique_id_col = "query_id",
  dict_id_col = "orbis_id"
)

print(results)
# }

Run the code above in your browser using DataLab