get_training_set

Creates a training set from a list of similarity matrices and labels it using a zero-shot GPT prompt.

Links datasets through fuzzy string matching using pretrained text embeddings. Produces more accurate record linkage when lexical string distance metrics are a poor guide to match quality (e.g., "Patricia" is more lexically similar to "Patrick" than it is to "Trish"). Capable of performing multilingual record linkage. Methods are described in Ornstein (2025) <https://joeornstein.github.io/publications/fuzzylink.pdf>.

Joe Ornstein

fuzzylink

Probabilistic Record Linkage Using Pretrained Text Embeddings

get_training_set function

<dl><dt>sim</dt>
<dd>A matrix of similarity scores</dd>
<dt>num_bins</dt>
<dd>Number of bins to split similarity scores for stratified random sampling (defaults to 50)</dd>
<dt>samples_per_bin</dt>
<dd>Number of string pairs to sample from each bin (defaults to 5)</dd>
<dt>n</dt>
<dd>Sample size for the training dataset</dd>
<dt>record_type</dt>
<dd>A character describing what type of entity the rows and columns of <code>sim</code> represent. Should be a singular noun (e.g. "person", "organization", "interest group", "city").</dd>
<dt>instructions</dt>
<dd>A string containing additional instructions to include in the LLM prompt during validation.</dd>
<dt>model</dt>
<dd>Which OpenAI model to prompt; defaults to 'gpt-3.5-turbo-instruct'</dd>
<dt>openai_api_key</dt>
<dd>Your OpenAI API key. By default, looks for a system environment variable called "OPENAI_API_KEY" (recommended option). Otherwise, it will prompt you to enter the API key as an argument.</dd>
<dt>parallel</dt>
<dd>TRUE to submit API requests in parallel. Setting to FALSE can reduce rate limit errors at the expense of longer runtime.</dd></dl>

Arguments

Create a training set — get_training_set

<dl>

<dt>sim</dt>
<dd>A matrix of similarity scores</dd>


<dt>num_bins</dt>
<dd>Number of bins to split similarity scores for stratified random sampling (defaults to 50)</dd>


<dt>samples_per_bin</dt>
<dd>Number of string pairs to sample from each bin (defaults to 5)</dd>


<dt>n</dt>
<dd>Sample size for the training dataset</dd>


<dt>record_type</dt>
<dd>A character describing what type of entity the rows and columns of <code>sim</code> represent. Should be a singular noun (e.g. "person", "organization", "interest group", "city").</dd>


<dt>instructions</dt>
<dd>A string containing additional instructions to include in the LLM prompt during validation.</dd>


<dt>model</dt>
<dd>Which OpenAI model to prompt; defaults to 'gpt-3.5-turbo-instruct'</dd>


<dt>openai_api_key</dt>
<dd>Your OpenAI API key. By default, looks for a system environment variable called "OPENAI_API_KEY" (recommended option). Otherwise, it will prompt you to enter the API key as an argument.</dd>


<dt>parallel</dt>
<dd>TRUE to submit API requests in parallel. Setting to FALSE can reduce rate limit errors at the expense of longer runtime.</dd>

</dl>

get_training_set: Create a training set

Description

Usage

Value

Arguments