get_embeddings

Get pretrained text embeddings from the OpenAI or Mistral API. Automatically batches requests to handle rate limits.

Links datasets through fuzzy string matching using pretrained text embeddings. Produces more accurate record linkage when lexical string distance metrics are a poor guide to match quality (e.g., "Patricia" is more lexically similar to "Patrick" than it is to "Trish"). Capable of performing multilingual record linkage. Methods are described in Ornstein (2025) <https://joeornstein.github.io/publications/fuzzylink.pdf>.

Joe Ornstein

fuzzylink

Probabilistic Record Linkage Using Pretrained Text Embeddings

get_embeddings function

<dl><dt>text</dt>
<dd>A character vector</dd>
<dt>model</dt>
<dd>Which embedding model to use. Defaults to 'text-embedding-3-large'.</dd>
<dt>dimensions</dt>
<dd>The dimension of the embedding vectors to return. Defaults to 256. Note that the 'mistral-embed' model will always return 1024 vectors.</dd>
<dt>openai_api_key</dt>
<dd>Your OpenAI API key. By default, looks for a system environment variable called "OPENAI_API_KEY".</dd>
<dt>parallel</dt>
<dd>TRUE to submit API requests in parallel. Setting to FALSE can reduce rate limit errors at the expense of longer runtime.</dd></dl>

Arguments

Get pretrained text embeddings — get_embeddings

<dl>

<dt>text</dt>
<dd>A character vector</dd>


<dt>model</dt>
<dd>Which embedding model to use. Defaults to 'text-embedding-3-large'.</dd>


<dt>dimensions</dt>
<dd>The dimension of the embedding vectors to return. Defaults to 256. Note that the 'mistral-embed' model will always return 1024 vectors.</dd>


<dt>openai_api_key</dt>
<dd>Your OpenAI API key. By default, looks for a system environment variable called "OPENAI_API_KEY".</dd>


<dt>parallel</dt>
<dd>TRUE to submit API requests in parallel. Setting to FALSE can reduce rate limit errors at the expense of longer runtime.</dd>

</dl>

get_embeddings: Get pretrained text embeddings

Description

Usage

Value

Arguments

Examples