Creates a training set from a list of similarity matrices and labels it using a zero-shot GPT prompt.
get_training_set(
sim,
num_bins = 50,
samples_per_bin = 10,
n = 500,
record_type = "entity",
instructions = NULL,
model = "gpt-3.5-turbo-instruct",
openai_api_key = Sys.getenv("OPENAI_API_KEY"),
parallel = TRUE
)
A dataset with string pairs A
and B
, along with a match
column indicating whether they match.
A matrix of similarity scores
Number of bins to split similarity scores for stratified random sampling (defaults to 50)
Number of string pairs to sample from each bin (defaults to 5)
Sample size for the training dataset
A character describing what type of entity the rows and columns of sim
represent. Should be a singular noun (e.g. "person", "organization", "interest group", "city").
A string containing additional instructions to include in the LLM prompt during validation.
Which OpenAI model to prompt; defaults to 'gpt-3.5-turbo-instruct'
Your OpenAI API key. By default, looks for a system environment variable called "OPENAI_API_KEY" (recommended option). Otherwise, it will prompt you to enter the API key as an argument.
TRUE to submit API requests in parallel. Setting to FALSE can reduce rate limit errors at the expense of longer runtime.