code_extend: Extending codes

Description

These functions use text embeddings and multinomial logistic regression to suggest missing codes or flag potentially incorrect codes based on text data. Two approaches are provided: one using GloVe embeddings trained on the input text, and another using pre-trained BERT embeddings via the {text} package. Both functions require a vector of text (e.g., titles or descriptions) and a corresponding vector of categorical codes, with NA or empty strings indicating missing codes to be inferred. The functions train a multinomial logistic regression model using glmnet on the text embeddings of the entries with known codes, and then predict codes for the entries with missing codes. The functions also validate the model's performance on a holdout set and report per-class precision, recall, and F1-score. If no missing codes are present, the functions instead check existing codes for potential mismatches and report them.

Usage

code_extend_glove(titles, var, req_f1 = 0.8, rarity_threshold = 8)
code_extend_bert(titles, var, req_f1 = 0.8, rarity_threshold = 8, emb_texts)

Arguments

titles: A character vector of text entries (e.g., titles or descriptions).
var: A character vector of (categorical) codes that might be coded from the titles or texts. Entries with missing codes should be NA_character_ or empty strings. The function will suggest codes for these entries. If no missing codes are present, the function will check existing codes for potential mismatches.
req_f1: The required macro-F1 score on the validation set before proceeding with inference. Default is 0.80.
rarity_threshold: Minimum number of occurrences for a code to be included in training. Codes with fewer occurrences are excluded from training to ensure sufficient data for learning. Default is 8.
emb_texts: For code_extend_bert(), pre-computed embeddings from text::textEmbed(). This avoids re-computing embeddings if they have already been computed. A Hugging Face model can be specified via the model argument. Default is "sentence-transformers/all-MiniLM-L6-v2". Other models can be used, but they should produce sentence-level embeddings.

Examples

Run this code

titles <- paste(emperors$Wikipedia$CityBirth,
                emperors$Wikipedia$ProvinceBirth,
                emperors$Wikipedia$Rise,
                emperors$Wikipedia$Dynasty,
                emperors$Wikipedia$Cause)
var <- emperors$Wikipedia$Killer
var[var=="Unknown"] <- NA
var[var %in% c("Senate","Court Officials","Opposing Army")] <- "Enemies"
var[var %in% c("Fire","Lightning","Aneurism","Heart Failure")] <- "God"
var[var %in% c("Wife","Usurper","Praetorian Guard","Own Army")] <- "Friends"
glo <- code_extend_glove(titles, 
           var)

Run the code above in your browser using DataLab