Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.
textEmbedRawLayers(
texts,
model = "bert-base-uncased",
layers = -2,
return_tokens = TRUE,
word_type_embeddings = FALSE,
decontextualize = FALSE,
keep_token_embeddings = TRUE,
device = "cpu",
tokenizer_parallelism = FALSE,
model_max_length = NULL,
max_token_to_sentence = 4,
logging_level = "error",
sort = TRUE
)
The textEmbedRawLayers() takes text as input, and returns the hidden states for each token of the text, including the [CLS] and the [SEP]. Note that layer 0 is the input embedding to the transformer, and should normally not be used.
A character variable or a tibble with at least one character variable.
(character) Character string specifying pre-trained language model (default = 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base". Only load models that you trust from HuggingFace; loading a malicious model can execute arbitrary code on your computer).
(character or numeric) Specify the layers that should be extracted (default -2, which give the second to last layer). It is more efficient to only extract the layers that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation function.
(boolean) If TRUE, provide the tokens used in the specified transformer model. (default = TRUE)
(boolean) Wether to provide embeddings for each word/token type. (default = FALSE)
(boolean) Wether to dectonextualise embeddings (i.e., embedding one word at a time). (default = TRUE)
(boolean) Whether to keep token level embeddings in the output (when using word_types aggregation). (default= TRUE)
(character) Name of device to use: 'cpu', 'gpu', 'gpu:k' or 'mps'/'mps:k' for MacOS, where k is a specific device number. (default = "cpu")
(boolean) If TRUE this will turn on tokenizer parallelism. (default = FALSE).
The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).
(numeric) Maximum number of tokens in a string to handle before switching to embedding text sentence by sentence. (default= 4)
(character) Set the logging level. (default ="error") Options (ordered from less logging to more logging): critical, error, warning, info, debug
(boolean) If TRUE sort the output to tidy format. (default = TRUE)
See textEmbedLayerAggregation
and textEmbed
.
# Get hidden states of layer 11 and 12 for "I am fine".
if (FALSE) {
imf_embeddings_11_12 <- textEmbedRawLayers(
"I am fine",
layers = 11:12
)
# Show hidden states of layer 11 and 12.
imf_embeddings_11_12
}
Run the code above in your browser using DataLab