
Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.
textEmbedLayersOutput(
x,
contexts = TRUE,
decontexts = TRUE,
model = "bert-base-uncased",
layers = 11,
return_tokens = TRUE,
device = "cpu",
print_python_warnings = FALSE,
tokenizer_parallelism = FALSE
)
A character variable or a tibble/dataframe with at least one character variable.
Provide word embeddings based on word contexts (standard method; default = TRUE).
Provide word embeddings of single words as input (embeddings used for plotting; default = TRUE).
Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".
Specify the layers that should be extracted (default 11). It is more efficient to only extract the layers that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation function.
If TRUE, provide the tokens used in the specified transformer model.
Name of device to use: 'cpu', 'gpu', or 'gpu:k' where k is a specific device number
bolean; when TRUE any warnings from python environment are printed to the console. (Either way warnings are saved in the comment of the embedding)
If TRUE this will turn on tokenizer parallelism. Default FALSE.
A tibble with tokens, column specifying layer and word embeddings. Note that layer 0 is the input embedding to the transformer, and should normally not be used.
see textEmbedLayerAggregation
and textEmbed
# NOT RUN {
# x <- Language_based_assessment_data_8[1:2, 1:2]
# word_embeddings_with_layers <- textEmbedLayersOutput(x, layers = 11:12)
# }
Run the code above in your browser using DataLab