- texts
A character variable or a tibble/dataframe with at least one character variable.
- model
Character string specifying pre-trained language model (default 'bert-base-uncased').
For full list of options see pretrained models at
HuggingFace.
For example use "bert-base-multilingual-cased", "openai-gpt",
"gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased",
"roberta-base", or "xlm-roberta-base". Only load models that you trust from HuggingFace; loading a
malicious model can execute arbitrary code on your computer).
- layers
(string or numeric) Specify the layers that should be extracted
(default -2 which give the second to last layer). It is more efficient to only extract the layers
that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter
to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and
thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation
function.
- dim_name
Boolean, if TRUE append the variable name after all variable-names in the output.
(This differentiates between word embedding dimension names; e.g., Dim1_text_variable_name).
see textDimName
to change names back and forth.
- aggregation_from_layers_to_tokens
(string) Aggregated layers of each token. Method to aggregate the
contextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively,
across each column; or "concatenate", which links together each word embedding layer to one long row.
- aggregation_from_tokens_to_texts
(string) Aggregates to the individual text (i.e., the aggregation of
all tokens/words given to the transformer).
- aggregation_from_tokens_to_word_types
(string) Aggregates to the word type (i.e., the individual words)
rather than texts.
- keep_token_embeddings
(boolean) Whether to also keep token embeddings when using texts or word
types aggregation.
- tokens_select
Option to select word embeddings linked to specific tokens
such as [CLS] and [SEP] for the context embeddings.
- tokens_deselect
Option to deselect embeddings linked to specific tokens
such as [CLS] and [SEP] for the context embeddings.
- decontextualize
(boolean) Provide word embeddings of single words as input to the model
(these embeddings are, e.g., used for plotting; default is to use ). If using this, then set
single_context_embeddings to FALSE.
- model_max_length
The maximum length (in number of tokens) for the inputs to the transformer model
(default the value stored for the associated model).
- max_token_to_sentence
(numeric) Maximum number of tokens in a string to handle before
switching to embedding text sentence by sentence.
- tokenizer_parallelism
(boolean) If TRUE this will turn on tokenizer parallelism. Default FALSE.
- device
Name of device to use: 'cpu', 'gpu', 'gpu:k' or 'mps'/'mps:k' for MacOS, where k is a
specific device number such as 'mps:1'.
- logging_level
Set the logging level. Default: "warning".
Options (ordered from less logging to more logging): critical, error, warning, info, debug
- ...
settings from textEmbedRawLayers().