- texts
A character variable or a tibble/dataframe with at least one character variable.
- model
Character string specifying pre-trained language model (default 'bert-base-uncased').
For full list of options see pretrained models at
HuggingFace.
For example use "bert-base-multilingual-cased", "openai-gpt",
"gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased",
"roberta-base", or "xlm-roberta-base". Only load models that you trust from HuggingFace; loading a
malicious model can execute arbitrary code on your computer).
- layers
(string or numeric) Specify the layers that should be extracted
(default -2 which give the second to last layer). It is more efficient to only extract the layers
that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter
to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and
thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation
function.
- dim_name
(boolean) If TRUE append the variable name after all variable-names in the output.
(This differentiates between word embedding dimension names; e.g., Dim1_text_variable_name).
see textDimName
to change names back and forth.
- aggregation_from_layers_to_tokens
(string) Aggregated layers of each token. Method to aggregate the
contextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively,
across each column; or "concatenate", which links together each word embedding layer to one long row.
- aggregation_from_tokens_to_texts
(string) Method to carry out the aggregation among the word embeddings
for the words/tokens, including "min", "max" and "mean" which takes the minimum, maximum or mean across each column;
or "concatenate", which links together each layer of the word embedding to one long row (default = "mean"). If set to NULL, embeddings are not
aggregated.
- aggregation_from_tokens_to_word_types
(string) Aggregates to the word type (i.e., the individual words)
rather than texts. If set to "individually", then duplicate words are not aggregated, (i.e, the context of individual
is preserved). (default = NULL).
- keep_token_embeddings
(boolean) Whether to also keep token embeddings when using texts or word
types aggregation.
- batch_size
Number of rows in each batch
- remove_non_ascii
(bolean) TRUE warns and removes non-ascii (using textFindNonASCII()).
- tokens_select
Option to select word embeddings linked to specific tokens
such as [CLS] and [SEP] for the context embeddings.
- tokens_deselect
Option to deselect embeddings linked to specific tokens
such as [CLS] and [SEP] for the context embeddings.
- decontextualize
(boolean) Provide word embeddings of single words as input to the model
(these embeddings are, e.g., used for plotting; default is to use ). If using this, then set
single_context_embeddings to FALSE.
- model_max_length
The maximum length (in number of tokens) for the inputs to the transformer model
(default the value stored for the associated model).
- max_token_to_sentence
(numeric) Maximum number of tokens in a string to handle before
switching to embedding text sentence by sentence.
- tokenizer_parallelism
(boolean) If TRUE this will turn on tokenizer parallelism. Default FALSE.
- device
Name of device to use: 'cpu', 'gpu', 'gpu:k' or 'mps'/'mps:k' for MacOS, where k is a
specific device number such as 'mps:1'.
- hg_gated
Set to TRUE if the accessed model is gated.
- hg_token
The token needed to access the gated model.
Create a token from the ['Settings' page](https://huggingface.co/settings/tokens) of
the Hugging Face website. An an environment variable HUGGINGFACE_TOKEN can
be set to avoid the need to enter the token each time.
- logging_level
Set the logging level. Default: "warning".
Options (ordered from less logging to more logging): critical, error, warning, info, debug
- implementation
(boolean; experiments) If TRUE the text is split using the DLATK-method; this method appears better for longer texts (but it does not
return token level word embeddings, nor word_types embeddings at this stage).
- trust_remote_code
(boolean) use a model with custom code on the Huggingface Hub
- ...
settings from textEmbedRawLayers().