textEmbed: Extract layers and aggregate them to word embeddings, for all character variables in a given dataframe.

Description

Extract layers and aggregate them to word embeddings, for all character variables in a given dataframe.

Usage

textEmbed(
  x,
  model = "bert-base-uncased",
  layers = 11:12,
  contexts = TRUE,
  context_layers = layers,
  context_aggregation_layers = "concatenate",
  context_aggregation_tokens = "mean",
  context_tokens_select = NULL,
  context_tokens_deselect = NULL,
  decontexts = TRUE,
  decontext_layers = layers,
  decontext_aggregation_layers = "concatenate",
  decontext_aggregation_tokens = "mean",
  decontext_tokens_select = NULL,
  decontext_tokens_deselect = NULL,
  device = "cpu",
  print_python_warnings = FALSE
)

Arguments

A character variable or a tibble/dataframe with at least one character variable.

model

Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".

layers

Specify the layers that should be extracted (default 11:12). It is more efficient to only extract the layers that you need (e.g., 12). Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus advised to not use. These layers can then be aggregated in the textEmbedLayerAggregation function. If you want all layers then use 'all'.

contexts

Provide word embeddings based on word contexts (standard method; default = TRUE).

context_layers

Specify the layers that should be aggregated (default the number of layers extracted above). Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus advised not to be used.

context_aggregation_layers

Method to aggregate the contextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.

context_aggregation_tokens

Method to aggregate the contextualized tokens (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.

context_tokens_select

Option to select word embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings.

context_tokens_deselect

Option to deselect embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings.

decontexts

Provide word embeddings of single words as input (embeddings, e.g., used for plotting; default = TRUE).

decontext_layers

Layers to aggregate for the decontext embeddings the number of layers extracted above.

decontext_aggregation_layers

Method to aggregate the decontextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.

decontext_aggregation_tokens

Method to aggregate the decontextualized tokens (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.

decontext_tokens_select

Option to select embeddings linked to specific tokens such as [CLS] and [SEP] for the decontext embeddings.

decontext_tokens_deselect

option to deselect embeddings linked to specific tokens such as [CLS] and [SEP] for the decontext embeddings.

device

Name of device to use: 'cpu', 'gpu', or 'gpu:k' where k is a specific device number

print_python_warnings

bolean; when true any warnings from python environment are printed to the console.

Value

A tibble with tokens, a column for layer identifier and word embeddings. Note that layer 0 is the input embedding to the transformer

Examples

Run this code

# NOT RUN {
# x <- Language_based_assessment_data_8[1:2, 1:2]
# Example 1
# word_embeddings <- textEmbed(x, layers = 9:11, context_layers = 11, decontext_layers = 9)
# Show information that have been saved with the embeddings about how they were constructed
# comment(word_embeddings$satisfactionwords)
# comment(word_embeddings$singlewords_we)
# comment(word_embeddings)
# Example 2
# word_embeddings <- textEmbed(x, layers = "all", context_layers = "all", decontext_layers = "all")
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples