50% off: Unlimited data and AI learning.
State of Data and AI Literacy Report 2025

text (version 0.9.50)

textEmbedLayersOutput: Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.

Description

Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.

Usage

textEmbedLayersOutput(
  x,
  contexts = TRUE,
  decontexts = TRUE,
  model = "bert-base-uncased",
  layers = 11,
  return_tokens = TRUE,
  device = "cpu",
  print_python_warnings = FALSE,
  tokenizer_parallelism = FALSE
)

Arguments

x

A character variable or a tibble/dataframe with at least one character variable.

contexts

Provide word embeddings based on word contexts (standard method; default = TRUE).

decontexts

Provide word embeddings of single words as input (embeddings used for plotting; default = TRUE).

model

Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".

layers

Specify the layers that should be extracted (default 11). It is more efficient to only extract the layers that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation function.

return_tokens

If TRUE, provide the tokens used in the specified transformer model.

device

Name of device to use: 'cpu', 'gpu', or 'gpu:k' where k is a specific device number

print_python_warnings

bolean; when TRUE any warnings from python environment are printed to the console. (Either way warnings are saved in the comment of the embedding)

tokenizer_parallelism

If TRUE this will turn on tokenizer parallelism. Default FALSE.

Value

A tibble with tokens, column specifying layer and word embeddings. Note that layer 0 is the input embedding to the transformer, and should normally not be used.

See Also

see textEmbedLayerAggregation and textEmbed

Examples

Run this code
# NOT RUN {
# x <- Language_based_assessment_data_8[1:2, 1:2]
# word_embeddings_with_layers <- textEmbedLayersOutput(x, layers = 11:12)
# }

Run the code above in your browser using DataLab