Tokenize according to different huggingface transformers
textTokenize(
texts,
model = "bert-base-uncased",
max_token_to_sentence = 4,
device = "cpu",
tokenizer_parallelism = FALSE,
model_max_length = NULL,
logging_level = "error"
)
Returns tokens according to specified huggingface transformer.
A character variable or a tibble/dataframe with at least one character variable.
Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".
(numeric) Maximum number of tokens in a string to handle before switching to embedding text sentence by sentence.
Name of device to use: 'cpu', 'gpu', 'gpu:k' or 'mps'/'mps:k' for MacOS, where k is a specific device number.
If TRUE this will turn on tokenizer parallelism. Default FALSE.
The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).
Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug
see textEmbed
# \donttest{
# tokens <- textTokenize("hello are you?")
# }
Run the code above in your browser using DataLab