Learn R Programming

FMAT (version 2025.4)

BERT_vocab: Check if mask words are in the model vocabulary.

Description

Check if mask words are in the model vocabulary.

Usage

BERT_vocab(
  models,
  mask.words,
  add.tokens = FALSE,
  add.method = c("sum", "mean"),
  add.verbose = TRUE
)

Value

A data.table of model name, mask word, real token (replaced if out of vocabulary), and token id (0~N).

Arguments

models

A character vector of model names at HuggingFace.

mask.words

Option words filling in the mask.

add.tokens

Add new tokens (for out-of-vocabulary words or phrases) to model vocabulary? Defaults to FALSE. It only temporarily adds tokens for tasks but does not change the raw model file.

add.method

Method used to produce the token embeddings of newly added tokens. Can be "sum" (default) or "mean" of subword token embeddings.

add.verbose

Print composition information of new tokens (for out-of-vocabulary words or phrases)? Defaults to TRUE.

See Also

BERT_download

BERT_info

FMAT_run

Examples

Run this code
if (FALSE) {
models = c("bert-base-uncased", "bert-base-cased")
BERT_info(models)

BERT_vocab(models, c("bruce", "Bruce"))

BERT_vocab(models, 2020:2025)  # some are out-of-vocabulary
BERT_vocab(models, 2020:2025, add.tokens=TRUE)  # add vocab

BERT_vocab(models,
           c("individualism", "artificial intelligence"),
           add.tokens=TRUE)
}

Run the code above in your browser using DataLab