Learn R Programming

FMAT (version 2025.3)

BERT_vocab: Check if mask words are in the model vocabulary.

Description

Check if mask words are in the model vocabulary.

Usage

BERT_vocab(
  models,
  mask.words,
  add.tokens = FALSE,
  add.method = c("sum", "mean")
)

Value

A data.table of model name, mask word, real token (replaced if out of vocabulary), and token id (0~N).

Arguments

models

Model names at HuggingFace.

mask.words

Option words filling in the mask.

add.tokens

Add new tokens (for out-of-vocabulary words or even phrases) to model vocabulary? Defaults to FALSE. It only temporarily adds tokens for tasks but does not change the raw model file.

add.method

Method used to produce the token embeddings of new added tokens. Can be "sum" (default) or "mean" of subword token embeddings.

See Also

BERT_download

BERT_info

FMAT_run

Examples

Run this code
if (FALSE) {
models = c("bert-base-uncased", "bert-base-cased")
BERT_info(models)

BERT_vocab(models, c("bruce", "Bruce"))

BERT_vocab(models, 2020:2025)  # some are out-of-vocabulary
BERT_vocab(models, 2020:2025, add.tokens=TRUE)  # add vocab

BERT_vocab(models,
           c("individualism", "artificial intelligence"),
           add.tokens=TRUE)
}

Run the code above in your browser using DataLab