Learn R Programming

RBERT (version 0.1.11)

tokenize_word: Tokenize a single "word" (no whitespace).

Description

In BERT: tokenization.py, this code is inside the tokenize method for WordpieceTokenizer objects. I've moved it into its own function for clarity. Punctuation should already have been removed from the word.

Usage

tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)

Arguments

word

Word to tokenize.

vocab

Character vector containing vocabulary words

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

Input word as a list of tokens.

Examples

Run this code
# NOT RUN {
tokenize_word("unknown", vocab = c("un" = 0, "##known" = 1))
tokenize_word("known", vocab = c("un" = 0, "##known" = 1))
# }

Run the code above in your browser using DataLab