tokenize_word

word

Character vector containing vocabulary words

vocab

unk_token

Maximum length of word recognized.

max_chars

In BERT: tokenization.py,
this code is inside the tokenize method for WordpieceTokenizer objects.
I've moved it into its own function for clarity.
Punctuation should already have been removed from the word.

Use pretrained models from Google Research's BERT in R.

tokenize_word: Tokenize a single "word" (no whitespace).

Description

Usage

Arguments

Value

Examples