Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but in BERT's tokenization, punctuation has been split out by this point.
.wp_tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)Word to tokenize.
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.
Token to represent unknown words.
Maximum length of word recognized.
Input word as a list of tokens.