.wp_tokenize_word

word

Character vector of vocabulary tokens. The tokens are assumed to
be in order of index, with the first index taken as zero to be compatible
with Python implementations.

vocab

unk_token

Maximum length of word recognized.

max_chars

Tokenize a single "word" (no whitespace). The word can technically contain
punctuation, but in BERT's tokenization, punctuation has been split out by
this point.

internal

Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text,
given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization
conventions are used by default.

.wp_tokenize_word: Tokenize a Word

Description

Usage

Arguments

Value