wordpiece_tokenize

text

Character vector of vocabulary tokens. The tokens are assumed to
be in order of index, with the first index taken as zero to be compatible
with Python implementations.

vocab

unk_token

Maximum length of word recognized.

max_chars

Given a sequence of text and a wordpiece vocabulary, tokenizes the text.

Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text,
given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization
conventions are used by default.

wordpiece_tokenize: Tokenize Sequence with Word Pieces

Description

Usage

Arguments

Value

Examples