Apply .mp_tokenize_word from both directions and pick the result with fewer pieces.
.mp_tokenize_word_bidir(
word,
vocab_split,
unk_token,
max_chars,
allow_compounds = TRUE
)Character scalar; word to tokenize.
List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes".
Token to represent unknown words.
Maximum length of word recognized.
Logical; whether to allow multiple whole words in the breakdown. Default is TRUE. This option will not be exposed to end users; it is kept here for documentation + development purposes.
Input word as a list of tokens.