.mp_tokenize_word_bidir

Character scalar; word to tokenize.

word

List of character vectors containing vocabulary words.
Should have components named "prefixes", "words", "suffixes".

vocab_split

unk_token

Maximum length of word recognized.

max_chars

Logical; whether to allow multiple whole words in the
breakdown. Default is TRUE. This option will not be exposed to end users;
it is kept here for documentation + development purposes.

allow_compounds

Apply .mp_tokenize_word from both directions and pick the result with fewer
pieces.

internal

Tokenize text into morphemes. The morphemepiece algorithm uses a
lookup table to determine the morpheme breakdown of words, and falls back on a
modified wordpiece tokenization algorithm for words not found in the lookup
table.

.mp_tokenize_word_bidir: Tokenize a Word Bidirectionally

Description

Usage

Arguments

Value